Image placeholder

Extracting Trends and Truths from Oil, Gas and Mining Contracts: Text Analytics and

As a global repository of more than 1,000 mining and petroleum contracts between governments and extractives companies, now provides a large corpus of data for advanced text analytics. The site, re-launched in November, has been upgraded to include digitized text as well as scans of original documents. One objective is to provide a rich source of information on the deals governments have been signing around the world; for journalists, researchers and policymakers.

In this blog we illustrate what is available on the platform and explore some examples of the kind of analysis that is now possible.

The contracts database now contains a total of 1,106 documents, including 863 digitized contracts; it hosts approximately 12.5 million words that can be searched, summarized and aggregated in response to queries. Most of the contracts available in the database are in English. The figure below illustrates the distribution of English-language contract documents across time, with 395 contracts (almost 80 percent) in the database originating during the 2000s.

Illustrative research objective:
Exploring the how the prevalence of key terms "infrastructure" and "profits taxes" have changed across petroleum and mineral contracts in recent years

To guide potential users through the techniques available on and its application programming interface (API) we use an illustrative research inquiry. Here we are interested in the evolution of key terms in resource contracts over time, to understand what specific term may have become more or less prevalent over time. This can be a first step for a citizen or government official to understand how trends in resource contracts might compare to contractual terms in their own countries.


First, we queried the prevalence (count) of the term "infrastructure" across all the contracts in our sample. You can do so by simply searching for the term using the main site search tool, or using the scripted Elastic Search query function via the API.

Contracts are mostly found to mention the term infrastructure in the sections specific to infrastructure requirements and provisions. (Three hundred and thirteen contracts contain the term infrastructure.) In some cases, this constitutes part of an agreement with the host government to share or jointly pay for that infrastructure with other parties (commonly termed as "multi-user" or "multi-access" provisions). Another context where the term "infrastructure" occurs is around the provision of "social infrastructure"—facilities such as schools, hospitals and in general assets accommodating community services (42 of 578 total contracts).

Looking at the graph above, we observe several notable trends. First, overall the prevalence of the term "infrastructure" is higher in mining contracts compared to hydrocarbon (petroleum) contracts. (Forty eight percent of mining contracts compared with 36 percent of hydrocarbons contracts contain at least one mention of the term.) Second, prevalence rises over time, with higher relative occurrence in the 2010-2014 period compared to earlier periods for both mining and hydrocarbons. Third, its prevalence in mining contracts has grown the most rapidly in our sample period—beginning with a lower or similar relative prevalence compared to petroleum, and by the end of the period recording a significantly larger prevalence.

The general upward trend in both petroleum and mining contracts could be explained by two factors. First, commodity prices rose significantly during 2000-2011 period, including for petroleum and many minerals. This increased the attractiveness of extraction for investors, increased rents and may have given governments sufficient bargaining power to ask for increased infrastructure provisions by companies, or propose new models for infrastructure provision (e.g., multi-access or multi-use forms). Second, the rise in prices led to investors seeking opportunities in frontier countries that typically do not have the infrastructure required to support major extraction projects. Thus, in general, contacts record a rising prevalence of infrastructure requirements as this becomes an increasingly important consideration in order to service extraction sites.

Profits tax

Second, we queried using the term “profits tax” and its synonym “income tax.” (Here we report the results jointly.) These terms relate to a variety of common tax types used in extractives contracts to capture revenues for government; they are associated with the profitability (or income) a company is earning from its extraction activities. In particular, these terms relate to the existence of excess profits taxes in addition to the more commonly used corporate income taxes.

Compared to hydrocarbons contracts, we observe that a much lower proportion of mining contracts include the words “profit tax” and “income tax”—on average, 55 percent of hydrocarbon contracts and 14 percent of mining contracts.

This marked difference might be the result of systematic differences in the way government seek to tax the respective sectors. A lower prevalence of “income tax” or “profits tax” in contracts could occur when companies are subject to generally applicable legislation, for example standard corporate income tax, thus reducing the need for contracts to discuss such terms. Thus, it might be the case that it is more common for the hydrocarbons sector to set out exceptions to the generally applicable legislation or to include a rule in the contract that is separate from the general legislation. This in turn may be driven by the differences in rents available: in hydrocarbons, governments may be more inclined to set sector-specific profits taxes to capture the higher excess profits associated with oil extraction, compared to mineral extraction. Indeed, according to the IMF, the effective tax rates were higher for hydrocarbons than for minerals during this period, which may support this hypothesis.

Research techniques

The illustrative research cases presented above are simple examples of the possibilities that offers as a data source. What makes this contract repository a promising platform for new research is the ability to search, download and analyze the digital text of each document it contains, including using powerful query tools such as Elastic Search. This allows for a wide variety of queries and research tasks. We summarize three options:

  • Full-text search is made possible by turning optically scanned PDF pages into searchable text, thanks to high-quality optical character recognition techniques. When source documents are of too poor a quality to optically transform to text, they are transcribed via Mechanical Turk, a paid service from Amazon. Full-text search across all contracts can be used to filter contracts based on a specific term in the text, allowing for some preliminary exploratory analysis and suggestion of directions of research. Annotations can be searched as well, making the database an even more precise research engine.
  • Another essential feature of is the way in which each contract has been tagged with relevant attributes, making metadata filtering possible. Currently, it is possible to filter contracts based on signature year, country, resource, company name, corporate group, contract type and annotation group.
  • The third type of research that the site has made possible is quantitative text analytics, as showcased above. Text analytics can be defined as the use of one or more methods for drawing statistical inferences from text populations. This approach combines the use of simple full-text search and the information stored in the metadata. It can range from simple count techniques (as seen above) to more sophisticated treatments, such as correspondence analysis and classification methods for document clustering.


There are many ways to take this research further: potential researchers could focus on a certain jurisdiction or on a region and combine text analysis with insights from national and international legislation—current laws as well as those in place when the contract was signed. Alternatively, the focus could be on a certain topic (e.g., a certain type of provision or clause) across different countries and/or companies.

Has this post prompted you to think about how you can use data? We are currently awarding grants of up to USD 10,000 for researchers, journalists and civil society actors who wish to use the site for applied research questions or investigations. Learn more here. If you have an idea we hope to hear from you!

Giorgia Cecchinato is an NRGI research and data associate. Jim Cust is the director of research and data.