Draft Contract Filenaming Convention
Purpose of a File Naming Convention
The contracts gathered in the initial OpenOil repository have been gathered from more than 60 sources across the Internet. Without respect to the content of the contracts, they have different formats - image PDFs, scrapeable PDFs, flat text tiles - and little attention has been paid to file naming, perhaps in part because contract disclosure where it has happened to date has usually been perceived to be a one-off, or isolated event.
But as contract transparency gains ground as a norm or best practice in extractive industries, it becomes more important to create information infrastructure that will support disclosure in scale. OpenOil therefore has therefore prototyped a file naming convention in this repository. No convention will be effective, of course, in the long run, if it is not widely adopted by consensus. This convention, therefore, should be seen as no more than a draft which serves the immediate purpose of ordering the contracts curated so far, and to promote discussion of if a convention at the level of the file would be useful, and if so, what its optimal form would be.
Below are some brief considerations:
- Many-to-many exchange. One key purpose of the convention, as with most naming conventions, is to enable file sharing to happen on a many-to-many scale. As more contracts get disclosed by more entities it will be increasingly unlikely that any one database or repository will contain a complete and updated archive. Imagine, for example, that individuals or groups decide to search deeper into financial regulator websites such as the Security and Exchange Commission in the USA and succeed in finding contracts that they want to work on. They may or may not upload or inform current repositories in real-time. So contracts may lie on the Internet unutilised. But if a filenaming convention that was efficient enough was in broad use, there is a fair chance, for example, that any user wanting to find a contract relating to any particular concession or field area in the world could simply construct a putative filename and Google it with a reasonable chance of success, no matter where it was stored.
- Human hints. A convention would also allow users to something of the scope of the contract from just the name - and thereby facilitate looking through directories. The overall number of contractual documents likely to be released into public domain is likely to be small compared to “big data sets”, even with optimistic assumptions about the progress made by contract transparency. This means that it will be some time before the number of petroleum contracts released became so large that it was unfeasible to see what was available simply by scrolling through flat directories at the level of country. Look through the current repository listings, for example. Once a user understood that “cn” as the first element in the filename meant China, it would take them two minutes to find and start to download all contracts in public domain from China.
- Machine-to-machine. If contract transparency advances strongly, at some stage there could be large scale disclosures of hundreds of documents at once. A file naming convention would enable machine-to-machine transfer and update at such a scale.
- Integration into Open Contracting Partnership. A convention around naming at the level of the file is clearly only a small potential sub-set of useful metadata that could be attached to petroleum contracts. In this context, the ongoing work of the Open Contracting Partnership, and in particular its current efforts to extend a core Open Contracting Data Standard to the extractive industries, is key. We are keen to contribute to the debate to define a process which achieves the optimum mix of richness of metadata against any constraints that may attach to institutions and parties that choose to embrace contract transparency at an institutional level. A file naming convention would only be one small part of such a richer definition.
- Continued unco-ordinated disclosures. At the same time, two important constraints are likely to remain which lend a file naming convention some utility for some time. First, disclosure is likely to be happening outside OCP frameworks for some time to come - either through continued corporate disclosures, which could remain entirely outside the scope of OCP, or by governments outside the context of OCP in response to other imperatives. While, therefore, the amount of order and richness of metadata that a naming convention at the level of a file can create is extremely limited, it may remain the biggest single contributor of meta-data across large parts of the corpus of public domain contracts for some time to come.
The overall approach of this file naming convention is to try and encode as much information in a human-readable form in the file name as is compatible within 50 characters. The contracts that have entered public domain have done so from multiple sources. Since no standard method exists in the public domain to reference concession blocks globally, the elements of the file name are broken into two categories. There are three normative elements at the start which are mandatory and unambiguous: jurisdiction, concession block name or number as used by the jurisdiction granting the concession, and date. In addition there are two other elements which are non-normative and appended: a typology of the kind of contractual document the file is, and one or more of the companies who signed it. These elements are optional parameters and are not normative, in the sense that both could be subject to more than one implementation - for reasons stated in more detail in the description of elements below.
- Core unambiguous (in theory) elements. The three core elements identify a contractual document against a unique concession in the area and the date on which it was signed. The dating sequence in particular offers indications of succession and sequence within the contractual framework applied to any concession block.
- Contract-area related. This standard can therefore only apply to contracts which are against specific concession blocks. The initial publication of the archive is only of Host Government Contracts of one kind or another - where one of the signatories is either the government itself or a government-nominated party such as a state-owned national oil company. In theory, this convention could be extended to other contracts between two private companies if the contract pertained to specific concession blocks. For instance, a gas purchase agreement between company A and company B which relates to gas produced by a separate contract, for example a Production Sharing Contract, between company A and a government.
- Suggested, not prescriptive. Finally, the use of this convention is not intended to be prescriptive, but rather to open a discussion as to what an optimal file naming convention could be. The archive can be renamed in batch to any other structured file format that subsequently emerges as an optimal standard.
The draft standard uses the ISO_3166-1 convention for naming countries.
- So for example Angola = “an” and Yemen = “ye”.
In cases where a contract was signed by a sub-national entity, this element would default to the ISO_3166-2 convention
- So for example if the region of Papua in Indonesia were to be the direct signatory, the jurisdiction denomiator would be “id-ij”
A derogation has been created in this initial release to reference contracts signed by the Kurdistan Regional Government in Iraq since although Iraq does have 3166-2 codes for its 18 governorates, the KRG does not unambiguously map to them. It does not seem appropriate either to code them with Iraq’s ISO 3166-1 code “iq” since the federal government of Iraq does not regocnise the legimitacy of these contracts. A custom jurisdiction denominator “krg” has therefore been used.
Contract Area Name
By preference – as referred to in the contract itself (for avoidance of doubt, in case of potential conflict with concession area as named in block allocation map or resource from a government source).
- For example, a Production Sharing Contract for coal bed methane in China signed by Pacific Asia Petroleum refers on the title page to the “Zijinshan Area”. The first two elements of the filename are therefore “cn_Zijinshan-Area”.
If there are multiple concession areas covered by the same contract, they are all listed with underscores between them to a maximum of three concessions, in the order in which they are listed in the contract.
- For example, a consortium led by BP signed a Production Sharing Contract with the government of Azerbaijan. The first two elements of that document in the repository are “az_Azeri_Chirag_Deepwater-Gunashli”, indicating that it refers to the Azeri and Chirag fields and the deepwater portion of the Gunashli field.
If there are more than three concession blocks, the blocks from four upwards will not be entered in the filename.
Big Ended Date
The big-ended date system yyyymmdd implicitly orders chronologically in file directories.
- 'Examples: D-Day in World War Two: 19440606; Kennedy Assassination: 19631123; Storming of the Bastille 17890714; traditional beginning of Christian era: 00001225'
The date is preceded by “dd” to allow text parsing to distinguish between possible multiple concession names previously.
If no month and day are available, pad out mmdd with zeros. As this is an impossible actual date, it clearly indicates that only the year 2005 was available at the time of naming.
- So for example, in the contract signed by Geopark with the government of Chile, only the year was immediately available, so the first three elements of the filename read: “cl_Fell-Block-Area_dd20050000”.
The date component allows the chronological sequence of contractual documents to be built in a concession block where more than one document is available.
- So for example in the Essaouria offshore block in Morocco there are two documents from successive years: the normative elements of the first one are “ma_Essaouira-Offshore_dd20110909” and of the second “ma_Essaouira-Offshore_dd20121219”. They therefore appear in time sequence in any directory containing both.
Non-normative Elements (optional)
These elements have been added to give extra information where possible. They refer to the kind of contract signed and the company signing.
Condensed Name of Contract
A normative approach was considered but rejected as being too complex for this filenaming convention given other constraints. Although there is frequent classification of petroleum contracts into three main categories - royalty/tax, production sharing and risk service agreement - in practice there are many other kinds of agreement, and also even agreements which are primarily one kind of contract can include elements of another. For example, many PSCs also include royalties. There is the further complication that the convention should be able to capture not just main Host Government Contracts but also all ancillary agreements. Finally, the question of classification according to any normalised system would require considerable time analysing the contract by someone with extensive domain experience and knowledge of the language of the contract. This was not feasible.
Instead, therefore, we adopted a non-normative approach. This element therefore maps onto however the contract refers to itself on the title page.
But there was a second problem. Frequently the title is too long to be included in full in a filename along with all the other elements. So we adopted a “condensation” approach to abbreviate it.
- For example, the contract signed by Hupecol in Colombia retrieved from the SEC website has on its title page: “Hydrocarbon Exploration and Production Contract”. Therefore this element appears in the repository as “Ex-Prod-Contract”, or, with the preceding elements co_La-Cuerva_dd20080416_Ex-Prod-Contract”.
While it would be possible in theory to create a set of unambiguous rules around condensation, in practice there are so many permutations against such a heterogenous data set that the complexity of formulating and applying it would defeat the purpose. This element is therefore subjective. We have therefore collected all the common abbreviations we have used in this file here, together with their expression in full.
Name of Company
This is another element whose purpose is not to provide unambiguous machine-readable encoding but additional information useful to a human. It also covers a highly complex area, since the distinction between a business group such as BP and the individual legal affiliate which signs a contract such as “BP Exploration Holdings Limited” is not widely understood, and often not referenced even in official or authoritative contexts, such as government or company documents. We therefore adopted a similar subjective “condensation” approach as with the name of the contract.
Also, although all contracts themselves contain the full legal names of the signing entities, it would take a large amount of time to find and transcribe them and would add to the length of the filename. So we use only the shortest common name by which a company is known.
- For example, a contract amendment for the East Wadi Araba concession in Egypt cites the full legal names of the signing entities, including - “Dover Investments Limited”, “Transpacific Petroleum Corporation”, “Mogul Energy Limited” and “Sea Dragon Energy Limited”. But the amendment provides contractions of these names on the title page itself, respectively “Dover”, “Transpacific”, “Mogul” and “Sea Dragon”. The filename therefore uses these contracted names and the filename is “eg_East-Wadi-Araba_dd20060413_Amending-Agreement_Dover_Mogul_Sea-Dragon.pdf”
It can be noted from the above that we do not include all company names. There are a maximum of three names, chosen subjectively. We anticipate acquiring the full legal names of all signing entities to public domain contracts by the end of the first quarter 2015. It is our view that once full legal affiliates are established any open data repository of contracts should use the Open Corporates naming convention to describe corporate entities, as it provides a simple unique global identifier which is dereferenceable, and does not depend for maintenance on any one organisation. Under this scenario, for example, any metadata for a contract to which BP Exploration Company Limited was a signatory would include the identifier gb/SC000792.
- Underscore separates one element from the next element... so... de_Leipzig_
- Hyphen separates words within the same element. So: Production-Sharing-Contract