The International Consortium of Investigative Journalists (ICIJ) has made its Offshore Leaks Database available for direct downloads as a set of comma-separated-values (CSV) files. That type of text file can be easily imported into a spreadsheet like Microsoft Excel or read by network-analysis software packages like Gephi. A more ambitious way to analyse and visualise the data would be to write your own code, in one of three programming languages most popular among data journalists: Python, R or Processing.
In order to use the data directly, we have to first familiarise ourselves with their content and format. ICIJ download includes 4 data files and a “readme” document, describing the data. The data depict a network of relationships between individuals and offshore corporations. Each individual, corporation and relationship has been assigned an unique ID number for the purpose of keeping track of them. Each has several attributes. For example, an individual has a name, a function he or she fulfils (master client, officer, etc.) and an address. A corporation would have a name, an address and additional information about its tax status, jurisdiction, whether it’s still active etc.
An individual or an offshore corporation (also referred to as an “entity”) would be a “node” in the network, while a relationship between them would be referred to as an “edge”. A node, depicted as a dot, may have several edges coming out of it, while an edge – a line – would always connect only two nodes.
There are four data files in the database:
nodes.csv – includes data on 251,405 nodes of the network: individuals, entities or addresses.
countries.csv – lists 275 countries with their identifiers and codes used in the other files.
node_countries.csv – assigns a country code to each node
edges_1direction.csv – lists 525,157 network edges, or relationships between nodes. Each edge has a starting node and an ending node, therefore it has also a direction.
One should think of it as a table containing rows and columns. Each row would be a node and the fields, in this case separated by semicolons, correspond to columns. The first row contains the column headers.
In order to get a better handle on the contents of this file, let’s import it into Microsoft Excel, by opening an empty spreadsheet and selecting
File -> Import. We must be careful, though, to make a couple small adjustments in step 2 of the process, since the data are separated by semicolons, not commas, and surrounded by quotation marks. The adjustments are illustrated here:
When the process is complete, the data will be presented in a spreadsheet with 251,406 rows. Many fields will be empty because they are not relevant to the particular node (for example only entities have “jurisdiction”, but individuals don’t), or because some data have been withheld by ICIJ. This is what the first few rows of the spreadsheet look like:
In order to be able to analyse the data in aggregate, we need to import all four files into the same Excel workbook but into separate sheets, using exactly the same steps as above. We’re creating a 92.3-MByte file, so not all computers will be able to handle this task. Microsoft Excel is an extremely powerful tool in the hands of experts, but even without much experience, there are ways of retrieving useful information from data, as we’ll see in the next post.
Originally published on December 20, 2013