Extracting stories from public DATA formerly unstructured and in PDFs.
After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses. This series of front page stories lead to more stories and different approaches to keep Senate accountable.
As we converted these PDFs into OCR txt files, we realized that we have lots of information lost as a consecuence of very noisy PDF´s. Besides, we realized that there would be more stories if more eyes helped us classify and enter this data. So we decided to ask for help: inspired in The Guardian MP´s Expenses and Propublica´s Free the Files, we asked our Knight-Mozilla Opennews Fellow 2013 Manuel Aristaran to help us develop “Vozdata” a platform for Crowdsourcing data in a structured way .
He developed “Crowdata” working together with Gabriela Rodriguez our Opennews Fellow 2014. We launched our first “Senate Expenses Vozdata project with a dataset of more than 6700 PDFs that took two months to be processed.
To fulfill that, again we asked for collaboration and activated our community organizing two “Civic Marathons” with NGO´s , Universities and users. See all the details here.
At the same time, one of our journalists heard that in Senate there had been a big growth of the amount of employees and as our data team has been scraping during 30 months (since november 2011), the lists of Senate permanent, temporary and contracted employees, we could release a unique and original analysis that became a new finding sustained with data and visualizations. In this period , senate employees and contracted went from 3.700 to 5.700 which meant a 55% of growth. Again, our vice president Amado Boudou replied to these articles using the official channel in national TV , but he could not deny any of the numbers on the reporting.
Here are the details, data collection and data analysis process and the articles and visualizations.
Regarding all the Senate Expenses stories, as they were many and some of them are still in judicial investigations, we decided to put them all together in a Tag home page
So here is the process and how we did this , together with new stories:
Play it in HD!
Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made.
I. THE PROCESS
Detect -> Scrape -> OCR (transform the different qualities PDFs, all images, some very noisy and some password protected) – Design a data model – Parse data – Structure data – Analyze DATA – Report and…keep it updated!
The source of raw data: LA NACION Data team Built a dataset from more than 37.000 PDFs gathered from three different sources of Senate Expenses renditions,
a) Senate presidential decrees (DP)
b) Senate administration department (DGA)
c) Senate accounting department (DGC)
In the following video there is a detailed explanation (in english) of how this dataset was built step by step. How documents are scraped, transformed, how we solved some of the noisy PDFs that are presented as images with different OCR solutions, then how we managed to structure and use this information for reporting:
The total process is divided in several phases, each of one requires different tools:
1) Download the PDF files (scanned images from paper documents) – We developed an application based in Excel Macros (VBA or Visual Basic for Applications), it connects to the site and searches in 4 different sections for all the PDF´s, or just for the new ones.
2) Remove PDF´s protection against printing and copy.
3) Convert PDF to searchable files, with Onmipage 18 (Batch Processing)
4) Analyze data – The same application described in 1) opens the TXT files, searches for names of Senators, name of Companies, amounts of money (in Pesos, USD, Euros and Pounds), dates, and specific keywords (like PURCHASE, DIRECT PURCHASE, SECURITY AGENTS, TRAVEL, FURNITURE, AIR or GROUND TRANSPORT, etc.) and inserts the full text and each one of these entities in different columns, asigning one row to each one of the 37.000 TXT files.
5) The 37.000 rows worksheet obtained in 4) can be used to investigate, simply by applying Excel filters reducing the scope.
6) Based on the Excel worksheet described in 4), a new macro that analyzes the 37.000 rows searching for the “SECURITY AGENTS”, and in a new worksheet the name of the Senator, the number of bodyguards, the destination (national or international), the dates range, and the amount of money requested.
7) The worksheet described in 6) was imported from Microsoft Project to generate a Gantt chart that showed in a time line the distribution of the trips, and their suspicious overlaps.
II. THE REPORTING
In the analysis and reporting stage we worked with three journalists from the Politics section: Laura Serra, Ivan Ruiz and Maia Jastreblansky. All these stories come out from the same dataset. Some are from Vice president´s expenses and others are from other senators expenses.
Last year our impact stories were about the amount of bodyguards and assistants vice president Boudou took in his trips, the second was about the expensive italian furniture he bought for his office using emergency funds, and the third was about strange overlapping dates presented in his trips expenses.
Story No. 1) In 2013 Boudou doubled his trips to other countries
Story No 3: The amount of Senate´s employees grew 55% since Boudou is in charge (in english)
II. DATA Home page for Senate Expenses stories
We decided to build a home page integrating a Data-topical-TAG in our CMS so we could gather and present all Senate expenses stories extracted from this dataset and the impact of the investigation in Justice as well:
Senate expenses project is part of our strategy to bring data to life and help journalism and citizens go through details and stories hidden in data.
In a country without FOIA and ranked 106 from 175 in the Corruption Perceptions Index, LA NACION believes that media must be proactive and open data to promote a change towards transparency and innovation. LA NACION DATA initiative was born to develop data journalism in Argentina and open data as we report. Using public data will activate demand of more public data.