Extracting stories from public DATA formerly unstructured and in PDFs.
Play it in HD!
After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses.
Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made.
I. THE PROCESS
Detect -> Scrape -> OCR (transform the different qualities PDFs, all images, some very noisy and some password protected) – Design a data model – Parse data – Structure data – Analyze DATA – Report and…keep it updated!
The source of raw data: LA NACION Data team Built a dataset from more than 30.000 PDFs gathered from three different sources of Senate Expenses renditions,
a) Senate presidential decrees (DP)
b) Senate administration department (DGA)
c) Senate accounting department (DC)
In the following video there is a detailed explanation (in english) of how this dataset was built step by step. How documents are scraped, transformed, how we solved noisy PDFs that are presented as images with different OCR solutions, and how we managed to structure and use this information for reporting:
The total process is divided in several phases, each of one requires different tools:
1) Download the PDF files (scanned images from paper documents) – We develope an application based in Excel Macros (VBA or Visual Basic for Applications), connects to the site and search in 4 different sections for all the PDF´s, or new ones.
2) Remove PDF´s protection against printing and copy.
3) Convert PDF to searchable files, with Onmipage 18 (Batch Processing)
4) Analyze data – The same application described in 1) opens the TXT files, search whithin for names of Senators, name of Companies, amounts of money (in Pesos, USD, Euros and Pounds), dates, and specific keywords (like PURCHASE, DIRECT PURCHASE, SECURITY AGENTS, TRAVEL, FURNITURE, AIR or GROUND TRANSPORT, etc.) and insert the full text and each one of these entities in different columns, asigning one row to each one of the 33.000 TXT files.
5) The 33.000 rows worksheet obtained in 4) can be used to investigate, simply by applying Excel filters reducing the scope.
6) Based on the Excel worksheet described in 4), a new macro that analyze the 33.000 rows searching for the “SECURITY AGENTS”, and in a new worksheet the name of the Senator, the number of bodyguards, the destination (national or international), the dates range, and the amount of money requested.
7) The worksheet described in 6) was imported from Microsoft Project to generate a Gantt chart that showed in a time line the distribution of the trips, and their suspicious overlaps.
II. THE REPORTING
In the analysis and reporting stage we worked with two journalists from the Politics section: Laura Serra and Maia Jastreblansky.
Three big stories came out from the same dataset regarding Vice president´s expenses. The first was about the amount of bodyguards and assistants he took in his trips, the second is about the expensive italian furniture he bought for his office using emergency funds, and the third is about strange overlapping dates presented in his trips expenses.
DATA and Journalism: what we found out
Laura Serra, reporter from the Politics section, wrote the first of a series of articles from this dataset. These stories are about our Senate´s president Boudou (he is also the country´s vice president), his spendings in bodyguards, and also his companions while travelling (assistants) that we found in further data analysis.
As an example, for a one day conference in Switzerland he travels for 6 days with 4 bodyguards and 7 assistants, with a cost of 100.000 USD. And bodyguards take 10.000 USD “for unexpected events” and they spent 10.820 USD….
For this first part of the story we published on February 10th, we ordered this Tableau by amount of expenses. The tableau is sorted by amount of money spent, and not by dates, that was another analysis presented afterwards.
Impact: The same night Vicepresident Amado Boudou answers using the official TV Publica channel (video), showing a stack of papers that he said are the decrees that LA NACION cannot find online.
The following day, Laura Serra, our politics data journalist was called from a news Channel (video) to explain this detailed investigation in person:
Audience Engagement: This story in three articles gathered more than 3000 comments and 6000 likes in Facebook
In the same week, Feb 15th 2013, another reporter from the politics section, Maia Jastreblansky found out in the dataset that vice president changed the furniture of his dependences, spending in direct purchases twice than what was permited, and also found out that this was not exposed to the judge who was investigating this.
Again, the Vicepresident allegued that he received the office in bad shape, what again previous VicePresident Cobos who refused about this publicly.
Impact: In this recent publication, a Judge anounces that he will reopen the case of the excesive spents of the VicePresident.
User engagement: This stories gathered more than 8.753 comments and 11.285 Facebook likes.
In April 3rd LA NACION unveils how dates overlap in VicePresident Boudou’s travel expenses renditions using a Gantt chart in an interactive visualization made in Tableau Public.
The visualization also shows expenses renditions for cancelled trips.
We extracted these expenses from the same PDFs dataset and included the original documents in the article to support this story.
User Engagement: this story, just published, got more than 1500 comments and 2200 Facebook likes.
III. DATA Home page for Senate Expenses stories
We decided to build a home page integrating a Data-topical-TAG in our CMS so we could gather and present all Senate expenses stories extracted from this dataset and the impact of the investigation in Justice as well:
Senate expenses project is part of our strategy to bring data to life and help journalism and citizens go through details and stories hidden in data.
In a country without FOIA nor open data portals and ranked 102 from 180 in the Corruption Perceptions Index, LA NACION believes that media must be proactive and open data to promote a change towards transparency and innovation. LA NACION DATA initiative was born to develop data journalism in Argentina and open data as we report. Using public data will activate demand of more public data.
IV. Team Members. More on LNdata.
Laura Serra: Reporter, LA NACION Politics section
Maia Jastreblansky: Reporter, Lanacion.com Politics Section
Ivan Ruiz: Reporter, Lanacion.com Politics Section
Ricardo Brom: Senior developer and Data engineer, LA NACION DATA
Mariana Trigo Viera: Interactive designer in chief, Lanacion.com
VIDEO: How we open data in a country without open data goverment portals or FOIA law (presented at Strata Conference 2013, Santa Clara, California)