Extracting stories from public DATA formerly unstructured and in PDFs.
After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses. This series of front page stories lead to more stories and different approaches to keep Senate accountable.
As we converted these PDFs into OCR txt files, we realized that we have lots of information lost as a consecuence of very noisy PDF´s. Besides, we realized that there would be more stories if more eyes helped us classify and enter this data. So we decided to ask for help: inspired in The Guardian MP´s Expenses and Propublica´s Free the Files, we asked our Knight-Mozilla Opennews Fellow 2013 Manuel Aristaran to help us develop “Vozdata” a platform for Crowdsourcing data in a structured way .
He developed “Crowdata” working together with Gabriela Rodriguez our Opennews Fellow 2014. We launched our first “Senate Expenses Vozdata project with a dataset of more than 6700 PDFs that took two months to be processed.
To fulfill that, again we asked for collaboration and activated our community organizing two “Civic Marathons” with NGO´s , Universities and users. See all the details here.
At the same time, one of our journalists heard that in Senate there had been a big growth of the amount of employees and as our data team has been scraping during 30 months (since november 2011), the lists of Senate permanent, temporary and contracted employees, we could release a unique and original analysis that became a new finding sustained with data and visualizations. In this period , senate employees and contracted went from 3.700 to 5.700 which meant a 55% of growth. Again, our vice president Amado Boudou replied to these articles using the official channel in national TV , but he could not deny any of the numbers on the reporting.
Here are the details, data collection and data analysis process and the articles and visualizations.
Regarding all the Senate Expenses stories, as they were many and some of them are still in judicial investigations, we decided to put them all together in a Tag home page
http://www.lanacion.com.ar/gastos-en-el-senado-t49163
So here is the process and how we did this , together with new stories:
Play it in HD!
Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made. Seguir leyendo →