INCAA II – Step by step, how we built these datasets from scratch

>>> PREVIOUS: Subsidies to the Argentina´s movies industry 2008-2014

Scrapping and building this dataset from scratch by Ricardo Brom


  • We build this data set from 2008 , and today we keep it updated

  • these are more than 70 PDF files

  • some PDFs are monthly, some annual and some cover a whole semester

Some challenges:

PDF´s had different formats and different data design for example: some had 5 columns:

while others had 4 columns:  Seguir leyendo

Sin comentarios

LA NACION DATA: Open Data Journalism for Change

LA NACION continued in 2014-2015 with its effort to use data as new raw material for journalism and to contribute to open data in Argentina while reporting or as a way to activate demand of more public information, in a country still without a FOI Law.

After another year of training and consistency in the data project of LA NACION, opportunities of new projects appeared. We continued reporting from datasets built from scratch and then transformed, opened and shared by our team in print, online and social media.

Seguir leyendo

Sin comentarios

Vozdata II: crowdsourcing online collaborative platform for investigative reporting

VozData is a collaborative tool to convert public documents trapped in closed formats into a structured database. Information proccessed is suitable for general audience understanding and for journalists to analyse and report.  The application was inspired by The Guardian “MP´s Expenses” and  Propublica´s “Free the files”.

VozData´s first initiative was Senate Expenses (Argentina), divided in 3 different periods. Over the course of a couple of months, LA NACION digitized more than 10000 PDF.  Team work was fulfilled by 1000 volunteers. Data obtained was published online in real time in the form of rankings of recipients and type of expense. The platform also includes ranking of users that review and classify documents.

At the end of each data project, LA NACION´s data team reviewed representative samples and published the dataset in open data formats for download (CSV, XLS, etc). The Code driving Vozdata was open sourced by OpenNews Fellows and named Crowdata.

Aggregated LA NACION’s reporting on Senate Expenses, including findings in Vozdata projects.

Vozdata VIDEO Demo in english

CIVIC OPEN COLLABORATION: Partnering with NGO’s and Universities, and general audience!

Students of Universidad Torcuato di Tella – Masters in Journalism during a civic marathon al LA NACION

During May 2014 we started a campaign to try to finish “the stack of PDFs” of the first set of Senate Expenses during #SemanadeMayo that is a patriotic week and culminates May 25th. This historical day of 1810 is well known with a phrase “El Pueblo Quiere Saber” “People wants to know”. So we decided to organize what we named “Civic Marathons” for opening data using Vozdata.


We made  banners for social media and got shared via LA NACION & LNdata twitter and Facebook accounts.  Seguir leyendo

Sin comentarios

Public officials salaries and assets for reporting and accountability

“Click to access to Statements of Assets integrated Tag”

LA NACION decided to fight for transparency in public officials salaries and declaration of assets as we feel that even in a country without FOIA, journalism and citizens must know and share what they know about how politicians earn their money, and how they compare with others or with other periods. This is a tool that is also helping detect cases of corruption, regarding public spending and companies owned by official´s relatives or friends.

To aggregate stories, we integrated the news application with the open dataset and´s CMS using a TAG that gathers all the stories that are coming out about these Declarations of Assets or salaries of the president and ministers we also requested and opened .

Main stories and data:  Seguir leyendo

Sin comentarios

Argentina´s Senate Expenses 2004-2013

Extracting stories from public DATA formerly unstructured and in PDFs.

After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses. This series of front page stories lead to more stories and different approaches to keep Senate accountable.

As we converted these PDFs into OCR txt files, we realized that we have lots of information lost as a consecuence of very noisy PDF´s. Besides, we realized that there would be more stories if more eyes helped us classify and enter this data. So we decided to ask for help: inspired in The Guardian MP´s Expenses and Propublica´s Free the Files, we asked our Knight-Mozilla Opennews Fellow 2013 Manuel Aristaran to help us develop “Vozdata” a platform for Crowdsourcing data in a structured way .

He developed “Crowdata” working together with Gabriela Rodriguez our Opennews Fellow 2014. We launched our first “Senate Expenses Vozdata project with a dataset of more than 6700 PDFs that took two months to be processed.

To fulfill that, again we asked for collaboration and activated our community organizing  two “Civic Marathons” with NGO´s , Universities and users. See all the details here.

At the same time, one of our journalists heard that in Senate there had been a big growth of the amount of employees and as our data team has been scraping during 30 months (since november 2011),  the lists of Senate permanent, temporary and contracted employees, we could release a unique and original analysis that became a new finding sustained with data and visualizations. In this period , senate employees and contracted went from 3.700 to 5.700 which meant a 55% of growth. Again, our vice president Amado Boudou replied to these articles using the official channel in national TV , but he could not deny any of the numbers on the reporting.

Here are the details, data collection and data analysis process and the articles and visualizations.

Regarding all the Senate Expenses stories, as they were many and some of them are still in judicial investigations, we decided to put them all together in a Tag home page

So here is the process and how we did this , together with new stories:

Play it in HD!

Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made.  Seguir leyendo

Sin comentarios

Argentina´s Official advertising funds distribution 2009 – 2013: Friends, politicians and a stylist.

This datajournalism project transformed built and opened the first normalized and comprehensible dataset of official advertising in Argentina covering this 4 year period and grouped by company´s shareholders researched in a different dataset. It produced front page and full page stories and home page news in, that had lots of impact. It was lead by datajournalist José Crettaz working in teams with LA NACION data team and LA NACION interactive and infographics departments.

So this is the story:

More than 2000 companies and individuals received advertising in Argentina between 2009 and 2013, but 50% of this amount went to 10 media groups … the ones closer to Govermnent.

Only seven companies received more than $ 100Million pesos in this period, including three national TV channels (first, third and fourth in audience order) , four cable news channels  and several radio channels. But in that list there are none of the largest and most traditional newspapers in argentina. Even a hairdresser (stylist) received more advertising money than this newspapers..

Independent media discrimination and freedom of the press are at risk in Argentina, not only in the abitrary  official ads distribution now sustained by this datajournalism project, but also with private advertisers being ordered by government to stop advertising in the country’s top newspapers, in a bid to weaken independent media companies.

All this actions , besides journalists suffering layoffs and threats from public media or media receiving most of official ads as well as journalsits being harassed in public.

Built from scratch with raw data published with more than two years of delay, and after two foia requests (sort of, as Argentina´s still without Foia law) from LA NACION and transparency NGOs in Argentina, The Jefatura de Gabinete of Ministers released one year of data in two semester PDF´s that contain 30  or more pages each. LA NACION data team added this new information in a three year of transformed, normalized, cleaned , enriched and then open dataset that again is now available for everyone in Argentina to reuse.  Seguir leyendo

Sin comentarios

How Argentina´s Senate grew in 55% its amount of permanent and temporary employees

On january the 10th 2011 vicepresident Amado Boudou started his administration. From the moment that several articles were published concerning the incorporation of a large quantity of employees to the Nation´s Senate, on january 2012 we developed an Excel application in Visual Basic for Applications, that (executed every month) and that would register the evolution of  the transitory, permanent and hired staff of public senate workers, as well as the details of each work force, which differ between each other.





The application did several tasks. First, it loaded each month in different Excel pages, identifying the date of the data collection.

Then we added up by date  the quantities of each type of recruitment to observe their evolution.  Seguir leyendo

Sin comentarios