INCAA II – Step by step, how we built these datasets from scratch

>>> PREVIOUS: Subsidies to the Argentina´s movies industry 2008-2014

Scrapping and building this dataset from scratch by Ricardo Brom

 

  • We build this data set from 2008 , and today we keep it updated

  • these are more than 70 PDF files

  • some PDFs are monthly, some annual and some cover a whole semester

Some challenges:

PDF´s had different formats and different data design for example: some had 5 columns:

while others had 4 columns:  Seguir leyendo

Sin comentarios

LA NACION DATA: Open Data Journalism for Change

LA NACION continued in 2014-2015 with its effort to use data as new raw material for journalism and to contribute to open data in Argentina while reporting or as a way to activate demand of more public information, in a country still without a FOI Law.

After another year of training and consistency in the data project of LA NACION, opportunities of new projects appeared. We continued reporting from datasets built from scratch and then transformed, opened and shared by our team in print, online and social media.

Seguir leyendo

Sin comentarios

Vozdata II: crowdsourcing online collaborative platform for investigative reporting

VozData is a collaborative tool to convert public documents trapped in closed formats into a structured database. Information proccessed is suitable for general audience understanding and for journalists to analyse and report.  The application was inspired by The Guardian “MP´s Expenses” and  Propublica´s “Free the files”.

VozData´s first initiative was Senate Expenses (Argentina), divided in 3 different periods. Over the course of a couple of months, LA NACION digitized more than 10000 PDF.  Team work was fulfilled by 1000 volunteers. Data obtained was published online in real time in the form of rankings of recipients and type of expense. The platform also includes ranking of users that review and classify documents.

At the end of each data project, LA NACION´s data team reviewed representative samples and published the dataset in open data formats for download (CSV, XLS, etc). The Code driving Vozdata was open sourced by OpenNews Fellows and named Crowdata.

Aggregated LA NACION’s reporting on Senate Expenses, including findings in Vozdata projects.

Vozdata VIDEO Demo in english

CIVIC OPEN COLLABORATION: Partnering with NGO’s and Universities, and general audience!

Students of Universidad Torcuato di Tella – Masters in Journalism during a civic marathon al LA NACION

During May 2014 we started a campaign to try to finish “the stack of PDFs” of the first set of Senate Expenses during #SemanadeMayo that is a patriotic week and culminates May 25th. This historical day of 1810 is well known with a phrase “El Pueblo Quiere Saber” “People wants to know”. So we decided to organize what we named “Civic Marathons” for opening data using Vozdata.

 

We made  banners for social media and got shared via LA NACION & LNdata twitter and Facebook accounts.  Seguir leyendo

Sin comentarios

Public officials salaries and assets for reporting and accountability

“Click to access to Statements of Assets integrated Tag”

LA NACION decided to fight for transparency in public officials salaries and declaration of assets as we feel that even in a country without FOIA, journalism and citizens must know and share what they know about how politicians earn their money, and how they compare with others or with other periods. This is a tool that is also helping detect cases of corruption, regarding public spending and companies owned by official´s relatives or friends.

To aggregate stories, we integrated the news application with the open dataset and Lanacion.com´s CMS using a TAG that gathers all the stories that are coming out about these Declarations of Assets or salaries of the president and ministers we also requested and opened .

Main stories and data:  Seguir leyendo

Sin comentarios

Argentina´s Senate Expenses 2004-2013

Extracting stories from public DATA formerly unstructured and in PDFs.

After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses. This series of front page stories lead to more stories and different approaches to keep Senate accountable.

As we converted these PDFs into OCR txt files, we realized that we have lots of information lost as a consecuence of very noisy PDF´s. Besides, we realized that there would be more stories if more eyes helped us classify and enter this data. So we decided to ask for help: inspired in The Guardian MP´s Expenses and Propublica´s Free the Files, we asked our Knight-Mozilla Opennews Fellow 2013 Manuel Aristaran to help us develop “Vozdata” a platform for Crowdsourcing data in a structured way .

He developed “Crowdata” working together with Gabriela Rodriguez our Opennews Fellow 2014. We launched our first “Senate Expenses Vozdata project with a dataset of more than 6700 PDFs that took two months to be processed.

To fulfill that, again we asked for collaboration and activated our community organizing  two “Civic Marathons” with NGO´s , Universities and users. See all the details here.

At the same time, one of our journalists heard that in Senate there had been a big growth of the amount of employees and as our data team has been scraping during 30 months (since november 2011),  the lists of Senate permanent, temporary and contracted employees, we could release a unique and original analysis that became a new finding sustained with data and visualizations. In this period , senate employees and contracted went from 3.700 to 5.700 which meant a 55% of growth. Again, our vice president Amado Boudou replied to these articles using the official channel in national TV , but he could not deny any of the numbers on the reporting.

Here are the details, data collection and data analysis process and the articles and visualizations.

Regarding all the Senate Expenses stories, as they were many and some of them are still in judicial investigations, we decided to put them all together in a Tag home page

http://www.lanacion.com.ar/gastos-en-el-senado-t49163

So here is the process and how we did this , together with new stories:

Play it in HD!

Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made.  Seguir leyendo

Sin comentarios

Argentina´s Official advertising funds distribution 2009 – 2013: Friends, politicians and a stylist.

This datajournalism project transformed built and opened the first normalized and comprehensible dataset of official advertising in Argentina covering this 4 year period and grouped by company´s shareholders researched in a different dataset. It produced front page and full page stories and home page news in Lanacion.com, that had lots of impact. It was lead by datajournalist José Crettaz working in teams with LA NACION data team and LA NACION interactive and infographics departments.

So this is the story:

More than 2000 companies and individuals received advertising in Argentina between 2009 and 2013, but 50% of this amount went to 10 media groups … the ones closer to Govermnent.

Only seven companies received more than $ 100Million pesos in this period, including three national TV channels (first, third and fourth in audience order) , four cable news channels  and several radio channels. But in that list there are none of the largest and most traditional newspapers in argentina. Even a hairdresser (stylist) received more advertising money than this newspapers..

Independent media discrimination and freedom of the press are at risk in Argentina, not only in the abitrary  official ads distribution now sustained by this datajournalism project, but also with private advertisers being ordered by government to stop advertising in the country’s top newspapers, in a bid to weaken independent media companies.

All this actions , besides journalists suffering layoffs and threats from public media or media receiving most of official ads as well as journalsits being harassed in public.

Built from scratch with raw data published with more than two years of delay, and after two foia requests (sort of, as Argentina´s still without Foia law) from LA NACION and transparency NGOs in Argentina, The Jefatura de Gabinete of Ministers released one year of data in two semester PDF´s that contain 30  or more pages each. LA NACION data team added this new information in a three year of transformed, normalized, cleaned , enriched and then open dataset that again is now available for everyone in Argentina to reuse.  Seguir leyendo

Sin comentarios

How Argentina´s Senate grew in 55% its amount of permanent and temporary employees

On january the 10th 2011 vicepresident Amado Boudou started his administration. From the moment that several articles were published concerning the incorporation of a large quantity of employees to the Nation´s Senate, on january 2012 we developed an Excel application in Visual Basic for Applications, that (executed every month) and that would register the evolution of  the transitory, permanent and hired staff of public senate workers, as well as the details of each work force, which differ between each other.

DATA : STEP BY STEP STARTING FROM RAW

Permanent:

Transitory:

Contracted:

The application did several tasks. First, it loaded each month in different Excel pages, identifying the date of the data collection.

Then we added up by date  the quantities of each type of recruitment to observe their evolution.  Seguir leyendo

Sin comentarios

Making public the salaries of the President and ministers in less than 24 hours

Press release announcing the publication of the salaries of the President and ministers

Press release announcing the publication of the salaries of the President and ministers

From the beginning we were interested in taking the greatest advantage of Declaraciones Juradas Abiertas (Open Asset Declarations), the site we created to inform in an understandable and accessible way the statements of wealth of the main officials of the three branches of government. That is why we thought of working with their salaries.

The investigation started when we wanted to analyze the evolution of the salary of the president and her ministers for the period 2012-2014. The first thing we did was to search for this information in the patrimonial statements, containing a specific field where they have to give this information in detail. We found here a first problem: some specified their monthly wages instead of the annual amount. To solve this first problem, we decided to ask for the information from the office of the General Secretary of the Presidency, making use of the current regulation on access to public information (decree 1172/03).

This agency in the year 2012 in its website published a form specifying the gross and net salary of these officials. In the process we learned that NGO had asked for the same information a year before and that it had been denied, with the argument that the salaries were personal data. After brief meetings with members of the staff of the newspaper we agreed to ask for the same but in a different manner. Thus, we asked for the last version of the document in which the Presidency informed the remunerations of the highest posts in the executive branch.

The answer again was negative. “We are informing you that salaries are considered sensitive information, according to the law 25.326 (law of personal data). Thus we turned this negative reply into a story. On February 20 we told what had happened (The presidency denied the information on Cristina Kirchner and her minister’s salaries).

Less than 10 hours after this was published, the Presidency published a press release in which it expressed that there had been an error in not providing the information required relative to the current salary of Madam President, dr. Cristina Fernández de Kirchner and her ministers, and that by her express indication they have been published, with the corresponding apology”.

Ate the same time, the state agency published on its website the document in PDF format with the current information.

To make the data more understandable and add value for the reader, the text was transformed into text (with an OCR tool) and information was put online on a spreadsheet. The following day a table was added with a search tool and the possibility of downloading the data inf CSV, XLS and PDF format (“Cristina published her salary: She earns 48366 pesos per month”).

In journalistic terms, transforming a negative to present information into a story, we got the National Government to publish data of great interest for all citizens.

All the stories on the case

  1. Presidency denied information on Cristina Kirchner and her minister’s salaries
  2. After LA NACION published its story presidency changed tack and published the salaries of Cristina Kirchner and her ministers
  3. President Cristina Kirchner and her minister’s salaries
  4. Cristina published her salary. She earns 48366 per month
  5. Presidency reverses its attitude. It informed that Cristina Kirchner earns 48366
  6. Oscar Parrilli: “The salary of the president is increased if there is an increase in the salaries of civil servants”
  7. Capitanich justified the salaries of the cabinet

Some repercussions of the story

The next day the news was on the front page of the main print media.

The cabinet chief in a press conference justified Cristina’s and the minister’s salaries.

Sin comentarios

President’s private secretary resigns after his wealth come to light

Thanks to a new regulation on statements of wealth approved in 2013, all material on the wealth of officials is published online in the Anticorruption Office site. From there at the beginning of 2014 we took the asset statements of the private secretaries of the President: Martín Federico Aguirres and Pablo Erasmo Barreiro. Both, according to the documents they presented, were able to increase their patrimony substantially in 2012. After an analysis by journalist Iván Ruiz, the first tripled his wealth, while Barreiros increased his in 70%.

Together with the use of official information, there was a search in different press archives to further investigate the patrimony of these private secretaries.

IMPACT

Almost 4 months after the publication of the story, Aguirres resigns his position, due to the accusations of illicit enrichment against him.  Seguir leyendo

Sin comentarios

Salaries of officials of the City of Buenos Aires: a minister complaint, a payslip and the wrong data from the Government

The Government of the City of Buenos Aires has a catalog of data where it publishes information in open format to make different areas of its administration more open to the public eye.

We  began searching among the datasets available to work with those of high relevance considering public interest.

We decided to download the datasets related to the salaries of the cabinet officials and the chief of the Buenos Aires administration.

Working with the database

As a first task, we downloaded the two files with the salaries for 2012 and 2013, we merged them in one Excel spreadsheet  so we could calculate the monthly variations. The data was distributed in several columns to facilitate the analysis and to have the chance to get a global vision of the whole set.

Because the published databases were clean, with no important mistakes, there was no need to do a lot of work on the information published by the City administration.

For a first story (More than 40 percent pay rise for BA ministers) we decided to calculate the percentage variations between July 2012 and July 2013, since the information published reached that month.

In the second story (Another pay rise for BA ministers: 20% in the last semester) with the complete 2013 data, the percentage difference was calculated for the period July-December 2013.

To facilitate the journalistic work, in both cases rankings by official were created according to salaries for all of 2012 and changes in ranking were analyzed according to salaries perceived in 2013.

The visualizations

When the work on the database was done, the journalists and the multimedia design team met to create a visualization that would clearly show the conclusions they had arrived at.

For the first publication, the percentage salary rise between July 2012 and July 2013 was considered.

 

The second visualization worked with the information of the second semester of 2013.  Seguir leyendo

Sin comentarios