Data Section in lanacion.com

LA NACION decided to position a new section DATA in its main navigation bar at the home page of lanacion.com

LNData is a place where people in Argentina can find stories with open data and data journalism ingredients. Some are datasets extracted from public or government sites, most of which are built, transformed, normalized and/or manually typed by LNdata team.

The data section includes: a data homepage, on open data catalog, a data blog, our data journalism projects, dataviz, @LNdata Twitter, Facebook Page  and daily reporting with data articles.  Seguir leyendo

Sin comentarios

Argentina’s Senate Expenses 2004-2013

Extracting stories from public DATA formerly unstructured and in PDFs.

Play it in HD!

After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses.

Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made.

  

 

I. THE PROCESS

Detect -> Scrape -> OCR (transform the different qualities PDFs, all images, some very noisy and some password protected) – Design a data model – Parse data – Structure data – Analyze DATA – Report and…keep it updated!

The source of raw data: LA NACION Data team Built a dataset from more than 30.000 PDFs gathered from three different sources of Senate Expenses renditions,

a) Senate presidential decrees (DP)

b) Senate administration department (DGA)

c) Senate accounting department (DC)

In the following video there is a detailed explanation (in english) of how this dataset was built step by step. How documents are scraped, transformed, how we solved noisy PDFs that are presented as images with different OCR solutions, and how we managed to structure and use this information for reporting:

 

The total process is divided in several phases, each of one requires different tools:  Seguir leyendo

News Application: Statements of assets from Argentina’s main public officials

 

How to show the statements of assets from public officials in a friendly way?

That was the first question we asked ourselves when starting the project. So, after four months of intensive work, on January 13 th we launched an interactive application that allows users to read a large volume of information and do comparisons, with an easy visual interface, that includes the possibility to explore every original PDF document in detail, using DocumentCloud.

The result combines an end to end process of datajournalism : Foia requests, 100% data typing, data checking, data analysis, journalism, database design, interface design, programming and data mining. Therefore, the project involved more than 10 people with very different profiles.  Seguir leyendo

Sin comentarios