Argentina’s Senate Expenses 2004-2013

Extracting stories from public DATA formerly unstructured and in PDFs.

Play it in HD!

After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses.

Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made.

  

 

I. THE PROCESS

Detect -> Scrape -> OCR (transform the different qualities PDFs, all images, some very noisy and some password protected) – Design a data model – Parse data – Structure data – Analyze DATA – Report and…keep it updated!

The source of raw data: LA NACION Data team Built a dataset from more than 30.000 PDFs gathered from three different sources of Senate Expenses renditions,

a) Senate presidential decrees (DP)

b) Senate administration department (DGA)

c) Senate accounting department (DC)

In the following video there is a detailed explanation (in english) of how this dataset was built step by step. How documents are scraped, transformed, how we solved noisy PDFs that are presented as images with different OCR solutions, and how we managed to structure and use this information for reporting:

 

The total process is divided in several phases, each of one requires different tools:  Seguir leyendo