LA NACION decided to position a new section DATAin its main navigation bar at the home page of lanacion.com
LNData is a place where people in Argentina can find stories with open data and data journalism ingredients. Some are datasets extracted from public or government sites, most of which are built, transformed, normalized and/or manually typed by LNdata team.
The data section includes: a data homepage, on open data catalog, a data blog, our data journalism projects, dataviz, @LNdata Twitter, Facebook Page and daily reporting with data articles. Seguir leyendo →
Extracting stories from public DATA formerly unstructured and in PDFs.
Play it in HD!
After finding out that Senate have published expenses since 2004 in raw PDFs, some of them as images and completely unstructured, LA NACION data team managed to scrape, transform, normalize and structure three datasets into one and began an interrogation process that included front page stories, replies from actual and former Argentina´s vice presidents (Senate presidents), and provoked a judicial investigation over vicepresident Amado Boudou regarding these expenses.
Thanks to building this dataset from scratch and analysing dates, we also found out that some expenses of official trips were presented with dates that were overlapped and even included some trips that were not made.
Detect -> Scrape -> OCR (transform the different qualities PDFs, all images, some very noisy and some password protected) – Design a data model – Parse data – Structure data – Analyze DATA – Report and…keep it updated!
The source of raw data: LA NACION Data team Built a dataset from more than 30.000 PDFs gathered from three different sources of Senate Expenses renditions,
In the following video there is a detailed explanation (in english) of how this dataset was built step by step. How documents are scraped, transformed, how we solved noisy PDFs that are presented as images with different OCR solutions, and how we managed to structure and use this information for reporting:
The total process is divided in several phases, each of one requires different tools: Seguir leyendo →
How to show the statements of assets from public officials in a friendly way?
That was the first question we asked ourselves when starting the project. So, after four months of intensive work, on January 13 th we launched an interactive application that allows users to read a large volume of information and do comparisons, with an easy visual interface, that includes the possibility to explore every original PDF document in detail, using DocumentCloud.
The result combines an end to end process of datajournalism : Foia requests, 100% data typing, data checking, data analysis, journalism, database design, interface design, programming and data mining. Therefore, the project involved more than 10 people with very different profiles. Seguir leyendo →