tesseract + tensorflow + humans + lots of PDF data tables = ???

aaronkyle · June 6, 2020, 3:31am

Hi Observable Community,

In a previous discussion (which I would link but cannot find), there was an indication that posting ‘jobs’ on this forum was acceptable. I am looking to ‘structure’ some difficult-to-wrangle public data (task at hand) and to help apply these data to tracking and visualizing changes over time (future task).

I have a relatively ‘contained’ problem and I suspect it can be swiftly cracked in a novel way. The government of Nepal (statistics bureau) publishes reports on a semi-regular basis with district-level socio-economic data. The data I encounter are in PDF, and it takes a long time and constellation of tools to parse PDF tables in a meaningful way. To get at the data in these reports, I’ve tried tools like tesseract .js, which help me to manually grab out data (sometimes encountering errors in symbol recognition, making it as painful often to correct data compared just to writing the table out again by hand). When I don’t need OCR and a PDF is editable: I sometime have to convert scripts, such as between ASCII to Unicode). Since I continue to manually pull data from PDF into JSON (piecemeal), I figured it would be worth developing a tool for this would save my sanity and greatly enhance my productivity.
I’m interested whether some sort of AI could help, (and I asked about tensorflow.js b/c there are great examples on Observable.

I would ask and AI to help me both ensure that the data captured are accurate and correctly lined up, as well as to help me make different decisions on data capture from PDFs. At its core, I’d like the AI to be about learn how to read and create a JSON data output for tables with nested column headers and could recognize and account for column and row spans such that it could help visually capture essentially and PDF.] Of course, there are other tasks a ‘smart’ computer could be asked to do in order to verify and improve quality of data capture. For example, analyzing a file to first determine if it has readable text, and then trying a combination of image analysis, comparison of extracted content (copying text from PDF tables, when breaking, will break in a patterned way), and human-corrected data.

This first data set is at least ‘bounded’ (there will be reports from fewer than 76 districts and really my immediate goal is to have the data in these in a format that I can use for analysis. If we develop the AI publicly and the model works for others or could be further adapted / trained using corrected data, that’d be sweet too.

I write here because this is the only community in the world that I know of among which people have this particular skill set. If you do, and if you’re potentially interested to collaborate (for remuneration), please write to me and let me know.

I seek to work with others to improve social development outcomes for people. This community helps me every day to do just that. Thank you.

P.S. - This is not well put together, but a comparable example of the type of work involved is contained in this notebook (all the individual data tables):

pabloesm · June 16, 2020, 3:13pm

Hi Aaron,

The problem you are describing sounds really interesting. It’s inspiring to see the work you are doing the help with social development.

I think the use of Deep Learning to parse PDF tables is a quite reasonable approach. In fact, there exists some literature in the topic and even a “standard” set of data + tasks in order to compare the performance from different methods (2013 Table Competition, 12th International Conference on Document Analysis and Recognition).

In addition, there are some commercial tools (e.g. https://pdftables.com) that may also be interesting for your purposes. Have you checked any of them?

aaronkyle · June 16, 2020, 8:03pm

HI @pabloesm, and thanks for the compliments, information, and links!

I have tried pdftables when I was working on the Indonesia example. It would take me a minute to track it down, but when I extracted those data tables for Indonesia, a friend helped create a script for MSExcel that could analyse and extract the PDF tables with relatively high precision. So I know that the extraction, at least, can be done with relative precision.

The machine learning aspect that I am envisioning as part of this process would focus on how to structure extracted data. Especially where there are weird things in data tables, like nested headers, merged cells, etc – it’s hard not only to extract these data, but what do you ‘do’ with this information when you encounter it? I imagine a machine could learn some common ‘rules’ that we readers take for granted when looking at a funky table.

Of course, getting the character recognition right for a variety of scripts/alphabets/characters is also a challenge.

I imagine that building this tool once, it’d very quickly learn how to impose order on data that are ‘locked’ away in PDFs Certainly there are several countries with ‘Open Data’ policies, but for which all data are saved as PDF tables.

Thanks for the encouragement! Please anyone write me if interested!

noise-machines · June 18, 2020, 4:12am

Hi folks,

This problem looks fascinating! Just sent you a message, @aaronkyle. Would love to help with this!

pabloesm · June 19, 2020, 9:34am

I think that having an end-to-end system that performs a map “table --> JSON data output” with a high level of confidence is quite ambitious, specially for borderless tables with complex structures. Maybe, it’s more reasonable to use the machine learning approach to detect table cells, tessearc (or another alternative) to get the textual information for each cell, and maybe some heuristics to determine cells hierarchy knowing its positions.

On the other hand, the reports published by the statistics bureau of Nepal seems to have a consistent, well-framed, style of tables.

What methods (if any) are you currently using to extract data from the tables?

PD: Hi @noise-machines, glad to hear you are also interested!

aaronkyle · June 19, 2020, 11:47am

Hi @pabloesm - and thanks again!

Currently I do not have any methods for extracting data from these documents. Some PDFs have text that can be copied–but often this process requires converting Preeti font to unicode before I can really do anything with it (like dropping it into Google Translate). Other PDFs were saved as flat image files; these would require some sort of OCR (hence tesseract ) – though I’ve had little success extracting Nepali script.

Topic		Replies	Views
Workflow diagrams. Is easily possible here? New To Observable	1	499	September 21, 2022
Dataflow, a self-hosted Observable Notebook Editor Show and Tell	4	1440	August 19, 2021
Looking for resources for a biophysicist friend of mine has to give a workshop on "how to design good scientific figures" Community Help	3	344	January 19, 2021
LIDAR image segmentation w/ outputs shared on Observable Marketplace	0	348	October 2, 2022
UNCA Data Viz notebooks published Show and Tell	0	263	September 7, 2023

tesseract + tensorflow + humans + lots of PDF data tables = ???

Related topics