Hi Observable Community,
In a previous discussion (which I would link but cannot find), there was an indication that posting ‘jobs’ on this forum was acceptable. I am looking to ‘structure’ some difficult-to-wrangle public data (task at hand) and to help apply these data to tracking and visualizing changes over time (future task).
I have a relatively ‘contained’ problem and I suspect it can be swiftly cracked in a novel way. The government of Nepal (statistics bureau) publishes reports on a semi-regular basis with district-level socio-economic data. The data I encounter are in PDF, and it takes a long time and constellation of tools to parse PDF tables in a meaningful way. To get at the data in these reports, I’ve tried tools like tesseract.js, which help me to manually grab out data (sometimes encountering errors in symbol recognition, making it as painful often to correct data compared just to writing the table out again by hand). When I don’t need OCR and a PDF is editable: I sometime have to convert scripts, such as between ASCII to Unicode). Since I continue to manually pull data from PDF into JSON (piecemeal), I figured it would be worth developing a tool for this would save my sanity and greatly enhance my productivity.
I’m interested whether some sort of AI could help, (and I asked about tensorflow.js b/c there are great examples on Observable.
I would ask and AI to help me both ensure that the data captured are accurate and correctly lined up, as well as to help me make different decisions on data capture from PDFs. At its core, I’d like the AI to be about learn how to read and create a JSON data output for tables with nested column headers and could recognize and account for column and row spans such that it could help visually capture essentially and PDF.] Of course, there are other tasks a ‘smart’ computer could be asked to do in order to verify and improve quality of data capture. For example, analyzing a file to first determine if it has readable text, and then trying a combination of image analysis, comparison of extracted content (copying text from PDF tables, when breaking, will break in a patterned way), and human-corrected data.
This first data set is at least ‘bounded’ (there will be reports from fewer than 76 districts and really my immediate goal is to have the data in these in a format that I can use for analysis. If we develop the AI publicly and the model works for others or could be further adapted / trained using corrected data, that’d be sweet too.
I write here because this is the only community in the world that I know of among which people have this particular skill set. If you do, and if you’re potentially interested to collaborate (for remuneration), please write to me and let me know.
I seek to work with others to improve social development outcomes for people. This community helps me every day to do just that. Thank you.
P.S. - This is not well put together, but a comparable example of the type of work involved is contained in this notebook (all the individual data tables):