pdf.js in Observable?

aaronkyle · May 12, 2019, 5:24pm

I would like to put up a view to a PDF file, but I haven’t figured out how to require all the components.

Here’s amy attempt:

I started by trying to follow this JSFiddle example, Which calls in pdf.js, but
pdf.js is not npm, so I have to find a work-around to requiring.

pdfjs =  require('https://mozilla.github.io/pdf.js/build/pdf.js').catch(() => window.pdfjs)

… appears to work, but the JSFiddle example returns Error: No "GlobalWorkerOptions.workerSrc" specified.

pdf-dist.js is on npm, and can be required:

require('pdfjs-dist@2.0.943/build/pdf.js')

… But using this doesn’t make the error go away.

There are some discussions online about workarounds and alternative approaches. For example, one thread suggests explicitly requiring the 'worker. I tried:

PDFJSWorker =  require('https://mozilla.github.io/pdf.js/build/pdf.worker.js').catch(() => window.PDFJSWorker)

but no dice.

Any insights?

[sorry for the many edits; I accidentally hit ‘submit’ early in drafting this question…]

aaronkyle · May 12, 2019, 9:21pm

Trying to recreate Mozilla’s helloworld example, I have managed to boil this all down better:

… still haven’t made it work yet, but resolved nearly all errors. Happy for help to get over this last hurdle!

mootari · May 12, 2019, 10:49pm

Hope this helps:

aaronkyle · May 12, 2019, 11:30pm

Thanks Fabian! It certainly does!!

jrus · June 15, 2019, 1:10am

Also check out @tom’s awesome video,

Which I stumbled across completely by chance. The Observable team should remember to cross-promote useful stuff like this at this forum, since I suspect many other readers are not followers of your youtube channel / independent twitter accounts / etc.

aaronkyle · June 15, 2019, 10:53am

Thanks @jrus. I had actually seen that video back when Tom first posted (but did indeed lose track of it, so thanks for the reminder) and have subsequently played around with it a bit (after @mootari generously helped me on the require). Like you, I re-encountered it randomly!

…I’ve been busy, but slowly keep trying to work out how to grab information from old Indonesian census documents using JS So much to learn… I’ll chime back in when I’ve put in more time and have questions that I can’t answer by searching and reading.

mootari · June 15, 2019, 12:12pm

Sounds interesting! Do you have a notebook somewhere with your current progress?

aaronkyle · June 16, 2019, 12:44am

Thanks for the interest, Fabian!

I don’t know that it’s much in the way of ‘progress’ (rather than an amalgamation of other people’s work in two very slow-to-load notebooks), but here’s a bit of it:

… This notebook seeks to re-create the census form itself (currently only a few sections). The idea will be to compare the data collected by the national census to data collected by local governments.

And here’s the notebook where I’m working on pulling together these different data from regional governments:

(After all the maps) this second one uses the PDF.js require that you supplied to render out one of the PDFs. Awhile back, I manually pulled the data and recently I stuck in to GitHub. Using Tom’s table function I have re-visualized (roughly, and messily) how the different pages of the census report should look. My next steps are to try to see if I can capture the table data with JS (rather than doing it again by hand).

… Elsewhere I have (manually and laboriously) compiled all data for 2013 across all regional governments in Tanimbar, which demonstrates that few of these indicators are consistently tracked across all governments. I can go more into why this is important, how it relates to the national census, and what ideas I have for trying to improve consistent data capture… but it’ll take awhile.

Please let me know your thoughts, feedback and ideas! And thanks again for your help and interest!

aaronkyle · June 16, 2019, 1:16am

Ah - Sorry! I just realized that neither of these notebooks contain the experiments I was doing based on Tom’s Effects of light rail / PDF-capture notebook! And as it turns out, my notebook is buried deep somewhere in my private notebooks. I’ll poke around a bit tomorrow and see if I can dig it back up. I really didn’t get too far though…

mootari · June 16, 2019, 6:35am

Has the actual census data already been digitized? Because, if all the data is just scans of the original form, I can see no way to automatically pull the data itself from the PDFs.

aaronkyle · June 16, 2019, 11:29am

Yes and No. The Government of Indonesia’s 2010 census is digital(-ish). For example, they publish data in tabular form on a number of topics, such as this table on Population by Age Group and Citizenship for Maluku Province.

I don’t think that they have an API to access these data, unfortunately.

As for the census itself (and the questions it uses)… sadly, no. This is precisely why I started re-creating it. I have copies of the physical forms used to collect data from across several years, going back to the early 1980’s, but that’s about it. While my work is only beginning (and now only a proof of concept), I was planning to work through a good part of the census and convert it to inputs. After that, I intend to work out a way to select relevant subsets of census questions (coupled with data collected by local government and not in the census) to inform baseline socio-economic profiles of local areas.

I fear that I’m trying to cover too much ground here and really ought to sit down and be methodical in the write-up, so I’ll try to wrap things up quickly.

In 2013, I used Excel Visual Basic macros to re-create the tables from the reporting PDF for local Maluku governments, one of which is covered in the MTB notebook I linked above. The outcomes were less than perfect, and all the data had to be manually verified and then re-compiled into a single data sheet containing all indicators, across multiple years. A year of working on this and I never really finished. Data would overlap and be different as reported in one publication to the next, one year to the next; the local data rarely tied in to the census; and the census seems to report on only a fraction of the information it collects (as summary figures).

Tom’s PDF tool is able to parse most of the copies of the census form that I have, as well as much of the PDF data publications of local governments. At the very least, I can use it to speed up my copy-and-paste work. Yet this really isn’t sustainable from a time perspective. I imagine that if visual basic can be set to re-produce tabular data, so too can JavaScript! Since it’s been a few years and local governments have 2 or 3 more annual statistical reports that I can access, I’d like to re-run this exercise, create a time-series data set, and then see if I can identify trends (such as increase or decreases to health and educational services in different areas).

Let’s see! Maybe something is possible. It’s great that Indonesia has an open data policy and that super-local reports are accessible. It’s a shame that these data are not structured or systematically collected… but maybe we can change all of this. I’ll keep you posted!

mootari · June 17, 2019, 6:02pm

Can you set some constraints on the data that you’re trying to obtain? Like time span, region, type of information? Perhaps you could outline the goal of your research?

What does MTB stand for?

To cite this paper from 2016:

After observing open data implementation from several big cities, province and ministries, we found that the most available files are PDF. There are a small number of files available in machine readable format. According to the five-star scheme by Tim Berners-Lee, most of the government websites are awarded 1-star. A few websites are awarded 2-star and almost no website is awarded 3-star. There is no website is awarded 4-star and 5- star. This result also shows that Indonesia is still in the first stage according to open stage model. Indonesia is still aggregating the government data and publish it without linking to the other data. We expect that the government has not taken the open data issue seriously yet. To increase the open data implementation, we suggest policy changing, rule implementation, socialization of open data and technical training to provide open data.

I also found this open data portal. However, I only took a very brief look at it, so I have no idea if the available data is relevant to your research. But the fact that they have JS graphs and widgets for the census data gives me hope that they may have converted the underlying information to a consumable format.

aaronkyle · June 17, 2019, 6:29pm

mootari:

To cite this paper from 2016:

After observing open data implementation from several big cities, province and ministries, we found that the most available files are PDF. There are a small number of files available in machine readable format. According to the five-star scheme by Tim Berners-Lee, most of the government websites are awarded 1-star. A few websites are awarded 2-star and almost no website is awarded 3-star. There is no website is awarded 4-star and 5- star. This result also shows that Indonesia is still in the first stage according to open stage model. Indonesia is still aggregating the government data and publish it without linking to the other data. We expect that the government has not taken the open data issue seriously yet. To increase the open data implementation, we suggest policy changing, rule implementation, socialization of open data and technical training to provide open data.

Yep! This is exactly what I experienced (though the authors’ conclusion about the government not taking open data seriously reads a bit harsh, to me at least; i think there’s clearly some seriousness in the government’s intention, but it takes time to build capacity… and the statistics bureau is highly decentralized…).

I’ll get back to you in greater detail in a bit (mostly by writing content into the notebooks that clarifies why I am doing what I am doing and what I hope to achieve). The long and the short of it is this:

Indonesia has an Open Data Policy, but (as pointed out well in the paper you cite) while data are available, they’re not really all that ‘open’ (from the perspective of enabling quantitative analysis
For those working on social and economic development in Indonesia, it’s important that we all try to build on (or clean, refine, extend) existing data
To build on existing data, first we ought to take inventory of the tools and approaches that exist – and to align our approaches with things that are already happening… at national and local levels.
One way toward achieving this (like I did in the past when researching potential socio-economic impacts of a large-scale development project in Maluku Province, as discussed above) is to manually re-compile everything… but this work is time-consuming and error prone.
Alternatively, one might be able to ask machines to help us [And this is about where I’m at].

As next steps, I’ll dig out the VisualBasic scripts that I used to grab tabular data from the annual social and economic statistical reports produced by regional governments in Maluku Province (most of which follow the same basic structure). I’ll see if I can convert this script to something that works with PDF.js – following Tom’s video, internet guides, etc. Maybe with some help, I can re-create the data capture process that I had working a few years back with Observable, and then re-run this for the reports available from 2013 - today (I stopped my work last in 2014, and a lot more has come out since then).

The outcome I am hoping for? I’d like to ask questions of the information like: Do more or less people now have access to a doctor near to them? What’s the average distance of a person in a given location from a hospital? School? Etc? What are the trends in terms of fisheries production? Agricultural production? Etc? And from these, it’s also helpful to imagine futures – things we’d like to see (like better access to sanitary facilities) and things we’d like to avoid (like over-fishing).

… Not the most well-defined project, but such is the nature of social analysis, I fear. Most of the challenge is getting access to reasonable data… and then verifying these data, triangulating, etc.

And by the way: MTB = Maluku Tenggara Barat [the name of the notebook I linked]; sorry for the shorthand.

Thank you again for your time, interest and encouragement. More to come!

Martien · March 16, 2020, 3:04pm

Codified @tom’s Youtube video on https://observablehq.com/@martien/parse-pdf.

Welcome exercise. Learned a lot. Thanks Tom.

Topic		Replies	Views
Use of download code ? Community Help	19	4855	October 7, 2020
Export final code? Community Help	39	22208	December 18, 2018
Export Observable Framework to PDF Community Help	0	159	November 25, 2024
Notebook to Vanilla JavaScript Steps Community Help	20	10137	October 18, 2020
Observable for research - (advanced) statistics	24	2541	June 23, 2021

pdf.js in Observable?

Related topics