DOMParser or d3.html

I’ve been trying to pull in an html page to then convert into a data source. When I use d3.html(‘https://en.wikipedia.org/wiki/List_of_towns_and_boroughs_in_Pennsylvania’) I get an empty result:

HTMLDocument {
  location: null
}

I also cannot get a valid DOMParser. for example

window.DOMParser().parseFromString(d3.text(‘https://en.wikipedia.org/wiki/List_of_towns_and_boroughs_in_Pennsylvania’), ‘text/html’)

I get the same result. Is there something I’m missing about importing HTML pages? Is there a good way to fetch HTML content in a notebook?

In this case, it doesn’t quite matter whether it’s d3.html or d3.text - to load any data from one webpage to another, you’ll need the resource to support CORS. Wikipedia doesn’t, for its webpages, because they’re typically consumed by browsers, not loaded by JavaScript. Here’s an example of loading that page:

There are two options that work - either pulling from Wikipedia through a proxy that adds CORS, or by pulling from the Wikipedia API.

Most of the time, when you’re loading things with JavaScript, CORS will be supported and what you’re loading is data. Webpages aren’t usually regarded as data, and usually don’t support CORS - they’re pretty tricky to parse, as this one is - if your intent is to get that list of towns a boroughs. That’s also why, for data in Wikipedia, there’s Wikidata, a sister project that provides better APIs and easier-to-use data. Unfortunately it doesn’t have coverage for this particular table.

Anyway, best of luck, and also threw in the code for parsing the entries from that table in that notebook.

I have written ‘fetcher’ for that purpose

It uses cors-anywhere to bypass cors restriction

Please be aware that the cors-anywhere demo is rate limited. The documentation states:

The current rate limits are listed here. Some alternative CORS proxies are listed in this gist and its comments.

Yes, I know, should have mentioned that :slight_smile: