DOMParser or d3.html

cforster · May 22, 2018, 2:22am

I’ve been trying to pull in an html page to then convert into a data source. When I use d3.html(‘https://en.wikipedia.org/wiki/List_of_towns_and_boroughs_in_Pennsylvania’) I get an empty result:

HTMLDocument {
  location: null
}

I also cannot get a valid DOMParser. for example

window.DOMParser().parseFromString(d3.text(‘https://en.wikipedia.org/wiki/List_of_towns_and_boroughs_in_Pennsylvania’), ‘text/html’)

I get the same result. Is there something I’m missing about importing HTML pages? Is there a good way to fetch HTML content in a notebook?

tom · May 22, 2018, 3:04am

In this case, it doesn’t quite matter whether it’s d3.html or d3.text - to load any data from one webpage to another, you’ll need the resource to support CORS. Wikipedia doesn’t, for its webpages, because they’re typically consumed by browsers, not loaded by JavaScript. Here’s an example of loading that page:

There are two options that work - either pulling from Wikipedia through a proxy that adds CORS, or by pulling from the Wikipedia API.

Most of the time, when you’re loading things with JavaScript, CORS will be supported and what you’re loading is data. Webpages aren’t usually regarded as data, and usually don’t support CORS - they’re pretty tricky to parse, as this one is - if your intent is to get that list of towns a boroughs. That’s also why, for data in Wikipedia, there’s Wikidata, a sister project that provides better APIs and easier-to-use data. Unfortunately it doesn’t have coverage for this particular table.

Anyway, best of luck, and also threw in the code for parsing the entries from that table in that notebook.

bumbeishvili · June 17, 2018, 5:22pm

I have written ‘fetcher’ for that purpose

It uses cors-anywhere to bypass cors restriction

mootari · June 18, 2018, 8:43pm

Please be aware that the cors-anywhere demo is rate limited. The documentation states:

The current rate limits are listed here. Some alternative CORS proxies are listed in this gist and its comments.

bumbeishvili · June 18, 2018, 8:47pm

Yes, I know, should have mentioned that

Topic		Replies	Views
google spredsheet import	2	644	June 26, 2020
d3.csvParseRow for data via URL? Community Help	3	1258	September 11, 2019
blocked by CORS policy at Chrome Community Help	8	9904	November 29, 2022
Problem fetching data from API with fetchp — possible CORS issue Community Help	8	682	August 13, 2022
Understanding Observable - Downloading HTML & JS Community Help	6	5799	March 17, 2020

DOMParser or d3.html

Related topics