🏠 back to Observable

streaming geojson - possible in the same manner as streaming shapefile?

In Streaming Shapefiles, Mike demonstrates a method for progressively reading in information from a somewhat large (13.6 MB) shapefile. Before I go too far trying to reproduce this with a JSON file that is 1,684,583 KB (1.6 GB), I wanted to ask if the operation is possible.

What gives me pause is that QGIS is also able to ‘stream’ in shapefiles—rendering them progressively as it reads through the file data. However it can’t do this with a geojson: I must wait for it to fully load in order to render.

Mike’s notebook uses a utility shapefile@0.6 to open the shapefile, and more specifically uses this command:

shapefile.open(await FileAttachment("UScounties.shp").stream()

He also notes that for this to work, a browser must support streaming fetch.

I’ve referred to the docs on the fetch API and on consuming fetch as a stream, but it’s all very technical and really far over my head. Confusing especially for me is the (apparently simple) example they give for fetch stream method:

// Fetch the original image
fetch('./tortoise.png')
// Retrieve its body as ReadableStream
.then(response => response.body)

To me, this conceptually doesn’t look like it would work in the same way as the shapefile > open > stream sequence from Mikes example (rather, it reads like it would first need to fetch everything, then return it).

Or maybe this is all just ‘automagical’ with today’s browsers? When I try just fetching the hosted URL in Observable, my cell just hangs there for about 10 minutes before finally returning the (complete) fetched data.

It goes without saying that I realize I should just cut down the data in QGIS first so as to focus on my specific problem area. At the same time, there’s something to be said for hosting a complete record, and then using JS to clip out the bits I need and re-export those clips without having to open QGIS each time.

Thanks in advance for your help and guidance!

1 Like

Yes, this is possible!

Although 1.6 GB is a bit much for a JSON data structure to be loaded in your browser’s memory, if your computer has enough memory available, it will work.

Depending on your use case, maybe you only need to look at the data once and then draw it directly on a canvas, for example, or if you are only interested in a small part of the data and you need to filter each incoming GeoJSON Feature to decide whether to keep or discard it. In those cases you are not keeping the complete 1.6 GB in your browser’s memory, and you’ll keep your notebook’s memory needs much smaller.

You can have a look at this notebook, it uses streaming fetch to load a NDJSON file (NDJSON is a bit like CSV, but for JSON) with geospatial data:
https://observablehq.com/@aboutaaron/racial-demographic-dot-density-map. In this notebook, the NDJSON file contains a GeoJSON Feature on each line of the file. The NDJSON file itself is not valid GeoJSON, but the individual lines are.

You could restructure your data as NDJSON and import the ndjsonStream function. The ndjsonStream function reads data from the stream, splits the stream on each incoming newline character, and returns each line as JSON. You could also rewrite that function to split the incoming streaming GeoJSON FeatureCollection data on the end of each incoming GeoJSON Feature (like the JSONStream module is doing for Node.js streams).

2 Likes

Hi @bert, and thank you for this reply! It’s currently very early for me (3:38 a.m.) so I’ll return to this in the morning in order to work through your tips! I wanted just to respond with a quick note of thanks.

For some additional clarification on the workflow I was thinking about maybe working toward:

I wanted to load in the OSM ‘built up area’ files for Nepal, which come as shapefiles but I intended to convert to geojson. [Actually, I plan to load in several large data layers.] In most instances I only need data for one or two districts or wards at a time, but since I contribute to a portfolio of projects across the country, I thought that–rather than cutting out each district each time–I might create a ‘project set up’ notebook of sorts where I specify the layers that I want to use, then identify the regions on which I will focus, and then downloading my focus area. If this were something that could be accomplished in the browser, that would be excellent (since I don’t always have access to QGIS and b/c I have been finding it very powerful to work with JavaScript / D3.js for other aspects of data analysis, and it’d be great to have everything all in one place).

Again - Thank you for the response and guidance! :pray: I’ll report back very soon when I have a chance to play :slight_smile: [And to determine whether this workflow will fall apart quickly when trying to start from several massive data files].

I’ve created a notebook that reads GeoJSON from a URL using the Streams API and then parses each individual Feature: https://observablehq.com/@bertspaan/streaming-geojson.

1 Like

Thank you, @bert – You’re way ahead of me! I meant to test this out today, but the day is fast escaping me (and I board a plane tomorrow for USA… so a bit hectic in the throws of packing).

Your caution on file size was prescient: When I dropped in my file’s S3 url into your notebook, it returned an error after a split second of reading my file (caught at JSON.parse).

Sorry for the slowness to test and follow-up! More soon.

Sincere thanks.

I didn’t thoroughly test my GeoJSON parsing code… maybe that’s why the JSON.parse error occured. If you send me the URL of the failing GeoJSON file, I’ll have a look.

1 Like

Thanks again @bert! I sent you a private message (reluctant to open a 1.6 gig download to the Internet at large :wink: ).

Inspired by your help, your references, and your notebook, I’ve been reading more into this and still plan to post a more detailed reply when I settle down in the states – in a couple more days. For now, I’d like to comment that I really, really like how your notebook shows the array features rolling in–almost like a download progress bar of sorts! I see also that you updated your notebook to render out some data into a table. Maybe it’s my browser or the limited number of features, but it seems that the addition of the table as a dependent cell somehow affect the geojson cell so that it’s now more prone to getting stuck at zero until the full 14 features are loaded in.

Thank you for your generosity and mentoring!

Hey Aaron,

Happy to help! I’ve been planning to create a streaming GeoJSON reader for a while, your question came at the right moment. I’d like to change a few things to my notebook, and display the features on a map instead of in a table. I’ll try to find time to do that sometime this week.

1 Like

Hi again, Bert, and thank you for all the time and guidance!

On the first example of your notebook, you were using a different data source and didn’t yet have the table chart loaded in. The result was a ‘counting updater’:

In your new update, that data cell no longer does this auto counting in the same way (and has fewer array values):

If you’re quick enough on page load, you can see that the array value now stays ‘pending’ until all 14 arrays load in. Is that possibly because the your table cell now relies on it?

Please forgive me, but I am too inexperienced yet to appreciate many of the intricacies here, so I’d also like raise a novice / uninformed question:

Is it needed to use an external library to achieve streaming fetch?

Looking at the consuming fetch as a stream example, it shouldn’t be. When I try to plug their example into a notebook with a JSON file attachment, I get this:

… However I can’t seem to ‘access’ this stream in the sense of being able to make it load in a piece at a time… such as when trying to render the stream as data into Mike’s GeoJSON Viewer

Since the file is so large, it’s be fun to see the areas slowly built into the bowers, like with the Streaming Shapefiles notebook. Or in a table, like in yours!