Takes 1 minute to summarize a million datapoints?

jwb · January 14, 2023, 6:26pm

Regarding this notebook: Sacramento River Flow Gage / Jeffrey Baker | Observable

It takes a whole minute to change anything about the graph and I am curious if there is anything simple I can change to prevent that. I’m not trying to throw a million points on the screen; the plot is binned and summarized by max. Since this works out to hundreds of thousands of CPU cycles spent per point in the dataset, it seems like there must be some extreme inefficiency either on my part or inside d3.js.

Ideally, it would be possible to switch rivers or change the graph between log and linear interactively, in less than a second.

mcmcclur · January 14, 2023, 6:53pm

The file you’re downloading, processing, and drawing is nearly 100 Mb and contains well over a million points. That’s far too large to expect end users to consume. If at all possible, I’d preprocess that data and make a filtered version available via a file attachment.

You can speed it up as is, if you must, by filtering prior to drawing, though. Even if you grab only ever thousandth point, you get pretty much the same picture. That is, define your data something like so:

d = raw.value.timeSeries[0].values[0].value
  .filter((e, i) => i % 1000 == 0)  // <---- Filtered
  .map((e) => ({
    v: +e.value,
    d: timeParse(e.dateTime)
  }))

jwb · January 14, 2023, 6:59pm

In the first place, I dispute that simply taking every 15000th minute as a datapoint is a legitimate tactic. That reduces this 15-minute dataset to one randomly-selected point per ten days! Peak flows only last for an hour or two (arguably, they are instantaneous) so reducing in windows is the correct tactic.

Secondly, I don’t see anything particularly large about a million points. In a local python notebook I can draw this graph in the blink of an eye.

mcmcclur · January 14, 2023, 7:10pm

I don’t know anything about this data; I’m simply pointing out the similarity in the pictures.

Yes, I agree. Working with a file on your hard drive in Python is going to be much quicker. Javascript, however, is optimized to give end users browsing the web a snappy experience. Most web pages supposedly lose half their audience for every second it takes a page to load; thus, loading a 100 Mb file is already out the window. That’s not to say you can’t do it but that’s not the priority when it comes to Javascript’s implementation.

This is all exactly why I recommend that you preprocess the file.

jwb · January 14, 2023, 7:24pm

I guess one way to do it would be to pre-summarize the data prior to 2023 and load it live for current data, and draw them on the same graph. I am not sure how to determine in advance or control the boundaries of the bin transform though.

Plus, the amount of data I’m going to have to get from USGS for every stream in the state since 1987 is going to be sort of large.

Cobus · January 14, 2023, 8:55pm

If you want to pre-process the data to get the max per day, you can reduce it:

maxByDay = d3.rollups(
  d,
  (v) => d3.max(v, (e) => e.v),
  (d) => new Date(d.d.toDateString())
)

and then plot it. It reduces to 12,347 records. Example:

mbostock · January 14, 2023, 10:07pm

Binning is slower here than I’d expect and it probably has to do with the fact that we need to use bisection in order to determine which bin to place each observation. (Calendar intervals such as months are not uniform length so we can’t simply quantize.) #454

Fortunately you can use the group transform instead if you know what time interval makes sense for your data. For example, here is grouping by month, which renders in about 300 ms on my laptop:

Plot.plot({
  y: {
    transform: d => d / 1000
  },
  marks: [
    Plot.line(data, Plot.groupX({y: "max"}, {x: d => d3.utcMonth(d.dateTime), y: "value"})),
    Plot.ruleY([0])
  ]
})

I also saved the file as a file attachment (compressed as zip so that it fits under the 50 MB limit—it reduces nicely down to 4 MB!). It takes ~30 seconds to download from USGS so that cuts most of the time, assuming you are satisfied working with a snapshot of the data.

mbostock · January 19, 2023, 12:42am

The latest release of Plot (0.6.2) has made this use case about 60 times faster (<300 ms). I’ve included an example in the linked notebook above. Thanks for the feedback!

Topic		Replies	Views
How to speed up interactive plot rendering Community Help	3	164	October 15, 2023
Animating a world map: how to make it snappier Community Help	10	326	May 23, 2021
d3: Scaling when there is a large amount of data can cause lag Community Help	6	159	November 23, 2023
Perfomance issue with one of the maps Community Help	7	621	May 7, 2020
Improve choropleth performance (d3) Community Help	3	421	October 28, 2021

Takes 1 minute to summarize a million datapoints?

Related Topics