D3.csv() and DuckDB caching remote files

Hi there! Can you suggest the optimal way to cache downloaded files using d3.csv() and DuckDB on the client-side browser? I use the Observable framework and Google Spreadsheet CSV files to create a marketing and analytics trends dashboard. Here is a code example:

const trendweek = await d3.csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vStElZLXAh6_Zy1z6HqCc6xf6_xOVBJMOzVvNQGLOf6KmsVY07etajt3_gFplnqoCCOi17bBkib_Xv0/pub?gid=2033815436&single=true&output=csv')
const sql = await DuckDBClient.sql({trends: trendweek})

Every time I reload the page or open it in a new tab, it takes too much time to download all the CSV data, which can distract users and make the page heavily loaded

So event requesting just the metadata (28 bytes of information) takes nearly as long as the full download

(
  await fetch(
    "https://docs.google.com/spreadsheets/d/e/2PACX-1vStElZLXAh6_Zy1z6HqCc6xf6_xOVBJMOzVvNQGLOf6KmsVY07etajt3_gFplnqoCCOi17bBkib_Xv0/pub?gid=2033815436&single=true&output=csv",
    {
      method: "HEAD"
    }
  )
).headers.get("content-length")

So that API endpoint is slow. It doesn’t expose cache headers or a version that I can see (e.g. etag)

The Google Sheets API v4 supports gzip and probably is faster too as its designed for programmatic access.

Just checking my notes… this is what I have used in the past

faster = {
  const sheetId = "1Z7Dja43FepxVOJc5_pMdP0etERM6h0BPAWT74zjdbno";
  const sheetName = "Sheet1"; // Change if your sheet name is different
  const url = `https://docs.google.com/spreadsheets/d/${sheetId}/gviz/tq?tqx=out:csv&sheet=${sheetName}`;

  return d3.csv(url);
}

Which seems faster

It definitely works, thanks!
But it forces me to make the whole Google Document accessible via the internet, and I have some data in there I don’t want to share with others :slightly_smiling_face:.
The second thing is that users still have to wait for the data to load after reloading the page or opening it in new tabs.
I asked ChatGPT, and it suggested using the browser’s storage API (like localStorage) to store data locally, but maybe there is another, more convenient way to achieve the same result using Observable functionality or another API?

Yes you can also use local storage to keep the assets in the browser, maybe revalidate in the background for 0 latency on page load. You can also downlaod the CSV and serve via framework which is better optimized. You can switch from CSV to compressed parquet (GitHub - hyparam/hyparquet: parquet file parser for javascript).

Also are you sure its the file that is the slow part? Coz often I find the 7MB duckdb dependancy is the actually cause of slow starts. Did you measure in the network tab of the browser.

This is good idea to take a look at network mesurement, i ll get back with research

1 Like

Okay, probably I am doing something wrong, but I set up slow 4G without cache to test weak network conditions in DevTools (Chrome, Linux OS).
Spreadsheet Request: Waiting and downloading from the spreadsheet takes 700 ms and 400 ms accordingly, which is not bad at all.

But then I saw this:

Request to: https://yugin-pro.nl/_npm/@duckdb/duckdb-wasm@1.29.0/dist/duckdb-eh.wasm Waiting and downloading takes 600 ms and 26,000 ms accordingly. :grimacing:
I have no idea why this is.

Yeah that’s duckdb, it’s massive. Unless I need SQL I use hyparquet

1 Like