Notebooks for generating and publishing versioned datasets

I really like the idea of this kind of notebook which fetches, tidies and analyses data.

It also, importantly, exports the final data for use in other notebooks.

However, whenever another notebook uses that data, it has to re-run all the fetching and tidying of the data, and can’t guarantee that the data hasn’t changed or the data sources haven’t disappeared.

It would be useful if there was a way in a cell to say “write this data to a file attachment and export that as a versioned artifact”. Then, a way to import that versioned file attachment into another notebook. Also, a way to re-run the original fetching and processing scripts to generate a new version of the data.

As a bonus, it would be nice to be able to browse through different versions of the same dataset!

I realise that this is somewhat contrary to the general aim of Observable, which is a place to visualise and examine data that already exists, but it could also be useful as a place to gather and publish that data for use in other notebooks?

5 Likes

Have you seen @tomlarkworthy’s serverless cells? You can build something along those lines on top. Serverless Cells

This recent Github project might also be useful to you: GitHub OCTO | Flat Data – Edit: that’s what you just linked to, nevermind.


Maybe the two can be combined: Make your github action fetch from a serverless cells page driven by a notebook.

3 Likes

yeah serverless-cells could be a good building block. They just go streaming support so they can chop up quite big datasets now. What kind of datasets are you thinking about? There are static datasets that might get updated once in a while, timeseries which are ongoing due to the nature of the data.

We have all the pieces now

http: Serverless Cells / Endpoint Services / Observable
cron: Schedule Regular Tasks with Cron / Endpoint Services / Observable
storage: Store Files with Storage / Endpoint Services / Observable
auth: Federated IndieAuth Server / Endpoint Services / Observable

The http handler has a CDN so storing intermediate results in the http cache can be simpler than storage.

I am also noticing a strong desire for SQL by the interest in Network efficient SQLite querying from static file hosting. / Tom Larkworthy / Observable, and caching SQL is a desired feature too (How to cache BigQuery results in a public Notebook with Firebase Storage or a Cloud Bucket / Endpoint Services / Observable)

What would be the ultimate interface? My guess:

Import into storage on a schedule (by admin user)
Transform with SQL (by admin user)
cache in storage on schedule (by admin user).
read from cache (public), CSV/Arrow (?)

What do you think? Anything else critical? You need to be able to jszip things too I guess, so SQL transformation is maybe a stretch goal and just transforming with code is the base case.

Then how

1 Like