I really like the idea of this kind of notebook which fetches, tidies and analyses data.
It also, importantly, exports the final data for use in other notebooks.
However, whenever another notebook uses that data, it has to re-run all the fetching and tidying of the data, and can’t guarantee that the data hasn’t changed or the data sources haven’t disappeared.
It would be useful if there was a way in a cell to say “write this data to a file attachment and export that as a versioned artifact”. Then, a way to import that versioned file attachment into another notebook. Also, a way to re-run the original fetching and processing scripts to generate a new version of the data.
I realise that this is somewhat contrary to the general aim of Observable, which is a place to visualise and examine data that already exists, but it could also be useful as a place to gather and publish that data for use in other notebooks?
yeah serverless-cells could be a good building block. They just go streaming support so they can chop up quite big datasets now. What kind of datasets are you thinking about? There are static datasets that might get updated once in a while, timeseries which are ongoing due to the nature of the data.
Import into storage on a schedule (by admin user)
Transform with SQL (by admin user)
cache in storage on schedule (by admin user).
read from cache (public), CSV/Arrow (?)
What do you think? Anything else critical? You need to be able to jszip things too I guess, so SQL transformation is maybe a stretch goal and just transforming with code is the base case.