Scalability advice for COVID-19 Preprint Review Notebook

We’ve built this notebook

that pulls in the latest preprints from BioRxiv/MedRxiv (currently ~1,300 pre-prints) using their API and integrates reviews written by immunologists from Mount Sinai. The number of pre-prints is growing fast (as many as 100 per day) and might get to ~5,000 in a few months. I wanted to ask if you all had any advice on how to maintain performance as the amount of data scales up in a notebook.

For instance, I’m thinking of pre-calculating the word cloud elsewhere if the number of papers grows too large.

Thanks,
Nick

2 Likes

Yes — limiting the amount of data loaded by the notebook, and running pre-computation offline are both great ideas to keep things working quickly and smoothly.

One nice pattern for this sort of thing is our File Attachments feature. You can implement a data processing and filtering cell to run an expensive computation to cut your data down to the minimum size you want, and then use the cell menu to the left to download the pre-processed version of the data.

Then, you can re-upload that cell (perhaps to a different notebook) to provide a more optimized version of the same dataset.

Doing that sort of thing live — where the notebook would run periodically, updating the dataset without you needing to perform any explicit action — is something that we’re actively thinking about, but don’t yet support on the platform itself. (You can always connect to APIs of your own elsewhere.)

4 Likes

Hi @jashkenas, thanks for the advice. I’m doing something similar for the Clustergrammer-GL heatmap - I’m calculating the clustering in a Jupyter notebook, pushing the result to Github, and loading the JSON into the Observable notebook. Similarly, I’m grabbing Altmetric scores for >1,000 papers from altmetric here in a Jupyter notebook, saving it to a JSON, and then making a GET request from GitHub from Observable.

It would be cool if that kind of thing could be done on Observable in a more automated manner. For instance, if we could load a cell with the results of a long-to-run calculation into another notebook (not sure if this can already be done).

The only thing we really need/want to be automatic is grabbing the latest papers from the BioRxiv API.

How about using DB connexion ? https://observablehq.com/@observablehq/connecting-to-databases?collection=@observablehq/introduction

1 Like

thanks @maliky I’ll look into it