Scalability advice for COVID-19 Preprint Review Notebook

cornhundred · April 9, 2020, 6:03pm

We’ve built this notebook

that pulls in the latest preprints from BioRxiv/MedRxiv (currently ~1,300 pre-prints) using their API and integrates reviews written by immunologists from Mount Sinai. The number of pre-prints is growing fast (as many as 100 per day) and might get to ~5,000 in a few months. I wanted to ask if you all had any advice on how to maintain performance as the amount of data scales up in a notebook.

For instance, I’m thinking of pre-calculating the word cloud elsewhere if the number of papers grows too large.

Thanks,
Nick

jashkenas · April 9, 2020, 8:42pm

Yes — limiting the amount of data loaded by the notebook, and running pre-computation offline are both great ideas to keep things working quickly and smoothly.

One nice pattern for this sort of thing is our File Attachments feature. You can implement a data processing and filtering cell to run an expensive computation to cut your data down to the minimum size you want, and then use the cell menu to the left to download the pre-processed version of the data.

Then, you can re-upload that cell (perhaps to a different notebook) to provide a more optimized version of the same dataset.

Doing that sort of thing live — where the notebook would run periodically, updating the dataset without you needing to perform any explicit action — is something that we’re actively thinking about, but don’t yet support on the platform itself. (You can always connect to APIs of your own elsewhere.)

cornhundred · April 9, 2020, 9:32pm

Hi @jashkenas, thanks for the advice. I’m doing something similar for the Clustergrammer-GL heatmap - I’m calculating the clustering in a Jupyter notebook, pushing the result to Github, and loading the JSON into the Observable notebook. Similarly, I’m grabbing Altmetric scores for >1,000 papers from altmetric here in a Jupyter notebook, saving it to a JSON, and then making a GET request from GitHub from Observable.

It would be cool if that kind of thing could be done on Observable in a more automated manner. For instance, if we could load a cell with the results of a long-to-run calculation into another notebook (not sure if this can already be done).

The only thing we really need/want to be automatic is grabbing the latest papers from the BioRxiv API.

maliky · April 9, 2020, 10:03pm

How about using DB connexion ? https://observablehq.com/@observablehq/connecting-to-databases?collection=@observablehq/introduction

cornhundred · April 9, 2020, 10:28pm

thanks @maliky I’ll look into it

Topic		Replies	Views
suggestion: give front page space to a coronavirus pandemic collection	22	2767	August 10, 2020
Help parsing JSON for Notebook of Latest COVID-19 Papers Community Help	8	1112	March 16, 2020
Handy Embed Notebook problem Community Help	9	756	May 12, 2020
Github Feedback	2	330	July 27, 2021
Feedback after making a very large notebook Feedback	1	1216	March 15, 2018

Scalability advice for COVID-19 Preprint Review Notebook

Related topics