Retrieving data from google drive / box

cskn · February 13, 2020, 10:22pm

I have a 96 MB .csv file I would like to load into a notebook. It appears as though it’s too large to attach directly to the notebook, but I could easily upload it to google drive or box. However, I don’t see a way to then retrieve the file from either of those services. I don’t know anything about SQL, so maybe that is the solution, but I’d love to know any info about how to get this data into my notebook. Thanks!

mbostock · February 13, 2020, 11:11pm

Hi @cskn. Unfortunately I don’t think Google Drive supports a CORS-accessible URL, which is what you’ll need to access the file from your browser in Observable.

I would probably create a repository on GitHub for your file, and then use the GitHub raw link to fetch the file in your notebook.

For example, here’s a CSV file hosted on GitHub:

If you click the Raw button, it’ll give you the raw URL:

https://raw.githubusercontent.com/maxogden/csv-spectrum/v1.0.0/csvs/comma_in_quotes.csv

You can then fetch and parse this CSV file in Observable as:

d3.csvParse(await fetch("https://raw.githubusercontent.com/maxogden/csv-spectrum/v1.0.0/csvs/comma_in_quotes.csv").then(response => response.text()))

cskn · February 13, 2020, 11:18pm

Thanks for the tip. The issue I’m running into is that it seems like github’s max file size is 25MB. Mine is 96MB. Do you know of any other service for getting a .csv uploaded, or any other solutions?

mbostock · February 14, 2020, 2:43am

I believe GitHub only has a maximum file size if you upload through the browser. If you upload from the terminal using git you should be fine.

I’ve also written a tool that creates a GitHub Gist from the command line:

cskn · February 18, 2020, 7:06am

Thanks for passing this along! As I was trying to implement it (I’m very much a novice on the command line), I spoke with a friend who knows much more about programming and the internet than me (but doesn’t know about Observable / vega-lite), and he suggested that retrieving a 95MB .csv file might actually be a horrible idea in the first place, since, depending on how observable / vega-lite work, users would each have to download the file, and then anytime a chart was re-rendered based on user input, all 96mb would potentially have to be re-processed. This may be too off-topic for this thread, bit figured I’d see if you had thoughts about that.

mbostock · February 18, 2020, 3:03pm

95MB is fairly big but not necessarily prohibitive. It depends on how quickly you want your notebook to load, whether your readers would be willing to wait.

A common path is to start working with a large dataset for exploratory visualization, to figure out what’s interesting, and then to download and re-attach a much smaller dataset that you then use in your published notebook. For example, if you have a dataset with dozens of columns available as data, you could define another cell just to extract two of the columns:

data.map(({foo, bar}) => ({foo, bar}))

Then you can click Download CSV in the cell menu for this cell, and you’ll get a new smaller CSV file you can use to replace the larger one. (The original file will still be available in your history if you want to go back, or could start a new notebook if you want to keep the two separate.)

And there are a variety of other techniques for making datasets smaller, such as aggregating using d3.group or d3.rollup.

This approach can also be used to reduce files that are larger than 15MB by using local files instead of file attachments:

Fil · February 19, 2020, 1:04pm

Just yesterday I was working with a 120MB csv data file that I reduced to 500kB by using a binary format. You can cook your own by building an arraybuffer, or use industry standards such as apache arrow (that vega reads natively), or numpy’s npy format.

Here are a few links:

Topic		Replies	Views
Feature request: API to create a notebook from a template Feedback	4	420	May 10, 2022
Getting CSV file from private repo in Github Community Help	1	1014	June 13, 2023
D3.csv() and DuckDB caching remote files Community Help	7	64	April 8, 2025
How to import raw CSV from github Community Help	3	799	February 12, 2021
Run local Python or R script? Community Help	4	265	May 2, 2023

Retrieving data from google drive / box

Related topics