Retrieving data from google drive / box

I have a 96 MB .csv file I would like to load into a notebook. It appears as though it’s too large to attach directly to the notebook, but I could easily upload it to google drive or box. However, I don’t see a way to then retrieve the file from either of those services. I don’t know anything about SQL, so maybe that is the solution, but I’d love to know any info about how to get this data into my notebook. Thanks!

Hi @cskn. Unfortunately I don’t think Google Drive supports a CORS-accessible URL, which is what you’ll need to access the file from your browser in Observable.

I would probably create a repository on GitHub for your file, and then use the GitHub raw link to fetch the file in your notebook.

For example, here’s a CSV file hosted on GitHub:

If you click the Raw button, it’ll give you the raw URL:

https://raw.githubusercontent.com/maxogden/csv-spectrum/v1.0.0/csvs/comma_in_quotes.csv

You can then fetch and parse this CSV file in Observable as:

d3.csvParse(await fetch("https://raw.githubusercontent.com/maxogden/csv-spectrum/v1.0.0/csvs/comma_in_quotes.csv").then(response => response.text()))

Thanks for the tip. The issue I’m running into is that it seems like github’s max file size is 25MB. Mine is 96MB. Do you know of any other service for getting a .csv uploaded, or any other solutions?

I believe GitHub only has a maximum file size if you upload through the browser. If you upload from the terminal using git you should be fine.

I’ve also written a tool that creates a GitHub Gist from the command line:

Thanks for passing this along! As I was trying to implement it (I’m very much a novice on the command line), I spoke with a friend who knows much more about programming and the internet than me (but doesn’t know about Observable / vega-lite), and he suggested that retrieving a 95MB .csv file might actually be a horrible idea in the first place, since, depending on how observable / vega-lite work, users would each have to download the file, and then anytime a chart was re-rendered based on user input, all 96mb would potentially have to be re-processed. This may be too off-topic for this thread, bit figured I’d see if you had thoughts about that.

1 Like

95MB is fairly big but not necessarily prohibitive. It depends on how quickly you want your notebook to load, whether your readers would be willing to wait.

A common path is to start working with a large dataset for exploratory visualization, to figure out what’s interesting, and then to download and re-attach a much smaller dataset that you then use in your published notebook. For example, if you have a dataset with dozens of columns available as data, you could define another cell just to extract two of the columns:

data.map(({foo, bar}) => ({foo, bar}))

Then you can click Download CSV in the cell menu for this cell, and you’ll get a new smaller CSV file you can use to replace the larger one. (The original file will still be available in your history if you want to go back, or could start a new notebook if you want to keep the two separate.)

And there are a variety of other techniques for making datasets smaller, such as aggregating using d3.group or d3.rollup.

This approach can also be used to reduce files that are larger than 15MB by using local files instead of file attachments:

2 Likes

Just yesterday I was working with a 120MB csv data file that I reduced to 500kB by using a binary format. You can cook your own by building an arraybuffer, or use industry standards such as apache arrow (that vega reads natively), or numpy’s npy format.

Here are a few links:

4 Likes