what's best data format for raw data file size at 50-300MB ? and if not gist, where is a better place to host for free?

I was using csv for raw data since beginning of a project, the initial versions csv was able to fit the 15MB FileAttachment limit, but after some time of schema evolving, some extra columns need nested data types and switched to json, the unprocessed data was in newline-delimited json, then I wrap each line with a [ , , , ] to convert to a full valid json array of objects, to be able to easy count lines with wc, also easy to load with d3.json(…); the total number of rows have grown to ~50K and file size has grown to be more than 15MB, cannot be embedded in a FileAttachment, then I found secret gist is a good place to host the raw data; but when the file size is growing to more than 50MB, it failed to upload to gist (the 50MB seems to be gist limit?)

[ {"date":"xxx","id": "...","articleId":"...","userName":"...","tags":[],"appreciatedBy":[{"date":"","amount":3,"userName":"..."},{...}]... }
, {"date":"xxx","id": "...","articleId":"...", ... }
, {"date":"xxx","id": "...","articleId":"...", ... }
...
, {"date":"xxx","id": "...","articleId":"...","tags":["tag1","tag2"] ... } ]

I have tried to convert the ndjson to avro binary file format (at ~17MB uncompressed) to continue to use gist, but require(‘avsc’) is failing, not a module, it seems to have a for-browser module, but trying to load the unpkg url is also failing,

require('https://unpkg.com/avsc@5.4.22/etc/browser/avsc.js')

just found the @theneuralbit/introduction-to-apache-arrow not sure if it’s a better file format for this use case? not sure yet if it supports nested fields?

and consider at current growth speed, the raw data file size may grow up to 300MB within the following year, what would be a better data format to load all the raw records into Observable?

Thanks,

Hi! Have you considered importing the data to an external relational database? You can connect to external DBs from your notebook through Observable’s database connectors.

If you’re still exploring your options, AWS also has several free DB related products.

1 Like

Not sure if this is an ideal solution, but I found this notebook by @mootari to be really helpful: https://observablehq.com/@mootari/data-images

I tried it out in this notebook https://observablehq.com/@a10k/zinzi for an sqlite db loaded as an image and then using localforage to store it to browser memory to make reopening the notebook much quicker.

I used DB browser (a free sqlite browser desktop app) to load csv files directly into an sqlite file as tables and then converted it to the image using the above notebook, I’m yet to test the performance of crossfilter.js/arrow or just using sqlite in browser for optimal performance/pain to use, (this one https://observablehq.com/@a10k/sqlite-sql-js-with-webworkers this notebook uses sqljs with workers)

For hosting, Google firebase provides free storage capabilities, https://observablehq.com/@a10k/serverless-notebooks-with-firebase you can setup a free account, then use the storage api to upload and download files too. (you can just use their web interface too, to upload files and make them public without having to do it programmatically)

2 Likes

thanks, both database solutions sound good; I haven’t tried yet; so far my dataset size is just a bit over 50MB in plain text;
before reaching 50MB json plain text, it was still acceptable to play in browser all in memory; I guess the real limit of browser in memory is somewhat above 50MB, maybe 100MB plain text json would be still ok;

use a png header to compress the json should be able to cut file size to below 20MB, that’s a nice trick I would like to try

not sure if /@theneuralbit can see here to comment on what’s max limit of playing Arrow format in browser all in memory? I see the example scrabble.arrow in use is ~45MB on gist, and plays still fast in browser memory, wonder whoever played with an even bigger file in browser memory? and what’s the max arrow binary file size to use on Observable and still got acceptable responsiveness?

➸ curl -I https://gist.githubusercontent.com/TheNeuralBit/64d8cc13050c9b5743281dcf66059de5/raw/c146baf28a8e78cfe982c6ab5015207c4cbd84e3/scrabble.arrow
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 46978004    ( ~45MB here, almost reaching 50MB the max limit of gist )

There shouldn’t be a specific limit. How many memory you can utilize depends entirely on your system’s resources.

this was asking what’s the arrow file size limit empirically? for example, when comparing svg or canvas, 10k visual elements is an empirical limit of using svg, and better to use canvas for better performance; to be acceptable to majority computers

talking about raw data file size limit in memory, there should be still a number, a minimum number that definitely not to play in memory; 300MB may be for plain text format (csv or json)?
the example [scrabble.arrow] has ~45MB and 1,542,642 records in the client and still in good performance, the true limit may be 10x of that size?

That depends entirely on your target audience and will vary greatly from device to device. If you plan to not exclude any users, then the first step would be to not force a download of 300 MB onto them. Instead store your data in a (remote) database and query only the subset that you want to display.

I am pretty aware of that general answer “that depends on targeting audience”;

but I am here still asking some file size limit empirically? and have given the 10k visual elements in svg,

may be because not too many ones have played Arrow files yet? How about another binary format dataset file load on the fly (like csv.gz, json.gz, etc …) ? what is a file size limit empirically?

because the above Introduction to Apache Arrow page is loading a ~45MB arrow file on the fly, could that mean ~45MB is ok to majority Observable users? how about 100MB? how about 300MB?
(maybe only 1GB+ is a definite not)

how about let me rephrase to this, targeting majority users who can comfortably load many Observable pages…, (but not saying not exclude any users)

in most computers in 2020, for those who can do video meetings every week, should already have enough network bandwidth, and enough CPU/memory resources

Google spreadsheets has a way you can publish a sheet as csv and json document.