I was using csv for raw data since beginning of a project, the initial versions csv was able to fit the 15MB FileAttachment limit, but after some time of schema evolving, some extra columns need nested data types and switched to json, the unprocessed data was in newline-delimited json, then I wrap each line with a [ , , , ]
to convert to a full valid json array of objects, to be able to easy count lines with wc
, also easy to load with d3.json(…); the total number of rows have grown to ~50K and file size has grown to be more than 15MB, cannot be embedded in a FileAttachment, then I found secret gist is a good place to host the raw data; but when the file size is growing to more than 50MB, it failed to upload to gist (the 50MB seems to be gist limit?)
[ {"date":"xxx","id": "...","articleId":"...","userName":"...","tags":[],"appreciatedBy":[{"date":"","amount":3,"userName":"..."},{...}]... }
, {"date":"xxx","id": "...","articleId":"...", ... }
, {"date":"xxx","id": "...","articleId":"...", ... }
...
, {"date":"xxx","id": "...","articleId":"...","tags":["tag1","tag2"] ... } ]
I have tried to convert the ndjson to avro binary file format (at ~17MB uncompressed) to continue to use gist, but require(‘avsc’) is failing, not a module, it seems to have a for-browser module, but trying to load the unpkg url is also failing,
require('https://unpkg.com/avsc@5.4.22/etc/browser/avsc.js')
just found the @theneuralbit/introduction-to-apache-arrow
not sure if it’s a better file format for this use case? not sure yet if it supports nested fields?
and consider at current growth speed, the raw data file size may grow up to 300MB within the following year, what would be a better data format to load all the raw records into Observable?
Thanks,