File attachement and loading over network efficiency ?

maliky · April 12, 2020, 1:24pm

I have a notebook with 2 datasets of 8 tables each. In the biggest dataset, 3 of the tables have about 1 millions records. I am running in loading time issue.

Would there be significative difference between loading files from gist and using file attachement ?

My understanding is that using DB connexion would be ideal. But I am not yet able to do that, so I am exploring file attachements with the following function:

async function get_data_(table_, course_ = course) {
  let dataset1_foo = FileAttachment("dataset1_circuit.json").json(), // ~
    dataset1_baz = FileAttachment("dataset1_autNcir.json").json(); // ~ 900 000
  // 6 more tables ...

  let dataset2_foo = d3.csvParse(
      await FileAttachment("dataset2_circuit.csv").text(),
      d3.autoType
    ),
    dataset2_alice = FileAttachment("dataset2_auteur.json").json();
  // 6 more tables ...

  return {
    dataset1: {
      circuit: dataset1_foo,
      auteur: dataset1_baz
    },
    dataset2: {
      circuit: dataset2_foo,
      autNcir: dataset2_alice
    }
  }[course_][table_];
}

async is really new to me.

Is calling get_data_ loading all the files memory to return juste one ? Or does it, as I hope, only load in memory the file that it returns?

jashkenas · April 13, 2020, 5:54pm

Hi @maliky,

It looks like, the way you have things set up currently, all of the remote FileAttachments are going to be fetched and loaded into memory, each time you call the function. After the function completes, most of that memory should then be garbage collected (the courses and tables that are not selected at the end).

A database connection (or an API) would indeed be better, because it would allow you to transfer only the subset of data you actually want to use.

But, if you just want to optimize your use of FileAttachment, I would recommend placing them in separate cells, so that they’re only requested once when the notebook is loaded, and not every time get_data_ is called. Something like:

dataset1_foo = FileAttachment("dataset1_circuit.json").json()

dataset1_baz = FileAttachment("dataset1_autNcir.json").json()

… and so on.

maliky · April 13, 2020, 6:57pm

Thank you. You helped me do what I wanted to do but did not do because I though it would be too much code repetition. In the end it is not. I can select one database or the other just by adding one test before each table loading. I understand that it is not possible to build fileAttachment names on the fly.

In the end, I may go the the DB connexion.

jashkenas · April 13, 2020, 7:13pm

Actually, to elaborate on this further — although it’s not possible to build FileAttachment names dynamically (because we use static analysis to make them securely available to your notebook) — they’re not actually requested until you use them, so if you change your setup to something like this:

dataset_1 = FileAttachment("dataset_1.json")

dataset_2 = FileAttachment("dataset_2.json")

data = (toggled ? dataset_1 : dataset_2).json()

… only one of the datasets will actually be requested and loaded into memory. If you’re planning on eventually loading everything anyway, I’d still go with the first approach — but if any given use of the notebook will only load a subset of the files, this might help make it possible to leave the other half unloaded.

maliky · April 13, 2020, 7:20pm

Is your code different from a memory point of view, from what I have ?

dataset_1= FileAttchment('dataset_1.json').json()

dataset_2 = d3.csvParse(await FileAttachment("dataset_2.csv").text(), d3.autoType);

data = toggled ? dataset_1 : dataset_2;

Note:
Event with carefully loading file attachment there is still too much latency to make the visualization usable. What are common DB service used with observable ? Amazon RDS ?

jashkenas · April 13, 2020, 7:30pm

Yes. Mine only loads one of the two files into memory , because .json() is when the loading actually occurs, not FileAttachment(). Your version loads them both.

Postgres, MySQL, BigQuery and Snowflake are the database connections we currently support. Amazon RDS should work just fine. For more info, see: Connecting to databases / Observable | Observable.

Another option that may or may not work for your use case is saving your dataset as an SQLite DB file, and using SQL.js to query it all in the browser. Here’s a demonstration of doing things that way, with a 93.5MB database: SQLite / Mike Bostock | Observable

maliky · April 13, 2020, 7:38pm

Excellent! I think the sqlite is the most portable solution for me.
Great !

Topic		Replies	Views
Problem with awaiting a FileAttachment Community Help	6	1178	September 14, 2020
Evaluate from FileAttachment Community Help	9	882	September 16, 2020
FileAttachment now supports CSV (and TSV)	7	1821	February 7, 2021
how to (elegantly) parse a *.csv FileAttachment ? Community Help	3	2378	November 27, 2019
Loading a duckdb database Community Help	7	509	March 23, 2024

File attachement and loading over network efficiency ?

Related topics