File attachement and loading over network efficiency ?

I have a notebook with 2 datasets of 8 tables each. In the biggest dataset, 3 of the tables have about 1 millions records. I am running in loading time issue.

Would there be significative difference between loading files from gist and using file attachement ?

My understanding is that using DB connexion would be ideal. But I am not yet able to do that, so I am exploring file attachements with the following function:

async function get_data_(table_, course_ = course) {
  let dataset1_foo = FileAttachment("dataset1_circuit.json").json(), // ~
    dataset1_baz = FileAttachment("dataset1_autNcir.json").json(); // ~ 900 000
  // 6 more tables ...

  let dataset2_foo = d3.csvParse(
      await FileAttachment("dataset2_circuit.csv").text(),
      d3.autoType
    ),
    dataset2_alice = FileAttachment("dataset2_auteur.json").json();
  // 6 more tables ...

  return {
    dataset1: {
      circuit: dataset1_foo,
      auteur: dataset1_baz
    },
    dataset2: {
      circuit: dataset2_foo,
      autNcir: dataset2_alice
    }
  }[course_][table_];
}

async is really new to me.

Is calling get_data_ loading all the files memory to return juste one ? Or does it, as I hope, only load in memory the file that it returns?

Hi @maliky,

It looks like, the way you have things set up currently, all of the remote FileAttachments are going to be fetched and loaded into memory, each time you call the function. After the function completes, most of that memory should then be garbage collected (the courses and tables that are not selected at the end).

A database connection (or an API) would indeed be better, because it would allow you to transfer only the subset of data you actually want to use.

But, if you just want to optimize your use of FileAttachment, I would recommend placing them in separate cells, so that they’re only requested once when the notebook is loaded, and not every time get_data_ is called. Something like:

dataset1_foo = FileAttachment("dataset1_circuit.json").json()
dataset1_baz = FileAttachment("dataset1_autNcir.json").json()

… and so on.

Thank you. You helped me do what I wanted to do but did not do because I though it would be too much code repetition. In the end it is not. I can select one database or the other just by adding one test before each table loading. I understand that it is not possible to build fileAttachment names on the fly.

In the end, I may go the the DB connexion.

Actually, to elaborate on this further — although it’s not possible to build FileAttachment names dynamically (because we use static analysis to make them securely available to your notebook) — they’re not actually requested until you use them, so if you change your setup to something like this:

dataset_1 = FileAttachment("dataset_1.json")
dataset_2 = FileAttachment("dataset_2.json")
data = (toggled ? dataset_1 : dataset_2).json()

… only one of the datasets will actually be requested and loaded into memory. If you’re planning on eventually loading everything anyway, I’d still go with the first approach — but if any given use of the notebook will only load a subset of the files, this might help make it possible to leave the other half unloaded.

Is your code different from a memory point of view, from what I have ?

dataset_1= FileAttchment('dataset_1.json').json()
dataset_2 = d3.csvParse(await FileAttachment("dataset_2.csv").text(), d3.autoType);
data = toggled ? dataset_1 : dataset_2;

Note:
Event with carefully loading file attachment there is still too much latency to make the visualization usable. What are common DB service used with observable ? Amazon RDS ?

Yes. Mine only loads one of the two files into memory , because .json() is when the loading actually occurs, not FileAttachment(). Your version loads them both.

Postgres, MySQL, BigQuery and Snowflake are the database connections we currently support. Amazon RDS should work just fine. For more info, see: https://observablehq.com/@observablehq/connecting-to-databases.

Another option that may or may not work for your use case is saving your dataset as an SQLite DB file, and using SQL.js to query it all in the browser. Here’s a demonstration of doing things that way, with a 93.5MB database: https://observablehq.com/@mbostock/sqlite

Excellent! I think the sqlite is the most portable solution for me.
Great !

1 Like