🏠 back to Observable

What is a guaranteed way to force a cell to be evaluated?

I have a notebook where the data is structured into objects. The shape (number of, depth) of the objects is created first and then other cells fill in the data. This is partially done for debugging, so that I can check intermediate steps. To make sure all of the parts are done correctly I need things happen in a certain order.

Currently, in the notebook, I have to go in after the page is loaded and manually fire cells to get the correct results.

I have tried calling cells as the value for new const to force it being evaluated when needed, but it doesn’t seem to work.

I feel like I am being pushed to put all of my data handling in one cell, which is particularly cumbersome and seems antithetical to Observables design…

It sounds like you are using mutation, and in particular mutation across cells. This is considered an antipattern in Observable because it doesn’t work well with dataflow (as you’ve discovered).

If you want to split the work across multiple cells, instead of mutating the values in place, you should return a transformed copy of the data. For example, say you have data as an array of strings:

data = [
  "2022-01-01",
  "2022-01-02",
  "2022-01-03",
  "2022-01-04",
  "2022-01-05",
  …
]

Using mutation, you might say:

{
  for (let i = 0; i < data.length; ++i) {
    data[i] = new Date(data[i]);
  }
}

After the above cell runs, data will be an array of dates instead of an array of strings. But the problem is that any other cell that references data, the behavior is undefined as to whether that other cell sees data as an array of strings or an array of dates. And furthermore when the cell runs that mutates data, it won’t cause any other cell that references data to run again automatically. (See How Observable Runs.)

So instead, create a transformed copy of the data, and make the dependencies in your code explicit.

strings = [
  "2022-01-01",
  "2022-01-02",
  "2022-01-03",
  "2022-01-04",
  "2022-01-05",
  …
]
dates = strings.map(s => new Date(s))

Now any cell that references dates will see an array of dates. And if you change your transformation logic by editing the dates cell, or if you change your data by editing the strings cell, anything downstream will run again automatically.

I mention this topic in Learn D3: Data:

A subtle consideration when working with data in Observable is whether to put code in a single cell or separate cells. A good rule of thumb is that a cell should either define a named value to be referenced by other cells (such as data above), or it should display something informative to the reader (such as this prose, the chart above, or cells which inspect the data).

A critical implication of this rule is that you should avoid implicit dependencies between cells: only depend on what you can name.

Why? A notebook’s dataflow is built from its cell references. If a cell doesn’t have a name, it cannot be referenced and cells that implicitly depend on its effect may run before it. (Cells run in topological order, not top-to-bottom.) Implicit dependencies lead to nondeterministic behavior :scream: during development, and possible errors :boom: when the notebook is reloaded!

For example, if one cell defines an array and a second cell modifies it in-place, other cells may see the array before or after the mutation. To avoid this nondeterminism, make the dependency explicit by giving the second cell a name and copying the array using Array.from or array.map. Or combine the two cells so that only the already-modified array is visible to other cells.

If you do want to use mutation, you can also limit the effect to within a single cell, even if your code is written in multiple cells, by using functions. For example:

function coerceToDates(data) {
  for (let i = 0; i < data.length; ++i) {
    data[i] = new Date(data[i]);
  }
  return data;
}
data = coerceToDates([
  "2022-01-01",
  "2022-01-02",
  "2022-01-03",
  "2022-01-04",
  "2022-01-05",
  …
])

You won’t be able to inspect the value of data prior to coercion, but you can call your function in another cell to test and debug its behavior. Writing unit tests for such functions is often effective at finding bugs.

First, I want to thank you for the response and info. I appreciate the time and effort to give educational replies.

Note, that I did read the tutorials. I consider the output of cells to be decent unit tests and try to make them such. One problem is that the data manipulation isn’t as simple as a 1:1 transform. The data needs to be remapped different ways. Doing that in one function ends up with one large data handling function that is difficult to debug.

Currently, my cells are named for the data structures they are creating. They have dependencies on each other. One creates the outline, others take the outline and do counting (one of the cells that is firing in the wrong order) and building their own structures (for which cells work well to check their output) and then a last function does a pass through the whole data set to fill the original structure with the parts.

I already have a lot of variables to track so I would prefer to not create tons of interstitial variables that need to be tracked (a la 1st example above).

I feel like Obs has a model that is good some of the time (simple notebooks) but obstructing other times (handling complex data). At some point I don’t appreciate or desire Obs deciding the order of operation. It ends up getting in the way of a desirable outcome. There wouldn’t be a mutability issue. I know the order I want things to be evaluated. In that case I would appreciate a mechanism for forcing the order cells are called so that I can make them what makes sense to me, see outputs and be assured that they will run appropriately.

I will try, once again, to rewrite my code to follow Obs model, but I think the result will either be a big, difficult to debug function, or a proliferation of cells and variables that are difficult to deal with.

When notebooks become complex you need control over dataflow. The tools for this are

viewof Inputs.input(...)
Inputs.bind

I am trying to write this up into patterns, like How to 1-of-n switch Dataflow streams on Observable / Tom Larkworthy / Observable

The use cases that I found particularly difficult is “auto-save” + “get/create on page load”. Mistakes can cause IO thrashing.

There are a set of low level patterns I see.

Cells are either:

  • evaluated exactly once, during setup of the notebook. They are similar in role to dependency injection.
  • Reevaluated repeatedly as part of stream in a dataflow graph. This is stream processing.

Note a classic function declaration is the first type, it should be evaluated once. Though you can pass arguments inside the function either by cell reference or function argument. One is dependency injection, the other normal.

CONSEQUENCE: A function should never be passed a stream as a cell dependency!!

(and it’s useful to be clear what role your cell is. I think adding a “once(f => {})” might be a useful to be explic which cells are “const”.

Another thing I noticed is that because cells “combineLatest” their dependencies. Its almost always a mistake to have two or more streaming cells as upstream dependancies for a given cell. It ends up multiplying their frequencies which is usually always causes weirdness everywhere. So you need to orchistrate these to have a primary dependancy that is a dataflow dependancy., and terminate the others in a Inputs.input() so you can poll from them.

RULE 2: Cells should have a maximum of 1 streaming upstream cell dependency.

I dunno how to best communicate these practices, Observable does not do a great job of visualizing the dataflow graph when viewofs are in play, plus the polling would be invisible anyway. Personally I think it be great if we could visualize dataflow a step deeper. I think if people saw the flow better they might build their own intuition on dataflow manipulation, still it’s kind of annoying that skill is needed. You do need dataflow control flow skills for notebook IO and data processing, plugins etc.

1 Like

All of my “mutable” cells are the “evaluated once on setup of the notebook” type. They should never be recalculated.

Observable isn’t deciding the order of operation—you are, by how you name and reference cells. The difference with Observable is that the order of operations is not dictated by the document order (the order in which cells appear) but by the dependency order (the order in which cells reference each other). If you want the document order to match the order of evaluation, you certainly can, but it’s not required.

If you want to control the order in which cells run explicitly (because you want to use mutation, or whatever) you can use a sequence of named cells like this:

step1 = {
  // do some work here
}
step2 = {
  step1; // wait for step1 to complete
  // do more work here
}
step3 = {
  step2; // wait for step2 to complete
  // do more work here
}

Even though the step1; statement doesn’t do anything, because the step2 cell references step1, it will cause the step2 cell to wait for step1 to complete. And likewise running or editing the step1 cell will cause all downstream steps to run again, in order.

2 Likes

Something that might help new users is a tutorial on how to efficiently use a flat dataset that has complex, inter-relationships in Obs. Examples tend to be for very simple data sets that don’t have complex relationships.

To your suggestion @mbostock …I think I’ve tried that and it isn’t working (for some reason).

I 1st tried explicitly calling the cells at the end of the cell doing all of the restructuring:

locations = Object.assign(...[locationsArr.map(loc => ({ id: loc, type: '', color: '', roles: [] }))])
// Creates set of unique locations, also separately done for roles and connections.
restructureData = {
    // Loop through each row of data and extract details into various data structures, including 
    data.forEach

        // Code to find unique roles and assign data
       locations.roles.forEach

        // Code to find unique locations, connections and assign data
 
        } // End locations.roles.forEach
    } // End data.forEach

individualRoles; // Being called to get number that setColors needs; 
setColors; // Being called to make sure it happens after individualRoles;
} 
setColors = {
      locations.forEach((loc, i) => {
      // Iterate locations and add color to each role per location.
}
individualRoles = locations.find(l => l.id === individualsLocation).roles.length
// Once roles have been assigned to locations count the number of roles in the "Individuals" location.

I also tried moving individualRoles into beginning of setColors, to no effect:

setColors = {
    individualRoles;

    locations.forEach((loc, i) => {
    // Iterate locations and add color to each role per location.
}

I could break things up restructuring the data more but I was trying to minimize the number of times I iterate through the data. Right now run through the data 4 times. Three data.map to find unique roles, locations and connections and a 4th to assign data to the unique objects. If I do roles, location and connections separately each would need 3+ iterations, 1st to create unique ids, then to get base values and again to get computed values from the other data structures. Would mean evaluating full data set 9+ times and rebuilding transient data selections and structures each time.

When you say

restructureData = {
  doSomething();
  individualRoles;
  setColors;
}

You’re saying that the restructureData cell has to wait for the individualRoles and setColors cells to run before the restructureData cell runs (and thus before doSomething() runs).

So, if you want setColors to run after individualRoles, then the setColors cell needs to reference individualRoles, like you had. Similarly, if you want individualRoles to run after restructureData mutates your data or locations, you’d need to reverse the references:

restructureData = {
  doSomething();
}
individualRoles = {
  restructureData; // wait for restructureData to run
  return locations.find(l => l.id === individualsLocation).roles.length;
}
setColors = {
  individualRoles; // wait for individualRoles to run (and by extension restructureData)
  doSomethingElse();
}

Cells can only express their upstream dependencies (what they wait for); they can’t imperatively force other cells to run after them.

If it helps, you can use the minimap in the side pane to see what’s upstream (before/left) and downstream (after/right) of the current cell.

1 Like

Also, you can remove the Object.assign(…[…]) here, as it’s not doing anything. This:

locations = Object.assign(...[locationsArr.map(loc => ({ id: loc, type: '', color: '', roles: [] }))])

Is equivalent to this:

locations = locationsArr.map(loc => ({ id: loc, type: '', color: '', roles: [] }))

If you call Object.assign with one argument, it just returns whatever you passed in.

1 Like

Appreciate you taking the time to explain this. I will never likely be more than an adequate programmer but I am very interested in creating visualizations, so this kind of help is invaluable.

Question: Is creating objects a suggested way to handle when you need to compute relationships from tabular data to build the relationships that are needed to render - ie Finding unique sets, compute relationships, build hierarchies, assign attributes (like color and position), etc. - or would you do it differently? I put all the data into Objects because it needs to be organized to render but not recomputed for interaction.

Would reiterate how useful a complex example would be to teach the principles for taking tabular data and creating needed relationships and then how that gets used to render a vis.

Thank you again for taking the time!

Absolutely, it is very common to derive secondary data structures as part of preparing data for analysis or visualization. These are sometimes plain objects, but they can also be arrays, maps, sets, and other more elaborate structures such as trees.

Here are a few examples of working with data in Observable: