Observable Plot: What is the idiomatic way to derive new data fields?

I am just exploring Observable Plot, and in particular how it compares with Vega-Lite.

A common use case is the need to be able to derive new data fields based on one or more existing fields. My question is, what is the idiomatic way to do this with Plot?

As an example, which works but looks quite verbose to me, I can reclassify data values using a custom transform function. Is there a better way? Is it generally recommended to do the transformations outside of Plot first?

popData = fetch(
  "https://cdn.jsdelivr.net/npm/vega-datasets@2.2/data/population.json"
).then((f) => f.json())
{
  return Plot.plot({
    color: {
      legend: true,
      type: "categorical"
    },
    marks: [
      Plot.rectX(
        popData,
        Plot.binY(
          { x: "sum" },
          {
            transform: (data, facets) => ({
              data: data
                .filter((d) => d.year === 2000)
                .map((d) =>
                  d.sex === 1 ? { ...d, sex: "male" } : { ...d, sex: "female" }
                ),
              facets: facets
            }),
            x: "people",
            y: "age",
            fill: "sex",
            thresholds: 20
          }
        )
      )
    ]
  });
}
2 Likes

In this case I’d probably do:

...
        {
          filter: (d) => d.year === 2000,
          fill: (d) => d.sex === 1 ? "male" : "female",
          …
        }

Note that if your transform returns an array where the indices do not match the indices of the original array (for example, because of filtering), you should also transform the facets so that they return indices in the new array. In your use-case it doesn’t show, since there are no facets, and the index [0, 1, 2, 3…, 569] still covers the whole range of the filtered array, but it would be wrong if we had facets or composition with other transforms.

What the filter: fn shorthand does in effect is the equivalent of:

 transform: (data, facets) => {
   facets = facets.map((I) => I.filter((i) => data[i].year === 2000));
   return { data, facets };
 },

see plot/basic.js at 2b59461f5c05107922da3ba0a602fb3faa271a2f · observablehq/plot · GitHub

hth!

1 Like

Ah, that makes sense – thanks. It looks like I was over-engineering a simpler task.

Given the mapping is being applied to the fill, is there any optimisation if the same transformation is applied to the same source data elsewhere in the document? Just wondering if is necessarily good practice to minimise the points where data transformations are specified, or whether transform at point of use, as with your example, is generally fine.

It would be optimal to filter outside of Plot. As you suggest, there are two possible optimizations for the future:

  1. apply the mapping only on the values we want to use;
  2. if the same transform is applied in several places, memoize and reuse the result.

Another solution here would be to avoid the mapping, using “1” and “2” as is (and forming the scale’s domain), then using tickFormat to show the relevant label in the legend.

1 Like

Another technique is that you can pass a column of values as a channel. For example:

Plot.rectX(
  popData,
  Plot.binY(
    { x: "sum" },
    {
      filter: d => d.year === 2000,
      x: "people",
      y: "age",
      fill: popData.map(d => d.sex === 1 ? "male" : "female"),
      thresholds: 20
    }
  )
)

So, if you want to avoid deriving the column in multiple places, you could extract that work like so:

sex = popData.map(d => d.sex === 1 ? "male" : "female")

And then reference it as fill: sex as needed. The same thing works for the other options, including filter itself. For example:

selectedYear = popData.map(d => d.year === 2000)

And then filter: selectedYear.

Live example: Untitled / Observable | Observable

I also expect that in the future Plot may do some internal optimizations when the same data and channel definitions are reused across marks. It’ll be a little tricky since I suspect that you’ll need identity equality for the channel definitions (meaning if you say d => d.sex === 1 ? "male" : "female" in multiple places, Plot will assume they are distinct definitions, so you’ll need to pull these definitions out into a local variable or similar). But it seems like Plot could readily do some internal caching rather than shifting that burden to the user.

Oh… and if you just want more readable labels in your legend, you can do that without deriving a new column by using the tickFormat option:

color: {
  legend: true,
  type: "categorical",
  tickFormat: d => d === 1 ? "male" : "female"
}
2 Likes

All very helpful comments - thanks Mike and Philippe.

One of the challenges with a (relatively) new technology isn’t so much the API, but what are the good practice and idiomatic ways of getting common tasks done, so all of the above helps in this respect.

2 Likes