Dynamic rollup object for Arquero

Is there a JS syntax that can be used in the cell rollupObj in this notebook that does not utilize the eval method?

    rollupObj = {
     const rollupObj ={};
     
     correlationPairings
                .forEach(({metric_1, metric_2}) => { 
                               rollupObj[`${metric_1}_${metric_2}`] = 
                                  eval(`d => op.corr(d["${metric_1}"], d["${metric_2}"])`);
                                  // d => op.corr(d[metric_1], d[metric_2])  // this does not work
                               });

      
      return rollupObj; 
      
    }  

Maybe related to Get month name in Arquero table - #2 by severo

Your references (1,2) were extremely helpful in understanding what exactly is going on; Thank You!

The ’A few words about table expressions and op…' section in Introducing Arquero really explains the issue.

Unfortunately, for my use case, I’m dynamically creating new columns/attributes for the rollup-table and can’t reference columns/attributes that are being generated in the rollup as parameters for the op.corr method.

The params method/pattern is only applicable for calculating values as a function of the current table row being processed. It is a similar scenario to what @bmschmidt explained in this notebook.

So my use of the eval appears to be the only option so far for my specific situation.

1 Like

I think the reason this is difficult is that the data columns are not tidy. If you fold the data into a long format first, the cross-product correlation is relatively straightforward in arquero: the only weird bit is a custom join function to avoid duplicating keys on the left and right.

  const long = aq
      .from(data)
      .fold(aq.not("Date"), {"as": ["company", "price"]})
  return long
    .join(long, (a, b) => op.equal(a.Date, b.Date) && a.company < b.company)
    .groupby("company_1", "company_2")
    .rollup(
      {correlation: op.corr("price_1", "price_2")})
    .orderby(aq.desc("correlation"))
    .view()

Ben:

Thanks for your elegant approach, but it appears to only apply for the use case where one knows ahead of time exactly which data columns are going to participate in the correlation and what uniquely identifies (keys (verb)) each row…in this case ‘Date’.

Knowing that ‘Date’ is the row identifier allows your logic to negate/remove it from the fold and also use it in the custom join.

The generalized approach is data-driven and dimension-independent. It uses the data profile to determine which columns are measures and then dynamically proceeds with the correlation analyses.

If the data has anything that is being ‘measured’ it will correlate against generic metric_1, metric_2 pairings and then rank those correlations. The stock data was used a convenient data source to illustrate that analysis pattern.

I added a ‘Usage’ section to the notebook.

Here’s how you would use correlationAnalysis against the ‘beers’ data from the Arquero introduction:

  • import {beers} from “@uwdata/introducing-arquero”
  • beersData = beers.objects();
  • import {correlationAnalysis} with {beersData as data} from “@mariodelgadosr/dow-jones-industrial-average-correlation-analysis-with-ar”
  • correlationAnalysis.view(correlationAnalysis.numRows())

A very careful reading of the API documentation for Table Expressions results in the following solution:

…At first glance table expressions look like normal JavaScript functions… but hold on! Under the hood, Arquero takes a set of function definitions, maps them to strings, then parses, rewrites, and compiles them to efficiently manage data internally…

With this is mind, the section of Limitations has this critical recommendation:

…Alternatively, for programmatic generation of table expressions one can fallback to generating a string – rather than a proper function definition – and use that instead

The string inside the eval method can be passed directly as a Table expression. See the solution implemented in cell rollupObj.

Be careful to follow the exact string formatting requirements!:

using an identifier other than d will fail. In contrast, with an explicit function definition you are free to rename the argument as you see fit…

1 Like

The suggestion made by @bmschmidt in this post was generalized and compared to the original pattern in this notebook: Arquero Correlation Speed Test / Mario Delgado / Observable.

The difference in time for both analyses is significant; in favor of the dynamic rollup pattern originally introduced in Dow Jones Industrial Average Correlation Analysis with Arquero / Mario Delgado / Observable.