Observable for research - (advanced) statistics

Hey Observables folks :wave:

Here’s something that I’d like to discuss with you:
Are you planning on using Observable notebooks for research / teaching? Or are you already doing so? If not, what is keeping you from doing it?

Here’s my current stance on it:
Pros
One idea I absolutely love about Observables is that users do not need to
install any software in order to edit and run a notebook.
You can just send a link to supervisors, students, reviewers, fellow researchers,
as well as your parents, and they can have a look at your notebook and play
around with it.

Another great benefit is that due to the reactive approach, the execution order does not matter in an Observable notebook. Just last month the JetBrains Datalore Team examined 10.000.000 Jupyter notebooks on Github, and found that ~36% would not run right away due to out-of-order execution source.

I think the combination of the two points not only shows the potential of Observables to overcome the technical part of the reproducibility crisis, but also highlights the lower entrance barrier to statistics / data science / programming in general.

Cons
Unfortunately, there still seems to be a lack of a well maintained (advanced) statistics library in the JS ecosystem, especially when it comes to doing regression analysis.
While I think simple-statistics,
jstats, and ml.js)
are great libraries, they are missing the functionality to perform (still relatively basic) regressions on (1) multiple-linear models (2) linear models including higher powers of one or more predictor variables (3) models with interaction terms (4) models including weights (see here for a non-extensive list).

The lack of these functions currently pushes me (and probably others) to using Python/R/Matlab for the actual data analysis, and uploading the results to Observable afterwards for visualization.
Unfortunately, this way we’re losing the benefit of having the data, analysis, and visualization all in one place, where students / fellow researchers / reviewers etc. can easily play around with the data / method.

Synopsis
While I am aware that it is possible to execute R from an Observable notebook, I still think that there is a need for an actual JS (or WebAssembly) implementation of the core functions (IMHO something similar to lm would be needed the most).

Right now we are teaching students R, but I sometimes think teaching JS/Observable might get more people interested in the subject as it also allows them to produce Websites / d3 visualizations.

Really interested in your thoughts and ideas here!

10 Likes

Yes I agree. We had a kinda similar chat on the Zulip server about how numerical computing holds back JS. (https://d3js.zulipchat.com/#narrow/stream/273726-observable/topic/numerical.20computing.20on.20the.20web)

R, Python and Matlab link to Fortan’s actively maintained BLAS, LAPACK. This is the missing foundation that stats is built on IMHO.

Python has finalizers which means it can make an ergonomic wrapper around the memory managed
primitives (numpy). I think JS did not have that until recently with WeakRef and FinalizationRegistry. So basically it was impossible to ergonomically attach to the de factor linear libraries (GitHub - likr/emlapack: BLAS / LAPACK for JavaScript). You would have to call “free()” in JS land.

Anyway, we were literally just chatting about how it might actually be possible now (not in a cross browser way). But AFAIK no-one has combined FinalizationRegistry and WASM for LAPACK so it be a pretty researchy thing to do and probably a fail.

But yeah, definitely agree the inability to do linear programming, eigenvectors, lasso regression or a trillion other off-the-shelf algorithms almost mandates using something else or sticking to primitive analysis. Its very annoying.

3 Likes

I use Observable to teach elementary statistics regularly but probably not in quite the way you are thinking. It sounds like you are thinking of using Javascript in an Observable notebook kinda like one might use R in RStudio or Python in Jupyter. That’s not what I do. Rather, I build interactive illustrations of the main ideas with Observable. If you’re familiar with R, it’s somewhat analogous to the way one might build an illustration with Shiny.

You can tak a look at my class web page from last semester to see what I’m talking about. Illustrations built with Observable are really littered all over. Having said that, you’ll also notice that I still use Python within Google Colab as the computational environment.

On the other hand, I have built some elementary computational tools for intro stats, like this T-Distribution calculator which is built on this Observable notebook. I’ve thought of expanding those greatly so that students could easily import or upload CSV files and run elementary statistical tests. I don’t think it would be too hard to build most of what one needs in intro stats. Not sure if I’ll ever get to that or not.

5 Likes

Thanks a lot for you insights! I remember checking out the elmlapack repo back when I first read this blog post on how R actually fits a linear model (A Deep Dive Into How R Fits a Linear Model), until then I didn’t know that it’s actually running Fortran in the background! Also, just stumbled upon this pure JS re-write: GitHub - R-js/blasjs: Pure Javascript manually written implementation of BLAS, Many numerical software applications use BLAS computations, including Armadillo, LAPACK, LINPACK, GNU Octave, Mathematica, MATLAB, NumPy, R, and Julia.

Back then I was naïvely thinking I could re-write the underlying R-code (until the point where R-calls the C function Cdqrls, which then calls Fortran), and turn it into a native module for Node.js… However, I quickly realised that the task is beyond my capabilities.

However, what are your thoughts on expanding existing libraries like simple-statistics to a point where a simple ANOVA or the previously mentioned (non-basic) linear regressions are possible? The other day I started implementing a function for weighted linear regressions, but adding support for multiple predictors is on another level!

That’s great to hear that you use Observables in class already! With @tomlarkworthy’s response in mind, I was thinking that it might be an intermediate step to extent a library auch as simple-statistics, before someone starts writing a NumPy/SciPy equivalent in JS.

Just as you said, it would be really nice for students to be able to upload a CSV file, and then perform an ANOVA / (Multiple) Linear-Regression analysis right in the notebook. While they might not be the most performant alternatives, I guess there already are packages for most of the required functions for this. However, I couldn’t find something even remotely close to R’s own lm yet, especially when it comes to multiple linear regressions.

1 Like

I’m afraid I have to offer a word of caution on simple-statistics; I’m just a bit leery of the quality under the hood.

I wrote a simple calculator using ss.inverseErrorFunction for students to compute some percentile ranks and they just weren’t quite getting the right answers. I traced the problem down to the implementation which, as I recall used a something like an interpolation of some sparsely pre-computed values from a table. I ended up quickly writing an implementation that used Newton’s method to invert the normal PDF. I’m sure that’s not particularly efficient but it worked.

Since then, I’ve moved to jstat and have been quite happy with it.

I think this sheds light on the general problem. Building quality numerical tools is a long road. In this particular case, we have a truly outstanding developer but who’s not a statistician. It’s also not clear how much vetting the library has been through. You need a community of folks who are good programmers and good scientists.


The idea of implementing BLAS in Javascript is not fathomable to me. :confused:

I use Observable for what I would call “research” projects, but it doesn’t involve statistical analysis. (And most of the actual ‘research’ part is done with pen and paper. There is also sometimes some Photoshop, Python, and Matlab used as tools along the way.)

As for BLAS: Maybe someone can compile Fortran to wasm effectively. A web search turns up GitHub - StarGate01/Full-Stack-Fortran: Fortran to WebAssembly

1 Like

The implementation hasn’t changed in 6 years, and this does not seem like an accurate description. The approximation used is Winitzki’s. The reference is this google hosted PDF file: erf-approx.pdf - Google Drive

There are a number of possible approximations to inverse erf, depending on your needs. It is plausible that this approximation is not accurate enough for your students needs though. @mbostock made a notebook describing this and a couple others: Error Function / Mike Bostock / Observable

2 Likes

Thanks for the clarification. Of course, there are often trade-offs to be made in numerical computation and my guess is that the code in Simple Statistics favors speed over precision. Whether that’s right or wrong, my comparisons with Mathematica indicate that

ss.inverseErrorFunction(x)

is off in the fourth decimal place for all x with |x|>0.68. That just wasn’t sufficient for my purposes. Hence, my word of caution. :neutral_face:

I agree it would probably be better to default to something relatively accurate to at least 7 or 8 digits, or even try to compute something to full accuracy with an approximation available as an explicit alternative. Or if not, to make the limitations crystal clear in the docs.

They would probably take a patch of a more accurate approximation. This one seems to have been added in 2015 by a user without too much other demand or discussion, and hasn’t been mentioned since in github issues, so probably isn’t seeing much practical use. Add inverses for error_function and cumulative_std_normal_probability by ericfischer · Pull Request #85 · simple-statistics/simple-statistics · GitHub

I think https://stdlib.io/ has a BLAS package, if that helps?

https://stdlib.io/docs/api/v0.0.90/@stdlib/blas

One challenge with stdlib is that it’s currently in a monolithic 223 MB package, which does not make it easily consumable in a web browser. I suspect that there’s a way to repackage it, though.

There’s an example of elmapack on Observable, too:

1 Like

@rreusser has been working on this

1 Like

Thanks, that seems to work (though it might be more efficient bundled):

blas = import("https://cdn.jsdelivr.net/npm/@stdlib/esm@0.0.3/blas.js") 
1 Like

For what it’s worth, I started collecting discussions around stdlib-js and similar libraries here:

1 Like

H folks,

This is a bit distant from the discussion, but in line with the topic heading of this thread, I thought it might be worthwhile flagging one major challenge of using JavaScript (in general) for advanced statistics, namely the breakdown of computational capacity for very large numbers:

Thanks for a concise breakdown of the technical barriers. For people who don’t follow Javascript-land that closely, but are kind of waiting for it to get there, it’s hard to get a sense for what exactly is the holdup, or to name specific technical issues that are holding things back (that might give some sense of a timeline for when the issues might be addressed).

The other side of the issue (what would researchers need for this to be viable) I guess would be the front-end (slice syntax and operator overloading for a R/Numpy/Matlab vectorized look), which might be where Observable could play a role. I recall seeing NumCalc a few years ago, and being pretty interested to see that kind of language extension (which i guess is kind of an alternative to operator overloading for the whole Javascript, something like that). I was wondering if Observable might at some point be using the cell separation to allow some of that syntax experimentation (for all the scientist/domain people who might be pouring into Javascript-land at some point, and want their home notation to work with).

(I was following what Iodide was doing a while ago wrt some of these steps, but I think they’ve kind of slowed down.)

1 Like

Operator overloading would be super helpful. There is a proposal for it: https://github.com/tc39/proposal-operator-overloading. But I don’t expect it to happen any time soon.

1 Like

Like @mcmcclur, I often use observable to prepare interactive teaching materials (for undergraduate cosmology) but I don’t expose the underlying notebooks to the students. I would guess this a more common approach when students are not expected to be javascript literate. My student-facing materials are here and the corresponding notebooks are in this collection.

My use of observable for teaching is definitely limited by the availability of high-quality numerical libraries, but I have been slowly implementing the pieces I need (e.g. this adaptive integrator).

1 Like

Hi David,

You might find useful the handful of things I put at @jrus/cheb about 2.5 years ago. There’s a ton more that could conceivably go in there. If you want to chat about this, stop by here.

2 Likes

I certainly don’t either! While I share the implementations on this forum, the demos are embedded into class webpages on my site.