Density contours change dramatically based on size/aspect

I’m not sure if the best place for this note is here, or in a github issue, or in a private email to @mbostock, or…

But anyway, in:

… I notice that when switching to a full screen view, the contours completely change. It seems like the “bandwidth” or radius is a single value, which means that the contours depend entirely on the chosen aspect ratio and size.

This seems to me like a questionable choice. You probably want to allow separate specification of the amount of blurring to do in x and y directions, since in the general case there’s no inherent reason why the two axes should have the same units, and you probably don’t want the contours constantly changing under changes of aspect ratio.

Here’s a pair of screenshots showing what I mean:

You could make the same argument about scatterplots by saying the position encoding is misleading because a pixel distance in x is not commensurate with the same pixel distance in y. But often x and y aren’t even in the same units; they happen to be the same here, but that’s not true of scatterplots or contour plots in general.

So, the behavior of the plot seems reasonable to me. And if you wanted to address the aspect ratio issue, you would do so by ensuring that the x and y scales had a consistent scale (the same minutes per pixel in both dimensions) regardless of the window size. You could also adjust the bandwidth based on the chart size.

Having separate bandwidths for x and y might be useful too, I suppose. (And is certainly feasible since the blur is already done separately per dimension.) But I’m not sure whether such a feature would be used in practice. In either case, the right place to make that suggestion would be an issue in the d3-contour repo.

Right. It just makes the code a bit less convenient, because you can’t use the existing x and y scales used for plotting.

In this case I don’t think you want the same minutes/pixel in both directions for the blur filter; despite being both units of time, it doesn’t really make sense to consider Euclidean distances per se.

That the axis are in typical scatterplots not in commensurable units (and therefore diagonal distances are not directly comparable) was my general point.

Noted. I’ll take future similar questions to github issues. I asked here because it’s partly a question about d3-contour, and partly a question about this specific notebook (or more generally, the use of this type of tool in a context where size can change and cause reflow).

It looks like geom_density2d from ggplot2 (and the underlying kde2d) takes a two-element vector for its bandwidth argument:

If not specified, it computes the normal reference bandwidth:

That would be my preferred default behavior, though I’m not sure if my bin-then-blur implementation is compatible with this approach.

I’m not too familiar with the statistical graphics literature. It’s also possible there has been work on estimating the appropriate “bandwidth” directly from the available data. I’m sure there are a variety of choices of other methods estimating bivariate probability distributions from discrete samples.

But for example, in this particular chart the the bandwidth should probably be fairly narrow in the y direction, especially near the bottom; there are no data points with eruption time < 1:40 or so, so the density should probably drop quickly to zero after that point. More along the lines of…

…except perhaps broader than that in the x direction.

It would be interesting to do a thorough literature review at some point; I don’t have the time in the near future though. (Edit: here are two books Silverman, Scott if anyone poking their head in this thread wants some homework. :))

The discussion continues at