Giter Club home page Giter Club logo

Comments (4)

mbostock avatar mbostock commented on May 3, 2024 1

If we don’t filter the thresholds, then probably the first bin’s bin.x0 and the last bin’s bin.x1 should be undefined. This would be consistent with threshold.invertExtent from d3-scale. But this would need a major version bump, and while perhaps more flexible, it also makes it more cumbersome in the common case because you’d need to handle the unbounded extreme bins specially when rendering the histogram; see the example.

So, I think it probably makes more sense to keep the current behavior where the histogram always has a finite domain (either set explicitly or determined from the data), and then the thresholds (even if set explicitly) must be constrained to that domain.

Since the lower bound of the bin is inclusive, we should tweak the threshold filtering of the upper bound (as described above) so that the threshold 99 isn’t dropped if the domain is [0, 99]:

// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m;

We should also document this behavior.

from d3-array.

curran avatar curran commented on May 3, 2024

I believe including the upper threshold value in the max bin is standard histogram behavior.

I wonder, would histogram libraries in the R or Python worlds give the same result?

from d3-array.

mbostock avatar mbostock commented on May 3, 2024

Per the API reference:

Thresholds are defined as an array of values [x0, x1, …]. Any value less than x0 will be placed in the first bin; any value greater than or equal to x0 but less than x1 will be placed in the second bin; and so on. Thus, the generated histogram will have thresholds.length + 1 bins.

But, what does not appear to be documented in the API reference is how histogram.domain affects the thresholds. From histogram.js:

// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] >= x1) tz.pop(), --m;

So, any threshold value that is less than or equal to domain[0] is dropped, as is any threshold value that is greater than or equal to domain[1].

Thus, if you have 100 thresholds [0, 1, 2, … 97, 98, 99] (d3.range(100)), and the default domain [0, 99] (d3.extent(d3.range(100))), you in fact only have 98 thresholds [1, 2, … 97, 98] because the 0 and 99 thresholds are removed.

This means the histogram will generate 99 (98 + 1) bins, with any value less than 1 ([0]) in the first bin, (bins[0]), any value greater than or equal to 1 and less than 2 ([1]) in the second bin (bins[1]), and so on, with the 99th bin (bins[98]) containing any value greater than or equal to 98 ([98, 99]).

I think it’s possible that the logic should be tweaked slightly since the upper bound of the bin is exclusive:

// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m;

That would produce 99 thresholds [1, 2, … 98, 99] and 100 bins, [[0], [1], … [98], [99]]. But that’s a little weird because it means that the last bin can only contain values that are exactly equal to the maximum value. This change feels to tailored to the contrived example in question.

So, perhaps it makes more sense to only remove thresholds when those thresholds aren’t specified explicitly; that is, to only remove thresholds when they are specified as a count. This would also apply to the builtin threshold count estimators d3.thresholdFreedmanDiaconis, d3.thresholdScott and d3.thresholdSturges. Then if 100 threshold values are specified explicitly, you’d always get 101 thresholds regardless of the value of histogram.domain. However, depending on the domain and the input values, some of those bins would be empty, as expected. And the default would be [[], [0], [1], … [98], [99]]. I think this change would make a lot of sense, since when you don’t specify the threshold values explicitly, the specified count is only a hint anyway.

from d3-array.

mbostock avatar mbostock commented on May 3, 2024

I’m still not 100% sure this is a good idea since the first and last bin’s bin.x0 and bin.x1 will be equal. Actually it’s worse than that because the first bin’s bin.x0 is set to domain[0] and the last bin.x1 is set to domain[1], which is wrong if the thresholds are outside or coincident with the domain endpoints.

from d3-array.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.