Giter Club home page Giter Club logo

Comments (4)

sharkdp avatar sharkdp commented on August 18, 2024

Thank you for the feedback!

Please consider adding json support to pass input to numbat structs. Presently I need to hard code data in numbat structs and that quickly becomes tiresome.

We have talked about this in the past — I would also love to see this feature!

Your request comes at the right point in time, since Numbat just gained support for structs and lists.

One thing that is not yet clear to me is how the API would look like. I generally think it was a good idea to have physical dimensions as types in Numbat, not units. But here, it's a bit of a disadvantage. Let's take a concrete example. Imagine we have a JSON file with basic data about the planets (ChatGPT generated, I did not fact-check it):

{
    "planets": [
      {
        "name": "Mercury",
        "perihelion": 0.3075,
        "orbital_period": 87.969,
        "mean_radius": 2439.7,
        "mass": 3.3011e23
      },
      {
        "name": "Venus",
        "perihelion": 0.7184,
        "orbital_period": 224.701,
        "mean_radius": 6051.8,
        "mass": 4.8675e24
      },
      {
        "name": "Earth",
        "perihelion": 0.9833,
        "orbital_period": 365.256,
        "mean_radius": 6371.0,
        "mass": 5.97237e24
      },
        ...
    ]
}

With the following units/dimensions:

Field Unit Dimension
perihelion AU Length
orbital_period days Time
mean_radius km Length
mass kg Mass

Ideally, I would like to be able to define a struct in Numbat, and then just call parse_json("planets.json") and have it 'magically' fill out that struct. Something like:

struct Planet {
    name: String,
    perihelion: Length,
    orbital_period: Time,
    mean_radius: Length,
    mass: Mass
}

struct Planets {
  planets: List<Planet>
}

let planets: Planets = parse_json("planets.json")

And I think we should actually be able to implement something like this with the current state of Numbat. parse_json would have a type of forall T. String -> T and T would be inferred to be Planets in the case above, so that parse_json function would be instantiated with the right type. And we could use that internally to parse the JSON file with the same structure (and throw errors in case anything doesn't fit). As for the parse_json call, we could also introduce Rust-style turbofish syntax such that users could also write:

let planets = parse_json::<Planets>("planets.json")

instead.

But there is a problem. How do we specify the units? How would we know that the perihelion is measured in AU and the mean_radius in kilometer? Both are just type-annotated with a physical dimension of Length.

There are multiple ways to solve this:

  1. Disallow this use case and only support a subset of this where quantities must always be specified in the base unit of a particular physical dimension. For the current Numbat Prelude, that would mean lengths in meters, time in seconds, etc. But the Prelude is fully customizable and someone could (in theory) provide one based on Imperial units, for example.
  2. Make it even stricter and only allow Scalar to be used for floating point numbers. This would require users to have a second conversion step inside their Numbat programs into a dimensionful struct.
  3. Extend the language to allow fields to be annotated/decorated with the expected units. This would be somewhat similar to what serde does for Rust. We could re-use our decorator syntax for that. Something like:
    struct Planet {
        name: String,
    
        @serialization_unit(AU)
        perihelion: Length,
    
        @serialization_unit(days)
        orbital_period: Time,
    
        @serialization_unit(km)
        mean_radius: Length,
    
        @serialization_unit(kg)
        mass: Mass
    }
  4. Use a completely different API that does not have this problem (I can't think of any right now)

We could, of course, also require that this information be present in the input files somehow. For example, using additional schemas or by using perihelion: "0.3075 AU" or perihelion: { value: 0.3075, unit: "AU" } to add units to the JSON file itself. But ideally, I would like to target a solution that would allow us to parse all possible JSON files, not just those that were specifically created for Numbat.

What do you think?

If we could pass data into a numbat struct from json then we would benefit from some sanity checking of the input data (using a tool such as ajv-cli) and so be able to readily obtain sanity-checked and unit-checked output from more extensive data sets.

I would love to hear more about your use case.

from numbat.

 avatar commented on August 18, 2024

I'm thinking of the following.

A common form of data is that of the 'dreaded spreadsheet', or database output. Spreadsheets or similar table-form (spreadsheet, database, R dataframe et cetera) are commonly used in many fields of science and engineering. Often the values are in one column and the units are in an adjacent column. Spreadsheets are still so very common because their barrier to entry is so low and they support comments, cell formatting et cetera ... but their flexibility can readily obscure errors. Data entry into spreadsheets is often 'manual' or via copy-and-paste; calculations are 'hidden', and can reference multiple parts of a tab or multiple tabs. It gets messy quickly, mistakes are common and often undetected. Numbat to the rescue!

Calculation errors in tabular data can be difficult to spot when the spreadsheet/database table has many rows and columns (Excel currently supports about 1M rows and 16.3k columns, and then their are multiple tabs!), the scale of measurements is wide (e.g. spanning logarithms), and various units are used throughout. Such use cases are common in science and engineering, for example in the case of chemical concentration data exported from a LIMS (laboratory information management system) ... where LIMS are typically used to provide data output from commercial laboratories, whereupon it is sent as a CSV or Excel file to customers. Scientists and engineers may receive LIMS data from multiple suppliers (including multiple suppliers from within their home institution!), where the data comes in various units (some laboratory instruments will spit out ppb, others ppm, et cetera - the point here is that a single LIMS export will include data acompanied by a variety of units), but certainly a common format is to have the values are in one column and the units in another (note that I'm ignoring the issue of values provide as "less than" or "greater than" since I think numbat should not have to solve this. Concrete example ... "< LOD" is a 'value' that means "less than the Limit *O**f Detection". Where you have such data you would need to recode it outside of numbat using a 'censoring rule').

Table-like data can be readily exported to JSON with a little scripting (e.g. typescript or python in the case of Excel), so that the JSON objects carry values and units.

Since data can be structured in so many ways I think numbat should aim, certainly initially, to support the use case where the JSON input to numbat is constrained to a simple format, structuctured as in the following example:

{
“value”: 60,
“uom”: “s”
}

That is, the JSON key:value pairs fed to numbat should typically not mix values (numbers) and units of measurement (uom). You could still have JSON arrays, but not mixed arrays (those with numbers and units of measurement in the same array).

So the tabular information (spreadsheet, database, R dataframe et cetera) would typically have the values in one column and the units in another, and it would be written to JSON in the above format. In such cases, JSON may be considered as a carrier of information between the tabular data store and numbat ... but with the significant advantage that some sanity checking of the data can be achieved via use of a JSON schema and the various JSON schema tools. JSON schema can be used to check both values (e.g. it can get hot where I live, but not (yet) > 50 deg celcius!), and units (e.g., we are expecting ppb, not pps!). A little JSON schema combined with the excellent JSON schema tools could go a long way.

I also like the idea of a language extension using annotations/decorations (your example 3), as that would allow use of JSON that doesn't come with units of measurement or where there is some data specifier that is 'hard-coded' into values (such as timestamps, e.g. 2024-06-16T00:38:56+00:00 (timestamp in ISO-8601). Handling timestamps would be nice as most real world data will come with them. :)

Finally, numbat could support output of JSON (I suggest in the simple structure described above, i.e. no mixed arrays), which could then be imported back to the spreadsheet (or other tabular data store) for ready comparison with the original data ... allowing spreadsheet errors to be fixed!

from numbat.

 avatar commented on August 18, 2024

JSON like the following is a more concrete example. When the JSON is in such a format you can maximise the use of JSON schema for sanity checking the paramater names (e.g. Ammonia) and paramater Units of Measurement (mg/L in the case of Ammonia). This would be at the expense of needing to generate JSON in that format and it may be more complicated to pass to a numbat struct.

[
    {
        "timestamp": "2024-06-16T00:17:00",
        "Ammonia": 60,
        "Ammonia-UoM": "mg/L"
    },
    {
        "timestamp": "2024-06-16T00:17:00",
        "Microcystin": 34,
        "Microcystin-UoM": "μg/L"
    },
    {
        "timestamp": "2024-06-16T00:17:00",
        "Chlorpyrifos": 10,
        "Chlorpyrifos-UoM": "ppb"
    }
]

from numbat.

sharkdp avatar sharkdp commented on August 18, 2024

That last example you gave is problematic. mg/L and µg/L are both of type MassDensity = Mass/Length³. But ppb is a a scalar unit (a number). We don't have sum types which would be required to represent something like concentration: Either<MassDensity, Scalar>. We also don't have overloaded functions which we could use to turn both into a quantity of the same type. How do you expect this to be handled?

from numbat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.