Giter Club home page Giter Club logo

phrase2unit's Introduction

phrase2unit ๐Ÿ“–๐Ÿ”ฌ

XKCD 2260 - Reaction Maps

ยฉ Randall Monroe, 2020

This tool interprets phrases as if they were a sequence of units, slicing the phrase into unit symbols that cancel out into the smallest unit that covers the most of the phrase. It thereby implements XKCD 2312. The engine inserts SI prefixes where they make the resulting unit smaller. It uses unit symbols, common names, and conversions to SI provided by Wikipedia, specifically this Lua table that is generated from that page.

Live implementation at https://units.lam.io

Technical overview

The engine is written in Haskell and is made up of two parts:

  1. A dynamic-programming algorithm that finds the unit that bridges the current string position to a later one with the smallest resulting SI unit, and
  2. A heuristic knapsack-problem solver to convert the final SI unit back to a more familiar worded form (e.g. m/s โ†’ speed).

Since the DP and knapsack solvers aren't optimal, the results aren't always strictly minimal, but they're usually pretty good and small.

Further details on implementation can be found on my blog.

Installation

This is a Python-Haskell-React project. Yes, it's time to get funky with package managers.

  1. Install and build using:
$ npm i
pip install -r < requirements.txt
cabal build
  1. This repo contains the data I used at the time of creation, so it's possible to run the server directly with cabal run, where it starts listening on port 8000 by [HappStack's] default.
  2. However, if you want to update the data from Wikipedia and/or tweak the conversion:
    1. prep/dat.lua is the extracted set of Lua unit tables at Wikipedia's Module:Convert/data, with all_units exposed. prep/convert.lua just uses a tweaked version of RXI's json.lua to ignore the weird mixed tables and yank the data into JSON like cd prep; lua convert.lua > wikiunits.json

    2. This data contains aliases of unit symbols that need to be flattened (e.g. U.S.gal -> USgal) and ratios whose constituent units need to be resolved (e.g. L/100 km). Further, some symbols that are referenced but not specified because they are prefixed. These include:

      cm2, km2, um, mm, km, cm3, km3, ml, dL, ug, mg, kg, Mg, kPa, kN, kJ, MJ
      

      I added these by hand to wikiunits.json.

      prep/convert.py converts the data to one the Haskell engine can use via cd prep; python convert.py wikiunits.json utypes.json > ../hs-data/u2si.json.

    3. There is a minified list of utypes called hs-data/lim_utypes for use by the knapsack solver that removes redundant options (like volume per area vs. length) and unitless utypes (like gradient) to help it solve faster.

The Wikipedia data is also not the most complete. Notably it's missing candela, and to my eye it's also sparse on electrical units (like Volt, Farad, Henry, etc.). I'm looking for better datasets (see #1) that should improve the quality of the solver's results.

phrase2unit's People

Contributors

acrylic-origami avatar

Stargazers

 avatar

Watchers

 avatar  avatar

phrase2unit's Issues

Better units dataset

The Wikipedia Convert/data is a pretty good source for units and unit symbols to SI, but it's lacking in a few areas, especially in electrical units (e.g. V, F, H) and surprisingly doesn't include candela. Furthermore, it would be nice to get the really obscure units in, to make for more interesting matches.

This may be more of a project to slowly bring more Wiki data in than a one-time-fix, but a search for other datasets for units out there is also warranted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.