Giter Club home page Giter Club logo

stillwater-sc / universal Goto Github PK

View Code? Open in Web Editor NEW
397.0 30.0 59.0 119.02 MB

Large collection of number systems providing custom arithmetic for mixed-precision algorithm development and optimization for AI, Machine Learning, Computer Vision, Signal Processing, CAE, EDA, control, optimization, estimation, and approximation.

License: MIT License

C++ 98.70% C 0.56% CMake 0.62% Shell 0.07% Dockerfile 0.02% TeX 0.03%
arithmetic integer-arithmetic fixed-point-arithmetic rational-arithmetic floating-point-arithmetic posit-arithmetic interval-arithmetic quarter-precision half-precision quad-precision

universal's Introduction

Universal: a header-only C++ template library of custom arithmetic plug-in types

System Status More information
Codacy Code Quality Codacy Badge Code Quality Assessment
FOSSA Status FOSSA Status Open-source license dependency scanner
GitHub Actions Build Status Latest Linux/MacOS/Windows builds and regression tests
Development Branch Development Branch Status Development Branch
Regression Status Regression Status Regression Status
Code Coverage Coverage Status Code coverage scanner
Docker Pulls Docker Pulls Container pulls
Awesome C++ Awesome Cpp Awesome C++ Libraries
JOSS Markdown status Journal of Open-Source Software paper
Zenodo DOI Zenodo DOI Badge

The goal of the Universal Numbers Library is to offer alternatives to native integer and floating-point for mixed-precision algorithm development and optimization. Tailoring the arithmetic types to the application's precision and dynamic range enables a new level of application performance and energy efficiency, particularly valuable for embedded applications that need autonomy through intelligent behavior.

Deep Learning algorithms in particular provide a core application vertical where alternative formats and precisions, such as half-precision floating-point and bfloat16, yield speed-ups of two to three orders of magnitude, making rapid innovation in AI possible.

The Universal Library is a ready-to-use header-only library that provides a plug-in replacement for native types and offers a low-friction environment to explore alternatives to IEEE-754 floating-point in AI, DSP, HPC, and HFT algorithms.

The basic use pattern is as simple as:

// bring in the parameterized type of interest, in this case
// a fixed-sized, arbitrary configuration classic floating-point
#include <universal/number/cfloat/cfloat.hpp>

// define your computational kernel parameterized by arithmitic type
template<typename Real>
Real MyKernel(const Real& a, const Real& b) {
    return a * b;  // replace this with your kernel computation
}

constexpr double pi = 3.14159265358979323846;

int main() {
    // if desired, create an application type alias to avoid errors
    using Real = sw::universal::half; // half-precision IEEE-754 floating-point  

    Real a = sqrt(2);
    Real b = pi;
    // finally, call your kernel with your desired arithmetic type
    std::cout << "Result: " << MyKernel(a, b) << std::endl;  
}

The library contains fast implementations of special IEEE-754 formats that do not have universal hardware implementations across x86, ARM, POWER, RISC-V, and GPUs. Special formats such as quarter precision, quarter, half-precision, half, and quad precision, quad, are provided, as well as vendor-specific extensions, such as NVIDIA TensorFloat, Google's Brain Float, bfloat16, or TI DSP fixed-points, fixpnt. In addition to these often-used specializations, Universal supports static and elastic integers, decimals, fixed-points, rationals, linear floats, tapered floats, logarithmic, interval and adaptive-precision integers, rationals, and floats. There are example number system skeletons to get you started quickly if you desire to add your own.

Communication channels

  • GitHub Issue: bug reports, feature requests, etc.
  • Forum: discussion of alternatives to IEEE-754 for computational science.
  • Slack: online chats, discussions, and collaboration with other users, researchers and developers.

Citation

Please cite our work if you use Universal.

@article{omtzigt2023universal,
  title={Universal Numbers Library: Multi-format Variable Precision Arithmetic Library},
  author={Omtzigt, E Theodore L and Quinlan, James},
  journal={Journal of Open Source Software},
  volume={8},
  number={83},
  pages={5072},
  year={2023}
}

@inproceedings{Omtzigt:2022,
  title        = {Universal: Reliable, Reproducible, and Energy-Efficient Numerics},
  author       = {E. Theodore L. Omtzigt and James Quinlan},
  booktitle    = {Conference on Next Generation Arithmetic},
  pages        = {100--116},
  year         = {2022},
  organization = {Springer}
}

@article{Omtzigt2020,
    author     = {E. Theodore L. Omtzigt and Peter Gottschling and Mark Seligman and William Zorn},
    title      = {{Universal Numbers Library}: design and implementation of a high-performance reproducible number systems library},
    journal    = {arXiv:2012.11011},
    year       = {2020},
}

Talks and Presentations

The following presentations describe Universal and the number systems it contained as of the time of publication.

Slides of a presentation at FPTalks'21

Presentation: Application-Driven Custom Number Systems

Slides of the presentation at CoNGA'22

Presentation: Universal: Reliable, Reproducible, and Energy-Efficient Numerics

A quick description of the structure of the number system parameterization can be found here.

Quick start

If you just want to experiment with the number system tools and test suites and don't want to bother cloning and building the source code, there is a Docker container to get started:

> docker pull stillwater/universal
> docker run -it --rm stillwater/universal bash
stillwater@b3e6708fd732:~/universal/build$ ls
CMakeCache.txt       Makefile      cmake-uninstall.cmake  playground  universal-config-version.cmake
CMakeFiles           applications  cmake_install.cmake    tests       universal-config.cmake
CTestTestfile.cmake  c_api         education              tools       universal-targets.cmake

Here is a quick reference of what the command line tools have to offer.

How to build

If you do want to work with the code, the universal numbers software library is built using cmake version v3.23. Install the latest cmake. There are interactive installers for MacOS and Windows. For Linux, a portable approach downloads the shell archive and installs it at /usr/local:

> wget https://github.com/Kitware/CMake/releases/download/v3.23.1/cmake-3.23.1-Linux-x86_64.sh 
> sudo sh cmake-3.23.1-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir

For Ubuntu, snap will install the latest cmake, and would be the preferred method:

> sudo snap install cmake --classic

The Universal Library is a pure C++ template library without any further dependencies, even for the regression test suites to enable hassle-free installation and use.

Clone the GitHub repo, and you are ready to build the different components of the Universal library.
The library contains tools for using integers, decimals, fixed-points, floats, posits, valids, and logarithmic number systems. It includes educational programs that showcase simple use cases to familiarize yourself with different number systems and application examples to highlight the use of other number systems to gain performance or numerical accuracy. Finally, each number system offers its own verification suite.

The easiest way to become familiar with all the options in the build process is to fire up the CMake GUI (or ccmake if you are on a headless server). The CMake output will summarize which options have been set.
The output will look something like this:

$ git clone https://github.com/stillwater-sc/universal
$ cd universal
$ mkdir build
$ cd build
$ cmake ..

 _____  _____  ____  _____  _____  ____   ____  ________  _______     ______        _       _____
|_   _||_   _||_   \|_   _||_   _||_  _| |_  _||_   __  ||_   __ \  .' ____ \      / \     |_   _|
  | |    | |    |   \ | |    | |    \ \   / /    | |_ \_|  | |__) | | (___ \_|    / _ \      | |
  | '    ' |    | |\ \| |    | |     \ \ / /     |  _| _   |  __ /   _.____`.    / ___ \     | |   _
   \ \__/ /    _| |_\   |_  _| |_     \ ' /     _| |__/ | _| |  \ \_| \____) | _/ /   \ \_  _| |__/ |
    `.__.'    |_____|\____||_____|     \_/     |________||____| |___|\______.'|____| |____||________|

-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- No default build type specified: setting CMAKE_BUILD_TYPE=Release
-- C++20 has been enabled by default
-- Performing Test COMPILER_HAS_SSE3_FLAG
-- Performing Test COMPILER_HAS_SSE3_FLAG - Success
-- Performing Test COMPILER_HAS_AVX_FLAG
-- Performing Test COMPILER_HAS_AVX_FLAG - Success
-- Performing Test COMPILER_HAS_AVX2_FLAG
-- Performing Test COMPILER_HAS_AVX2_FLAG - Success
--
-- PROJECT_NAME                = universal
-- PROJECT_NAME_NOSPACES       = universal
-- PROJECT_SOURCE_DIR          = /home/stillwater/dev/clones/universal
-- PROJECT_VERSION             = 3.68.1.80df9073
-- CMAKE_C_COMPILER            = /usr/bin/cc
-- CMAKE_CXX_COMPILER          = /usr/bin/c++
-- CMAKE_CURRENT_SOURCE_DIR    = /home/stillwater/dev/clones/universal
-- CMAKE_CURRENT_BINARY_DIR    = /home/stillwater/dev/clones/universal/build_gcc
-- GIT_COMMIT_HASH             = 80df9073
-- GIT_BRANCH                  = v3.68
-- include_install_dir         = include
-- include_install_dir_full    = include/universal
-- config_install_dir          = share/universal
-- include_install_dir_postfix = universal
--
-- ******************* Universal Arithmetic Library Configuration Summary *******************
-- General:
--   Version                          :   3.68.1.80df9073
--   System                           :   Linux
--   C++ Language Requirement         :   C++20
--   C compiler                       :   /usr/bin/cc
--   Release C flags                  :   -O3 -DNDEBUG -Wall -Wpedantic -Wno-narrowing -Wno-deprecated
--   Debug C flags                    :   -g -Wall -Wpedantic -Wno-narrowing -Wno-deprecated
--   C++ compiler                     :   /usr/bin/c++
--   Release CXX flags                :   -O3 -DNDEBUG   -Wall -Wpedantic -Wno-narrowing -Wno-deprecated -Wall -Wpedantic -Wno-narrowing -Wno-deprecated
--   Debug CXX flags                  :   -g   -Wall -Wpedantic -Wno-narrowing -Wno-deprecated -Wall -Wpedantic -Wno-narrowing -Wno-deprecated
--   Build type                       :   Release
--
--   BUILD_ALL                        :   OFF
--   BUILD_CI                         :   OFF
--
--   BUILD_DEMONSTRATION              :   ON
--   BUILD_NUMBERS                    :   OFF
--   BUILD_NUMERICS                   :   OFF
--   BUILD_BENCHMARKS                 :   OFF
--   BUILD_MIXEDPRECISION_SDK         :   OFF
--
--   BUILD_CMD_LINE_TOOLS             :   ON
--   BUILD_EDUCATION                  :   ON
--   BUILD_APPLICATIONS               :   ON
--   BUILD_PLAYGROUND                 :   ON
--
--   BUILD_NUMBER_INTERNALS           :   OFF
--   BUILD_NUMBER_NATIVE_TYPES        :   OFF
--   BUILD_NUMBER_ELASTICS            :   OFF
--   BUILD_NUMBER_STATICS             :   OFF
--   BUILD_NUMBER_CONVERSIONS         :   OFF
--
--   BUILD_NUMBER_EINTEGERS           :   OFF
--   BUILD_NUMBER_DECIMALS            :   OFF
--   BUILD_NUMBER_RATIONALS           :   OFF
--   BUILD_NUMBER_EFLOATS             :   OFF
--   BUILD_NUMBER_EPOSITS             :   OFF
--
--   BUILD_NUMBER_INTEGERS            :   OFF
--   BUILD_NUMBER_FIXPNTS             :   OFF
--   BUILD_NUMBER_BFLOATS             :   OFF
--   BUILD_NUMBER_CFLOATS             :   OFF
--   BUILD_NUMBER_DFLOATS             :   OFF
--   BUILD_NUMBER_AREALS              :   OFF
--   BUILD_NUMBER_UNUM1S              :   OFF
--   BUILD_NUMBER_UNUM2S              :   OFF
--   BUILD_NUMBER_POSITS              :   OFF
--   BUILD_NUMBER_VALIDS              :   OFF
--   BUILD_NUMBER_LNS                 :   OFF
--   BUILD_NUMBER_LNS2B               :   OFF
--   BUILD_NUMBER_SORNS               :   OFF
--
--   BUILD_NUMERIC_FUNCTIONS          :   OFF
--   BUILD_NUMERIC_QUIRES             :   OFF
--   BUILD_NUMERIC_CHALLENGES         :   OFF
--   BUILD_NUMERIC_UTILS              :   OFF
--   BUILD_NUMERIC_FPBENCH            :   OFF
--
--   BUILD_BENCHMARK_ERROR            :   OFF
--   BUILD_BENCHMARK_ACCURACY         :   OFF
--   BUILD_BENCHMARK_REPRODUCIBILITY  :   OFF
--   BUILD_BENCHMARK_PERFORMANCE      :   OFF
--   BUILD_BENCHMARK_ENERGY           :   OFF
--
--   BUILD_MIXEDPRECISION_ROOTS       :   OFF
--   BUILD_MIXEDPRECISION_APPROXIMATE :   OFF
--   BUILD_MIXEDPRECISION_INTEGRATE   :   OFF
--   BUILD_MIXEDPRECISION_INTERPOLATE :   OFF
--   BUILD_MIXEDPRECISION_OPTIMIZE    :   OFF
--   BUILD_MIXEDPRECISION_TENSOR      :   OFF
--
--   BUILD_LINEAR_ALGEBRA_BLAS        :   OFF
--   BUILD_LINEAR_ALGEBRA_VMATH       :   OFF
--
--
--   BUILD_C_API_PURE_LIB             :   OFF
--   BUILD_C_API_SHIM_LIB             :   OFF
--   BUILD_C_API_LIB_PIC              :   OFF
--   BUILD_DOCS                       :   OFF
--
-- Regression Testing Level:
--   BUILD_REGRESSION_SANITY          :   ON
--
-- Dependencies:
--   SSE3                             :   NO
--   AVX                              :   NO
--   AVX2                             :   NO
--   Pthread                          :   NO
--   TBB                              :   NO
--   OMP                              :   NO
--
-- Utilities:
--   Serializer                       :   NO
--
-- Install:
--   Install path                     :   /usr/local
--

 _____  _____  ____  _____  _____  ____   ____  ________  _______     ______        _       _____
|_   _||_   _||_   \|_   _||_   _||_  _| |_  _||_   __  ||_   __ \  .' ____ \      / \     |_   _|
  | |    | |    |   \ | |    | |    \ \   / /    | |_ \_|  | |__) | | (___ \_|    / _ \      | |
  | '    ' |    | |\ \| |    | |     \ \ / /     |  _| _   |  __ /   _.____`.    / ___ \     | |   _
   \ \__/ /    _| |_\   |_  _| |_     \ ' /     _| |__/ | _| |  \ \_| \____) | _/ /   \ \_  _| |__/ |
    `.__.'    |_____|\____||_____|     \_/     |________||____| |___|\______.'|____| |____||________|

-- Configuring done
-- Generating done

As you can see in the cmake output there are many build targets. Each build target is designed to provide focus and fast build turnarounds when working with different number systems. Each number system has its own build target allowing fast and efficient regression testing.

The build options are enabled/disabled as follows:

> cmake -DBUILD_EDUCATION=OFF -DBUILD_NUMBER_POSITS=ON ..

After building, issue the command make test to run the complete test suite of all the enabled components, as a regression capability when you are modifying the source code. This will take touch all the corners of the code.

> git clone https://github.com/stillwater-sc/universal
> cd universal
> mkdir build
> cd build
> cmake ..
> make -j $(nproc)
> make test

For Windows and Visual Studio, there are CMakePredefinedTargets that accomplish the same tasks:

    - ALL_BUILD will compile all the projects
    - INSTALL   will install the Universal library
    - RUN_TESTS will run all tests

Here is the layout of all the projects contained in V3.68 of Universal:

visual-studio-project

In the Applications section, you will find application examples to demonstrate the use of Universal arithmetic types to accomplish different numerical goals, such as reproducibility, accuracy, performance, or precision. These examples are great starting points for your own application requirements.

example-applications

How to develop and extend Universal

The Universal library contains hundreds of example programs to demonstrate the use of the arithmetic types and the enable new developers to get up to speed. In each number system type's regression suite there is an api/api.cpp that chronicles all the invokation and use cases to provide an executable example of how to use the type. In general, the api section of the regression tests has code examples how to use the different library components, such as manipulators, attributes, number traits, exceptions, and special cases.

In the education build target (BUILD_EDUCATION), there are individual test programs that demonstrate how to use the different types.

The docs directory contains the descriptions of the command line tools, a tutorial explaining the parameterization design of the arithmetic types in Universal, several conference presentations, FPTalks and CoNGA22, describing the arithmetic types. The docs directory also contains ready-to-use value tables and dynamic range comparisons of many key small arithmetic types of interest in AI and DSP applications.

Each number system comes with a complete regression suite to verify the functionality of assignment, conversion, arithmetic, logic, exceptions, number traits, and special cases. These regression suites are run for each PR or push to the version branch. Universal uses standard GitHub Actions for this, so add your branch to the workflow cmake yaml to trigger CI for your own branch.

The easiest way to get started is to pick up and copy the directory structure under ROOT/include/universal/number/skeleton_1param or ROOT/include/universal/number/skeleton_2params. They are configured to get you all the constituent pieces of a number system Universal-style.

Installation and usage

After cloning the library, building and testing it in your environment, you can install it via:

> cd universal/build
> cmake .. -DCMAKE_INSTALL_PREFIX:PATH=/your/installation/path
> cmake --build . --config Release --target install -- -j $(nproc)

or manually via the Makefile target in the build directory:

> make -j $(nproc) install

The default install directory is /usr/local under Linux. There is also an uninstall command:

> make uninstall

If you want to use the number systems provided by Universal in your own project, you can use the following CMakeLists.txt structure:

project("my-numerical-experiment")

find_package(UNIVERSAL CONFIG REQUIRED)

add_executable(${PROJECT_NAME} src/mymain.cpp)
set_property(TARGET ${PROJECT_NAME} PROPERTY CXX_STANDARD 17)
target_link_libraries(${PROJECT_NAME} universal::universal)

Controlling the build to include different components

The default build configuration will build the command line tools, a playground, educational and application examples. If you want to build the full regression suite across all the number systems, use the following cmake command:

cmake -DBUILD_ALL=ON ..

For performance, the build configuration can enable specific x86 instruction sets (SSE/AVX/AVX2). For example, if your processor supports the AVX2 instruction set, you can build the test suites and educational examples with the AVX2 flag turned on. This typically yields a 20% performance boost.

cmake -DBUILD_ALL=on -DUSE_AVX2=ON ..

The library builds a set of useful command utilities to inspect native IEEE float/double/long double types, as well as the custom number systems provided by Universal. Assuming you have build and installed the library, the inspection commands available are:

    ieee           -- show the components of the full set of IEEE floating point values
    quarter        -- show the components and traits of a quarter precision floating-point value (FP8)
    half           -- show the components and traits of a half precision IEEE-754 value (FP16)
    single         -- show the components and traits of a single precision IEEE-754 value (FP32)
    double         -- show the components and traits of a double precision IEEE-754 value (FP64)
    longdouble     -- show the components and traits of a native long double IEEE-754 value
    quad           -- show the components and traits of a quad precision IEEE-754 value (FP128)

    signedint      -- show the components and traits of a signed integer value
    unsignedint    -- show the components and traits of an unsigned integer value
    fixpnt         -- show the components and traits of a fixed-point value
    posit          -- show the components and traits of a posit value
    lns            -- show the components and traits of a logarithmic number system value

    float2posit    -- show the conversion process of a Real value to a posit

    propenv        -- show the properties of the execution (==compiler) environment that built the library
    propp          -- show numerical properties of a posit environment including the associated quire
    propq          -- show numerical properties of a quire

For example:

$ ieee 1.234567890123456789012
compiler              : 7.5.0
float precision       : 23 bits
double precision      : 52 bits
long double precision : 63 bits

Decimal representations
input value:             1.23456789012
      float:                1.23456788
     double:        1.2345678901199999
long double:    1.23456789011999999999

Hex representations
input value:             1.23456789012
      float:                1.23456788    hex: 0.7f.1e0652
     double:        1.2345678901199999    hex: 0.3ff.3c0ca428c1d2b
long double:    1.23456789011999999999    hex: 0.3fff.1e06521460e95b9a

Binary representations:
      float:                1.23456788    bin: 0.01111111.00111100000011001010010
     double:        1.2345678901199999    bin: 0.01111111111.0011110000001100101001000010100011000001110100101011
long double:    1.23456789011999999999    bin: 0.011111111111111.001111000000110010100100001010001100000111010010101101110011010

Native triple representations (sign, scale, fraction):
      float:                1.23456788    triple: (+,0,00111100000011001010010)
     double:        1.2345678901199999    triple: (+,0,0011110000001100101001000010100011000001110100101011)
long double:    1.23456789011999999999    triple: (+,0,001111000000110010100100001010001100000111010010101101110011010)

Universal triple representation (sign, scale, fraction):
input value:             1.23456789012
      float:                1.23456788    triple: (+,0,00111100000011001010010)
     double:        1.2345678901199999    triple: (+,0,0011110000001100101001000010100011000001110100101011)
long double:    1.23456789011999999999    triple: (+,0,001111000000110010100100001010001100000111010010101101110011010)
      exact: TBD

This ieee command is very handy to quickly determine how your development environment represents (truncates) a specific value.

The specific commands single, double, and longdouble focus on float, double, and long double representations respectively.

There is also a command posit to help you visualize and compare the posit component fields for a given value, for example:

$ posit 1.234567890123456789012
posit< 8,0> = s0 r10 e f01000 qNE v1.25
posit< 8,1> = s0 r10 e0 f0100 qNE v1.25
posit< 8,2> = s0 r10 e00 f010 qNE v1.25
posit< 8,3> = s0 r10 e000 f01 qNE v1.25
posit<16,1> = s0 r10 e0 f001111000001 qNE v1.234619140625
posit<16,2> = s0 r10 e00 f00111100000 qNE v1.234375
posit<16,3> = s0 r10 e000 f0011110000 qNE v1.234375
posit<32,1> = s0 r10 e0 f0011110000001100101001000011 qNE v1.2345678918063641
posit<32,2> = s0 r10 e00 f001111000000110010100100001 qNE v1.2345678880810738
posit<32,3> = s0 r10 e000 f00111100000011001010010001 qNE v1.2345678955316544
posit<48,1> = s0 r10 e0 f00111100000011001010010000101000110001011010 qNE v1.2345678901234578
posit<48,2> = s0 r10 e00 f0011110000001100101001000010100011000101101 qNE v1.2345678901234578
posit<48,3> = s0 r10 e000 f001111000000110010100100001010001100010110 qNE v1.2345678901233441
posit<64,1> = s0 r10 e0 f001111000000110010100100001010001100010110011111101100000000 qNE v1.2345678901234567
posit<64,2> = s0 r10 e00 f00111100000011001010010000101000110001011001111110110000000 qNE v1.2345678901234567
posit<64,3> = s0 r10 e000 f0011110000001100101001000010100011000101100111111011000000 qNE v1.2345678901234567

The fields are prefixed by their first characters, for example, "posit<16,2> = s0 r10 e00 f00111100000 qNE v1.234375"

  • sign field = s0, indicating a positive number
  • regime field = r10, indicates the first positive regime, named regime 0
  • exponent field = e00, indicates two bits of exponent, both 0
  • fraction field = f00111100000, a full set of fraction bits

The field values are followed by a quadrant descriptor and a value representation in decimal:

  • qNE = North-East Quadrant, representing a number in the range "[1, maxpos]"
  • v1.234375 = the value representation of the posit projection

The positive regime for a posit shows a very specific structure, as can be seen in the image blow: regime structure

Leveraging the Universal library in your own mixed-precision algorithm research

To bootstrap any new mixed-precision algorithm development and optimization project quickly and robustly, there is a github template repo available that will set up a complete working development environment with dependent libraries, development containers, VS Code integration, and a Github CI workflow. The template repo can be found at mpadao-template.

The template repo is the easiest way to get started with mixed-precision algorithm development using Universal.

Motivation

Modern Deep Learning AI applications are very demanding high-performance applications. Runtimes to train models are measured in terms of weeks, and target latencies for inference are 10-100msec. Standard double, and even single, precision IEEE-754 floating-point have been too expensive to use in addressing the performance and power requirements of AI applications in both the cloud and the edge. Both Google and Microsoft have jettisoned traditional floating-point formats for their AI cloud services to gain two orders of magnitude better performance. Similarly, AI applications for mobile and embedded applications are requantized to small integers to fit their very stringent power budgets. The AI domain has been researching better number systems to address both power and performance requirements, but all these efforts have worked in isolation, with results being difficult to reproduce.

AI applications are only some of the applications that expose the limitations of traditional hardware.
Inefficiencies in numeric storage and operations also limit cloud scale, IoT, embedded, and HPC applications. A simple change to a new number system may improve the scale and cost of these applications by orders of magnitude.

When performance and power efficiency are the differentiating attributes for a use case, arithmetic systems that are tailored to the needs of the application are desired.

In particular, there are two concerns when using the IEEE-754 floating-point formats:

  • inefficient representation of the real numbers
  • irreproducibility in the context of concurrency

More specifically,

  1. Wasted Bit Patterns - 32-bit IEEE floating point has around eight million ways to represent NaN (Not-A-Number), while the 64-bit floating point has two quadrillion, which is approximately 2.251x10^15 to be more exact. A NaN is an exceptional value to represent undefined or invalid results, such as the result of a division by zero.
  2. Mathematically Incorrect - The format specifies two zeroes, a negative and positive zero, with different behaviors. - Loss of associative and distributive law due to rounding after each operation. This loss of associative and distributive arithmetic behavior creates an irreproducible result of concurrent programs that use the IEEE floating point. This is particularly problematic for embedded and control applications.
  3. Overflows to ยฑ inf and underflows to 0 - Overflowing to ยฑ inf increases the relative error by an infinite factor while underflowing to 0 loses sign information.
  4. Unused dynamic range - The dynamic range of double precision floats is a whopping 2^2047, whereas most numerical software is architected to operate around 1.0.
  5. Complicated Circuitry - Denormalized floating point numbers have a hidden bit of 0 instead of 1. This creates a host of special handling requirements that complicate compliant hardware implementations.
  6. No Gradual Overflow and Fixed Accuracy - If accuracy is defined as the number of significand bits, IEEE floating-point has fixed accuracy for all numbers except denormalized numbers because the number of signficand digits is fixed. Denormalized numbers are characterized by decreased significand digits when the value approaches zero due to having a zero hidden bit. Denormalized numbers fill the underflow gap (i.e., between zero and the least non-zero values). The counterpart for gradual underflow is gradual overflow which does not exist in IEEE floating points.

In contrast, the posit number system was designed to overcome these negatives:

  1. Economical - No bit patterns are redundant. There is one representation for infinity denoted as ยฑ inf and zero. All other bit patterns are valid distinct non-zero real numbers. ยฑ inf serves as a replacement for NaN.
  2. Preserves Mathematical Properties - There is only one representation for zero, and the encoding is symmetric around 1.0. Associative and distributive laws are supported through deferred rounding via the quire, enabling reproducible linear algebra algorithms in any concurrency environment.
  3. Tapered Accuracy - Tapered accuracy is when values with small exponents have more precision and values with large exponents have fewer digits of accuracy. This concept was first introduced by Morris (1971) in his paper โ€Tapered Floating Point: A New Floating-Point Representationโ€.
  4. Parameterized precision and dynamic range - posits are defined by a size, nbits, and the number of exponent bits, es. This enables system designers the freedom to pick the right precision and dynamic range required for the application. For example, we may pick 5- or 6-bit posits without any exponent bits for AI applications to improve performance. For embedded DSP applications, such as 5G base stations, we may select a 16-bit posit with one exponent bit to improve performance per Watt.
  5. Simpler Circuitry - There are only two exceptional cases, Not a Real and Zero. No denormalized numbers, overflow, or underflow.

However as Deep Learning has demonstrated, there are many different requirements to optimize an arithmetic and tailor it to the needs of the application. Universal brings all the machinery together to experiment to facilitate efficient contrast and compare different arithmetic number system designs, before committing them to hardware.

Goals of the library

The Universal library started as a bit-level arithmetic reference implementation of the evolving unum Type III (posit and valid) standard. However, the demands for supporting number systems, such as adaptive-precision integers to solve large factorials, adaptive-precision floats to act as Oracles, or comparing linear and tapered floats provided the opportunity to create a complete platform for numerical analysis and computational mathematics. With this Universal platform, we enable a new direction for optimizing algorithms to take advantage of mixed-precision computation to maximize performance and minimize energy demands. Energy efficiency is going to be the key differentiator for embedded intelligence applications.

As a reference library, Universal offers an extensive test infrastructure to validate number system arithmetic operations, and there is a host of utilities to inspect the internal encodings and operations of the different number systems.

The design space for custom arithmetic is vast, and any contribution to expanding the capability of the Universal library is encouraged.

Contributing to universal

We are happy to accept pull requests via GitHub. The only requirement is that the entire regression test suite passes.

Stargazers over time

Verification Suite

Typically, the verification suite is run as part of the build directory's make test command. However, it is possible to run specific test suite components, for example, to validate algorithmic changes to more complex arithmetic functions, such as square root, exponent, logarithm, and trigonometric functions.

Here is an example:

>:~/dev/universal/build$  make posit_logarithm
[ 50%] Building CXX object static/posit/CMakeFiles/posit_logarithm.dir/math/logarithm.cpp.o
[100%] Linking CXX executable posit_logarithm
[100%] Built target posit_logarithm
>:~/dev/universal/build$ static/posit/posit_logarithm
posit logarithm function validation: results only
               4 -> log(4) =  1.3862943649292
0110000000000000 -> log(4) = 0100011000101110 (reference: 0100011000101110)   PASS

posit<2,0>                                                   log PASS
posit<3,0>                                                   log PASS
posit<3,1>                                                   log PASS
posit<4,0>                                                   log PASS
posit<4,1>                                                   log PASS
posit<5,0>                                                   log PASS
posit<5,1>                                                   log PASS
posit<5,2>                                                   log PASS
posit<8,4>                                                   log PASS
posit<8,4>                                                   log2 PASS
posit<8,4>                                                   log10 PASS
posit logarithm function validation: PASS

Structure of the tree

The universal library contains a set of functional groups to organize the development and validation of different number systems. Each number system type has a single include file that brings together the arithmetic number type and all the extensions that Universal has standardized so that working with numeric types is more productive. For example, facilities for number traits, an arithmetic exception hierarchy, number system attributes, manipulators, and finally, a math library specialized for the type.

The number system types are categorized as static or elastic. Static types are arithmetic types that have a constant, that is static size, and thus can be used for sharing composite data structures, such as matrices and tensors, between general-purpose CPUs and special-purpose hardware accelerators. Elastic types are arithmetic types that can grow and shrink during computation, typically to accommodate error-free, or closed computations.

In addition to the static and elastic classification, we also recognize that the base of the number system is a key parameter in the arithmetic and numerical traits of the type. In particular, the tree will specialize in binary and decimal forms of arithmetic.

Here is a complete list:

static fixed-sized configurations

  • universal/number/integer - arbitrary configuration fixed-size integer
  • universal/number/fixpnt - arbitrary configuration fixed-size fixed-point number system
  • universal/number/areal - arbitrary configuration fixed-size faithful floating-point with uncertainty bit
  • universal/number/cfloat - arbitrary configuration fixed-size classic floating-point number system
  • universal/number/posit - arbitrary configuration fixed-size posit number system, a tapered floating-point
  • universal/number/valid - arbitrary configuration fixed-size valid number system, a tapered floating-point interval number system
  • universal/number/unum - arbitrary configuration unum Type 1 number system
  • universal/number/unum2 - arbitrary configuration unum Type 2 number system
  • universal/number/lns - arbitrary configuration logarithmic number system with fixed-point exponent
  • universal/number/dbns - double base number system with integer exponents
  • universal/number/sorn - set of real numbers

elastic adaptive-precision configurations

  • universal/number/decimal - adaptive-precision decimal integer
  • universal/number/einteger - adaptive-precision binary integer
  • universal/number/rational - adaptive-precision rational number system
  • universal/number/efloat - adaptive-precision linear floating-point
  • universal/number/eposit - adaptive-precision tapered floating-point

super-accumulator facilities

  • universal/number/quire - arbitrary configuration fixed-size super accumulator number system (add/sub/abs/sqrt)
  • universal/number/float - contains the implementation of the IEEE floating point augmentations for reproducible computation

And each of these functionality groups has an associated test suite located in ".../universal/tests/..."

Background information

Universal numbers, unums for short, express real numbers and ranges of real numbers. There are two modes of operation, selectable by the programmer, posit mode and valid mode.

In posit mode, a unum behaves like a floating-point number of fixed size, rounding to the nearest expressible value if the result of a calculation is not expressible exactly. A posit offers more accuracy and a wider dynamic range than floats with the same number of bits.

In valid mode, a unum represents a range of real numbers and can be used to bound answers rigorously, much as interval arithmetic does.

Posit configurations have a specific relationship to one another. When expanding a posit, the new value falls 'between' the old values of the smaller posit. The new value is the arithmetic mean of the two numbers if the expanding bit is a fraction bit, and it is the geometric mean of the two numbers if the expanding bit is a regime or exponent bit. This page shows a visualization of the expansion of posit<2,0> to posit<7,1>:

Public Domain and community resources

The unum format is a public domain specification and a collection of web resources to manage information and discussions using unums.

Posit Hub

Unum-computing Google Group

Projects that leverage posits

Matrix Template Library

The Matrix Template Library incorporates modern C++ programming techniques to provide an easy and intuitive interface to users while enabling optimal performance. The natural mathematical notation in MTL4 empowers all engineers and scientists to implement their algorithms and models in minimal time. All technical aspects are encapsulated in the library. Think of it as MATLAB for applications.

G+SMO

G+Smo (Geometry + Simulation Modules, pronounced "gismo") is a new open-source C++ library that brings together mathematical tools for geometric design and numerical simulation. It is developed mainly by researchers and PhD students. It implements the relatively new paradigm of isogeometric analysis, which suggests the use of a unified framework in the design and analysis pipeline. G+Smo is an object-oriented, cross-platform, template C++ library and follows the generic programming principle, with a focus on both efficiency and ease of use. The library is partitioned into smaller entities, called modules. Examples of available modules include the dimension-independent NURBS module, the data fitting and solid segmentation module, the PDE discretization module and the adaptive spline module, based on hierarchical splines of arbitrary dimension and polynomial degree.

FEniCS

FEniCS is a popular open-source (LGPLv3) computing platform for solving partial differential equations (PDEs). FEniCS enables users to quickly translate scientific models into efficient finite element code. With the high-level Python and C++ interfaces to FEniCS, it is easy to get started, but FEniCS offers also powerful capabilities for more experienced programmers. FEniCS runs on a multitude of platforms ranging from laptops to high-performance clusters.

ODEINT-v2

Odeint is a modern C++ library for numerically solving Ordinary Differential Equations. It is developed in a generic way using Template Metaprogramming which leads to extraordinary high flexibility at top performance. The numerical algorithms are implemented independently of the underlying arithmetics. This results in an incredible applicability of the library, especially in non-standard environments. For example, odeint supports matrix types, arbitrary precision arithmetics and even can be easily run on CUDA GPUs.

Several AI and Deep Learning libraries are being reengineered to enable the use of posits for both training and inference. They will be announced as they are released.

License

FOSSA Status

universal's People

Contributors

allanleal avatar billzorn avatar cjdelisle avatar codacy-badger avatar davidmallasen avatar davidsummers avatar floedelmann avatar fossabot avatar gussmith23 avatar hypercubestart avatar ibmtomtzigt avatar jamesquinlan avatar jtodd440 avatar lvandam avatar mmoelle1 avatar mrpnk avatar petergottschling avatar ravenwater avatar shikharvashistha avatar silvan-k avatar snyk-bot avatar spakin avatar willwray avatar xman avatar yboettcher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

universal's Issues

adaptive precision linear floating-point addition/subtraction

implement addition for arbitrary precision linear floating point

We have arbitrary precision and adaptive precision vocabulary.

arbitrary precision arithmetic is any arithmetic using a given fixed precision, whereas adaptive precision changes representations during computation to capture all relevant bits.

arbitrary precision arithmetic is, say float<nbits, es> where nbits = 32, and es = 8, which would represent single precision floating point.

adaptive precision arithmetic is afloat, without any template parameters as the size and representation will adapt to the computation.

adaptive precision linear floating-point multiplication

implement multiplication for arbitrary precision linear floating point

We have arbitrary precision and adaptive precision vocabulary.

arbitrary precision arithmetic is any arithmetic using a given fixed precision, whereas adaptive precision changes representations during computation to capture all relevant bits.

arbitrary precision arithmetic is, say float<nbits, es> where nbits = 32, and es = 8, which would represent single precision floating point.

adaptive precision arithmetic is afloat, without any template parameters as the size and representation will adapt to the computation.

Standard 64bit_posit arithmetic test suite is failing

The standard posit<64,3> has a maximum fraction at 1.0 of nbits - sign_bit - 2_region_bits - 3_exponents_bits = 64 - 1 - 2 - 3 = 58 bits. This is more bits than captured by a regular double, and thus doubles can't be used to create golden references.

Unfortunately, different platforms provide different implementations of long double, which creates a lot of gnarly environmental checking, so find the right bits to extract. At this point, something is going wrong, and the test suite is consistently failing on both Windows and Linux (gcc). Windows provides no long double, and Linux gcc 5.4.0 provides an 80bit long double. It does provide a true long double __int128_t implementation, but I haven't been able to investigate how to use this.

However, from a general verification test suite perspective, long doubles would just be a stop gap measure. More appropriately would be to leverage an arbitrary precision floating point library, like MPFR. However, this would create an external dependency that would make the posit lib more difficult to use, so we would like to get at least the posit<64,3> working with native types, and then split out a larger, more comprehensive verification environment that would be using MPFR.

native posit serialization code

the posit operator<<() marshals its value through standard IEEE long double, thus limiting the accurate output of posit values that lie outside of the precision and dynamic range of IEEE long double.

we need a native serialization implementation to support arbitrary large posits. This will require an arbitrary precision float and its serialization.

Consistent exception design across all number systems

With the posit number system we added a conditionally compiled exception mechanism to enable the use case to trap a piece of code and catch any arithmetic errors, such as division by zero. This initial design was relatively ad hoc. With the addition of modulo and saturating number systems, the number of interesting exception conditions has increased, begging for a revisiting of the overall design/architecture, and possibly create a better design to aid numerical debugging and root cause analysis.

C api conversions between posits uses double

When converting from one posit type to another, the conversion implementation converts the posit to a double and then converts the double to the other type posit, which may lose bits.

Redesign project structure

With the universal library now being a collection of number systems, we need to reorganize the project structure so that each number system has its own segment. This will make it easier to work with the code and provide a better overview of all the functionality in the library.

Compilation with Clang leads to constexpr errors

Compiling with Clang 8 on Mac:

/Users/gus/universal-llvm/./include/universal/posit/posit.hpp:1732:27: error: constexpr variable 'nbits' must be initialized by a constant expression
        constexpr size_t nbits = number.nbits;
                                 ^~~~~~~~~~~~
/Users/gus/universal-llvm/numerical/thin_triangle.cpp:223:35: note: in instantiation of function template specialization
      'sw::unum::to_binary<sw::unum::posit<32, 2> >' requested here
        ostr << "    a  = " << sw::unum::to_binary(a) << " " << sw::unum::to_base2_scientific(a) << " : " << std::showpos << a << std::noshowpos << std::endl;
                                         ^
/Users/gus/universal-llvm/numerical/thin_triangle.cpp:289:3: note: in instantiation of function template specialization
      'printTriangleConfiguration<sw::unum::posit<32, 2> >' requested here
                printTriangleConfiguration<Posit>(cout, a, b, c);
                ^

There are many instances of this same error. Any ideas?

compilation errors with gcc9 in std random

/home/stillwater/universal/tests/functions/lerp.cpp: In function 'int main(int, char**)':
/home/stillwater/universal/tests/functions/lerp.cpp:67:15: error: no matching function for call to 'std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>::mersenne_twister_engine(Seed<std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005> >&)'
67 | Rand rng(seed);
| ^
In file included from /usr/local/include/c++/9.3.0/random:49,
from /home/stillwater/universal/tests/functions/lerp.cpp:6:
/usr/local/include/c++/9.3.0/bits/random.h:530:9: note: candidate: 'template<class _Sseq, class> std::mersenne_twister_engine<_UIntType, __w, __n, __m, __r, __a, __u, __d, __s, __b, __t, __c, __l, __f>::mersenne_twister_engine(_Sseq&)'
530 | mersenne_twister_engine(_Sseq& __q)
| ^~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/c++/9.3.0/bits/random.h:530:9: note: template argument deduction/substitution failed:
/usr/local/include/c++/9.3.0/bits/random.h: In substitution of 'template<class _Sseq, class _Engine, class _Res, class _GenerateCheck> using __is_seed_seq = std::_and<std::_not<std::is_same<typename std::remove_cv<typename std::remove_reference<_Tp>::type>::type, _Engine> >, std::is_unsigned, std::_not<std::is_convertible<_Sseq, _Res> > > [with _Sseq = Seed<std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005> >; _Engine = std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>; _Res = long unsigned int; _GenerateCheck = void]':
/usr/local/include/c++/9.3.0/bits/random.h:491:8: required by substitution of 'template<class _UIntType, long unsigned int __w, long unsigned int __n, long unsigned int __m, long unsigned int __r, _UIntType __a, long unsigned int __u, _UIntType __d, long unsigned int __s, _UIntType __b, long unsigned int __t, _UIntType __c, long unsigned int __l, _UIntType __f> template using _If_seed_seq = typename std::enable_if<std::__detail::__is_seed_seq<_Sseq, std::mersenne_twister_engine<_UIntType, __w, __n, __m, __r, __a, __u, __d, __s, __b, __t, __c, __l, __f>, _UIntType>::value>::type [with _Sseq = Seed<std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005> >; _UIntType = long unsigned int; long unsigned int __w = 64; long unsigned int __n = 312; long unsigned int __m = 156; long unsigned int __r = 31; _UIntType __a = 13043109905998158313; long unsigned int __u = 29; _UIntType __d = 6148914691236517205; long unsigned int __s = 17; _UIntType __b = 8202884508482404352; long unsigned int __t = 37; _UIntType __c = 18444473444759240704; long unsigned int __l = 43; _UIntType __f = 6364136223846793005]'
/usr/local/include/c++/9.3.0/bits/random.h:528:32: required from here
/usr/local/include/c++/9.3.0/bits/random.h:197:13: error: no type named 'result_type' in 'class Seed<std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005> >'
197 | using __is_seed_seq = _and<
| ^~~~~~~~~~~~~
/usr/local/include/c++/9.3.0/bits/random.h:519:7: note: candidate: 'std::mersenne_twister_engine<_UIntType, __w, __n, __m, __r, __a, __u, __d, __s, __b, __t, __c, __l, __f>::mersenne_twister_engine(std::mersenne_twister_engine<_UIntType, __w, __n, __m, __r, __a, __u, __d, __s, __b, __t, __c, __l, __f>::result_type) [with _UIntType = long unsigned int; long unsigned int __w = 64; long unsigned int __n = 312; long unsigned int __m = 156; long unsigned int __r = 31; _UIntType __a = 13043109905998158313; long unsigned int __u = 29; _UIntType __d = 6148914691236517205; long unsigned int __s = 17; _UIntType __b = 8202884508482404352; long unsigned int __t = 37; _UIntType __c = 18444473444759240704; long unsigned int __l = 43; _UIntType __f = 6364136223846793005; std::mersenne_twister_engine<_UIntType, __w, __n, __m, __r, __a, __u, __d, __s, __b, __t, __c, __l, __f>::result_type = long unsigned int]'
519 | mersenne_twister_engine(result_type __sd)
| ^~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/c++/9.3.0/bits/random.h:519:43: note: no known conversion for argument 1 from 'Seed<std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005> >' to 'std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>::result_type' {aka 'long unsigned int'}
519 | mersenne_twister_engine(result_type __sd)
| ~~~~~~~~~~~~^~~~
/usr/local/include/c++/9.3.0/bits/random.h:516:7: note: candidate: 'std::mersenne_twister_engine<_UIntType, __w, __n, __m, __r, __a, __u, __d, __s, __b, __t, __c, __l, __f>::mersenne_twister_engine() [with _UIntType = long unsigned int; long unsigned int __w = 64; long unsigned int __n = 312; long unsigned int __m = 156; long unsigned int __r = 31; _UIntType __a = 13043109905998158313; long unsigned int __u = 29; _UIntType __d = 6148914691236517205; long unsigned int __s = 17; _UIntType __b = 8202884508482404352; long unsigned int __t = 37; _UIntType __c = 18444473444759240704; long unsigned int __l = 43; _UIntType __f = 6364136223846793005]'
516 | mersenne_twister_engine() : mersenne_twister_engine(default_seed) { }
| ^~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/c++/9.3.0/bits/random.h:516:7: note: candidate expects 0 arguments, 1 provided
/usr/local/include/c++/9.3.0/bits/random.h:461:11: note: candidate: 'constexpr std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>::mersenne_twister_engine(const std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>&)'
461 | class mersenne_twister_engine
| ^~~~~~~~~~~~~~~~~~~~~~~
/usr/local/include/c++/9.3.0/bits/random.h:461:11: note: no known conversion for argument 1 from 'Seed<std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005> >' to 'const std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>&'
/usr/local/include/c++/9.3.0/bits/random.h:461:11: note: candidate: 'constexpr std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>::mersenne_twister_engine(std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>&&)'
/usr/local/include/c++/9.3.0/bits/random.h:461:11: note: no known conversion for argument 1 from 'Seed<std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005> >' to 'std::mersenne_twister_engine<long unsigned int, 64, 312, 156, 31, 13043109905998158313, 29, 6148914691236517205, 17, 8202884508482404352, 37, 18444473444759240704, 43, 6364136223846793005>&&'
make[2]: *** [tests/functions/CMakeFiles/functions_lerp.dir/build.make:82: tests/functions/CMakeFiles/functions_lerp.dir/lerp.cpp.o] Error 1

Reciprocation needs some work

Currently, the division pipeline a / b is configured as a multiply by reciprocal a * 1/b. The reciprocal uses the reflective property between upper and lower quadrants of the projective circle, which we know does not work for non-powers of 2.

The current idea is to use the reflection as the first guess and thus use an iterative approximation like Newton-Raphson to complete the correct reciprocation. The iterative approximation is currently missing.

Unable to convert posits to integer.

Compiled using g++ new.cpp gives me this error.

new.cpp: In function โ€˜int main()โ€™:
new.cpp:12:4: error: cannot convert โ€˜sw::unum::posit<32, 2>โ€™ to โ€˜intโ€™ in assignment
x=a;

new.cpp

   #include "universal/posit/posit"

   int main()
   {

      sw::unum::posit<32,2> a = 1.2;
      cout << a << endl;

      int x;
      x=a;
      cout << x << endl;

      return 0;
   }

g++ new.cpp

Is there a way to convert posit to int ?

quire for floating point types is failing

The quire type is implemented as a generic super-accumulator divided in three segments:

  • lower segment for bits right of the radix point
  • upper segment for bits left of the radix point till maxpos^2
  • capacity segment for bits left of the maxpos^2 scale bit

The input to the quire is a (sign, scale, fraction) triplet of arbitrary size. This makes the core quire implementation also usable for float/double/long double and experiment with deferred rounding with IEEE floating point types. The only difference is the dynamic range calculation to size the super-accumulator.

The current test suite is still just an experimental program, and some thought needs to go into what is important to test.

literals in logic operators need support

this code:
posit<16,1> p(1);
...
if (p > 0) {
// do stuff when p is positive
} else {
// do stuff when p is negative
}
will require some syntactic love to enable.

command propenv does not use fast posits

image

propenv is configured to use the generalized posits, instead of fast posits, which are optimal with respect to memory storage.

Two options: move the generalized posits to use blockbinary storage that is optimal, or enable fast posits in propenv.

Quicker option is to enable fast posits and recompile.

Compiling programs with mutiple sources returns mutiple definition error.

Trying to compile main.cpp foo.cpp and foo2.cpp gives me this error.

CMakeFiles/edu_main.dir/foo.o: In function `sw::unum::ulp(float const&)': foo.cpp:(.text+0x0): multiple definition of `sw::unum::ulp(float const&)' CMakeFiles/edu_main.dir/main.o:main.cpp:(.text+0x0): first defined here CMakeFiles/edu_main.dir/foo.o: In function `sw::unum::ulp(double const&)': foo.cpp:(.text+0x4a): multiple definition of `sw::unum::ulp(double const&)' CMakeFiles/edu_main.dir/main.o:main.cpp:(.text+0x4a): first defined here CMakeFiles/edu_main.dir/foo2.o: In function `sw::unum::ulp(float const&)': foo2.cpp:(.text+0x0): multiple definition of `sw::unum::ulp(float const&)' CMakeFiles/edu_main.dir/main.o:main.cpp:(.text+0x0): first defined here CMakeFiles/edu_main.dir/foo2.o: In function `sw::unum::ulp(double const&)': foo2.cpp:(.text+0x4a): multiple definition of `sw::unum::ulp(double const&)'

Tried compiling the same using g++ and it still throws the same error

/tmp/ccaDTIuZ.o: In function `sw::unum::ulp(float const&)': foo.cpp:(.text+0x0): multiple definition of `sw::unum::ulp(float const&)' /tmp/ccQBbsJ7.o:main.cpp:(.text+0x0): first defined here /tmp/ccaDTIuZ.o: In function `sw::unum::ulp(double const&)': foo.cpp:(.text+0x4a): multiple definition of `sw::unum::ulp(double const&)' /tmp/ccQBbsJ7.o:main.cpp:(.text+0x4a): first defined here /tmp/ccOh6lJT.o: In function `sw::unum::ulp(float const&)': foo2.cpp:(.text+0x0): multiple definition of `sw::unum::ulp(float const&)' /tmp/ccQBbsJ7.o:main.cpp:(.text+0x0): first defined here /tmp/ccOh6lJT.o: In function `sw::unum::ulp(double const&)': foo2.cpp:(.text+0x4a): multiple definition of `sw::unum::ulp(double const&)' /tmp/ccQBbsJ7.o:main.cpp:(.text+0x4a): first defined here

main.cpp

#include "posit"

using namespace std;
using namespace sw::unum;	// standard namespace for posits

template<size_t nbits, size_t es>
sw::unum::posit<nbits,es> foo(sw::unum::posit<nbits,es> x);
template<size_t nbits, size_t es>
sw::unum::posit<nbits,es> foo2(sw::unum::posit<nbits,es> x);

int main()
{
	const size_t nbits = 32;
	const size_t es = 2;
	
	posit<nbits, es> s1, s2, s3;

	s1 = 9;        
	s2=foo(s1);
	s3=foo2(s1);
	cout << "s1          : " << setw(3) << s2 << '\n';
	cout << "s2          : " << setw(3) << s3 << '\n';

}

foo.cpp

#include "posit"

template<size_t nbits, size_t es>
sw::unum::posit<nbits,es> foo(sw::unum::posit<nbits,es> x)
{
	return (x+1);
}

foo2.cpp

#include "posit"

template<size_t nbits, size_t es>
sw::unum::posit<nbits,es> foo2(sw::unum::posit<nbits,es> x)
{
	return (x-1);
}

g++ -I./universal/posit main.cpp foo.cpp foo2.cpp

Adaptive precision linear floating-point implementation

For many numerical tests we would benefit from a high-performance arbitrary precision floating point representation. We don't want to create a dependency on BOOST or MPFR as it would make using the universal library more difficult, so a native implementation is desired.

adaptive precision linear floating-point division

implement division for arbitrary precision linear floating point

We have arbitrary precision and adaptive precision vocabulary.

arbitrary precision arithmetic is any arithmetic using a given fixed precision, whereas adaptive precision changes representations during computation to capture all relevant bits.

arbitrary precision arithmetic is, say float<nbits, es> where nbits = 32, and es = 8, which would represent single precision floating point.

adaptive precision arithmetic is afloat, without any template parameters as the size and representation will adapt to the computation.

Quire capacity and overflow

I was reading the Posit Standard and notice that it says that the quire cannot overflow up to 2^c-1 additions (with c=nbits-1, which, if I understand correctly, is the parameter capacity).

My question is about the quire implementation, since it mentions that the capacity indicates the power of 2 number of accumulations of maxpos^2 the quire can support. Does this mean that this implementation supports an additional accumulation? Since the standard has that -1 in the capacity.
I also noticed that the upper_range is greater than the half_range by 1 bit, because of the maxpos^2 scale. However, according to the standard, the upper (integer) and half (fraction) should have the same size. And if the upper range has 1 more bit, then I believe the quire can actually accumulate 2^(c+1)-1 products maxpos^2 without overflow.

Moreover, total_bits() returns qbits+1. qbits is equal to range+capacity. Therefore, the total size would be sign + capacity + range.
Looking again to the Standard, the quire structure indicates that the range corresponds to the size of the integer+fraction. However, in this implementation, the upper_range is 1 bit larger than the half_range; does that mean that this quire would actually have 1 bit more?

Sorry for the questions, but I've been trying to understand and wrap my head around this, with no success.
Am I misunderstanding the implementation or the standard? Thank you in advance!

(I'm seeing this standard: https://posithub.org/docs/posit_standard.pdf )

mac compilation problem

trying build on mac / clang etc.
complains about popcntl, ctzl and __N
can you hint where those should normally come from? tried searching but failed to find so far

performance improvements: we want a fastPosit sw emulation

Right now, the core bit implementation is based on std::bitset<>. Some of the key operators on posits need to be implemented by bit searches in the bitset<>, which is particularly painful for big posits. The current lib runs in the 1MOPS range for small posit, but in the 100KOPS range for big posits.

Looking for help to create a fastPosit implementation that is based on integer arithmetic that will improve performance by several orders of magnitude. Should be a drop-in replacement, so that an application can swap out the two interfaces seamlessly.

Babylonian method for sqrt fails for small, coarse posit configurations

First simple attempt to get sqrt implemented using the Babylonian method runs into a problem as the basic loop invariant won't hold for highly discrete posit configurations.

The Babylonian method depends on a loop variant that compares the difference between the square of the current iteration and the input value against a small epsilon. For coarse posits this invariant will not lead to convergence.

test suite needs strengthening with multiprecision floating point reference calculations

The test suite is setup to generate a reference value in float/double/long double. The rounding that takes place in the reference computation creates a limit to the the size of the posit configuration that can be tested.

floats contain 24 bits in the fraction and multiplies would generate 48 bits of fraction.
doubles contain 53 bits in the fraction and multiplies generate 106 bits of fraction.
long doubles should contain 113 bits in the fraction and multiplies generate 226 bits of fraction.

Validating the quire accumulation would require access to all these bits and IEEE floating point hardware doesn't support any of that.

This means we really need to go for an multiprecision reference calculation, such as MPFR, for validating posits bigger than 26 bits, and definitely for any quire validation.

Build errors when integrated with downstream TVM

Hi @Ravenwater !

I am working on the Bring Your Own DataTypes PR (apache/tvm#5812),
and unfortunately we are having trouble building because of warnings from this project.

I can open a new PR if you are okay with it

See: (https://ci.tvm.ai/blue/organizations/jenkins/tvm/detail/PR-5812/16/pipeline) or below:

/workspace/3rdparty/universal/posit/value.hpp: In function 'uint8_t Posit8es2Sqrt(uint8_t)':

/workspace/3rdparty/universal/posit/value.hpp:219:7: error: '_exponent' may be used uninitialized in this function [-Werror=maybe-uninitialized]

       _scale = _exponent - 1;

       ^

/workspace/3rdparty/universal/posit/value.hpp:217:11: note: '_exponent' was declared here

       int _exponent;

           ^

/workspace/3rdparty/universal/posit/value.hpp: In function 'uint16_t Posit16es2Sqrt(uint16_t)':

/workspace/3rdparty/universal/posit/value.hpp:219:7: error: '_exponent' may be used uninitialized in this function [-Werror=maybe-uninitialized]

       _scale = _exponent - 1;

       ^

/workspace/3rdparty/universal/posit/value.hpp:217:11: note: '_exponent' was declared here

       int _exponent;

           ^

/workspace/3rdparty/universal/posit/value.hpp: In function 'uint32_t Posit32es2Sqrt(uint32_t)':

/workspace/3rdparty/universal/posit/value.hpp:219:7: error: '_exponent' may be used uninitialized in this function [-Werror=maybe-uninitialized]

       _scale = _exponent - 1;

       ^

/workspace/3rdparty/universal/posit/value.hpp:217:11: note: '_exponent' was declared here

       int _exponent;

Geometric Rounding remains a mystery

There is a known issue of the library making an incorrect rounding choices for posit configurations with exponent bits. It is related to the logic to deal with geometric rounding when bits propagate from the exponent field to the regime field.

The current design tries to create combinatoric equations to set the bits as it tries to model how hardware would compute these fields. The data path pipeline is set up as a floating point value pipeline with values being represented by their triple (sign, binary scale, fraction without hidden bit). As we are trying to reinterpret a triple into a posit triple (regime, exponent, fraction without hidden bit) there is ambiguity how to set up the equations for regime and exponent extractions as a function of nbits and es. This ambiguity appears to be the cause for this particular bug.

Initializing posit with long double yields incorrect representation

When I try to initialize a posit (nbits = 32, es = 2) with a value cast as long double, the representation when printed is not close to the actual initial value.

posit<32, 2> E_pos((long double)0.79432823472428150206586100479);
cout << setprecision(30) << fixed << E_pos << endl;
cout << pretty_print(E_pos)  << endl;

Output

0.500071857124567031860351562500
s0 r01 e11 f000000000000100101101011001 qSE v0.50007185712456703

Part of the fraction bits ("100101101011001") seems to be correct, but the closest representation should have this part shifted to the left by 12 positions.

100101101011001000110000110 // fbits correct representation
000000000000100101101011001 // fbits actual representation

Is this a bug, or a mistake in my small example? Would love to know.

Cheers,
Laurens

clang warnings

There are many warnings:

/usr/ports/math/universal/work/universal-7fb24e1058eccd17699d6147fefffd2a5468876c/./include/universal/posit/specializations.hpp:33:9: warning: unknown pragma ignored [-Wunknown-pragmas]
#pragma warning( disable : 4365 ) // warning C4365: 'initializing': conversion from 'long' to 'uint32_t', signed/unsigned mismatch
        ^
/usr/ports/math/universal/work/universal-7fb24e1058eccd17699d6147fefffd2a5468876c/./include/universal/posit/specializations.hpp:31:9: warning: unknown pragma ignored [-Wunknown-pragmas]
#pragma warning( disable : 4242 ) // warning C4242: 'argument': conversion from 'int32_t' to 'const int8_t', possible loss of data
        ^
/usr/ports/math/universal/work/universal-7fb24e1058eccd17699d6147fefffd2a5468876c/./include/universal/posit/specializations.hpp:45:9: warning: unknown pragma ignored [-Wunknown-pragmas]
#pragma warning( pop )
        ^

clang=9
FreeBSD 12.1

Tests pde_laplace and blas_matrix_ops fail on FreeBSD

$ ./work/.build/examples/pde/pde_laplace
PI = 3.1416 16.1x5922p
Assertion failed: (A.rows() == m * n), function laplace2D, file /usr/ports/math/universal/work/universal-513bb30bac688b4b28eeb7429d601e7c40d2e3f3/./include/universal/blas/laplace2D.hpp, line 17.
Abort trap
$ ./work/.build/examples/blas/blas_matrix_ops
1 0 0 0 0 
0 1 0 0 0 
0 0 1 0 0 
0 0 0 1 0 
0 0 0 0 1 

Assertion failed: (A.rows() == m * n), function laplace2D, file /usr/ports/math/universal/work/universal-513bb30bac688b4b28eeb7429d601e7c40d2e3f3/./include/universal/blas/laplace2D.hpp, line 17.
Abort trap

Stateless quire function support

In order to implement an API which can be fulfilled by a stateless FPU yet is able to efficiently support quire operations, I need the following operations:

  • posit[2] posit_add_exact(posit a, posit b):
    • the arguments are 2 posits a and b of the same parameters
    • the return value is a pair of posits, the first one is the nearest value to the actual sum and the second result is the difference between the first value and the exact result.
    • if the exponents of a and b are such that there is no bit-overlap in the mantissas, this function returns max(a,b), min(a,b)
  • posit[2] posit_sub_exact(posit a, posit b): same as add_exact with b negated
  • posit<nbits*2,es+1> posit_mul_promote(posit a, posit b):
    • the arguments are 2 posits a and b of the same parameters
    • the result is a posit with nbits twice that of the arguments and an es one more than that of the arguments
    • this function is equivalent to converting a and b to the larger size and then multiplying.
    • this function should never round (if it does then I've made a mistake)
  • posit<nbits*2,es+1> posit_div_promote(posit a, posit b):
    • Result should be the same as posit_div( posit<nbits2,es+1>(a), posit<nbits2,es+1>(b) )
  • posit posit_frexp(posit a, int* exp_out):
  • posit posit_ldexp(posit a, int exp):

information on IEEE floats

As mentioned in README 32 bit single precision IEEE floats have 8 million NaN representations which is ((2^23) - 1) where 23 is the number of fraction bits. All zero fraction bits would imply infinity hence the minus 1.

https://github.com/stillwater-sc/universal/blob/master/README.md#L138

For 64 bit double precision IEEE floats with 52 fraction bits, this would mean ((2^52) - 1) NaN representations which amounts to 4503599627370495 or 4.503x10^15

Build is failing

Suggest separate work in progress commits into a separate branch. The master branch should keep stable as possible for third-party to compile and try out. From below error messages, I don't find the extract_sign() etc. functions defined. Perhaps the function renaming wasn't complete?

universal.git $ git grep extract_sign
Binary file bin/cmd_dc matches
Binary file bin/cmd_fc matches
Binary file bin/cmd_ieee_fp matches
Binary file bin/cmd_ldc matches
Binary file bin/cmd_pc matches
deprecated/posit/posit_conversion.hpp: bool _negative = extract_sign(rhs);
deprecated/posit/posit_conversion.hpp: bool _negative = extract_sign(rhs);
posit/value.hpp: _sign = extract_sign((double)rhs);
tests/posit/extract.cpp: bool _sign = extract_sign(f);
tests/posit/extract.cpp: sign = extract_sign(f);
tests/posit/extract.cpp: sign = extract_sign(f);

$ make
[ 1%] Building CXX object tests/float/CMakeFiles/float_quires.dir/quires.cpp.o
In file included from /home/xman/speedgo/projects/sgc/universal.git/tests/float/quires.cpp:15:0:
/home/xman/speedgo/projects/sgc/universal.git/tests/float/../../posit/value.hpp: In member function โ€˜sw::unum::value& sw::unum::value::operator=(long double)โ€™:
/home/xman/speedgo/projects/sgc/universal.git/tests/float/../../posit/value.hpp:221:39: error: there are no arguments to โ€˜extract_signโ€™ that depend on a template parameter, so a declaration of โ€˜extract_signโ€™ must be available [-fpermissive]
_sign = extract_sign((double)rhs);
^
/home/xman/speedgo/projects/sgc/universal.git/tests/float/../../posit/value.hpp:221:39: note: (if you use โ€˜-fpermissiveโ€™, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/home/xman/speedgo/projects/sgc/universal.git/tests/float/../../posit/value.hpp:222:44: error: there are no arguments to โ€˜extract_exponentโ€™ that depend on a template parameter, so a declaration of โ€˜extract_exponentโ€™ must be available [-fpermissive]
_scale = extract_exponent((double)rhs) - 1;
^
tests/float/CMakeFiles/float_quires.dir/build.make:62: recipe for target 'tests/float/CMakeFiles/float_quires.dir/quires.cpp.o' failed
make[2]: *** [tests/float/CMakeFiles/float_quires.dir/quires.cpp.o] Error 1
CMakeFiles/Makefile2:87: recipe for target 'tests/float/CMakeFiles/float_quires.dir/all' failed
make[1]: *** [tests/float/CMakeFiles/float_quires.dir/all] Error 2
Makefile:94: recipe for target 'all' failed

Support conda install via conda-forge

Hi @Ravenwater !

Thanks for this amazing project. I think it would be great to have conda support to install the universal library using:

conda install universal

This would also be beneficial to those using conda as a dependency manager, in which an environment.yml file on the root directory of the project specifies all dependencies of the library.

The best way to accomplish this is via conda-forge. Here is a list of all packages/libraries they already manage automatic builds whenever a new release is detected.

Here is the how to write a recipe that conda-forge will understand how to build and package a library. The best way, though, is to copy and paste an existing recipe (e.g., eigen recipe and autodiff recipe) and adapt it to universal.

Minor typo

If you want to give me permission to push a feature/bug branch, I can PR this. I might have some more minor wording cleanup when I review more.

% git diff master
diff --git a/README.md b/README.md
index c7a46a3..a8d3325 100644
--- a/README.md
+++ b/README.md
@@ -151,7 +151,7 @@ In contrast, the _posit_ number system is designed to be efficient, symmetric, a
 1. **Economical** - No bit patterns are redundant. There is one representation for infinity denoted as ยฑ inf and zero.
 All other bit patterns are valid distinct non-zero real numbers. ยฑ inf serves as a replacement for NaN.
 2. **Mathematical Elegant** - There is only one representation for zero, and the encoding is symmetric around 1.0. Associative and distributive laws are supported through deferred rounding via the quire, enabling reproducible linear algebra algorithms in any concurrency environment.
-3. **Tapered Accuracy** - Tapered accuracy is when values with small exponent have more digits of accuracy and values with large exponents have less digits of accuracy. This concept was first introduced by Morris (1971) in his paper โ€Tapered Floating Point: A New Floating-Point Representationโ€.
+3. **Tapered Accuracy** - Tapered accuracy is when values with small exponent have more digits of accuracy and values with large exponents have fewer digits of accuracy. This concept was first introduced by Morris (1971) in his paper โ€Tapered Floating Point: A New Floating-Point Representationโ€.
 4. **Parameterized precision and dynamic range** -- posits are defined by a size, _nbits_, and the number of exponent bits, _es_. This enables system designers the freedom to pick the right precision and dynamic range required for the application. For example, for AI applications we may pick 5 or 6 bit posits without any exponent bits to improve performance. For embedded DSP applications, such as 5G base stations, we may select a 16 bit posit with 1 exponent bit to improve performance per Watt.
 5. **Simpler Circuitry** - There are only two special cases, Not a Real and Zero. No denormalized numbers, overflow, or underflow.

Arithmetic with posits

Dear developers of universal,

we are developing a fully templated simulation package for the numerical solution of PDE problems that would allow to test the arithmetic performance of unums/posits for realistic applications.

Does universal support the 'standard' numerical functions like std::sin, std::cos, ... next to the binary operators +,-,/,*?

Kind regards,
Matthias

posit32_tod yields wrong result

Posit32 with bit pattern 0xb0bfe591 should (afaict) correspond to double value -3.81260083615779876708984375, this test shows that posit32_tod() is outputting 0.0000000000000000000000000003092334748466... instead. The test includes posit32_fromd() which behaves as expected.

static void testp32tod() {
    posit32_t p32;
    p32.v = 0xb0bfe591u;
    double d = -3.8126008361577987670898437500000000000000;
    double d2 = posit32_tod(p32);
    posit32_t _p32 = posit32_fromd(d);
    printf(" p32.v = %08x\n_p32.v = %08x\n\n d = %.40f\n d2 = %.40f\n", p32.v, _p32.v, d, d2);
}
notgay:build user$ ./c_api/test/posit/c_api_posit8
 p32.v = b0bfe591
_p32.v = b0bfe591

 d = -3.8126008361577987670898437500000000000000
 d2 = 0.0000000000000000000000000003092334748466
notgay:build user$

math functions need implementations

This is a research topic. The first pass we'll likely have straight implementations, but we are experimenting with implementations that leverage the quire. This enables low precision posits to generate very high accuracy results. For example, 16 bit posits can beat 32 bit floats.

There are a hundred different ways these functions can be implemented, so we are inviting the community to contribute fast algorithms.

cmake's "install" target is broken: No such file or directory.

===>   Generating temporary packing list
[0/1] cd /usr/ports/math/universal/work/.build && /usr/local/bin/cmake -DCMAKE_INSTALL_DO_STRIP=1 -P cmake_install.cmake
-- Install configuration: "Release"
-- Up-to-date: /usr/ports/math/universal/work/stage/usr/local/share/Universal/universal-config.cmake
-- Up-to-date: /usr/ports/math/universal/work/stage/usr/local/share/Universal/universal-config-version.cmake
CMake Error at cmake_install.cmake:48 (file):
  file INSTALL cannot find
  "/usr/ports/math/universal/work/universal-1c244646b8f616881f8b5b95ee542c96e49e5580/universal":
  No such file or directory.

Eliminating libstdc++ from c_api

Kind of an annoyance for C projects is the fact that C++ code is full of references to things which in fact end up requiring their own support libraries. Exceptions, for example, require a support library, and the C api is going to be very cool for mixed C/C++ projects but a project where the maintainer wants to keep it in C is going to have a hard sell to adopt the codebase when it requires linking libstdc++, which by then they might as well just switch to C++ for their own code...

It is possible to write C++ code which doesn't use any of this, but you just need to be acutely aware of which features you're taking advantage of...

Here is the linker error from compiling a trivial program which calls posit_integer_assign128 when libstdc++ is not present, it shows the list of C++ symbols which the linker wants and why:

notgay:build user$ gcc-7 -o xxx ../xxx.c -L.//c_api/posit/ -std=c99 -Wall -Wextra -pedantic -lposit_c_api
../xxx.c: In function 'main':
../xxx.c:6:16: warning: unused variable 'four' [-Wunused-variable]
     posit128_t four = posit_integer_assign128(4);
                ^~~~
Undefined symbols for architecture x86_64:
  "std::runtime_error::what() const", referenced from:
      vtable for shift_too_large in libposit_c_api.a(posit_c_api.cpp.o)
      vtable for posit_internal_exception in libposit_c_api.a(posit_c_api.cpp.o)
      vtable for integer_divide_by_zero in libposit_c_api.a(posit_c_api.cpp.o)
      vtable for bitblock_arithmetic_exception in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::__basic_string_common<true>::__throw_length_error() const", referenced from:
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::locale::use_facet(std::__1::locale::id&) const", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::ios_base::getloc() const", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::logic_error::logic_error(char const*)", referenced from:
      std::__1::bitset<8ul>::set(unsigned long, bool) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<8ul, 0ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<16ul, 1ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<32ul, 2ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<8ul>& sw::unum::convert_to_bb<8ul, 0ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<8ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<16ul>& sw::unum::convert_to_bb<16ul, 1ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<16ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<32ul>& sw::unum::convert_to_bb<32ul, 2ul, 52ul>(bool, int, sw::unum::bitblock<52ul> const&, sw::unum::bitblock<32ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::out_of_range::~out_of_range()", referenced from:
      std::__1::bitset<8ul>::set(unsigned long, bool) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<8ul, 0ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<16ul, 1ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<32ul, 2ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<8ul>& sw::unum::convert_to_bb<8ul, 0ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<8ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<16ul>& sw::unum::convert_to_bb<16ul, 1ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<16ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<32ul>& sw::unum::convert_to_bb<32ul, 2ul, 52ul>(bool, int, sw::unum::bitblock<52ul> const&, sw::unum::bitblock<32ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::runtime_error::runtime_error(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)", referenced from:
      posit_internal_exception::posit_internal_exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in libposit_c_api.a(posit_c_api.cpp.o)
      bitblock_arithmetic_exception::bitblock_arithmetic_exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::runtime_error::~runtime_error()", referenced from:
      shift_too_large::~shift_too_large() in libposit_c_api.a(posit_c_api.cpp.o)
      shift_too_large::~shift_too_large() in libposit_c_api.a(posit_c_api.cpp.o)
      posit_internal_exception::~posit_internal_exception() in libposit_c_api.a(posit_c_api.cpp.o)
      posit_internal_exception::~posit_internal_exception() in libposit_c_api.a(posit_c_api.cpp.o)
      integer_divide_by_zero::~integer_divide_by_zero() in libposit_c_api.a(posit_c_api.cpp.o)
      integer_divide_by_zero::~integer_divide_by_zero() in libposit_c_api.a(posit_c_api.cpp.o)
      bitblock_arithmetic_exception::~bitblock_arithmetic_exception() in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__init(char const*, unsigned long)", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<64ul, 3ul>(sw::unum::posit<64ul, 3ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<128ul, 4ul>(sw::unum::posit<128ul, 4ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__init(unsigned long, char)", referenced from:
      std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > std::__1::__pad_and_output<char, std::__1::char_traits<char> >(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, char const*, char const*, char const*, std::__1::ios_base&, char) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::append(char const*, unsigned long)", referenced from:
      posit_internal_exception::posit_internal_exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in libposit_c_api.a(posit_c_api.cpp.o)
      bitblock_arithmetic_exception::bitblock_arithmetic_exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::resize(unsigned long, char)", referenced from:
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::overflow(int) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::push_back(char)", referenced from:
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::overflow(int) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      _posit_format8 in libposit_c_api.a(posit_c_api.cpp.o)
      _posit_format16 in libposit_c_api.a(posit_c_api.cpp.o)
      _posit_format32 in libposit_c_api.a(posit_c_api.cpp.o)
      _posit_format64 in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::operator=(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)", referenced from:
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_istream<char, std::__1::char_traits<char> >::~basic_istream()", referenced from:
      construction vtable for std::__1::basic_istream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_istream<char, std::__1::char_traits<char> >::~basic_istream()", referenced from:
      construction vtable for std::__1::basic_istream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_ostream<char, std::__1::char_traits<char> >::sentry::sentry(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_ostream<char, std::__1::char_traits<char> >::sentry::~sentry()", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_ostream<char, std::__1::char_traits<char> >::~basic_ostream()", referenced from:
      construction vtable for std::__1::basic_ostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_ostream<char, std::__1::char_traits<char> >::~basic_ostream()", referenced from:
      construction vtable for std::__1::basic_ostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(float)", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(unsigned long)", referenced from:
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<32ul, 2ul>(sw::unum::posit<32ul, 2ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<64ul, 3ul>(sw::unum::posit<64ul, 3ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<128ul, 4ul>(sw::unum::posit<128ul, 4ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_iostream<char, std::__1::char_traits<char> >::~basic_iostream()", referenced from:
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_iostream<char, std::__1::char_traits<char> >::~basic_iostream()", referenced from:
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_iostream<char, std::__1::char_traits<char> >::~basic_iostream()", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_stringstream() in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<32ul, 2ul>(sw::unum::posit<32ul, 2ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::sync()", referenced from:
      vtable for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::imbue(std::__1::locale const&)", referenced from:
      vtable for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow()", referenced from:
      vtable for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::setbuf(char*, long)", referenced from:
      vtable for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsgetn(char*, long)", referenced from:
      vtable for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsputn(char const*, long)", referenced from:
      vtable for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::showmanyc()", referenced from:
      vtable for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::basic_streambuf()", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<32ul, 2ul>(sw::unum::posit<32ul, 2ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<64ul, 3ul>(sw::unum::posit<64ul, 3ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::__1::basic_streambuf<char, std::__1::char_traits<char> >::~basic_streambuf()", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_stringstream() in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<32ul, 2ul>(sw::unum::posit<32ul, 2ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::__1::ctype<char>::id", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::locale::~locale()", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::ios_base::__set_badbit_and_consider_rethrow()", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::ios_base::init(void*)", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<32ul, 2ul>(sw::unum::posit<32ul, 2ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<64ul, 3ul>(sw::unum::posit<64ul, 3ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::__1::ios_base::clear(unsigned int)", referenced from:
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "std::__1::basic_ios<char, std::__1::char_traits<char> >::~basic_ios()", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_stringstream() in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<32ul, 2ul>(sw::unum::posit<32ul, 2ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "std::terminate()", referenced from:
      ___clang_call_terminate in libposit_c_api.a(posit_c_api.cpp.o)
  "typeinfo for std::__1::basic_istream<char, std::__1::char_traits<char> >", referenced from:
      construction vtable for std::__1::basic_istream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "typeinfo for std::__1::basic_ostream<char, std::__1::char_traits<char> >", referenced from:
      construction vtable for std::__1::basic_ostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "typeinfo for std::__1::basic_iostream<char, std::__1::char_traits<char> >", referenced from:
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
      typeinfo for std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "typeinfo for std::__1::basic_streambuf<char, std::__1::char_traits<char> >", referenced from:
      typeinfo for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "typeinfo for std::out_of_range", referenced from:
      std::__1::bitset<8ul>::set(unsigned long, bool) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<8ul, 0ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<16ul, 1ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<32ul, 2ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<8ul>& sw::unum::convert_to_bb<8ul, 0ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<8ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<16ul>& sw::unum::convert_to_bb<16ul, 1ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<16ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<32ul>& sw::unum::convert_to_bb<32ul, 2ul, 52ul>(bool, int, sw::unum::bitblock<52ul> const&, sw::unum::bitblock<32ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "typeinfo for std::runtime_error", referenced from:
      typeinfo for posit_internal_exception in libposit_c_api.a(posit_c_api.cpp.o)
      typeinfo for bitblock_arithmetic_exception in libposit_c_api.a(posit_c_api.cpp.o)
  "vtable for __cxxabiv1::__si_class_type_info", referenced from:
      typeinfo for std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
      typeinfo for std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
      typeinfo for posit_internal_exception in libposit_c_api.a(posit_c_api.cpp.o)
      typeinfo for shift_too_large in libposit_c_api.a(posit_c_api.cpp.o)
      typeinfo for bitblock_arithmetic_exception in libposit_c_api.a(posit_c_api.cpp.o)
      typeinfo for integer_divide_by_zero in libposit_c_api.a(posit_c_api.cpp.o)
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
  "vtable for std::out_of_range", referenced from:
      std::__1::bitset<8ul>::set(unsigned long, bool) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<8ul, 0ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<16ul, 1ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<32ul, 2ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<8ul>& sw::unum::convert_to_bb<8ul, 0ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<8ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<16ul>& sw::unum::convert_to_bb<16ul, 1ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<16ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<32ul>& sw::unum::convert_to_bb<32ul, 2ul, 52ul>(bool, int, sw::unum::bitblock<52ul> const&, sw::unum::bitblock<32ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
  "non-virtual thunk to std::__1::basic_iostream<char, std::__1::char_traits<char> >::~basic_iostream()", referenced from:
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "non-virtual thunk to std::__1::basic_iostream<char, std::__1::char_traits<char> >::~basic_iostream()", referenced from:
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "virtual thunk to std::__1::basic_istream<char, std::__1::char_traits<char> >::~basic_istream()", referenced from:
      construction vtable for std::__1::basic_istream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "virtual thunk to std::__1::basic_istream<char, std::__1::char_traits<char> >::~basic_istream()", referenced from:
      construction vtable for std::__1::basic_istream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "virtual thunk to std::__1::basic_ostream<char, std::__1::char_traits<char> >::~basic_ostream()", referenced from:
      construction vtable for std::__1::basic_ostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "virtual thunk to std::__1::basic_ostream<char, std::__1::char_traits<char> >::~basic_ostream()", referenced from:
      construction vtable for std::__1::basic_ostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "virtual thunk to std::__1::basic_iostream<char, std::__1::char_traits<char> >::~basic_iostream()", referenced from:
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "virtual thunk to std::__1::basic_iostream<char, std::__1::char_traits<char> >::~basic_iostream()", referenced from:
      construction vtable for std::__1::basic_iostream<char, std::__1::char_traits<char> >-in-std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> > in libposit_c_api.a(posit_c_api.cpp.o)
  "operator delete(void*)", referenced from:
      std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_stringstream() in libposit_c_api.a(posit_c_api.cpp.o)
      non-virtual thunk to std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_stringstream() in libposit_c_api.a(posit_c_api.cpp.o)
      virtual thunk to std::__1::basic_stringstream<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_stringstream() in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_stringbuf() in libposit_c_api.a(posit_c_api.cpp.o)
      shift_too_large::~shift_too_large() in libposit_c_api.a(posit_c_api.cpp.o)
      posit_internal_exception::~posit_internal_exception() in libposit_c_api.a(posit_c_api.cpp.o)
      integer_divide_by_zero::~integer_divide_by_zero() in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "operator new(unsigned long)", referenced from:
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::str() const in libposit_c_api.a(posit_c_api.cpp.o)
  "___cxa_allocate_exception", referenced from:
      std::__1::bitset<8ul>::set(unsigned long, bool) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<8ul, 0ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<16ul, 1ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<32ul, 2ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<8ul>& sw::unum::convert_to_bb<8ul, 0ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<8ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<16ul>& sw::unum::convert_to_bb<16ul, 1ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<16ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<32ul>& sw::unum::convert_to_bb<32ul, 2ul, 52ul>(bool, int, sw::unum::bitblock<52ul> const&, sw::unum::bitblock<32ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "___cxa_begin_catch", referenced from:
      ___clang_call_terminate in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::overflow(int) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "___cxa_end_catch", referenced from:
      std::__1::basic_stringbuf<char, std::__1::char_traits<char>, std::__1::allocator<char> >::overflow(int) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long) in libposit_c_api.a(posit_c_api.cpp.o)
  "___cxa_free_exception", referenced from:
      std::__1::bitset<8ul>::set(unsigned long, bool) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<8ul, 0ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<16ul, 1ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<32ul, 2ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<8ul>& sw::unum::convert_to_bb<8ul, 0ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<8ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<16ul>& sw::unum::convert_to_bb<16ul, 1ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<16ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<32ul>& sw::unum::convert_to_bb<32ul, 2ul, 52ul>(bool, int, sw::unum::bitblock<52ul> const&, sw::unum::bitblock<32ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "___cxa_throw", referenced from:
      std::__1::bitset<8ul>::set(unsigned long, bool) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<8ul, 0ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<16ul, 1ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::regime<32ul, 2ul>::assign_regime_pattern(int) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<8ul>& sw::unum::convert_to_bb<8ul, 0ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<8ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<16ul>& sw::unum::convert_to_bb<16ul, 1ul, 23ul>(bool, int, sw::unum::bitblock<23ul> const&, sw::unum::bitblock<16ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::bitblock<32ul>& sw::unum::convert_to_bb<32ul, 2ul, 52ul>(bool, int, sw::unum::bitblock<52ul> const&, sw::unum::bitblock<32ul>&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
  "___gxx_personality_v0", referenced from:
      sw::unum::to_string(sw::unum::posit<8ul, 0ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<16ul, 1ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      sw::unum::to_string(sw::unum::posit<32ul, 2ul> const&, long) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<8ul, 0ul>(sw::unum::posit<8ul, 0ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<16ul, 1ul>(sw::unum::posit<16ul, 1ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<32ul, 2ul>(sw::unum::posit<32ul, 2ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > sw::unum::posit_format<64ul, 3ul>(sw::unum::posit<64ul, 3ul> const&) in libposit_c_api.a(posit_c_api.cpp.o)
      ...
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
notgay:build user$

TODO: need to convert subnormals

In the division QA tests we are seeing subnormal IEEE doubles. Right now the float_assign method punts on subnormals, we need to fix that.

Building without c++14 support

I am working on an older machine that uses gcc 4.8 (without c++14). I do not have admin rights to upgrade, is there an easy way to run it without c++14?

Documentation

Is there any API documentation? It's a bit of a drag looking through the code to go through the code of the Posit class, especially since there is no division between source and header.

investigate if we can decouple an expression template front-end from an arithmetic back-end

boost::multiprecision is organized in a number class front-end and an arithmetic back-end

https://www.boost.org/doc/libs/1_75_0/libs/multiprecision/doc/html/boost_multiprecision/perf/overhead.html

The number class provides expression template functionality, but adds a little bit of overhead: for simple types about 0.5%. So a 100MOPS code would run at 95MOPS with expression templates enables.

We know that for small types, the extra processing inside the expression templates to avoid temporaries is not going to be a win, but for arbitrary precision types that could need thousands of bytes, any copies avoided will be a big win.

Since Universal is mostly about providing high-performance, tailored number systems that will have a hardware data path executing them, expression templates are not all that attractive, but as we are planning to incorporate arbitrary precision number systems, having a pluggable expression template front-end would be attractive.

The key comparisons we should use to drive this implementation are:

  1. 32bit posits
  2. 256bit posits
    posits are designed to provide oracle services to other number systems without being arbitrary precision.
    256bit posits also have hardware support and run at terra-op order (>10^12 operations per second)
  3. 64bit unum type 1
    64bit type 1 is a dynamic precision format with at most 64bit representations
  4. 512-bit fixed-points
    the 512-bit fixed-point format is the base type of the quire for 32-bit posits
  5. arbitrary precision floats
  6. arbitrary precision integers
  7. arbitrary precision posits
    this is still an open question: arbitrary precision posits would basically forgo the benefits of tapering,
    but it would provide again a nice oracle mechanism for validation studies.

Quire += operator seems to fail

I'm working with posit<16,1> at the moment. Modeling quire usage after the code in examples/blas/blas.hpp line 62, I have the following:

for(int row=0; row<n; ++row) {
     quireX sum = sw::unum::quire_mul(beta, ycoefs[row]); //sum is initially 0
     for(LocalOrdinalType i=Arowoffsets[row]; i<Arowoffsets[row+1]; ++i) {
          positX tempPosit; 
          sum += sw::unum::quire_mul(Acoefs[i], xcoefs[Acols[i]]);
          if (row > 263 && row < 269) {
               tempPosit.convert(sum.to_value());
               if (tempPosit < testCompare) {
                        std::cout <<"WRONG! Sum after += is less than -1.0: " << tempPosit <<", after taking sum (" << sumHolderPosit << ") += quire_mul("<<Acoefs[i] << ", " <<xcoefs[Acols[i]] << ")" << std::endl;
                        std::cout <<"We essentially took " << sumHolderPosit << " += " << tempQuireMultholder << " and somehow ended up with " << tempPosit << std::endl;
                        std::cout << "Incorrect sum as a quire: " << sum << std::endl;
                        std::cout << "Delta Quire: " << tempQuireMult << ", (as posit) = " << tempQuireMultholder << std::endl;
                        std::cout << "Row: " << row << ", Acoefs[i]: " << Acoefs[i] << ", Acols[i]: " << Acols[i] << ", xcoefs[" << Acols[i] << "]: " << xcoefs[Acols[i]] << std::endl;
               }
          }

The sum should never go below the value -1.0 - and if I run an alternate version of this code that does not use the quire feature, it doesn't.
With the quire enabled, I get the following output - the WRONG! tag near the bottom is where the error occurs - I figured I'd include additional output prior to the error of the sum successfully reducing in size.

Row is 266. Acoefs[5334] = -0.016571, xcoefs[Acols[5334]] = 0.000999451
Sum before += quire_mul  1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000001100000110001100000010000000000000000000000
Sum before += quire_mul (as posit)9.25064e-05
product of quire_mul = -1.66893e-05
Row is 266. Acoefs[5335] = -0.00828552, xcoefs[Acols[5335]] = 0.000999451
Sum before += quire_mul  1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000001001111011010000011010000000000000000000000
Sum before += quire_mul (as posit)7.53403e-05
product of quire_mul = -8.10623e-06
Row is 266. Acoefs[5336] = -0.0165405, xcoefs[Acols[5336]] = 0.000999451
Sum before += quire_mul  1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000001000110101110010100110000000000000000000000
Sum before += quire_mul (as posit)6.77109e-05
product of quire_mul = -1.66893e-05
Row is 266. Acoefs[5337] = 7.62939e-06, xcoefs[Acols[5337]] = 0.000999451
Sum before += quire_mul  1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000000110101011000111010110000000000000000000000
Sum before += quire_mul (as posit)5.05447e-05
product of quire_mul = 1.49012e-08
Row is 266. Acoefs[5338] = -0.0165405, xcoefs[Acols[5338]] = 0.000999451
Sum before += quire_mul  1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000000110101011001011011100000000000000000000000
Sum before += quire_mul (as posit)5.05447e-05
product of quire_mul = -1.66893e-05
Row is 266. Acoefs[5339] = -0.00828552, xcoefs[Acols[5339]] = 0.000999451
Sum before += quire_mul  1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000000100100000100000001100000000000000000000000
Sum before += quire_mul (as posit)3.43323e-05
product of quire_mul = -8.10623e-06
Row is 266. Acoefs[5340] = -0.016571, xcoefs[Acols[5340]] = 0.000999451
Sum before += quire_mul  1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000000011011011000010011000000000000000000000000
Sum before += quire_mul (as posit)2.6226e-05
product of quire_mul = -1.66893e-05
-------------------------------------------------------------------------------------------------------------------
WRONG! Sum after += is less than -1.0: -2.68435e+08, after taking sum (2.6226e-05) += quire_mul(-0.016571, 0.000999451)
We essentially took 2.6226e-05 += -1.66893e-05 and somehow ended up with -2.68435e+08 (= -maxpos)
Incorrect sum as a quire: -1: 111111111111111111111111111111_111111111111111111111111111111111111111111111111111111111.11111111111111110101111111001010000000000000000000000000
Delta Quire: -1: 000000000000000000000000000000_000000000000000000000000000000000000000000000000000000000.00000000000000010001010111011101000000000000000000000000, (as posit) = -1.66893e-05
Row: 266, Acoefs[i]: -0.016571, Acols[i]: 398, xcoefs[398]: 0.000999451
Row is 266. Acoefs[5341] = -0.00828552, xcoefs[Acols[5341]] = 0.000999451
-------------------------------------------------------------------------------------------------------------------

If you noticed, the value -2.68435e+08 is negative maxpos for a <16,1> posit. So what I'm wondering is, why does my quire suddenly explode from very small positive values to massive negative values?

My positX and quireX are declared as

namespace posit_shape {
    const uint es = UNUM_ES_SIZE; //1
    const uint nbits = UNUM_NBIT_SIZE; //16
}
namespace quire_shape {
    const uint es = UNUM_ES_SIZE;
    const uint nbits = UNUM_NBIT_SIZE;
    const uint capacity = UNUM_QUIRE_CAPACITY; //10 - as per stillwater example
}
typedef sw::unum::posit<posit_shape::nbits, posit_shape::es> positX;
typedef sw::unum::quire<quire_shape::nbits, quire_shape::es> quireX;

All evidence points to a bug in quire's +=, any feedback welcome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.