parallella / pal Goto Github PK

View Code? Open in Web Editor NEW

304.0 304.0 109.0 2.86 MB

An optimized C library for math, parallel processing and data movement

License: Apache License 2.0

Shell 2.22% C 83.84% C++ 0.58% Assembly 0.17% HTML 4.76% CSS 0.06% Erlang 5.29% Makefile 0.52% Python 0.11% M4 2.45%

pal's People

Contributors

Stargazers

Watchers

Forkers

mateunho yiiwood duca anjoah mafleming censix 346am pmeles jeff-ayars pakloong wejn benkluwe wtfuzz fenollp 6thimage aarontpz micheletogni technopolymath jeffawang hakanardo twocs bmcgavin gbuella jar rkyymmt fossington westonal leimingyu fritur gogo40 syoyo cloned67 kapacuk damjandakic dflicker indrora kfitch samvrlewis olajep nplanel abukva viktorslavkovic yogenm miroslavbogdanovic gabicavalcante pdfrod ebadi sjfloat dsebasti bremby tcptomato optimus-prime yggdrasii tthtlc pelrun trondarild wizard97 hiovidiu tadworthington tedades raypfaff kranfix rtsys2020 dobkeratops quell- shimshir adamszk pymblesoftware icaven ahaigh1 haneefmubarak 8l theappleman joerghenning jaspercat debug-de-su-ka iplayfast pombredanne jfdawson20 ifzz wasserfuhr templeblock ipapadop hongyunnchen peteasa boleslavsd rowhit dis3buted 1847123212 linuxperia dizuo arnoldlai wanming2008 bseazeng pconcepcion subhadeep-maishal glennneiger acamelq ethercat-fpga zhyh329

pal's Issues

Running examples

Hi,
I'm trying to run examples. But for instance:

./simple_example
Running p_wait
Running p_close (0xffffffea)
Running p_finalize(0xffffffda)

Do you know how to make it work?

Fast/Approximate math function section

It would be better to have faster/approximate math functions section in PAL, which is specially designed for running math kernel on Epiphany with selectable precision.

For example,

Full precision/Low performance(reference) : e.g., expf() from newlib.
Middle precision/Middle performance(20 ~ 100 clocks. general use) : e.g, e_approx_exp()
Low precision/Faster performance(~20 clocks. DSP use): e.g, e_fast_exp()
Short vectorized version for Middle and Low precision.

I think this design is much applicable for actual application programmers.

And here's validated approximate exp() implementation for Epiphany.

https://github.com/syoyo/parallella-playground/blob/master/math_exp/e_fast_exp.c

25 ~ 74 clocks for each exp(x) evaluation within 1.0e-5 relative error.

math: p_min: Insert max value for the comparison

Issue:
There's 0.0f as a starting comparison value.

Fix:
Substitute current value with maximum float value.

math: p_abs, p_exp: Examine and resolve header/impl. difference in `const` qualifier for input arguments

Headers of those functions have just float*, while implementation have const float* for input vector.

In response to build fail in PR #9

please help: unclear interface of p_sad8x8_f32, p_sad16x16_f32

Hi, I would like to implement the mentioned functions, but the description of the arguments is unclear to me. My guess (for the 8x8 case; for 16x16 analog):

SAD works on an input source image x with dimensions rows x cols. There is a second reference image with the same dimensions as x. m is a pointer to the upper left corner of a 8x8 block within this second image. (Therefore m is not an 8x8 array.) The result vector r is a 2D array with dimensions (rows - 7) x (cols - 7) holding the absolute differences for the possible offsets of the 8x8 block within the source image.

Please tell me if these assumptions are correct. Thank you.
Greetings, Alexander

Improve const-correctness

I suggest to add the key word "const" to the type specifiers for parameters like the following.

x (function "p_conv2d_f32")
a (function "p_itof")

Would you like to apply the advices from an article to more places in your source files?

math: Change all '32f' suffixes to 'f32'

It's all about inconsistencies between header/impl.
Having 43 implementation files, merely 38 of them suffer this issue.
One of them even has '32f' both in header and implementation, which is rather puzzling.

#define ⇒ enum?

Would you like to replace more defines for constant values by enumerations to stress their relationships?

math: Add definition of M_DIV14 constant in header

Actually it's a bug where two constants are named M_DIV15 and have the same value.

math: Fix all functions with scalar output

Issue:
Scalar values being used as vectors, eg. iterated through.

Fix:
Stop treating them like vectors 😉

Add p_median to math library

This Issue tracks my contribution of p_median() to the math library.

tests: math: Mark all scalar output functions w/ -DSCALAR_OUTPUT

Example here:
https://github.com/parallella/pal/blob/master/tests/math/Makefile.am#L188

p_cos_f32 uses 1 fewer iteration than p_sin_f32

p_sin_f32 uses 6 terms to approximate the sine function, while p_cos_f32 uses 5 terms. This results in p_cos_f32 being significantly worse than p_sin_f32.

math: Add defines for max and min values in header

Linked with #24 and #25

Hygene: .gitignore should ignore Autotools and libtool

When you run ./bootstrap, you get a bunch of things that git should ignore, and which should be added to .gitignore.

move function documentation to header

I passionately dislike libraries, which put the user API documentation into the c files. It clearly belongs into the header files, where the user (or users IDE) can find it, even when precompiled lib + headers are used. In addition there might be several implementations for different architectures in the future, each having its c file. Still you want a common interface definition for them maintained in a single set of headers.

This is of course a kind suggestion, but with a strong opinion ;-)

HELP: Improve documentation

Task:
-Improve documentation

Guideliens:
-documentation at the beginning of each function source file
-descriptions should be as short as possible but not shorter
-doxygen compatible
-explain basics and anything unusual about function

For:
-anyone

Completing math/dsp function list

There should be no dependencies in the PAL math library. A first priority is to remove those dependencies. Once we have a clean starting point, we will start with optimization.

One function at a time....

Expand report code size reporting?

Would this be possible?

-Build and report code size for ARM, x86, Epiphany
-Would require developers to install ARM and Epihany tool chain dependencies.

Some of the current contributions are too x86 centric, this might help bring everyone done by one level?

Image filter methods, size of output

In OpenCV, the output image is the same size as the input image. The indices for x,y in the output image corresponds to the x,y in the input.

But in Pal, the 3x3 methods have output sizes that are two pixels smaller in width and height than the input size. This is more economical on memory but does not have corresponding x,y locations. Is this the preferred method?

The size of the output images should be added to the comments and documentation for these methods.

image: Fix inconsistencies between header/implementation

Refactoring include:

32f to f32 suffix change
argument re-ordering
correction of mistyped function name

Is p_a_inv needed ?

Is p_a_inv needed ? It has many branches and maybe not so good precision.
p_inv is already fast.

Some platforms have hardware division that is faster than software division.
I think we can either make an inline function like this

inline PAL_DIV(float a,float b){
#ifdef HAVE_DIV
    return a/b;
#else
    /* compute inverse */
    union {
        float f;
        uint32_t x;
    } u = {b};
    /* First approximation */
    u.x = 0x7EEEEBB3 - u.x;
    /* Refine */
    u.f = u.f * (2 - u.f * cur);
    u.f = u.f * (2 - u.f * cur);
    u.f = u.f * (2 - u.f * cur);
    return a*u.f
#endif
}

and use it when division is needed
or use division in every case and let the compiler to do the division

p_ftoi: Remove? / Let compiler decide rounding mode

Do we really need this function?

And can't we just let the compiler decide the rounding mode?

Reference:
https://github.com/parallella/pal/blob/master/src/math/p_ftoi.c

HELP: Fix core/doxygen.cfg

Task:
-create API reference manual using doxygen

Questions:
-example of best practices?
-one per dir or one per project?
-ability to create well structured linked pdf automatically?

For:
-anyone

image: p_gauss3x3: Add function implementation

🚧 Work in progress 🚧

When should sqrt and invsqrt stop iterating?

Following on from the discussion in pull request #105, should those 2 functions have a hard limit on how many iterations or should it just keep iterating until convergence?

Personally I think there should be a pair of functions instead for each of those functions, one with a limit (lower accuracy) and one that keeps iterating until an accuracy set by the user (e.g. iterate until each iteration only changes by 1% or less).

@aolofsson , it'd also be great if you could tell us which one would be better

Create ARM/x86 Neon/SMP optimized versions of the compute functions

Most of my own work will be focused on the Epiphany. but clearly the goal for the project is to make the PAL something universal. There is really nothing equivalent out there so it does make sense

Guidelines:
-POSIX (not OpenMP) to maximize portability
-Start with "good enough" C but assume that you will eventually need to use assembly...

One at a time....

Math functions that output single vs multiple values should be named differently

Besides grabbing the lowest hanging fruit ( #82 ), I actually have a criticism of the design of the math library. The majority of the library are functions that output multiple values, and but there are a few that output single values from multiple inputs (min, max, sum, etc). Especially given that this is a parallel math library, I'd argue that these single output functions are wasting valuable names better given to fully parallel operations.

For example, min and max. Presently, these functions are returning a single min or max value for all values in the input array. This contradicts other functions like abs, add, mul, and many other functions that output multiple values.

I'd recommend instead naming functions like p_min_f32 to p_min_value_f32 or p_minv_f32. That way you know from the name that it's outputting a single value instead of many.

This will then free up p_min_f32 and p_max_f32 to operate on two arrays of input, and output an array of min or max values, consistent with the rest of the functions. This also means the namespace is more open to adding other useful functions like clamp, sign, floor, and ceil to name a few.

Update README in tests to explain how to run the tests

I can make runtest, but runtest requires an argument which is another program and what that should be is unclear. I don't know how all the gold.h stuff comes in to play.

Script for measuring performance and code size

Our goal is continuous integration in terms of build, test, and measure.

As part of that goal, we need to build a framework for measuring and displaying (tabulating) the code size and performance of all of the functions across all of the supported platforms.

The results should be published and readable in markdown in the root directory of PAL.

Suggestion: CMake

During the code development I found it very annoying to have the compiler generated files along with the tracked ones.
As I don't know how to improve it using autoconf, my suggestion is the use of CMake. Using it, it's possible to have all generated files in any other directory. Additionally, and the best part of it, it's possible to generate all kind of project files (eclipse, vs, codeblocks, makefile and etc.) on demand. You can also find all *.c files in your project instead of declaring each one in some list.

M_NORMALIZE_RADIANS doesn't work

Claims to normalize to [-pi..pi]

Bad example:

M_NORMALIZE_RADIANS(3.141593)=-21.991148

In this PR #117 I have alternate function.

test:p_sincos Fix reference test function implementation

First of all, p_sincos has one input and two outputs, which is different from most functions in pal. It must have a proper test implementation.
Secondly, current (single) reference vector for testing purposes hold entirely wrong values, since it's calculated using log10f function.

#include <math.h>
#include "simple.h"

void generate_ref(float *out, size_t n)
{
    size_t i;

    for (i = 0; i < n; i++)
        out[i] = log10f(ai[i]); // copy paste mistake
}

Namely in this line.

math: Optimize global memory writes of scalar output functions

Issue:
Every value change is written to global memory pointer.

Fix:
Introduce temporary variable which is assigned to global memory pointer only once - after function operation finishes.

Enhancement: Add p_mode() functions

This Issue tracks my contribution of p_mode() to the math library.

Use of angle brackets around file names for include statements

Would you like to replace any double quotes by angle brackets around file names for include statements?

image: Replace division by constants with multiplication by their reciprocals

pal_math.h defined constants to be used

Improve math test environment and vectors

The current test environment is quite primitive. It works with a set of golden vectors in a tabular text format.

Example:
pal/src/math/test/p_log10_f32.dat

Currently some of the functions are missing test vectors, and certainly all functions need more exhaustive testing...open to suggestions with respect to framework. (I am sure there is a lot out there).

I have had great success with plain text based unit testing in the past. (at least as an intermediate format).

The current pal/src/test_main.c is VERY primitive. What I would prefer not having is a personalized test function for each function that is copy pasted or auto generated by some other program (been there done that...) Since all of the functions are quite similar and math oriented, it seems that a common single test framework with input data and expected data is the way to go...

math: p_sincos: Add function implementation

image: p_conv2d: Add function implementation

🚧 Work in progress 🚧

Argument missing in p_popcount.c

Refers to p_popcount.
In function definition there is one argument missing (compared to declaration and doc).

Enhancement: add p_median() function

Issue to keep track of my contribution of p_median to the math library.

image: p_conv2d: Fix inconsistency between header/implementation

math: p_popcount_u64: Add implementation

math: p_popcount: Add function implementation

tighten constraints on parameters

In many functions there are parameters which are of type int, but whose value will always be unsigned (processor count, width, height, etc.). In order to reduce overhead by parameter validity checks (which btw. I have not seen anywhere), I suggest to change these to unsigned int.

For further optimization on bit level it might also be nice to use fixed size types from stdint.h everywhere e.g. uint32_t. This should improve portability of low level optimization.

Create a single threaded BLAS library using BLIS

As a basic building block, we need a very fast optimized linear algebra call that runs single threaded. The parallel framework will be built on top of this basic building block.

A great starting point is BLIS from the University of Texas:

https://github.com/flame/blis

Major tasks:
-Create the optimized assembly macro needed at the base (basically a 4x4 matrix multiply)
-Run BLIS through the epiphany tool chain to create the library. (sounds easy doesn't it...)

Some notes from the source:

 * 0. User space
 * 1. Specific to bsb/device/os
 * 2. Should be lightning fast
 * 3. Add "safety" compile switch for error checking
 * 4. Function should not call any
 * 5. Need a different call per chip, board, O/S?

and some personal observations:

Of course it's userspace.
Why? How? (If it's specific to a bsp/dev/os, shouldn't it be dev_ops?)
This needs to be clearly defined; newlib variants can be found with fantastical memcpy implementations, optimizing for all sorts of things, and OpenBSD has it in libkern.
How should this be named? Kernel style doesn't dictate those names (as I can tell). I would rather a compile flag along the lines of FAIL_FAST but others would name it __SAFE_MEMCPY.
Did Candlejack come b--
Should this be a special thing? (also, somewhat contradicts item 1)

Memcpy isn't a terribly special function, but gets a lot of press for being a pusher of bits.