ispc / ispc Goto Github PK

Intel® Implicit SPMD Program Compiler

License: BSD 3-Clause "New" or "Revised" License

C++ 50.69% Python 6.52% Shell 0.11% C 0.38% LLVM 20.09% Yacc 2.00% M4 14.06% CMake 3.73% Dockerfile 2.34% Batchfile 0.01% Vim Script 0.05%

ispc programming-language compiler intel simd spmd

ispc's Introduction

Intel® Implicit SPMD Program Compiler (Intel® ISPC)

ispc is a compiler for a variant of the C programming language, with extensions for single program, multiple data programming. Under the SPMD model, the programmer writes a program that generally appears to be a regular serial program, though the execution model is actually that a number of program instances execute in parallel on the hardware.

Overview

ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs and GPUs; it frequently provides a 3x or more speedup on architectures with 4-wide vector SSE units and 5x-6x on architectures with 8-wide AVX vector units, without any of the difficulty of writing intrinsics code. Parallelization across multiple cores is also supported by ispc, making it possible to write programs that achieve performance improvement that scales by both number of cores and vector unit size.

There are a few key principles in the design of ispc:

To build a small set of extensions to the C language that would deliver excellent performance to performance-oriented programmers who want to run SPMD programs on the CPU and GPU.
To provide a thin abstraction layer between the programmer and the hardware--in particular, to have an execution and data model where the programmer can cleanly reason about the mapping of their source program to compiled assembly language and the underlying hardware.
To make it possible to harness the computational power of SIMD vector units without the extremely low-programmer-productivity activity of directly writing intrinsics.
To explore opportunities from close coupling between C/C++ application code and SPMD ispc code running on the same processor--to have lightweight function calls between the two languages and to share data directly via pointers without copying or reformatting.

ispc is an open source compiler with the BSD license. It uses the remarkable LLVM Compiler Infrastructure for back-end code generation and optimization and is hosted on github. It supports Windows, macOS, and Linux as a host operating system and also capable to target Android, iOS, and PS4/PS5. It currently supports multiple flavours of x86 (SSE2, SSE4, AVX, AVX2, and AVX512), ARM (NEON), and Intel® GPU architectures (Gen9 and Xe family).

Features

ispc provides a number of key features to developers:

Familiarity as an extension of the C programming language: ispc supports familiar C syntax and programming idioms, while adding the ability to write SPMD programs.
High-quality SIMD code generation: the performance of code generated by ispc is often close to that of hand-written intrinsics code.
Ease of adoption with existing software systems: functions written in ispc directly interoperate with application functions written in C/C++ and with application data structures.
Portability across over a decade of CPU generations: ispc has targets for x86 SSE2, SSE4, AVX, AVX2, and AVX512, as well as ARM NEON and recent Intel® GPUs.
Portability across operating systems: Microsoft Windows, macOS, Linux, and FreeBSD are all supported by ispc.
Debugging with standard tools: ispc programs can be debugged with standard debuggers.

Installation

Official Release Binaries

You can download the official release binaries from the latest release page. Choose the appropriate version for your operating system and architecture.

Linux (Snap Store)

Linux users can install ispc using the Snap Store:

snap install ispc

Intel® oneAPI Distribution

ispc is distributed as part of the Intel® oneAPI. You can install it from the corresponding repositories for DEB-based and RPM-based Linux distributions. Follow the instructions below:

DEB-based Linux (Ubuntu, Debian, etc.) First, download the key to the system keyring:

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

Next, add the signed entry to apt sources and configure the APT client to use the Intel repository:

echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list

Update the package list and install ispc:

sudo apt-get update
sudo apt-get install intel-oneapi-ispc

The installation location is inside the /opt/intel/ directory. To use ispc, either use the full path /opt/intel/oneapi/ispc/latest/bin/ispc or add the bin directory to your PATH:

source /opt/intel/oneapi/ispc/latest/env/vars.sh

Other Package Managers

Thanks to community support, ispc is also available through a variety of package managers on multiple operating systems.

Windows

To install ispc on Windows, you can download the latest release as zip archive from the latest release page. Then you need to unpack that to some directory. It is user's responsibility to set-up permissions for this directory according to the principle of least privilege.

Moreover, ispc depends on run-time components of Visual C++ (DLLs). These libraries can be installed with Microsoft Visual C++ Redistributable package. Instruction to install them can be found here.

Additional Resources

Latest ispc binaries corresponding to main branch can be downloaded from Appveyor for Linux and Windows See also additional documentation and additional performance information. If you have a bug report and have a question, you are welcome to open an issue or start a discussion on GitHub.

ispc's People

Contributors

Stargazers

Watchers

Forkers

petecoup texane dpephd ngaloppo chrisrallis jduprat palacaze gabeweisz continuumio awreece mwiebe guanqun nipunn1313 anta40 hgrasberger mmp mokerjoke ameyapg pengtu jbrodman dbabokin adam-singer louisfeng fkpanov ifilippov syoyo tkoziara ebshimizu 3gx smiert zyangpointer pgurd oyytoyyt fenghaitao josephwinston sanyaade-teachings vsevolod-livinskij hy tweakoz imaxxs ncos stevenlol waferix rootfs-analytics mpdn tenomoto hansbogert mckellyln amaliujia wxx5433 lalakis migdalskiy aminems hef kku1993 kimwooyoung ramakarl shishpan mcanthony jigbyjug thebusytypist aristidb mikeseven transformersprimeabcxyz danishkhakwani cephdon yyzreal daniel-schuermann bin2000 simpala jkozdon psub amos-zq sailfish009 utahfmr skbaum ned14 suluke numericalboy superwangkai yang123vc densoitlab top501 favreau gangliao cmb499 ppciarravano venkatarajasekhar intrigus aksris varunnagpaal ivan-gimenez elarafx liuguoyou cartazio jameslinus aarongut eiketinen firephinx digideskio

ispc's Issues

Add support for in-memory half float types

It would be nice to have optimized routines that load/store fp16 values from memory into float types for computation.

Fix unnecessary scalarization of some uniform vector type converts

(See comment around line 4013 of expr.cpp, in TypeCastExpr.cpp.)

Currently, when we're type converting from e.g. a uniform float<4> to a uniform int<4>, we are still serializing, doing the float->int for each element, even though there's an SSE instruction to do this all at once. We should special the case of type converting uniform short-vectors to differently-typed uniform short vectors to not seralize in this case. (We still need the loop over elements for both varying short vectors as well as uniform->varying short-vector type conversions.)

Should add support for multi-element swizzle

For the short-vector types, it would be nice to have support for HLSL-style swizzles like:

float<3> foo = ...;
float<3> bar = foo.zyx;
float<2> bat = foo.yz;

One detail is comes up in passing swizzles as references:

void foo(reference float<3> vec) {vecx.y = 0; }
...
float<3> vec = ...;
foo(vec.yxz);

The code that does pass by value/result in FunctionCallExpr::GetValue() needs to handle this case as well.

Should stop using sign extension for bool->int conversions

Currently the type conversion code in lTypeConvAtomic() in expr.cpp uses sign extension (rather than zero extension) to convert bool values to integer types. (So e.g. "true" is 0xffffffff, not 1, for a 32-bit int.) It would be better to use zero extension for more consistency with C/C++, though there is currently some code in stdlib.ispc that expects that if it casts a bool to an int, it will have sign extension (e.g. some of the any(), all(), and popcnt() stuff that uses __mask explicitly.) That code should be changed to call an __sext() intrinsic or somesuch.

Passing structs on the stack from C/C++ code is broken

Passing structs on the stack from the app code to ispc code gives garbled data. I believe that the underlying issue is the one brought up in this LLVM mailing list discussion (http://old.nabble.com/struct-passing-on-X86-64-td31812612.html)--some x86_64 ABI weirdness. Passing structs in this manner is currently caught by lCheckForStructParameters() in module.cpp (and an error is issued), but it'd be nice to make this work someday. (Though in general, we expect people to pass structs by reference most of the time anyway...)

Add support for multi-target binaries

It would be nice to be able to compile an ispc source file into an object file that had the original source compiled to multiple targets (e.g. SSE2, SSE4, AVX), perhaps also including a little trampoline function that dispatched to the appropriate version at runtime based on the actual system the program was running on.

very long compile times for some tests

Among a few others, tests/array-mixed-unif-vary-indexing-2.ispc takes about 45s to compile on my laptop. Something very bad is obviously happening there; should profile and figure out if this is an ispc bug or an llvm bug.

reduce functions for double-precision numbers

I've noticed that there are no reduce functions for double precision (only int and float).

Should short-circuit && and || operators

Unlike C/C++, ispc doesn't currently short-circuit evaluation of && and ||. It probably should.

StructType should maintain SourcePos values for where each member was declared

In particular, this is the position we'd like to pass to llvm::DIBuilder::createMemberType() in StructType::GetDIType().

Need to support hexadecimal floating-point constants on Windows

The atof() function on Windows returns 0 if given a c99-style hexadecimal constant like "0x1.b4c904p+5"; this is a problem, since the parser calls atof() when it encounters a hexadecimal constant. Should figure out if there is a Windows function that does do this, or else find an alternative.

Should add support for enums

The compiler doesn't currently support enums; this should be a relatively straightforward extension.

2D array support

The following code:

export void test(uniform float array[][], uniform int length)
{
  for (uniform int i = 0; i < length; i + programCount)
    {
      float a = array[1][i + programIndex];
      float b = array[2][i + programIndex];
      array[3][i + programIndex] = a + b;
    }
}

produces following assembly

    .file   "2darray.ispc"
    .text
    .globl  test___REFUf_5B__5D__5B__5D_Ui
    .align  16, 0x90
    .type   test___REFUf_5B__5D__5B__5D_Ui,@function
test___REFUf_5B__5D__5B__5D_Ui:         # @"test___REFUf[][]Ui"
# BB#0:                                 # %allocas
    movmskps    %xmm0, %eax
    cmpl    $15, %eax
    jne .LBB0_3
# BB#1:                                 # %for_test.preheader
    testl   %esi, %esi
    jle .LBB0_7
    .align  16, 0x90
.LBB0_2:                                # %__load_masked_32.exit.us
                                        # =>This Inner Loop Header: Depth=1
    jmp .LBB0_2
.LBB0_3:                                # %for_test238.preheader
    testl   %esi, %esi
    jle .LBB0_7
# BB#4:                                 # %for_loop240.lr.ph
    testl   %eax, %eax
    jne .LBB0_6
    .align  16, 0x90
.LBB0_5:                                # %__load_masked_32.exit693.us.us
                                        # =>This Inner Loop Header: Depth=1
    jmp .LBB0_5
    .align  16, 0x90
.LBB0_6:                                # %__load_masked_32.exit693.us729
                                        # =>This Inner Loop Header: Depth=1
    jmp .LBB0_6
.LBB0_7:                                # %cif_done
    ret
.Ltmp0:
    .size   test___REFUf_5B__5D__5B__5D_Ui, .Ltmp0-test___REFUf_5B__5D__5B__5D_Ui

    .globl  test
    .align  16, 0x90
    .type   test,@function
test:                                   # @test
# BB#0:                                 # %allocas
    testl   %esi, %esi
    jle .LBB1_2
    .align  16, 0x90
.LBB1_1:                                # %__load_masked_32.exit.us
                                        # =>This Inner Loop Header: Depth=1
    jmp .LBB1_1
.LBB1_2:                                # %cif_done
    ret
.Ltmp1:
    .size   test, .Ltmp1-test


    .section    ".note.GNU-stack","",@progbits

and following header:

//
// 2darray.h
// (Header automatically generated by the ispc compiler.)
// DO NOT EDIT THIS FILE.
//

#ifndef ISPC_2DARRAY_H
#define ISPC_2DARRAY_H

#include <stdint.h>

#ifdef __cplusplus
namespace ispc {
#endif // __cplusplus


///////////////////////////////////////////////////////////////////////////
// Functions exported from ispc code
///////////////////////////////////////////////////////////////////////////
#ifdef __cplusplus
extern "C" {
#endif // __cplusplus
    extern void test(float[] *array, int32_t length);
#ifdef __cplusplus
}
#endif // __cplusplus

#ifdef __cplusplus
}
#endif // __cplusplus

#endif // ISPC_2DARRAY_H

Improve error reporting

This program (with a single misspelling) issues a cascade of error messages. Even worse, the "print the context of the error" code is printing 7 lines of code each time by the time it gives up.

void unpack_int8_to_int32(unsigned int32 in, referenec int32 v0, reference int32 v1,
reference int32 v2, reference int32 v3) {
v0 = (in >> 24) & 0xff;
v1 = (in >> 16) & 0xff;
v2 = (in >> 8) & 0xff;
v3 = in & 0xff;
}

Implementations of __load_masked_{32,64}() have lurking dangers

The stdlib-sse4x2.ll versions of these doesn't even check the mask; it should make sure it isn't all off before doing the load--otherwise the address may be bogus.

The stdlib-sse.ll version at least doesn't have this bug, but uses the logic "if any mask element is on, then it's safe to load the whole vector's worth". This is unsafe in the case where the vector load would straddle a page boundary but where the mask is off for the portion on one of the two pages and where that page isn't valid to load from.

Really, we can only do an n-wide load if at minimum the first and last lane masks are on. Maybe should just change this to do the load if it's all on and serialize otherwise.

A number of tests fail in windows (but not OSX/Linux)

Note that some fail in ispc and some cause ispc_test to crash.

$ ./run_tests.sh  -v
Running all correctness tests
00760089 (0x003FF74C 0x003FF6CC 0x0041BC08 0x00000000) <unknown module>  
01231DAA (0x0CE81104 0x01560044 0x00000001 0x00000000), lRunTest()+2970 bytes(s), c:\users\mmp\ispc\ispc_test.cpp, line 253
012322EF (0x003FF828 0x01333FFC 0x00000002 0x005711F8), main()+0479 bytes(s), c:\users\mmp\ispc\ispc_test.cpp, line 306+0012 byte(s)
Test tests/cfor-test-101.ispc FAILED ispc_test

Test tests/cfor-test-72.ispc FAILED ispc_test

Test tests/load-int16-1.ispc FAILED ispc compile

00630089 (0x0042F914 0x0042F894 0x001BBC08 0x00000000) <unknown module>
009B1DAA (0xBE44207C 0x00CE0044 0x00000001 0x00000000), lRunTest()+2970 bytes(s), c:\users\mmp\ispc\ispc_test.cpp, line 253
009B22EF (0x00000002 0x007E11F8 0x007E23A0 0xBE4420E0), main()+0479 bytes(s), c:\users\mmp\ispc\ispc_test.cpp, line 306+0012 byte(s)
00AB3FFC (0x7EFDE000 0x0042FA38 0x77989F02 0x7EFDE000), __tmainCRTStartup()+0290 bytes(s), f:\dd\vctools\crt_bld\self_x86\crt\src\crtexe.c, line 555+0023 byte(s)
77433677 (0x7EFDE000 0x77BD4BCC 0x00000000 0x00000000), BaseThreadInitThunk()+0018 bytes(s)
77989F02 (0x00AB411D 0x7EFDE000 0x00000000 0x00000000), RtlInitializeExceptionChain()+0099 bytes(s)
77989ED5 (0x00AB411D 0x7EFDE000 0x00000000 0x00000000), RtlInitializeExceptionChain()+0054 bytes(s)
Test tests/test-101.ispc FAILED ispc_test

Test tests/varying-struct-4.ispc FAILED ispc compile

Running failing tests06.

LLVM assertion hits under linux when building aobench_instrumented example

ispc -O2 --fast-math --instrument ao.ispc -o objs/ao_ispc.o -h objs/ao_ispc.h
ispc: MCELFStreamer.cpp:43: virtual void llvm::MCELFStreamer::EmitLabel(llvm::MCSymbol_): Assertion `Symbol->isUndefined() && "Cannot define a symbol twice!"' failed.
make: *_* [objs/ao_ispc.h] Segmentation fault

This only happens under Linux, but seems to happen with both LLVM2.9 and LLVM dev tot. Need to determine if this is an LLVM bug that should be filed or if ispc is doing something wrong.

Add support for int8/int16 types

It would be nice to have support for 8 and 16-bit integer types. Implementation in ispc should mostly be straightforward plumbing through the parser, types, and type conversion code. Note that LLVM 2.9 (and LLVM TOT, up until mid-June) doesn't generate great code for vectors of i8 and i16 values; current LLVM TOT should be much better, though.

Add support for 'switch' statements

It would be good to add support for 'switch'.

For 'uniform' type switch expressions, this should be a very straightforward mapping to the LLVM SwitchInst.

How to efficiently implement it for 'varying' switch expressions is an interesting question. A correct-but-possibly-slow baseline would be to transform it into the equivalent set of if/elses, updating the mask at each block. One could also imagine something more efficient along the lines of:

lanes = current active simd lanes
while (lanes != 0) {
find first lane in lanes that is on
figure out which switch target it wants to jump to
figure out which other active lanes, if any, want to jump to that target
set the mask accordingly
run the code for those lanes
update 'lanes' to turn off the bits for the lanes that just ran
}

regression on Ubuntu Linux (32Bit) with gcc-4.6

compiling the current git head fails on my Ubuntu Linux (32bit) with gcc-4.6:
/ispc$ make
Updating dependencies
Creating objs/ directory
Compiling builtins.cpp
Compiling ctx.cpp
Compiling decl.cpp
Compiling expr.cpp
expr.cpp: In member function ‘virtual void ConstExpr::Print() const’:
expr.cpp:3540:38: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘int64_t’ [-Wformat]
expr.cpp:3547:39: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘uint64_t’ [-Wformat]
Compiling ispc.cpp
Compiling llvmutil.cpp
Compiling main.cpp
Compiling module.cpp
Compiling opt.cpp
opt.cpp:63:53: fatal error: llvm/Support/PassManagerBuilder.h: No such file or directory
compilation terminated.
make: *** [objs/opt.o] Error 1

Mandelbrot sample: SSE access is not aligned to 16x, segfault

I have a Celeron E3400, so I switched some Makefile settings:

uname -a

Linux home 2.6.39-ARCH #1 SMP PREEMPT Tue Jun 7 05:49:02 UTC 2011 i686 Intel(R) Celeron(R) CPU E3400 @ 2.60GHz GenuineIntel GNU/Linux

cd examples/mandelbrot

git diff Makefile

--- a/examples/mandelbrot/Makefile
+++ b/examples/mandelbrot/Makefile
@@ -1,8 +1,8 @@

CXX=g++
-CXXFLAGS=-Iobjs/ -O3 -Wall
+CXXFLAGS=-Iobjs/ -g -O2 -Wall -msse -msse2 -mstackrealign # hoping it would help align, but it didnt
ISPC=ispc
-ISPCFLAGS=-O2 --target=sse4x2
+ISPCFLAGS=-O2 -g --target=sse2 --arch=x86 --cpu=core2

make

/bin/mkdir -p objs/
ispc -O2 -g --target=sse2 --arch=x86 --cpu=core2 mandelbrot.ispc -o objs/mandelbrot_ispc.o -h objs/mandelbrot_ispc.h
g++ mandelbrot.cpp -Iobjs/ -g -O2 -Wall -msse -msse2 -mstackrealign -c -o objs/mandelbrot.o
g++ mandelbrot_serial.cpp -Iobjs/ -g -O2 -Wall -msse -msse2 -mstackrealign -c -o objs/mandelbrot_serial.o
g++ -Iobjs/ -g -O2 -Wall -msse -msse2 -mstackrealign -o mandelbrot objs/mandelbrot.o objs/mandelbrot_ispc.o objs/mandelbrot_serial.o -lm

./mandelbrot

Segmentation fault

gdb ./mandelbrot

(gdb) run
Program received signal SIGSEGV, Segmentation fault.
0x08049030 in mandelbrot_ispc (x1=, y1=Unhandled dwarf expression opcode 0x0
) at mandelbrot.ispc:37
37 for (i = 0; i < count; ++i) {

(gdb) set disassembly-flavor intel

(gdb) display /i $pc
1: x/i $pc
=> 0x8049030 <mandelbrot_ispc+496>: movaps XMMWORD PTR [eax+edi*1],xmm1

(gdb) p /x $eax
$1 = 0xb7b9c008

Ooops, not aligned to x 16 ... hence the segfault.

Bug in ispc, or am I doing something wrong?

Fix bugs with indexing into arrays of uniform short-vectors

See failing_tests/masked-scatter-vector and failing_tests/scatter-vector for two test cases. The issue is that the compiler is computing the wrong offsets into the arrays.

More generally, we should have a lot more tests of that functionality.

Add support for varying lvalues in more FunctionEmitContext methods

The PtrToIntInst and IntToPtrInst methods in FunctionEmitContext should be generalized to support varying lvalues (i.e. arrays of pointers rather than just pointers). See FunctionEmitContext::BitCastInst() for an example of where this is implemented. (This functionality isn't currently needed, but it would be nice to have for completeness. For now LLVM should throw an assert if we try passing arrays of pointers to them in the future.)

Can simplify some type-related code by introducing a new base-class

Currently VectorType and ArrayType benefit from inheriting from SequentialType; there are a bunch of places in the code that benefit from inheriting from a common base-class that abstracts the notion of "type of a sequence of some number of things with the same type".

If we generalize this to the notion of "type of a sequence of some number of things with possibly-differing types", then StructType can fall out of this umbrella, and we can further simplify a bunch of code. (One example: lUniformVariableToVarying() in expr.cpp). It's not immediately obvious what a good name for such a type is.

An added advantage of this would be cleaning up the messiness that SequentialType has a GetElementCount() method, but StructType's corresponding method is NumElements().

Note that we probably still want to maintain a separate SequentialType, which lets us keep code that checks "can I index over this type" clean.

Get a better preprocessor solution

Currently we popen() out to /usr/bin/cpp on Linux/Mac and require the user to run it by hand on Windows. Both of these are not great (the windows part more so).

Syoyo suggests either mcpp (http://mcpp.sourceforge.net/) or grabbing clang's C preprocessor. mcpp has the advantage of being designed to be integrated into other things, but it would be nice to wire it up to build as part of ispc without making the ispc build too complicated...

Add support for aliasing pointers/references

ispc currently expects/requires that there be no aliasing in pointers from the app and references in the ispc program. I think that this is the right default, but it would be nice to allow the programmer to declare which pointers may alias and use that information accordingly during compilation. (i.e. the inverse of "restrict")

Should have more --fast-math versions of transcendental functions in stdlib

Currently --fast-math uses the same implementations of the transcendentals in the stdlib as regular compiles. It would be nice to find some lower-precision but more efficient versions of those for when --fast-math is enabled.

Should support VectorTypes in cast instructions in FunctionEmitContext

The FunctionEmitContext:{TruncInst,CastInst,FPCastInst,SExtInst,ZExtInst} methods should be generalized so that they emit correct code if passed VectorType values (which will be LLVM arrays of vectors at this point in the program execution), along the lines of BinaryOperator(), etc. Calling code doesn't currently depend on this functionality, but it would be nice to have for completeness.

Setting cpu doesn't limit targets

I ran into this while trying to get the examples to run on my netbook. I'm not actually sure if it's a bug or not.

If I build any of the examples out of the box, they will generate sse4 instructions. This compiles successfully, but obviously doesn't run (correctly).

If I add the "--cpu=atom" flag, the code will no longer compile. LLVM throws an error, saying the blendvps sse4 instruction cannot be selected.

If I add the "--target=SSE2" flag, then the code compiles and runs.

I'm not sure if this is expected behaviour or not. It confused me for a while, as I expected setting the cpu type to limit the available targets.

Cheers,
Sam

aobench is 50% slower with llvm TOT vs 2.9 release

This should probably be packaged into a bug to file with LLVM.

Should be able to index more than 4G bytes from the start of arrays

There are many places where offsets from base pointers are represented by int32 values (e.g. all of the scatter/gather code). This means that, even when compiling to a 64-bit target, we can't index more than 4G bytes from the start of an array.

Open question: what is the performance impact from going to 64-bit offsets everywhere? This will probably make the indexing calculations more expensive, since we can't e.g. do a nice 4-wide int32 add on SSE targets, but will have to do 2 64-bit ones. If this ends up being unacceptable, need to figure out how to support both. (e.g. as part of the declaration of the array type).

Fix MSVC solution files in examples to use fast math

They currently use /fp:precise, which doesn't lead to an apples-to-apples comparison.

Improve implementation of __masked_store_blend_64() for AVX target

This function should be fixed to do the "do two 32-bit masked stores" (with appropriate bitcasting and the mask values doubled up) that the implementations of this for other targets do (see e.g. stdlib-sse4.ll)

(Need AVX support to work before this can be tested.)

Gather/scatter optimization pass improvements

GSImprovementsPass currently looks for ops that either all go to the same location or go to a linear sequence of locations. There are a few generalizations that could be useful:

Look for ops that go to a linear sequence of locations, just not necessarily in order. This could be handled with a vector swizzle before or after doing a regular vector store/load.
Look for cases like "reading two 4-wide linear sequences with an 8-wide target"; for e.g. AVX, this could be turned into two 4-wide loads and a vector shuffle.

Logic in util.cpp/GetDirectoryAndFileName() will fail on Windows

The idea that only a path that starts with '/' is an absolute path isn't right on windows. (This is an academic issue for now since this function is only called when generating debugging information, and that isn't currently supported for COFF-format object files.)

Add a prefetch() builtin?

It might be useful to have a mapping to the prefetch intrinsic available via the standard library.

parse errors issued from short-vector members of structures

This code:

struct Foo {
    float<4> e;
};

Causes a slew of syntax errors to be issued; something is clearly incorrect in the grammar as far as struct parameter parsing...

Clean up implementation of GSImprovementsPass?

This optimization goes through a lot of trouble to try to determine if the gather or scatter is either all going to the same memory location, or accessing a linear sequence of memory locations. The current implementation is relatively complex for what it does. Would it be possible to simplify it substantially by doing something simpler, e.g. an early pass to flatten out GEPs when the size is known, then do LLVM's constant folding, then flatten into an array, etc.? Then to check for the linear sequence case, we could subtract out a vector (0, 4, 8, ...) and see if the result is all the same value, etc.

Clean-up syntax for task launch

The task launch syntax currently is a little goofy:

    launch < mandelbrot_scanlines(j, j+span, x0, dx, y0, dy, width, maxIterations, output) >;

It would probably be a good idea to fix up the grammar to remove those < >s.

Stack-allocated uniform arrays should have target vector width alignment

Currently, they're just aligned to their element size; we should align them to the target vector width alignment. (I suspect there are lurking bugs where we expect them to be aligned elsewhere, since we do require that pointers coming in from the app be vector width aligned...)

Clean up the decl code so that it doesn't have so many tendrils

It would be nice to further isolate the grungy decl processing code that's implemented in decl.h/decl.cpp; currently there's too much code in module.cpp and stmt.cpp that has to be aware of that stuff. Better would be to get those decls converted to proper Type *s and Symbol *s as soon as possible so that the module and stmt code can just work with those types directly.

Add support for unaligned pointers from app

ispc currently expects that all pointers from the app will be aligned to the target's natural vector alignment (16 bytes on SSE, 32 on AVX, ...). I think this is a reasonable default going forward, but it would be good to add the capability for the app to also provide not-necessarily-aligned data. ("unaligned" keyword?)

Optionally provide built-in task systems

It would be nice to make it even easier to launch tasks for users who don't want to provide their own task systems by optionally providing the ones in examples/mandelbrot_tasks as builtins.

For example, we could add a --task-system= command line option, with options "concrt", "gcd", and "pthreads" and maybe a "builtin" that maps to the appropriate one for the target system. We could use the same trick we use to compile stdlib-c.c to precompile those ones to bitcode to get them linked into the compiler executable and could then link them into the module if asked for. There could be a (very small) benefit in that the calls out to the task enqueue and sync routines could then be inlined.

Users who wanted to hook in with their own task systems could do --task-system=user and then we'd just leave the corresponding symbols unimplemented and for the user to take care of.

Unnecessary blending in simple program

Given the following simple program:

float f(float a, float b) {
    if (a < b)
        a += b;
   return a;
}

The generated code does two blends, one based on the (a<b) test and one with the mask passed into the function. Only the first one is necessary; the second shouldn't be there.

(Presumably this has performance implications in more complex programs as well.)

Need parser support for 64-bit constants

Currently integer constants are treated as int32 types and floating-point constants are treated as float32s. The parser should be extended to support 64-bit int and double constants as well, though the type promotion code should probably be updated so that an ostensibly double-precision constant like "2.0" doesn't cause type promotion to double if everything else in the expression is float32 (vs "2.0f"). At minimum, there should be a warning if this does happen.

More general support for in-memory int8/int16 data

(The following would be superseded by first-class support for int8 and in16 types in the language)

The routines for loading from/storing to int8/16 types in memory only support the equivalent of unaligned vector loads. We should also have support for loading just a single value from memory as well as for scattering/gathering to them.

All of the existing optimization infrastructure for distinguishing between actual scatters/gathers vs loads/stores should just directly apply to this.

Add support for 64-bit types in the standard library

Currently there's no support for doubles and 64-bit int types in the standard library. It would be good to provide the same range of functionality for them that is there for 32-bit types:

min/max/clamp
sqrt / rsqrt
reduce_*
transcendentals
abs

When implemented, we also need a whole bunch of tests for those...

Language compatibility with C / scalar code option

It might be nice to be able to emit scalar code (or straight up C code), for compatibility with targets that don't have ispc backends. There are a few options:

Have a set of #defines that turn ispc code into valid C code (e.g. "#define uif if", "#define programCount 1", etc.)
- +s: easy, just works
- -s: can't do this with the current language syntax (e.g. "launch < foo(); >"). Fix language syntax?
Use the LLVM C backend.
- Need to fix bugs in that when given vector code (probably should do that in any case)
- May want a scalar target for ispc anyway, which presumably would play well with LLVM's C backend.

Values of threadIndex and threadCount from tasks_concrt.cpp and tasks_gcd.cpp are bogus

(Specifically, the versions in examples/mandelbrot_tasks).

The task system is supposed to pass two integers into the task function's code when it runs the function: threadIndex, which is an index from 0 to threadCount-1, indicating a unique hardware thread number that corresponds to the HW thread that is actually running the task.

With pthreads, this is easy, since we just assign each HW a thread number from 0 to n-1. With MSFT Concurrency Runtime and Apple's GCD, the task system abstracts away the notion of how many threads are running as well as which thread a given task is running on, so the ispc task code that uses them in mandelbrot_tasks just passes bogus values.

One option would be to have a little map between HW thread numbers and integers starting at zero.

Alternatively, maybe we need to kill threadIndex and threadCount and have a different enabling mechanism for private thread-specific data.

Add support for JIT compilation

It would be good to add support for JIT/runtime compilation to ispc. This is a significant project.

Memory management: ispc currently makes almost no effort to free dynamically allocated memory.
- Option 1: a thorough pass through the code to implement destructors that delete any memory that was dynamically allocated may be in order
- Option 2: (probably better) rewrite the system so that all dynamic allocation during compilation is done from a memory arena, so that at the end of compilation we can just free the entire arena. The arena should be supplied via user-supplied memory allocation callbacks, for ease of embedding in complex apps.
Memory management: we also need to free as much of the LLVM-allocated memory as possible, while retaining the final generated code. Search the LLVM mailing lists from ~a year ago for postings from Larry Gritz asking about (and finding a solution to) this issue for the osl project.
API: we need an API for parameter discovery of the functions that are compiled
** This should also include some functionality for function specialization