intellabs / t2sp Goto Github PK

Productive and portable performance programming across spatial architectures (FPGAs, etc.) and vector architectures (GPUs, etc.)

License: Other

CMake 0.70% Makefile 1.20% Shell 0.92% C++ 76.62% Java 0.60% C 3.37% MATLAB 0.03% Python 1.51% Objective-C 0.01% Objective-C++ 0.09% LLVM 0.80% Smarty 0.01% TeX 0.14% HTML 0.47% CSS 0.57% JavaScript 12.96%

dsl language compiler fpga gpu productivity portability performance systolic-arrays

t2sp's Introduction

DISCONTINUATION OF PROJECT

This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

T2SP (Temporal To Spatial Programming, previously called T2S) enables software programmers to build systolic arrays for dense tensor computes with portable performance across spatial architectures (like FPGAs) and vector architectures (like GPUs) in a constructive way.

T2SP is available under a permissive license, the BSD+Patent license.

Currently, we support only Intel FPGAs and GPUs. We assume your device is local to you, or within Intel DevCloud, and the operating system is Linux (We have tried Ubuntu 18.04 and CentOS 7.9, but our system is not really tied to any specific Linux system or version). Other platforms might also work, although not tested.

Our newest paper, Lasa: Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs (to appear in FCCM 2023), is currently a separate project released at pku-liang/Lasa.

[DevCloud] Open an account (once)

Register at the Intel's FPGA DevCloud. This will enable access to both the FPGAs and the GPUs in the cloud. Currently, the cloud offers Arria 10 and Stratix 10 FPGAs, and GEN 9.5 (Intel UHD Graphics P630) and GEN 12 ( Intel Iris Xe MAX Graphics) GPUs.
Follow the instructions of an approval email to set up your connection to DevCloud.
Connect to DevCloud. Now you are at the head node named login-2.

Add the following to your .bashrc:

 if [ -f /data/intel_fpga/devcloudLoginToolSetup.sh ]; then
     source /data/intel_fpga/devcloudLoginToolSetup.sh
 fi

Then

 source .bashrc

Clone T2SP (once)

git clone https://github.com/IntelLabs/t2sp

Install tools (once)

[DevCloud] From the head node, submit a job with one of the following commands, based on the type of device you will use:

# For Arria 10 FPGA
qsub -q batch@v-qsvr-fpga -l nodes=arria10:ppn=2 -d $HOME/t2sp $HOME/t2sp/install-tools.sh

# For Stratix 10 FPGA
qsub -q batch@v-qsvr-fpga -l nodes=darby:ppn=2  -d $HOME/t2sp $HOME/t2sp/install-tools.sh

# For GEN 9.5 GPU
qsub -l nodes=1:gen9:ppn=2 -d $HOME/t2sp $HOME/t2sp/install-tools.sh 

# For GEN 12 GPU
qsub -l nodes=1:iris_xe_max:ppn=2 -d $HOME/t2sp $HOME/t2sp/install-tools.sh

This may take 1-5 hours on DevCloud, depending on the specific machine allocated for the job.

A known issue: on a GEN 9.5 GPU machine, it is possible to see some errors during installing m4, but it turns out that package is not necessary for that machine, and we can ignore the error.

[Local machine with an FPGA or a GPU]
```
cd $HOME/t2sp
./install-tools.sh
```
[Local machine with an FPGA] Also download Intel FPGA SDK for OpenCL, and install with
```
tar -xvf AOCL-pro-*-linux.tar 
./setup_pro.sh
```

Note:

We assume your system has python >= 2.7 already installed.
The above install-tools.sh command installs llvm-clang >= 9.0, gcc >= 7.5.0, and python's numpy and matplotlib package. The command installs all of them and their dependencies we know to make the system self-contained. If your system has some of the tools already installed, you could edit install-tools.sh to disable the installations of these tools, then modify the environment setting as shown below.

Modify the environment setting (once)

The environment setting file is in $HOME/t2sp/setenv.sh.

If you have your own gcc, llvm or clang and thus did not use the above install-tools.sh command to install them, in setenv.sh, modify the following path variables appropriately:

  GCC_PATH=...
  export LLVM_CONFIG=...
  export CLANG=...

If you installed the Intel FPGA SDK for OpenCL for your local FPGA, check the following variables, and modify if needed:
```
ALTERA_PATH=...
AOCL_VERSION=...
FPGA_BOARD_PACKAGE=...
export FPGA_BOARD=...
export LM_LICENSE_FILE=...
```
Here is an example how to find out the board package and board (Assume Intel FPGA SDK for OpenCL 19.1 was installed under directory $HOME/intelFPGA_pro):
```
$HOME/intelFPGA_pro/19.1/hld/bin/aoc -list-boards
   Board list:
     a10gx
       Board Package: $HOME/intelFPGA_pro/19.1/hld/board/a10_ref
  
     a10gx_hostpipe
       Board Package: $HOME/intelFPGA_pro/19.1/hld/board/a10_ref
```
There are 1 board package and 2 boards in this case, and you should set FPGA_BOARD_PACKAGE=a10_ref, and either export FPGA_BOARD=a10gx or export FPGA_BOARD=a10gx_hostpipe.

Open a terminal on a compute node

[DevCloud] from the head node, log into a compute node:

FPGA:
```
  devcloud_login
```
Choose
```
6) Enter Specific Node Number
```
Enter the name of a node with Arria 10 Release 1.2.1, or with Stratix 10.

GPU: to request a compute node with GEN 9.5 or GEN 12,

qsub -I -l nodes=1:gen9:ppn=2

qsub -I -l nodes=1:iris_xe_max:ppn=2

[Local] Open a bash shell

For all the steps below, we assume you are either on a compute node of DevCloud or on a local machine, except explicitly stated otherwise.

Set up the environment (whenever a terminal is open)

cd $HOME/t2sp
source ./setenv.sh (devcloud|local) (fpga|gpu)

The options say if you are working on DevCloud or locally, and to use an FPGA or a GPU.

Build T2SP (whenever you change the source code)

cd $HOME/t2sp/Halide
make -j

Regression tests

Currently the regressoin tests are for FPGAs only. On a machine with an FPGA,

cd $HOME/t2sp/t2s/tests/correctness
./test.sh

After the testing, each sub-directory there will contain a success.txt and/or failure.txt, which have the command lines for compiling and running every test. These tests are small examples one can play with.

To remove all the temporary files generated during the regression testing:

./test.sh clean

Performance tests

Current release contains SGEMM, 2-D convolution and Capsule convolution on Arria 10 FPGA and GEN 9.5 GPU. For every kernel, we write a single specification that gets mapped to the different kinds of hardware. This reflects our concept of "write a kernel once, and run with high performance across spatial and vector architectures".

Summary of throughput:

	A10	S10	GEN 9.5	GEN 12
SGEMM	620 GFLOPS, 97% DSP efficiency	1790 GFLOPS, 99% DSP efficiency	410 GFLOPS, 90% machine peak	2165 GFLOPS, 85% machine peak
2-D convolution	605 GFLOPS, 99% DSP efficiency	1509 GFLOPS, 99% DSP efficiency	421 GFLOPS, 92% machine peak	2236 GFLOPS, 88% machine peak
Capsule convolution	568 GFLOPS, 96% DSP efficiency	885 GFLOPS, 56% DSP efficiency	398 GFLOPS, 87% machine peak	1850 GFLOPS, 73% machine peak
PairHMM	41.8 GCups, 95% PE efficiency	47.9 GCups, 93% PE efficiency	4.25 GCups	14.8 GCups

To reproduce the performance,

cd $HOME/t2sp/t2s/tests/performance

then

[DevCloud head node] Submit a job:

# Test all kernels
./devcloud-jobs.sh (a10|gen9)
  
# Or test 1 kernel
./devcloud-job.sh (gemm|conv|capsule) (a10|gen9) (tiny|large) (hw|emulator)

[A DevCloud compute node, or a local machine] Use the pre-generated bitstreams:

# By default, files *.aocx are excluded. You can pull all the files:
git lfs pull --include="*.aocx" --exclude=""

# Or a specific file for test (e.g., gemm on A10):
git lfs pull --include="t2s/tests/performance/gemm/bitstream/a10/a.aocx" --exclude=""

# Test all kernels
./tests.sh (devcloud|local) (a10|s10) bitstream

# Or test 1 kernel
./test.sh (devcloud|local) (gemm|conv|capsule|pairhmm) (a10|s10) (tiny|large) (hw|emulator) bitstream

[A DevCloud compute node, or a local machine] Test directly:

# Test all kernels
./tests.sh (devcloud|local) (a10|gen9)
  
# Or test 1 kernel
./test.sh (devcloud|local) (gemm|conv|capsule) (a10|gen9) (tiny|large) (hw|emulator)

Note:

The emulator option is applicable only to FPGAs and tiny size.
Synthesis of an FPGA design will take hours. So on DevCloud, we recommend submitting a job for testing on FPGAs.
As for the results, look for the synthesis report of an FPGA design in KERNEL/a/reports/report.html. Here KERNEL is gemm, conv, etc.
Look for the performance of an FPGA design in a roofline model that is automatically generated in KERNEL/roofline.png.
Look for the performance of a GPU design from the standard output.

Features

The current release contains the following features:

Expressing systolic arrays

UREs (uniform recurrence equations) and space-time transforms are supported for expressing systolic arrays in general. Currently, a space-time transform must be unimodular.
Defining an abstract, performance portable memory hierarchy

A memory hierarchy is defined for each tensor by streaming the tensor across DRAM, SRAM, and registers. The memory hierarchy is then specialized by the compiler for specific hardware with portable performance.
Isolation

Split a compute into spatial pieces, so that each piece can be optimized individually.
Data optimizations

Data gathering, scattering, double buffering, serialization and de-serialization
Loop optimizations

Loop flattening, removal, unrolling, vectorization

Tutorials

A 10-minute video introduces the basic concept of T2SP. There is an initial version of programming guide. There are also a set of tutorials at DevCloud.

Citation

If you use T2SP, please cite the following position paper:

@article{T2SP,
  author    = {Hongbo Rong},
  title     = {Programmatic Control of a Compiler for Generating High-performance Spatial Hardware},
  journal   = {CoRR},
  volume    = {abs/1711.07606},
  year      = {2017},
  url       = {http://arxiv.org/abs/1711.07606},
  archivePrefix = {arXiv},
  eprint    = {1711.07606},
  timestamp = {Mon, 13 Aug 2018 16:46:47 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1711-07606.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  note      = {Open source available at https://github.com/IntelLabs/t2sp}
}

Publications

SuSy: a programming model for productive construction of high-performance systolic arrays on FPGAs. Yi-Hsiang Lai, Hongbo Rong, Size Zheng, Weihao Zhang, Xiuping Cui, Yunshan Jia, Jie Wang, Brendan Sullivan, Zhiru Zhang, Yun Liang, Youhui Zhang, Jason Cong, Nithin George, Jose Alvarez, Christopher Hughes, and Pradeep Dubey. 2020. ICCAD'20. Link
T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations. Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, David Albonesi,Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff Lowney, Adam Herr, Christopher Hughes,Timothy Mattson, Pradeep Dubey. FCCM, 2019. Link

Acknowledgement

t2sp's People

Contributors

Stargazers

Watchers

Forkers

haoxiaochen hecmay ronghongbo vmiheer isolatedmy abe157 mfkiwl tanzelin430 z24tao zhuangzhuangwu

t2sp's Issues

QRD failed in compilation

There are several warnings that I am not sure matter or not. And finally, there is an error thrown.

cd t2s/tests/performance/qrd
g++ qrd-mgs-batch.cpp -g -I ../util  -I ../../../../Halide/include -L ../../../../Halide/bin $EMULATOR_LIBHALIDE_TO_LINK -lz -lpthread -ldl -std=c++11 -DVERBOSE_DEBUG
env HL_DEBUG_CODEGEN=4 PRAGMAUNROLL=1 BITSTREAM="${HOME}/tmp/a.aocx" CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA=1 INTEL_FPGA_OCL_PLATFORM_NAME="$EMULATOR_PLATFORM" AOC_OPTION="$EMULATOR_AOC_OPTION -board=${FPGA_BOARD} -emulator-channel-depth-model=strict " ./a.out
...
Warning: Failed to serialize an input in function AFeeder: path condition to     ....                                                                                                      
Warning: Failed to serialize an output in function QCollector: path condition to ....
...
 Combining channels ...
User error triggered at /home/u89062/a10-intelLabs-t2sp-old/Halide/../t2s/src/CombineChannels.cpp:341
Warning: (at ./qrd-mgs-batch.cpp:107) Failed to combine channels Q.channel and R.channel in function vec_a: Path conditions differ                                                      
    (vec_a.s0.j == vec_a.s0.i) vs. (vec_a.s0.i < min(vec_a.s0.j, 127))
...
Late fuse...
...
terminate called after throwing an instance of 'Halide::InternalError'
what():  Internal Error at /home/u89062/a10-intelLabs-t2sp-old/Halide/../t2s/src/Utilities.cpp:103 triggered by user code at : Condition failed: ends_with(str, postfix):

Cannot run T2SP emulation on local machine

I am trying to run AOT regression cases on a local machine after installing t2sp. And I got the following errors

g++ gemm-generate.cpp -g -I ../util -I ../../../../Halide/include -L ../../../../Halide/bin -lHalide -lz -lpthread -ldl -std=c++11
env BITSTREAM=b.aocx AOC_OPTION="-march=emulator -board=a10gx -emulator-channel-depth-model=strict " ./a.out

g++ gemm-run.cpp host.cpp ../../../src/AOT-OpenCL-Runtime.cpp ../../../src/SharedUtilsInC.cpp -g -DLINUX -DALTERA_CL -fPIC -I../../../src/ -I ../../../../Halide/include -I/work/shared/common/CAD_tool/Intel/intelFPGA_pro//19.4/hld/examples_aoc/common/inc /work/shared/common/CAD_tool/Intel/intelFPGA_pro//19.4/hld/examples_aoc/common/src/AOCLUtils/opencl.cpp /work/shared/common/CAD_tool/Intel/intelFPGA_pro//19.4/hld/examples_aoc/common/src/AOCLUtils/options.cpp -I/work/shared/common/CAD_tool/Intel/intelFPGA_pro//19.4/hld/host/include -L/work/shared/common/CAD_tool/Intel/intelFPGA_pro//19.4/hld/linux64/lib -L/linux64/lib -L/work/shared/common/CAD_tool/Intel/intelFPGA_pro//19.4/hld/host/linux64/lib -lOpenCL -L ../../../../Halide/bin -lelf -lHalide -lz -lpthread -ldl -std=c++11

env CL_CONTEXT_EMULATOR_DEVICE_INTELFPGA=1 INTEL_FPGA_OCL_PLATFORM_NAME="Intel(R) FPGA Emulation Platform for OpenCL(TM)" BITSTREAM=b.aocx ./a.out

ERROR: UNRECOGNIZED ERROR CODE (-1001)
Location: /work/shared/common/CAD_tool/Intel/intelFPGA_pro//19.4/hld/examples_aoc/common/src/AOCLUtils/opencl.cpp:297
Query for number of platforms failed

Similarly, if I run realize() directly on FPGA target (using emulation mode), I would get the following errors from JITModule

CL: halide_opencl_init_kernels (user_context: 0x0, state_ptr: 0x7ff98ff7b000, program: 0x7ff98ff76280, size: 10331
    load_libopencl (user_context: 0x0)
    Loaded OpenCL runtime library: libOpenCL.so
halide_acquire_cl_context 
    create_opencl_context (user_context: 0x0)
Error: CL: clGetPlatformIDs failed: <Unknown error> -1001
Aborted (core dumped)

@ronghongbo @haoxiaochen

capsule: annotation affects correctness

Check out the current release, and test on a devcloud gen9 machine with tiny input (Follow README.md for how to log onto a GEN9 machine):

u89062@s001-n141:~/gen9-xiaochen-t2sp/t2s/tests/performance$ ./test.sh devcloud capsule gen9 tiny hw
...
capsule-run-gpu.out: capsule-run-gpu.cpp:64: void check_correctness(float*, float*, float*): Assertion `fabs(golden - V[o_0 + SIZE_O_0 * o_1]) < 0.005*fabs(golden)' failed.

However, just by changing two annotations, it passes the test:

--- a/t2s/tests/performance/capsule/capsule.cpp
+++ b/t2s/tests/performance/capsule/capsule.cpp
@@ -48,7 +48,7 @@ int main(void)

     // Inputs
 #ifdef GPU
-    ImageParam P("P", TTYPE, 2), W("W", TTYPE, 2);
+    ImageParam P("I", TTYPE, 2), W("K", TTYPE, 2);

This issue seems to exist only for GPU, not for FPGA.

symbol not found (probably due to devectorization)

In t2s/tests/performance/gemm/gemm.cpp

add another URE W(P)=matrixC(total_j, total_i), add another ImageParam matrixC.
Z(P) = select(kkk == 0 && kk == 0 && k == 0, W(P), ...
W.merge_ures(X, Y, Z, Out);
W.set_bounds(...);
W.space_time_transform(...);

Follow t2s/tests/performance/gemm/README to compile it on an FPGA emulator. We will see Internal Error at /home/u128292/t2sp/Halide/src/CodeGen_LLVM.cpp:1465 triggered by user code at : Symbol not found: W.s0.kkk

IR:

  .......
  Z.shreg.temp() = ... (float32)read_shift_reg("W.shreg", W.s0.jjj, W.s0.iii, W.s0.kkk) ...
  unrolled (W.s0.kkk, 0, 4) {... }

Note the operand 'read_shift_reg("W.shreg", W.s0.jjj, W.s0.iii, W.s0.kkk)' is before the kkk loop, but it refers to kkk. This looks like an issue with devectorization.

clang-9: error: linker command failed with exit code 1 (use -v to see invocation)

Hi,I met a serious problem when I make /t2sp/t2s/preprocessor/src.
At first I log into gen9 compute node on devcloud，and I ran setenv.sh.Then I went to /t2sp/t2s/preprocessor/src and make,error occurred.(clang-9: error: linker command failed with exit code 1 (use -v to see invocation)).
Solution like changing -std=c++17 to c++11 has been proved meaningless.

Bare bones GEMM

Segmentation fault (core dumped) when trying to compile bare bones T2X GEMM.
gemm.cpp.txt

Commands used to build:

# Replace gemm.cpp with gemm.cpp.txt attached 
cd ~/t2s/tests/performance/gemm
cp ~/gemm.cpp.txt ./gemm.cpp

# Compile for emulation using the following commands
source ../../../../setenv.sh devcloud
g++ gemm.cpp -g -I ../util -I $T2S_PATH/Halide/include -L $T2S_PATH/Halide/bin $EMULATOR_LIBHALIDE_TO_LINK -lz -lpthread -ldl -std=c++11 -DTINY
env BITSTREAM=a.aocx AOC_OPTION="$EMULATOR_AOC_OPTION -board=$FPGA_BOARD -emulator-channel-depth-model=strict" ./a.out

# received the following output
# Segmentation fault (core dumped)

Intermediate results from systolic array is not vectorized

gemm: C = alpha * A @ B + beta * C
intermediate result of alpha * A @ B is currently passed to the next kernel as individual scalars, instead of being vectorized

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.