dsa-docs's People
dsa-docs's Issues
Adding task features in stream dataflow
Microbenchmarks
- Simple task-parallel program
Read array A to array B on premapped three cores, with core ID = 0, 1, and 2 and copy on those premapped three cores.
Ordering between A's and B's elements is maintained.
Example
array_A = {0, 1, 2, 3}
array_B = {0, 1, 2, 3}
DFG
Input: A
B = A
Output: B
TaskProp[0]: (coreMask=1110)
Stream code
if tid<3:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
SS_GLOBAL_WAIT(3);
-
- Heterogeneous cores
Copy first half of array A to array B in parallel on premapped two cores, with core ID = 0, 1.
Copy second half of array A+1 to array B in parallel on premapped two cores, with core ID = 2, 3.
Ordering between A's and B's elements is maintained.
Example
- Heterogeneous cores
array_A = {0, 1, 2, 3}
array_B = {0, 1, 3, 4}
DFG
Input: A, A2
B = A
B2 = Add(A2, 1)
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
else:
SS_LINEAR_READ(&array2[0]*tid/3, N/3, P_A2);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Creation dependence
Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on premapped two cores with core ID = 0, 1. Ordering between A's and B's elements is not maintained.
Example
- Creation dependence
array_A = {0, 1, 2, 3}
array_B = {0, 3, 2, 1}
DFG
Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=creation]: (B:A2)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Near-data scheduling
Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on two cores with core ID = 0, 1. The location of copy among core ID = 0, 1 is determined whether the value in array is odd (core ID = 1) or even (core ID = 1). Ordering between A's and B's elements is maintained.
Example
- Near-data scheduling
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}
DFG
Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100, spatial=arg0)
TaskDep[0:1, type=creation]: (producer:consumer)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Producer-consumer consumption with single operand
Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array B on premapped two cores with core ID = 0, 1 in chunks of two elements (bytes=8*2) per core. Ordering between A's and B's elements is not maintained.
Example
- Producer-consumer consumption with single operand
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}
DFG
Input: A, A2
Input: flag, bytes_in
B = A
B2 = A2
count = Acc64(flag)
Output: B, B2
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_flag, 1, N);
SS_CONST(P_bytes_in, 8, N/2);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Producer-consumer consumption with multiple operands (single source->multiple dest)
Read array A on premapped two cores with core ID = 2, 3 and send their contents to array on two cores with core ID = 0, 1 in chunks of two elements per core.
Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
Is I needed for mapping, sounds like a limitation?
Example
- Producer-consumer consumption with multiple operands (single source->multiple dest)
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}
DFG
Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)
Stream code
if tid>1:
// this will produce data at (B, dummy, count, bytes_out, CHILD).
// (dummy) will be automatically sent to (CHILD's task buffer)
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_flag, 1, N);
SS_CONST(P_bytes_in, 8, N/2);
else:
// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
// When acknowledged, it will read "count" and use it as an index to read data...
SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Producer-consumer consumption with multiple operands (multiple source->multiple dest)
Read array A on any of the two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any two cores with core ID = 0, 1 in chunks of two elements per core.
Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
Example
- Producer-consumer consumption with multiple operands (multiple source->multiple dest)
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}
DFG
Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
// we need to send CHILD information to where the producer task is scheduled (schedParent)
// DIFFERENT BETWEEN SCHED AND INDEX: FILTERED??? CHILD is distributed among STRM_CHILD and STRM_PARENT using dependence distance.
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/, sched=STRM_CHILD/*hw*/, schedParent=STRM_PARENT/*hw*/, depDist=8]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)
Stream code
if tid==-1:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else if tid>1:
// this will produce data at (B, dummy, count, bytes_out, CHILD).
// (dummy) will be automatically sent to (CHILD's task buffer)
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_flag, 1, N);
SS_CONST(P_bytes_in, 8, N/2);
else:
// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
// When acknowledged, it will read "count" and use it as an index to read data...
SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and copy to B2. Ordering between A's, A2's and B2's elements is maintained.
Example
- Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}
DFG
Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out]: (B:BATCH, bytes_out:SIZE)
Stream code
SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_bytes_in, 8, N/2);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Coarse-grained read reuse (Multicast data -> remote destination in-order)
Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and send their contents to corresponding dynamic core ID = 0, 1 for the final copy to B2. Ordering between A's, A2's and B2's elements is maintained.
Example
- Coarse-grained read reuse (Multicast data -> remote destination in-order)
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}
DFG
Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// Merge "memory task-part" by COAL_ID and "ACK" Task with "tag=count".
// When setting streams, send "bytes" from "COAL_ID" to CHILD core.
TaskDep[0:1, type=creation, ack=ACK, index=CHILD, sched=STRM_CHILD]: (dummy:A2, I:A)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out, init_order=count, index=CHILD]: (B:BATCH, bytes_out:SIZE)
TaskDep[0:1, type=direct, repeat=]: (DATA:A2)
Stream code
SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_bytes_in, 8, N/2);
// inform coalescing location about the child location
SS_CONFIG_STREAM(/*bypass=1*/);
SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
Notes from microbenchmarks for other options
- From example 1, it does not automatically distribute data across assigned cores. Data sent to unconfigured cores will be hanging.
- From example 2, I do not know what coreMask contributes when memory distribution still need to distribute explicitly like pthread (it should be like openMP)
- From example 8, There is no way to define an explicit tasks and its annotations. Therefore, we need a dummy TaskDep[0:1]
ACK edge annotation sounds like "Task 2's property".
Problem: There is not definition of Task[i], and is implicitly associated with the producer and consumer ports. - From example 10, can we have a non-stream-based way of handshaking? kind-of short streams...
- From example 11, not sure if multiple data locations add any complexity?
Workloads in the ASPLOS TaskStream paper
Code: https://github.com/PolyArch/ss-workloads/tree/polyarch/task-parallel-benchmarks/camera_ready
- Cholesky
- GCN
- DB->ML
- kNN
- SpMM
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.