The dsa-docs from polyarch

Adding task features in stream dataflow

Microbenchmarks

Simple task-parallel program
Read array A to array B on premapped three cores, with core ID = 0, 1, and 2 and copy on those premapped three cores.
Ordering between A's and B's elements is maintained.
Example

array_A = {0, 1, 2, 3}
array_B = {0, 1, 2, 3}

DFG

Input: A
B = A
Output: B
TaskProp[0]: (coreMask=1110)

Stream code

if tid<3:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
SS_GLOBAL_WAIT(3);

- Heterogeneous cores
  Copy first half of array A to array B in parallel on premapped two cores, with core ID = 0, 1.
  Copy second half of array A+1 to array B in parallel on premapped two cores, with core ID = 2, 3.
  Ordering between A's and B's elements is maintained.
  Example

array_A = {0, 1, 2, 3}
array_B = {0, 1, 3, 4}

DFG

Input: A, A2
B = A
B2 = Add(A2, 1)
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
else:
	SS_LINEAR_READ(&array2[0]*tid/3, N/3, P_A2);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

- Creation dependence
  Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on premapped two cores with core ID = 0, 1. Ordering between A's and B's elements is not maintained.
  Example

array_A = {0, 1, 2, 3}
array_B = {0, 3, 2, 1}

DFG

Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=creation]: (B:A2)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

- Near-data scheduling
  Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on two cores with core ID = 0, 1. The location of copy among core ID = 0, 1 is determined whether the value in array is odd (core ID = 1) or even (core ID = 1). Ordering between A's and B's elements is maintained.
  Example

array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}

DFG

Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100, spatial=arg0)
TaskDep[0:1, type=creation]: (producer:consumer)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

- Producer-consumer consumption with single operand
  Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array B on premapped two cores with core ID = 0, 1 in chunks of two elements (bytes=8*2) per core. Ordering between A's and B's elements is not maintained.
  Example

array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}

DFG

Input: A, A2
Input: flag, bytes_in
B = A
B2 = A2
count = Acc64(flag)
Output: B, B2
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

- Producer-consumer consumption with multiple operands (single source->multiple dest)
  Read array A on premapped two cores with core ID = 2, 3 and send their contents to array on two cores with core ID = 0, 1 in chunks of two elements per core.
  Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
  Is I needed for mapping, sounds like a limitation?
  Example

array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}

DFG

Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)

Stream code

if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

- Producer-consumer consumption with multiple operands (multiple source->multiple dest)
  Read array A on any of the two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any two cores with core ID = 0, 1 in chunks of two elements per core.
  Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
  Example

array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}

DFG

Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
// we need to send CHILD information to where the producer task is scheduled (schedParent)
// DIFFERENT BETWEEN SCHED AND INDEX: FILTERED??? CHILD is distributed among STRM_CHILD and STRM_PARENT using dependence distance.
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/,  sched=STRM_CHILD/*hw*/, schedParent=STRM_PARENT/*hw*/, depDist=8]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)

Stream code

if tid==-1:
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

- Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
  Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
  Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and copy to B2. Ordering between A's, A2's and B2's elements is maintained.
  Example

array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}

DFG

Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out]: (B:BATCH, bytes_out:SIZE)

Stream code

SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

- Coarse-grained read reuse (Multicast data -> remote destination in-order)
  Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
  Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and send their contents to corresponding dynamic core ID = 0, 1 for the final copy to B2. Ordering between A's, A2's and B2's elements is maintained.
  Example

array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}

DFG

Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// Merge "memory task-part" by COAL_ID and "ACK" Task with "tag=count".
// When setting streams, send "bytes" from "COAL_ID" to CHILD core.
TaskDep[0:1, type=creation, ack=ACK, index=CHILD, sched=STRM_CHILD]: (dummy:A2, I:A)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out, init_order=count, index=CHILD]: (B:BATCH, bytes_out:SIZE)
TaskDep[0:1, type=direct, repeat=]: (DATA:A2)

Stream code

SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
	// inform coalescing location about the child location
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

Notes from microbenchmarks for other options

From example 1, it does not automatically distribute data across assigned cores. Data sent to unconfigured cores will be hanging.
From example 2, I do not know what coreMask contributes when memory distribution still need to distribute explicitly like pthread (it should be like openMP)
From example 8, There is no way to define an explicit tasks and its annotations. Therefore, we need a dummy TaskDep[0:1]
ACK edge annotation sounds like "Task 2's property".
Problem: There is not definition of Task[i], and is implicitly associated with the producer and consumer ports.
From example 10, can we have a non-stream-based way of handshaking? kind-of short streams...
From example 11, not sure if multiple data locations add any complexity?

Workloads in the ASPLOS TaskStream paper

Code: https://github.com/PolyArch/ss-workloads/tree/polyarch/task-parallel-benchmarks/camera_ready

Cholesky
GCN
DB->ML
kNN
SpMM

polyarch / dsa-docs Goto Github PK

dsa-docs's People

Contributors

Stargazers

Watchers

dsa-docs's Issues

Adding task features in stream dataflow

Microbenchmarks

Notes from microbenchmarks for other options

Workloads in the ASPLOS TaskStream paper

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent