Giter Club home page Giter Club logo

dsa-docs's People

Contributors

dkupsh avatar sihaoliu avatar tony-nowatzki avatar were avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsa-docs's Issues

Adding task features in stream dataflow

Microbenchmarks

  1. Simple task-parallel program
    Read array A to array B on premapped three cores, with core ID = 0, 1, and 2 and copy on those premapped three cores.
    Ordering between A's and B's elements is maintained.
    Example
array_A = {0, 1, 2, 3}
array_B = {0, 1, 2, 3}

DFG

Input: A
B = A
Output: B
TaskProp[0]: (coreMask=1110)

Stream code

if tid<3:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
SS_GLOBAL_WAIT(3);
    • Heterogeneous cores
      Copy first half of array A to array B in parallel on premapped two cores, with core ID = 0, 1.
      Copy second half of array A+1 to array B in parallel on premapped two cores, with core ID = 2, 3.
      Ordering between A's and B's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 1, 3, 4}

DFG

Input: A, A2
B = A
B2 = Add(A2, 1)
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
else:
	SS_LINEAR_READ(&array2[0]*tid/3, N/3, P_A2);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Creation dependence
      Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on premapped two cores with core ID = 0, 1. Ordering between A's and B's elements is not maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 3, 2, 1}

DFG

Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=creation]: (B:A2)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Near-data scheduling
      Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on two cores with core ID = 0, 1. The location of copy among core ID = 0, 1 is determined whether the value in array is odd (core ID = 1) or even (core ID = 1). Ordering between A's and B's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}

DFG

Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100, spatial=arg0)
TaskDep[0:1, type=creation]: (producer:consumer)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Producer-consumer consumption with single operand
      Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array B on premapped two cores with core ID = 0, 1 in chunks of two elements (bytes=8*2) per core. Ordering between A's and B's elements is not maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}

DFG

Input: A, A2
Input: flag, bytes_in
B = A
B2 = A2
count = Acc64(flag)
Output: B, B2
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Producer-consumer consumption with multiple operands (single source->multiple dest)
      Read array A on premapped two cores with core ID = 2, 3 and send their contents to array on two cores with core ID = 0, 1 in chunks of two elements per core.
      Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
      Is I needed for mapping, sounds like a limitation?
      Example
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}

DFG

Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)

Stream code

if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Producer-consumer consumption with multiple operands (multiple source->multiple dest)
      Read array A on any of the two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any two cores with core ID = 0, 1 in chunks of two elements per core.
      Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}

DFG

Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
// we need to send CHILD information to where the producer task is scheduled (schedParent)
// DIFFERENT BETWEEN SCHED AND INDEX: FILTERED??? CHILD is distributed among STRM_CHILD and STRM_PARENT using dependence distance.
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/,  sched=STRM_CHILD/*hw*/, schedParent=STRM_PARENT/*hw*/, depDist=8]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)

Stream code

if tid==-1:
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
      Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
      Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and copy to B2. Ordering between A's, A2's and B2's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}

DFG

Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out]: (B:BATCH, bytes_out:SIZE)

Stream code

SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Coarse-grained read reuse (Multicast data -> remote destination in-order)
      Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
      Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and send their contents to corresponding dynamic core ID = 0, 1 for the final copy to B2. Ordering between A's, A2's and B2's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}

DFG

Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// Merge "memory task-part" by COAL_ID and "ACK" Task with "tag=count".
// When setting streams, send "bytes" from "COAL_ID" to CHILD core.
TaskDep[0:1, type=creation, ack=ACK, index=CHILD, sched=STRM_CHILD]: (dummy:A2, I:A)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out, init_order=count, index=CHILD]: (B:BATCH, bytes_out:SIZE)
TaskDep[0:1, type=direct, repeat=]: (DATA:A2)

Stream code

SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
	// inform coalescing location about the child location
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

Notes from microbenchmarks for other options

  1. From example 1, it does not automatically distribute data across assigned cores. Data sent to unconfigured cores will be hanging.
  2. From example 2, I do not know what coreMask contributes when memory distribution still need to distribute explicitly like pthread (it should be like openMP)
  3. From example 8, There is no way to define an explicit tasks and its annotations. Therefore, we need a dummy TaskDep[0:1]
    ACK edge annotation sounds like "Task 2's property".
    Problem: There is not definition of Task[i], and is implicitly associated with the producer and consumer ports.
  4. From example 10, can we have a non-stream-based way of handshaking? kind-of short streams...
  5. From example 11, not sure if multiple data locations add any complexity?

Workloads in the ASPLOS TaskStream paper

Code: https://github.com/PolyArch/ss-workloads/tree/polyarch/task-parallel-benchmarks/camera_ready

  1. Cholesky
  2. GCN
  3. DB->ML
  4. kNN
  5. SpMM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.