Comments (4)
from yask.
I see, I understand your point. Basically, in cache mode, performing temporal blocking will allow the temporally-blocked region to stay in MCDRAM when the input is larger than MCDRAM size, brining the achieved performance closer to what can be achieved with smaller stencils that fully fit in MCDRAM.
However, I believe this is different from the traditional way of taking advantage of temporal blocking, which is to speed up memory-bound stencils beyond the roofline imposed by the external memory bandwidth. I am actually more interested in maximizing the performance of a given stencil, with a size that fits in MCDRAM (512^3 or 1024^3), and since the reported performance seems to be lower than what I expected, I thought maybe I should use temporal blocking to see if the performance can be improved any further.
I'd really appreciate it if you shed some light on the following issues:
- I seem to be encountering a discrepancy in the reported GFLOPS and points per second for the 3axis stencil. From what I understand, the 3axis stencil is the standard 3D star-shaped stencil that has (6 * radius + 1) points, and a computation intensity of (12 * radius + 1) FLOP per point (6 * radius + 1 FPMUL and 6 * radius FPADD). Hence, for this stencil, I expect the reported GFLOPS to be equal to (points/sec * 13) for radius=1, and (points/sec * 25) for radius=2. However, the numbers reported by YASK seem to be a lot lower than this:
radius=1, size=512^3:
────────────────────────────────────────────────────────────
Running 1 performance trial(s) of 50 step(s) each...
────────────────────────────────────────────────────────────
num-points-per-step: 134.218M
num-writes-per-step: 134.218M
num-est-FP-ops-per-step: 1.07374G
num-steps-done: 50
elapsed-time (sec): 0.390961
throughput (num-points/sec): 17.1651G
throughput (est-FLOPS): 137.321G
throughput (num-writes/sec): 17.1651G
time in halo exch (sec): 0 (0%)
────────────────────────────────────────────────────────────~~
radius=2, size=512^3:
────────────────────────────────────────────────────────────
Running 1 performance trial(s) of 50 step(s) each...
────────────────────────────────────────────────────────────
num-points-per-step: 134.218M
num-writes-per-step: 134.218M
num-est-FP-ops-per-step: 2.01327G
num-steps-done: 50
elapsed-time (sec): 0.336311
throughput (num-points/sec): 19.9544G
throughput (est-FLOPS): 299.316G
throughput (num-writes/sec): 19.9544G
time in halo exch (sec): 0 (0%)
────────────────────────────────────────────────────────────
Specifically, the conversion ratio in YASK seems to be 8 for radius=1, and 15 for radius=2. Can you please elaborate how the reported GFLOPS is calculated here?
- Should I expect temporal blocking to at all improve performance for the 3axis stencil, when the grid already fits in MCDRAM?
Edit: I realized that the coefficients in YASK are shared and because of this, the FLOP per point in the default 3axis stencil is different from what I had in mind. Hence I scratched the first issue.
from yask.
As currently implemented in YASK, I have not seen a case where temporal blocking improves problems that fit inside MCDRAM. In theory, you might expect a temporal-block size approximately the size of the sum of L2 caches might improve performance, but I have not been able to produce that effect.
from yask.
I see, thank you for the clarification.
The ticket can be closed.
from yask.
Related Issues (20)
- Remove unnecessary MPI buf allocation HOT 1
- Improve default temporal allocation
- Track and check step-var values in APIs
- Select host arch by default during build and run HOT 2
- Compile error: find_seq_in_lookup_table: seq_number not found (shared/cfe/edgcpfe/il.c) HOT 2
- Improve halo-exchange for all domain sizes
- Convert code-names to ISA for "arch" setting.
- Enable intra-node MPI via shared-memory as default HOT 2
- Serpentine block path doesn't work HOT 2
- Add data visualization
- Share auto-tuner results among ranks
- Enable streaming writes automatically HOT 1
- Relocate API doc hosting HOT 1
- "make clean; make stencil=iso3dfd" error HOT 13
- paper or so HOT 1
- Use specialised stores without mask in calc_loop_of_clusters HOT 3
- Unnecessary integer moves HOT 4
- Python API does not work with next-gen Intel compiler HOT 1
- Implement "*_in_slice" APIs directly on device for offload builds
- Hang during MPI shm allocation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yask.