Tuning parameters for temporal blocking about yask HOT 4 CLOSED

intel commented on July 23, 2024

Tuning parameters for temporal blocking

from yask.

Comments (4)

chuckyount commented on July 23, 2024

Hi, Actually, temporal block helps only with very large domains. The benefit of temporal blocking on Xeon Phi is to be able to get closer to MCDRAM performance with DDR capacity when using the Xeon Phi in MCDRAM cache mode. (In other words, it does not help you get improvement over MCDRAM performance.) Have you looked at one of the papers describing temporal blocking? If not, you may want to scan this paper<https://drive.google.com/open?id=0Bz7DSHk3JHkpU0ZWdzFVZlhaRTQ> and/or these slides<https://drive.google.com/open?id=0Bz7DSHk3JHkpSmttY2h2dFpvYjA>. Here are some commands that illustrate using temporal blocking: % # make sure you’re in cache mode % numactl -H available: 1 nodes (0) … % make clean; make -j stencil=3axis arch=knl … % # run without wavefronts. % bin/yask.sh -stencil 3axis -arch knl -exe_prefix 'numactl -m 0' -d 2048 -t 1 … best-throughput (num-points/sec): 6.84733G % # run with wavefronts. % bin/yask.sh -stencil 3axis -arch knl -exe_prefix 'numactl -m 0' -d 2048 -t 1 -rt 25 -r 512 … best-throughput (num-points/sec): 12.7478G (These are not the best settings, just an example.) If you’re not seeing something similar, please send me your relevant log file(s) from the logs directory. Chuck From: H.R. Zohouri [mailto:[email protected]] Sent: Friday, January 26, 2018 7:31 AM To: intel/yask <[email protected]> Cc: Subscribed <[email protected]> Subject: [intel/yask] Tuning parameters for temporal blocking (#82) I am trying to evaluate the performance of the 3axis stencil in your framework with different radius values on a Xeon Phi 7210F and the MCDRAM configured in the default cache mode. My compiler is ICC 2018.1. I am trying to improve the performance of this stencil using temporal blocking (-r and -rt switches), since the general consensus is that this stencil is memory-bound, especially at low-order, and could benefit from temporal blocking. However, I have not been successful to improve the performance using temporal blocking, unless when the input size is small (smaller than 128^3). Should I at all expect performance improvement for this stencil when large inputs are used (512^3 and above) using temporal blocking, or is the ~140 GFLOPs that can be achieved using the default settings, the maximum achievable performance? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#82>, or mute the thread<https://github.com/notifications/unsubscribe-auth/APyDPijMsRLj0qzUO-yF5d9WHB_UJNjPks5tOeGLgaJpZM4RuX8K>.

from yask.

zohourih commented on July 23, 2024

I see, I understand your point. Basically, in cache mode, performing temporal blocking will allow the temporally-blocked region to stay in MCDRAM when the input is larger than MCDRAM size, brining the achieved performance closer to what can be achieved with smaller stencils that fully fit in MCDRAM.

However, I believe this is different from the traditional way of taking advantage of temporal blocking, which is to speed up memory-bound stencils beyond the roofline imposed by the external memory bandwidth. I am actually more interested in maximizing the performance of a given stencil, with a size that fits in MCDRAM (512^3 or 1024^3), and since the reported performance seems to be lower than what I expected, I thought maybe I should use temporal blocking to see if the performance can be improved any further.

I'd really appreciate it if you shed some light on the following issues:

- I seem to be encountering a discrepancy in the reported GFLOPS and points per second for the 3axis stencil. From what I understand, the 3axis stencil is the standard 3D star-shaped stencil that has (6 * radius + 1) points, and a computation intensity of (12 * radius + 1) FLOP per point (6 * radius + 1 FPMUL and 6 * radius FPADD). Hence, for this stencil, I expect the reported GFLOPS to be equal to (points/sec * 13) for radius=1, and (points/sec * 25) for radius=2. However, the numbers reported by YASK seem to be a lot lower than this:

radius=1, size=512^3:

────────────────────────────────────────────────────────────
Running 1 performance trial(s) of 50 step(s) each...
────────────────────────────────────────────────────────────
num-points-per-step:                    134.218M
num-writes-per-step:                    134.218M
num-est-FP-ops-per-step:                1.07374G
num-steps-done:                         50
elapsed-time (sec):                     0.390961
throughput (num-points/sec):            17.1651G
throughput (est-FLOPS):                 137.321G
throughput (num-writes/sec):            17.1651G
time in halo exch (sec):                0 (0%)
────────────────────────────────────────────────────────────~~

radius=2, size=512^3:

────────────────────────────────────────────────────────────
Running 1 performance trial(s) of 50 step(s) each...
────────────────────────────────────────────────────────────
num-points-per-step:                    134.218M
num-writes-per-step:                    134.218M
num-est-FP-ops-per-step:                2.01327G
num-steps-done:                         50
elapsed-time (sec):                     0.336311
throughput (num-points/sec):            19.9544G
throughput (est-FLOPS):                 299.316G
throughput (num-writes/sec):            19.9544G
time in halo exch (sec):                0 (0%)
────────────────────────────────────────────────────────────

~~Specifically, the conversion ratio in YASK seems to be 8 for radius=1, and 15 for radius=2. Can you please elaborate how the reported GFLOPS is calculated here?~~

Should I expect temporal blocking to at all improve performance for the 3axis stencil, when the grid already fits in MCDRAM?

Edit: I realized that the coefficients in YASK are shared and because of this, the FLOP per point in the default 3axis stencil is different from what I had in mind. Hence I scratched the first issue.

from yask.

chuckyount commented on July 23, 2024

As currently implemented in YASK, I have not seen a case where temporal blocking improves problems that fit inside MCDRAM. In theory, you might expect a temporal-block size approximately the size of the sum of L2 caches might improve performance, but I have not been able to produce that effect.

from yask.

zohourih commented on July 23, 2024

I see, thank you for the clarification.

The ticket can be closed.

from yask.

Tuning parameters for temporal blocking about yask HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent