Giter Club home page Giter Club logo

Comments (4)

chuckyount avatar chuckyount commented on July 23, 2024

from yask.

zohourih avatar zohourih commented on July 23, 2024

I see, I understand your point. Basically, in cache mode, performing temporal blocking will allow the temporally-blocked region to stay in MCDRAM when the input is larger than MCDRAM size, brining the achieved performance closer to what can be achieved with smaller stencils that fully fit in MCDRAM.

However, I believe this is different from the traditional way of taking advantage of temporal blocking, which is to speed up memory-bound stencils beyond the roofline imposed by the external memory bandwidth. I am actually more interested in maximizing the performance of a given stencil, with a size that fits in MCDRAM (512^3 or 1024^3), and since the reported performance seems to be lower than what I expected, I thought maybe I should use temporal blocking to see if the performance can be improved any further.

I'd really appreciate it if you shed some light on the following issues:

- I seem to be encountering a discrepancy in the reported GFLOPS and points per second for the 3axis stencil. From what I understand, the 3axis stencil is the standard 3D star-shaped stencil that has (6 * radius + 1) points, and a computation intensity of (12 * radius + 1) FLOP per point (6 * radius + 1 FPMUL and 6 * radius FPADD). Hence, for this stencil, I expect the reported GFLOPS to be equal to (points/sec * 13) for radius=1, and (points/sec * 25) for radius=2. However, the numbers reported by YASK seem to be a lot lower than this:

radius=1, size=512^3:

────────────────────────────────────────────────────────────
Running 1 performance trial(s) of 50 step(s) each...
────────────────────────────────────────────────────────────
num-points-per-step:                    134.218M
num-writes-per-step:                    134.218M
num-est-FP-ops-per-step:                1.07374G
num-steps-done:                         50
elapsed-time (sec):                     0.390961
throughput (num-points/sec):            17.1651G
throughput (est-FLOPS):                 137.321G
throughput (num-writes/sec):            17.1651G
time in halo exch (sec):                0 (0%)
────────────────────────────────────────────────────────────~~
radius=2, size=512^3:

────────────────────────────────────────────────────────────
Running 1 performance trial(s) of 50 step(s) each...
────────────────────────────────────────────────────────────
num-points-per-step:                    134.218M
num-writes-per-step:                    134.218M
num-est-FP-ops-per-step:                2.01327G
num-steps-done:                         50
elapsed-time (sec):                     0.336311
throughput (num-points/sec):            19.9544G
throughput (est-FLOPS):                 299.316G
throughput (num-writes/sec):            19.9544G
time in halo exch (sec):                0 (0%)
────────────────────────────────────────────────────────────

Specifically, the conversion ratio in YASK seems to be 8 for radius=1, and 15 for radius=2. Can you please elaborate how the reported GFLOPS is calculated here?

  • Should I expect temporal blocking to at all improve performance for the 3axis stencil, when the grid already fits in MCDRAM?

Edit: I realized that the coefficients in YASK are shared and because of this, the FLOP per point in the default 3axis stencil is different from what I had in mind. Hence I scratched the first issue.

from yask.

chuckyount avatar chuckyount commented on July 23, 2024

As currently implemented in YASK, I have not seen a case where temporal blocking improves problems that fit inside MCDRAM. In theory, you might expect a temporal-block size approximately the size of the sum of L2 caches might improve performance, but I have not been able to produce that effect.

from yask.

zohourih avatar zohourih commented on July 23, 2024

I see, thank you for the clarification.

The ticket can be closed.

from yask.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.