Is your feature request related to a problem? Please describe. Se

Talking with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

For similarity with Intel's AVX512 instruction, I would call it <code class="notransla

<a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#" rel="nofollow

You need vector_simd (which maps to <code class="notr

Add a `sum` function to `ekat_pack.hpp`. about ekat HOT 17 CLOSED

e3sm-project commented on August 12, 2024 1

Add a `sum` function to `ekat_pack.hpp`.

from ekat.

Comments (17)

tcclevenger commented on August 12, 2024 3

Talking with @ambrad , he suggested I take this issue (assuming no one else was planning on it).

from ekat.

ambrad commented on August 12, 2024 1

We need to handle BFB reduction of packs in a BFB build. Non-BFBness can arise like this:

F90:

    do k = 1,nlev  ! scalar k
       a = a + b(k)
    end do

C++:

    ||r(k_pack, a) {
       a += reduce_sum(b(k_pack)) // but this has a problem
    }

The value a will not be BFB-equal because reduce_sum accumulates b(k_pack), then adds it to the accumulating a. So reduce_sum actually needs to be used like this:

    reduce_sum(a, b(k_pack));

Then the impl can be something like this:

#ifdef BFB_BUILD
    for s..., a += b[s]
#else
    b_reduction = ... // vector impl of pack reduction
    a += b_reduction
#endif

from ekat.

bartgol commented on August 12, 2024 1

Adding the hommexx-nh impl of this feature as a reference. It can definitely be improved and/or simplified.

https://github.com/E3SM-Project/E3SM/blob/b7b269b3cfaca869e861fa01e6d65e201dc588a3/components/homme/src/share/cxx/ColumnOps.hpp#L474

from ekat.

bartgol commented on August 12, 2024

For similarity with Intel's AVX512 instruction, I would call it reduce_add, or maybe reduce_sum; having reduce in the name makes it clear it is summing within the vector (and it's also easier to grep). But that's just a personal preference.

from ekat.

bartgol commented on August 12, 2024

Also, is there really any benefit in decorating the loop with simd? I know someone might say "worst case scenario, nothing happened, so you get the same behavior as without the vector_simd macro", but I'm always wary of adding things "just in case", when they could be misleading...

from ekat.

jeff-cohere commented on August 12, 2024

For similarity with Intel's AVX512 instruction, I would call it reduce_add, or maybe reduce_sum; having reduce in the name makes it clear it is summing within the vector (and it's also easier to grep). But that's just a personal preference.

I think that's fine. Wouldn't this also be the case for min and max, too? Can you point me to a link that describes the AVX-512 instruction you're referring to?

Regarding vector_simd: I had no specific reason for including it. I just copy-pasted the min (or max) implementation. I'm not versed in all of this machinery just yet.

from ekat.

bartgol commented on August 12, 2024

This is the intel intrinsics guide, which includes plenty of vector extension versions. Probably you're most interested in AVX2 and AVX512 (the latter comes in n+1 flavors).

I don't think using similar names to avx is stricly necessary, but it might help convey the message that packs are to enhance vectorization (in case it wasn't already clear).

BTW, in this case, without using intrinsics, I'm not sure if the compiler is smart enough to use vector instructions to achieve better-than-scalar performance. Probably not. One would have to manually implement a divide-and-conquer strategy (like sum(pack) = sum(pack[0:N/2)+pack[N/2,N)). However, I don't know if this is worth the effort. I think one would have to do many reduction ops for its cost to become noticeable compared to the rest.

from ekat.

jeff-cohere commented on August 12, 2024

Thanks for the link! I've updated the proposed solution above.

from ekat.

ambrad commented on August 12, 2024

You need vector_simd (which maps to #pragma omp simd; see docs for that to understand the motivation) in there. All pack routines must include it. We're telling the compiler with maximum intensity that it must vectorize the loop.

Edit: Sometimes we find the compiler does something buggy. In that case, to be clear about why we're not using vector_simd, we use vector_disabled, with a comment about compiler version. We could actually use compiler version cpp defs to fine-tune that.

from ekat.

bartgol commented on August 12, 2024

I was just wondering whether the simd pragma does anything here, since the different iterations all update the same quantity. In this scenario, it seems to me that the correct pragma would be #pragma omp simd reduction(+:output_var_name). I wonder if a simple omp simd does anything at all here (or whether it does the right thing). There is not much online for data races on the output variable in simd operations. And all the reduction examples add the reduction(+:var) clause.

from ekat.

ambrad commented on August 12, 2024

Good point. Actually, for min, max, this new sum, and Mask::any, all, we probably do want the reduction clause, as that might help the compiler even more. An action item here is to look at the emitted code (use icpc -S -fsource-asm -c foo.cpp to get annotated assembly) with and without the reduction clause.

from ekat.

bartgol commented on August 12, 2024

Yeah, that's the best way to check and be sure.

from ekat.

jeff-cohere commented on August 12, 2024

Thanks for the info, @ambrad!

from ekat.

tcclevenger commented on August 12, 2024

Great, thanks!

Does there currently exist a version of #ifdef BFB_BUILD in scream already?

from ekat.

bartgol commented on August 12, 2024

Ah, I thought we did, but I grepped and there's nothing in scream. In Ekat, I believe debug builds are always meant to have bfb enabled. However, ekat tests also access tests/ekat_test_config.h, which has (or doesn't, depending on config) the EKAT_TEST_STRICT_FP macro. You could use that to select the impl in your test.

from ekat.

ambrad commented on August 12, 2024

Note there's two parts to this issue: (1) the pack reduction, packed_sum.reduce_add in the code Luca links and (2) the wrapper that builds on the separate parallel_reduce wrapper from issue #32.

from ekat.

bartgol commented on August 12, 2024

This was completed with #37.

from ekat.

Add a `sum` function to `ekat_pack.hpp`. about ekat HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent