Giter Club home page Giter Club logo

Comments (17)

tcclevenger avatar tcclevenger commented on August 12, 2024 3

Talking with @ambrad , he suggested I take this issue (assuming no one else was planning on it).

from ekat.

ambrad avatar ambrad commented on August 12, 2024 1

We need to handle BFB reduction of packs in a BFB build. Non-BFBness can arise like this:

F90:

    do k = 1,nlev  ! scalar k
       a = a + b(k)
    end do

C++:

    ||r(k_pack, a) {
       a += reduce_sum(b(k_pack)) // but this has a problem
    }

The value a will not be BFB-equal because reduce_sum accumulates b(k_pack), then adds it to the accumulating a. So reduce_sum actually needs to be used like this:

    reduce_sum(a, b(k_pack));

Then the impl can be something like this:

#ifdef BFB_BUILD
    for s..., a += b[s]
#else
    b_reduction = ... // vector impl of pack reduction
    a += b_reduction
#endif​

from ekat.

bartgol avatar bartgol commented on August 12, 2024 1

Adding the hommexx-nh impl of this feature as a reference. It can definitely be improved and/or simplified.

https://github.com/E3SM-Project/E3SM/blob/b7b269b3cfaca869e861fa01e6d65e201dc588a3/components/homme/src/share/cxx/ColumnOps.hpp#L474

from ekat.

bartgol avatar bartgol commented on August 12, 2024

For similarity with Intel's AVX512 instruction, I would call it reduce_add, or maybe reduce_sum; having reduce in the name makes it clear it is summing within the vector (and it's also easier to grep). But that's just a personal preference.

from ekat.

bartgol avatar bartgol commented on August 12, 2024

Also, is there really any benefit in decorating the loop with simd? I know someone might say "worst case scenario, nothing happened, so you get the same behavior as without the vector_simd macro", but I'm always wary of adding things "just in case", when they could be misleading...

from ekat.

jeff-cohere avatar jeff-cohere commented on August 12, 2024

For similarity with Intel's AVX512 instruction, I would call it reduce_add, or maybe reduce_sum; having reduce in the name makes it clear it is summing within the vector (and it's also easier to grep). But that's just a personal preference.

I think that's fine. Wouldn't this also be the case for min and max, too? Can you point me to a link that describes the AVX-512 instruction you're referring to?

Regarding vector_simd: I had no specific reason for including it. I just copy-pasted the min (or max) implementation. I'm not versed in all of this machinery just yet.

from ekat.

bartgol avatar bartgol commented on August 12, 2024

This is the intel intrinsics guide, which includes plenty of vector extension versions. Probably you're most interested in AVX2 and AVX512 (the latter comes in n+1 flavors).

I don't think using similar names to avx is stricly necessary, but it might help convey the message that packs are to enhance vectorization (in case it wasn't already clear).

BTW, in this case, without using intrinsics, I'm not sure if the compiler is smart enough to use vector instructions to achieve better-than-scalar performance. Probably not. One would have to manually implement a divide-and-conquer strategy (like sum(pack) = sum(pack[0:N/2)+pack[N/2,N)). However, I don't know if this is worth the effort. I think one would have to do many reduction ops for its cost to become noticeable compared to the rest.

from ekat.

jeff-cohere avatar jeff-cohere commented on August 12, 2024

Thanks for the link! I've updated the proposed solution above.

from ekat.

ambrad avatar ambrad commented on August 12, 2024

You need vector_simd (which maps to #pragma omp simd; see docs for that to understand the motivation) in there. All pack routines must include it. We're telling the compiler with maximum intensity that it must vectorize the loop.

Edit: Sometimes we find the compiler does something buggy. In that case, to be clear about why we're not using vector_simd, we use vector_disabled, with a comment about compiler version. We could actually use compiler version cpp defs to fine-tune that.

from ekat.

bartgol avatar bartgol commented on August 12, 2024

I was just wondering whether the simd pragma does anything here, since the different iterations all update the same quantity. In this scenario, it seems to me that the correct pragma would be #pragma omp simd reduction(+:output_var_name). I wonder if a simple omp simd does anything at all here (or whether it does the right thing). There is not much online for data races on the output variable in simd operations. And all the reduction examples add the reduction(+:var) clause.

from ekat.

ambrad avatar ambrad commented on August 12, 2024

Good point. Actually, for min, max, this new sum, and Mask::any, all, we probably do want the reduction clause, as that might help the compiler even more. An action item here is to look at the emitted code (use icpc -S -fsource-asm -c foo.cpp to get annotated assembly) with and without the reduction clause.

from ekat.

bartgol avatar bartgol commented on August 12, 2024

Yeah, that's the best way to check and be sure.

from ekat.

jeff-cohere avatar jeff-cohere commented on August 12, 2024

Thanks for the info, @ambrad!

from ekat.

tcclevenger avatar tcclevenger commented on August 12, 2024

Great, thanks!

Does there currently exist a version of #ifdef BFB_BUILD in scream already?

from ekat.

bartgol avatar bartgol commented on August 12, 2024

Ah, I thought we did, but I grepped and there's nothing in scream. In Ekat, I believe debug builds are always meant to have bfb enabled. However, ekat tests also access tests/ekat_test_config.h, which has (or doesn't, depending on config) the EKAT_TEST_STRICT_FP macro. You could use that to select the impl in your test.

from ekat.

ambrad avatar ambrad commented on August 12, 2024

Note there's two parts to this issue: (1) the pack reduction, packed_sum.reduce_add in the code Luca links and (2) the wrapper that builds on the separate parallel_reduce wrapper from issue #32.

from ekat.

bartgol avatar bartgol commented on August 12, 2024

This was completed with #37.

from ekat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.