Comments (17)
Talking with @ambrad , he suggested I take this issue (assuming no one else was planning on it).
from ekat.
We need to handle BFB reduction of packs in a BFB build. Non-BFBness can arise like this:
F90:
do k = 1,nlev ! scalar k
a = a + b(k)
end do
C++:
||r(k_pack, a) {
a += reduce_sum(b(k_pack)) // but this has a problem
}
The value a will not be BFB-equal because reduce_sum accumulates b(k_pack), then adds it to the accumulating a. So reduce_sum actually needs to be used like this:
reduce_sum(a, b(k_pack));
Then the impl can be something like this:
#ifdef BFB_BUILD
for s..., a += b[s]
#else
b_reduction = ... // vector impl of pack reduction
a += b_reduction
#endif
from ekat.
Adding the hommexx-nh impl of this feature as a reference. It can definitely be improved and/or simplified.
from ekat.
For similarity with Intel's AVX512 instruction, I would call it reduce_add
, or maybe reduce_sum
; having reduce
in the name makes it clear it is summing within the vector (and it's also easier to grep). But that's just a personal preference.
from ekat.
Also, is there really any benefit in decorating the loop with simd? I know someone might say "worst case scenario, nothing happened, so you get the same behavior as without the vector_simd
macro", but I'm always wary of adding things "just in case", when they could be misleading...
from ekat.
For similarity with Intel's AVX512 instruction, I would call it
reduce_add
, or maybereduce_sum
; havingreduce
in the name makes it clear it is summing within the vector (and it's also easier to grep). But that's just a personal preference.
I think that's fine. Wouldn't this also be the case for min
and max
, too? Can you point me to a link that describes the AVX-512 instruction you're referring to?
Regarding vector_simd
: I had no specific reason for including it. I just copy-pasted the min
(or max
) implementation. I'm not versed in all of this machinery just yet.
from ekat.
This is the intel intrinsics guide, which includes plenty of vector extension versions. Probably you're most interested in AVX2 and AVX512 (the latter comes in n+1 flavors).
I don't think using similar names to avx is stricly necessary, but it might help convey the message that packs are to enhance vectorization (in case it wasn't already clear).
BTW, in this case, without using intrinsics, I'm not sure if the compiler is smart enough to use vector instructions to achieve better-than-scalar performance. Probably not. One would have to manually implement a divide-and-conquer strategy (like sum(pack) = sum(pack[0:N/2)+pack[N/2,N)). However, I don't know if this is worth the effort. I think one would have to do many reduction ops for its cost to become noticeable compared to the rest.
from ekat.
Thanks for the link! I've updated the proposed solution above.
from ekat.
You need vector_simd
(which maps to #pragma omp simd
; see docs for that to understand the motivation) in there. All pack routines must include it. We're telling the compiler with maximum intensity that it must vectorize the loop.
Edit: Sometimes we find the compiler does something buggy. In that case, to be clear about why we're not using vector_simd
, we use vector_disabled
, with a comment about compiler version. We could actually use compiler version cpp defs to fine-tune that.
from ekat.
I was just wondering whether the simd pragma does anything here, since the different iterations all update the same quantity. In this scenario, it seems to me that the correct pragma would be #pragma omp simd reduction(+:output_var_name)
. I wonder if a simple omp simd does anything at all here (or whether it does the right thing). There is not much online for data races on the output variable in simd operations. And all the reduction examples add the reduction(+:var)
clause.
from ekat.
Good point. Actually, for min, max, this new sum, and Mask::any, all, we probably do want the reduction clause, as that might help the compiler even more. An action item here is to look at the emitted code (use icpc -S -fsource-asm -c foo.cpp
to get annotated assembly) with and without the reduction clause.
from ekat.
Yeah, that's the best way to check and be sure.
from ekat.
Thanks for the info, @ambrad!
from ekat.
Great, thanks!
Does there currently exist a version of #ifdef BFB_BUILD
in scream already?
from ekat.
Ah, I thought we did, but I grepped and there's nothing in scream. In Ekat, I believe debug builds are always meant to have bfb enabled. However, ekat tests also access tests/ekat_test_config.h
, which has (or doesn't, depending on config) the EKAT_TEST_STRICT_FP
macro. You could use that to select the impl in your test.
from ekat.
Note there's two parts to this issue: (1) the pack reduction, packed_sum.reduce_add
in the code Luca links and (2) the wrapper that builds on the separate parallel_reduce
wrapper from issue #32.
from ekat.
This was completed with #37.
from ekat.
Related Issues (20)
- test-launcher is run from build/bin directory, not install bin directory HOT 4
- Allow user to specify a PATH for finding MPI HOT 3
- Change default internal storage of arrays in ParameterList, when parsing a YAML file
- Allow user to extend set of supported types for YAML parsing/writing.
- The tridiag tests are not really run, but report pass
- Make genRandArray accept 64 bit ints for the array length
- Allow specific log filenames to be supplied to `ekat::logging::Logger` HOT 2
- TPL Catch2 is incomplete HOT 8
- Potential problem in WSM with take/release
- EKAT valid_nested_list_format uses size_t for an integer, num_open, that might go negative HOT 1
- Remove assumption of CudaUVM in order to create Host team policies
- Current kokkos submodule reference is not a tree HOT 1
- Add SYCL/HIP testing to PR testing
- Need to update setting of C++ standard in ekat CMakeLists.txt
- Append to rather than set CXXFLAGS, etc. HOT 4
- Do not make EKAT enable FPEs during initialize_ekat_session HOT 4
- Expand view_reduction to handle array of accumulators.
- Potential problem with ExeSpaceUtils view_reduction and parallel_reduce HOT 5
- Potential fix for Apple silicon build HOT 10
- Add stacktrace information to assert macros HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ekat.