I found that if the primitive array has no null values, Auto vectorized can outperform

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-h

Auto vectorized vs packed_simd in arrow about arrow2 HOT 10 CLOSED

sundy-li commented on August 14, 2024 1

Auto vectorized vs packed_simd in arrow

from arrow2.

Comments (10)

sundy-li commented on August 14, 2024 1

Ok, so is that possible to change the nonnull case branch to auto vectorized version?

from arrow2.

Dandandan commented on August 14, 2024 1

Hey @leiysky

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

See for an example here:
https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/aggregate/sum.rs#L22

from arrow2.

Dandandan commented on August 14, 2024 1

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

Rust can cross compile to a different target architecture if you like, but this code only generates it when compiling it for the specified target. E.g. x86_64+avx creates different compiled versions only when compiling with x86_64 as target but won't do that for aarch64 or x86 (it wouldn't make sense as the code would be invalid).
When it matches the target it will include multiple versions and will do detection at the first call of the function.

from arrow2.

jorgecarleitao commented on August 14, 2024

yeap, it has been a battle. I actually have not used packed_simd for a while, but the null case was so important and I was unable to hit the right instructions, and so ended up adding it.

from arrow2.

jorgecarleitao commented on August 14, 2024

it is faster; it is simpler => definitely :)

from arrow2.

leiysky commented on August 14, 2024

Hi, @jorgecarleitao . I have a question about the comptibility of vectorization here.

Since there are many kinds of SIMD instruction sets(e.g. SSE, AVX, FMA), which are coupled with microarchitecture(e.g. Intel Skylake, AMD Zen2). If we only do simple cross compilation, that is, only specifying target architecture, we may not utilize with SIMD well.

AFAIK, this issue is usually solved by function multiversioning.

In C++ world, there are some approaches like GCC target attribute, which can generate multiple versions of a function(typically with different SIMD instruction sets) and dispatch them during load-time.

And I noticed that there is a multiversion crate https://docs.rs/multiversion/0.6.1/multiversion/, but I haven't tested it yet.

Is it possible to support this in arrow2?

from arrow2.

leiysky commented on August 14, 2024

Hey @leiysky

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

See for an example here:

https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/aggregate/sum.rs#L22

Nice!

I only read the code here, and find there seems no special handling.

https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/arithmetics/basic/add.rs

Sorry for my misunderstanding.

from arrow2.

sundy-li commented on August 14, 2024

I only read the code here, and find there seems no special handling.

I had some doubt before, I think it may not work in the platform without avx support. But I have not tested about it.

from arrow2.

ritchie46 commented on August 14, 2024

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

from arrow2.

leiysky commented on August 14, 2024

Yes, we use the multiversion crate right now for achieving auto-vectorization with specific SIMD instructions.

To utilize avx at runtime. This still has to be compiled with a machine that supports this right? Or can it cross compile for different targets?

Multiversioning allows you to define targets(e.g.avx, sse) for a function, then compiler will always produce specified versions of the function, and dispatch them at loadtime(not runtime, with which it can achieve zero-overhead).

from arrow2.

Auto vectorized vs packed_simd in arrow about arrow2 HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent