<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Bulk message creation about fairmq HOT 16 OPEN

rbx commented on August 16, 2024

Bulk message creation

from fairmq.

Comments (16)

ktf commented on August 16, 2024 1

I guess, you are asking for a bulk api on a single transport only?

You mean that all the messages of the bulk are created on the same transport? Yes. See the line above to see what I actually want to optimise.

from fairmq.

rbx commented on August 16, 2024

@ktf Can you provide more context here? What are you running, how many messages, their size. What impact does this have?

from fairmq.

dennisklein commented on August 16, 2024

Should the message count be a run-time or compile-time argument? e.g.

std::vector<MessagePtr> CreateMessages(size_t n, std::vector<size_t> const & sizes, std::vector<Alignment> const & alignments);
assert(n == sizes.size());
assert(n == alignments.size())

template <size_t N>
std::array<MessagePtr, N> CreateMessages(std::array<size_t, N> const & sizes, std::array<Alignment, N> const & alignments); // or spans

I guess, in case of a failure to create a single msg, the whole call is expected to fail?

from fairmq.

dennisklein commented on August 16, 2024

I guess, you are asking for a bulk api on a single transport only?

from fairmq.

ktf commented on August 16, 2024

This is ITS readout, but it's a general issue IMHO. Whenever we use the shallow copy to increase parallelism on the same data when running in shared memory, we need to copy messages one by one as done in:

https://github.com/AliceO2Group/AliceO2/blob/dev/Framework/Core/src/DataProcessingDevice.cxx#L637

It would be good if there was a way to reduce the impact of the mutex.

from fairmq.

ktf commented on August 16, 2024

Should the message count be a run-time or compile-time argument? e.g.

Runtime.

I guess, in case of a failure to create a single msg, the whole call is expected to fail?

I would say so.

from fairmq.

dennisklein commented on August 16, 2024

Since this mutex is internal to the boost shm library, we can only save it by making a big allocation from the boost shmem segment and then chunk it up according to the requested fmq msgs. However, when the fmq msgs are destroyed, we would have to track and delay the deallocation on the boost shmem segment until all msg we chunked the large allocation into, are destroyed.

Will not be trivial, and we will have to see in measurements, how this compares performance-wise. It could be that we only shift the overhead to msg destruction time - which may of course still help your use case. But just to set the expectation right.

It would be a different story of course, if boost shmem would also provide a bulk allocation api, but this I don't know (yet).

from fairmq.

rbx commented on August 16, 2024

There is a bulk API, but it may have fragmentation effects (positive or negative), as it will attempt to put all buffers contiguously:
https://www.boost.org/doc/libs/1_83_0/doc/html/interprocess/managed_memory_segments.html#interprocess.managed_memory_segments.managed_memory_segment_advanced_features.managed_memory_segment_multiple_allocations

Need to also see how it deals with (custom) alignment.

from fairmq.

ktf commented on August 16, 2024

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

from fairmq.

dennisklein commented on August 16, 2024

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?

from fairmq.

dennisklein commented on August 16, 2024

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?

hmm, even if we speculatively pre-allocate those 2-byte buffer to only pay the overhead only once every couple of shmem::Message::Copy()s, I guess, we will still need a bulk Copy() api, because we would need to synchronize access to the pre-allocated refcount buffer pool as well, right? maybe we could get away with one or two atomic integers for this?

from fairmq.

ktf commented on August 16, 2024

Yes, I suppose in this particular case we are in that case, assuming UR means UnmanagedRegion. Are those allocations in shared memory? That could explain the fragmentation as well, no?

from fairmq.

rbx commented on August 16, 2024

Yes, they are in shared memory. Certainly they contribute to overall fragmentation.

Either we make a bulk call for Copy to store them together. Or in this case, since we know the exact data size (header to store the ref count...but still an unknown number of them) we need, we could make use of a pool allocator.

from fairmq.

ktf commented on August 16, 2024

I would definitely use a pool allocator for those. We probably have thousands of them being created for each timeframe. It would actually be good if we could also have a bulk allocator for messages below size 256 bytes or so, since every other message in our datamodel is an around 80 bytes header.

from fairmq.

rbx commented on August 16, 2024

Small update on this. In the dev branch there is now first implementation using pools.

For reference counting messages from UnmanagedRegion, it will internally store these in a separate segment (1 per UnmanagedRegion) and use a pool to track them. Technically this is still syncing on every message, but it should be faster since it is 1. a separate segment from the main data and 2. uses a pool. Also, this certainly should improve fragmentation in the main data segment, as now nothing is allocated there for UR region messages. The segment size is 10,000,000 bytes by default, which should be good for up to 5,000,000 messages (probably a bit less). If you expect this much or more, you can bump the size in region configuration: https://github.com/FairRootGroup/FairMQ/blob/dev/fairmq/UnmanagedRegion.h#L137.

Could you give it a try and provide some feedback on how this behaves with the bottlenecks that you observe?

A further step could be to make use of cached pools to reduce synchronization. These could be effective with a bulk message copy API.

from fairmq.

ktf commented on August 16, 2024

Thanks! I will try ASAP.

from fairmq.

Bulk message creation about fairmq HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent