Giter Club home page Giter Club logo

Comments (16)

ktf avatar ktf commented on August 16, 2024 1

I guess, you are asking for a bulk api on a single transport only?

You mean that all the messages of the bulk are created on the same transport? Yes. See the line above to see what I actually want to optimise.

from fairmq.

rbx avatar rbx commented on August 16, 2024

@ktf Can you provide more context here? What are you running, how many messages, their size. What impact does this have?

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

Should the message count be a run-time or compile-time argument? e.g.

std::vector<MessagePtr> CreateMessages(size_t n, std::vector<size_t> const & sizes, std::vector<Alignment> const & alignments);
assert(n == sizes.size());
assert(n == alignments.size())

vs

template <size_t N>
std::array<MessagePtr, N> CreateMessages(std::array<size_t, N> const & sizes, std::array<Alignment, N> const & alignments); // or spans

I guess, in case of a failure to create a single msg, the whole call is expected to fail?

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

I guess, you are asking for a bulk api on a single transport only?

from fairmq.

ktf avatar ktf commented on August 16, 2024

This is ITS readout, but it's a general issue IMHO. Whenever we use the shallow copy to increase parallelism on the same data when running in shared memory, we need to copy messages one by one as done in:

https://github.com/AliceO2Group/AliceO2/blob/dev/Framework/Core/src/DataProcessingDevice.cxx#L637

It would be good if there was a way to reduce the impact of the mutex.

from fairmq.

ktf avatar ktf commented on August 16, 2024

Should the message count be a run-time or compile-time argument? e.g.

Runtime.

I guess, in case of a failure to create a single msg, the whole call is expected to fail?

I would say so.

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

Since this mutex is internal to the boost shm library, we can only save it by making a big allocation from the boost shmem segment and then chunk it up according to the requested fmq msgs. However, when the fmq msgs are destroyed, we would have to track and delay the deallocation on the boost shmem segment until all msg we chunked the large allocation into, are destroyed.

Will not be trivial, and we will have to see in measurements, how this compares performance-wise. It could be that we only shift the overhead to msg destruction time - which may of course still help your use case. But just to set the expectation right.

It would be a different story of course, if boost shmem would also provide a bulk allocation api, but this I don't know (yet).

from fairmq.

rbx avatar rbx commented on August 16, 2024

There is a bulk API, but it may have fragmentation effects (positive or negative), as it will attempt to put all buffers contiguously:
https://www.boost.org/doc/libs/1_83_0/doc/html/interprocess/managed_memory_segments.html#interprocess.managed_memory_segments.managed_memory_segment_advanced_features.managed_memory_segment_multiple_allocations

Need to also see how it deals with (custom) alignment.

from fairmq.

ktf avatar ktf commented on August 16, 2024

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?

from fairmq.

dennisklein avatar dennisklein commented on August 16, 2024

Ok, but how about we simplify the problem? In the end what I need to do is to shallow copy messages. How about we optimise that case?

Ya, this is what I am wondering still, why do you see an allocation at all. I guess, you are hitting this case?

hmm, even if we speculatively pre-allocate those 2-byte buffer to only pay the overhead only once every couple of shmem::Message::Copy()s, I guess, we will still need a bulk Copy() api, because we would need to synchronize access to the pre-allocated refcount buffer pool as well, right? maybe we could get away with one or two atomic integers for this?

from fairmq.

ktf avatar ktf commented on August 16, 2024

Yes, I suppose in this particular case we are in that case, assuming UR means UnmanagedRegion. Are those allocations in shared memory? That could explain the fragmentation as well, no?

from fairmq.

rbx avatar rbx commented on August 16, 2024

Yes, they are in shared memory. Certainly they contribute to overall fragmentation.

Either we make a bulk call for Copy to store them together. Or in this case, since we know the exact data size (header to store the ref count...but still an unknown number of them) we need, we could make use of a pool allocator.

from fairmq.

ktf avatar ktf commented on August 16, 2024

I would definitely use a pool allocator for those. We probably have thousands of them being created for each timeframe. It would actually be good if we could also have a bulk allocator for messages below size 256 bytes or so, since every other message in our datamodel is an around 80 bytes header.

from fairmq.

rbx avatar rbx commented on August 16, 2024

Small update on this. In the dev branch there is now first implementation using pools.

For reference counting messages from UnmanagedRegion, it will internally store these in a separate segment (1 per UnmanagedRegion) and use a pool to track them. Technically this is still syncing on every message, but it should be faster since it is 1. a separate segment from the main data and 2. uses a pool. Also, this certainly should improve fragmentation in the main data segment, as now nothing is allocated there for UR region messages. The segment size is 10,000,000 bytes by default, which should be good for up to 5,000,000 messages (probably a bit less). If you expect this much or more, you can bump the size in region configuration: https://github.com/FairRootGroup/FairMQ/blob/dev/fairmq/UnmanagedRegion.h#L137.

Could you give it a try and provide some feedback on how this behaves with the bottlenecks that you observe?

A further step could be to make use of cached pools to reduce synchronization. These could be effective with a bulk message copy API.

from fairmq.

ktf avatar ktf commented on August 16, 2024

Thanks! I will try ASAP.

from fairmq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.