Recently I analyzed the encoding performance of this library for my clients, who need

proposal for performance improvements about gpb HOT 9 OPEN

sasa1977 commented on September 28, 2024

proposal for performance improvements

from gpb.

Comments (9)

tomas-abrahamsson commented on September 28, 2024

Hi, interesting stuff!

About switching MsgDef to a map: I'm getting the impression you are using the gpb module directly as opposed to generating an encoder/decoder module and using it? If yes, then you would probably get some performance improvement more-or-less for free if you can switch over to using a generated module. Can you, or are you limited to gpb due to something? Maybe you or your clients are using the Elixir protobuf, in case it would imply some restrictions? I'm not familiar with how it works. I'd be interested to learn more about your use case. The generated encoder/decoder does not traverse the list of definitions. Instead, the definitions are embedded in the structure of the generated code.

I've kept the data-driven encoder/decoder in the gpb module mostly to be able to cross-check encoding/decoding results against generated modules, but I have generally not spent any efforts on performance or other bells and whistles here. This has gone into the code generator instead.

Regarding maps, I just opened an issue over in hexpm/hex_core#134 for what the oldest OTP version to support. (If it is still 17, it would impose some limitations on how one can phrase the maps expressions)

Regarding the api of the (assumedly intended) gpb module, I think an approach could be to work on maps internally, but the api would still need to accept defs also as a list (bwd compat), and in that case convert the defs to maps as a first step. As the documented definitions format is a list, I guess it would make sense to also expose, as an api function, that function that will turn the definition list into a map. (An alternative is of course to define a new version of the definitions format, but I think would be more work, since then the code generator needs to be adapted as well.)

Regarding io-lists, do you have any performance figures for only this part of the proposed change? Currently, the code relies on the Erlang optimization that binaries are initially write-appendable under the hood. This is at least for the generated code, maybe also for the gpb module (don't remember,) but again, are you referring to the data-driven encoder/decoder in the gpb module? I tried earlier with iolists instead, but didn't find it made any much of a speedup, if I remember correctly. Unfortunately, I don't think I have any results to share anymore and it was quite some time ago.

from gpb.

sasa1977 commented on September 28, 2024

I'm getting the impression you are using the gpb module directly as opposed to generating an encoder/decoder module and using it?

Yeah, I was benching and optimizing gpb:encode_msg. I didn't look at the generated encoder, but I naively assumed that the code in those modules would internally use gpb. Is that not the case, i.e. you're saying that gpb is basically a parallel implementation of the encoder/decoder?

Regarding io-lists, do you have any performance figures for only this part of the proposed change?

As mentioned in the first comment, switching to iolists brought an extra 2x improvement (on the particular data structure I was benching).

Currently, the code relies on the Erlang optimization that binaries are initially write-appendable under the hood.

Yes, binary is write-appendable, but it still needs to be reallocated if it doesn't have enough space (source). Given a large enough input data structure, I'd expect frequent reallocation. With iolists, these reallocations can be avoided. Only at the end we invoke an efficient iolist_to_binary to produce the encoded bytes.

are you limited to gpb due to something?

I'll double check and report back.

from gpb.

sasa1977 commented on September 28, 2024

As mentioned in the first comment, switching to iolists brought an extra 2x improvement (on the particular data structure I was benching).

Sorry, this statement is misleading. The speed up was 2x after the maps optimization. But disregarding that, and speaking in absolute numbers, with the test input I'm using the master version encoding takes 5ms on average. If I switch to iolists, it takes 4ms on average.

from gpb.

tomas-abrahamsson commented on September 28, 2024

Is that not the case, i.e. you're saying that gpb is basically a parallel implementation of the encoder/decoder?

Yes, that's correct, it is a parallel implementation, there is no run-time dependency from the generated code to gpb.

but it still needs to be reallocated if it doesn't have enough space [...] with the test input I'm using the master version encoding takes 5ms on average. If I switch to iolists, it takes 4ms on average.

Good point about reallocations, I didn't think about that. And 20% improvement on encoding (for this particular input) is indeed something :) So this seems it would be a worthwhile improvement.

There could probably a break-even somewhere if a binary of an iolist is small, to use integers instead in case they are below 256. I'm thinking about memory usage for binaries vs integers, as described in the efficiency guide For example let's say we have a field that is of type bytes, let's say the field's number is 1 and its value is 17 bytes. The wiretype for a length-delimited field with number 1 is (1 bsl 3) + 2 = 10 which is below 256. The length 17 is below 128, so the varint-encoding of the length will also be below 256. Then it would be slightly more memory-efficient to store it in iolist as [10, 17, <<17 bytes>>] than the more general [<<10>>, <<17>>, <<17 bytes>>] or [<<10, 17>>, <<17 bytes>>] But this could well be a premature optimization that just results in too complex code.

from gpb.

sasa1977 commented on September 28, 2024

Yes, that's correct, it is a parallel implementation, there is no run-time dependency from the generated code to gpb.

How stable would you say the gbp implementation is? I'm asking because my clients are currently using that one, and it seems that switching to the generated code might be a pretty large undertaking at this point.

from gpb.

tomas-abrahamsson commented on September 28, 2024

Stable in what sense? I'd say both implementations are fairly well tested. Most work has gone into the generated code. Both since it is a bit more complex problem, but also because there are more options for it.

I forgot to mention that for the generated code, there is also a nif option to generate code that uses Google's C++ protobuf library via NIFs to encode and decode, and a bit more performance can be squeezed out with the bypass_wrappers option. But there are some caveats if you plan to use or switch between overlapping set of proto definitions, see this section of the README.nif-cc and the build process becomes yet a bit more complex of course.

from gpb.

sasa1977 commented on September 28, 2024

OK thanks! Let me know if you're interested in accepting these two perf improvements for the gpb module.

from gpb.

tomas-abrahamsson commented on September 28, 2024

Yes, definitely, I think they'd be nice improvements.

from gpb.

tomas-abrahamsson commented on September 28, 2024

Thanks for both PRs. I will take a look.

from gpb.

proposal for performance improvements about gpb HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent