Giter Club home page Giter Club logo

Comments (2)

osheroff avatar osheroff commented on August 24, 2024

Can you give some numbers showing how expensive each issue you're calling out is? Whether to proceed on large changes like this really are just a matter of how much speed you'll win by doing so.

As far as breaking the interface goes; generally, the answer is no. Ignore the lower version number, this library is stable and used heavily in production all over the place; I'm not quite sure of course who relies on the specific interfaces of UUIDSet etc, but I'd still prefer to not change interfaces unless there was a very very large win possible.

regarding two different classes for maria/mysql:

The mariaDB GTID support is more recent (and so could in theory break a little), but the interface as it stands can be used reasonably well without having to specify flags for connecting to maria vs mysql, which I like.

we could explore introducing some kind of (Abstract)TransactionState/GtidState/ReplicationState

I like this idea, especially if we can keep the old interfaces the same but add a migration path to the new. I'm also not against marking the old interfaces as deprecated, if you come up with a great approach to organizing this information... but it'd probably be a long long migration to the new interfaces.

from mysql-binlog-connector-java.

janickr avatar janickr commented on August 24, 2024

I created a POC to compare the allocation and cpu profiles of the points 1) and 2). Point 3) is more of a design change imo.

TL;DR
IMO it is worth it fixing the allocations in GtidEventDataDeserializer, the GtidSet speedup not so much

The setup

The POC app starts a MySql OneTimeServer with binary logs in gtid mode. It creates a table and populates it with 300000 rows. The total size of the binlog is around 130Mib.
It then starts the various binary log client implementations one after another, a first time without profiling, to ensure all classes are loaded at least once, then with cpu profiling and a last time with allocation profiling.
The binlog clients use the EventDeserializer with the default EventDataDeserializers, so in addition to the GTID data, also the WriteRowsEventData is deserialized during the run.

1) The allocations caused by GtidEventDataDeserializer:

First the allocation profile of the current binarylogclient:
alloc-original
More than half of the allocations are caused by the String operations in GtidEventDataDeserializer

If instead we read the gtid in a MySqlGtid object with UUID sourceId and long transactionId:
alloc-new-gtid-deserializer-without-strings
The GtidEventDataDeserializer accounts for less than 1% of the allocations

With the allocations of the GtidEventDataDeserializer out of the picture, I also noticed that the EventHeaderV4Deserializer kept creating arrays for each event:
alloc-original-eventheader-deserializer
This was causing more than half of the allocations

When building and using an index once, these allocations can be avoided:
alloc-new-eventheader-deserializer

The string formatting is also very visible in the cpu profile (the byteArrayToHex method):
cpu-original-gtid-deserializer-2

This is improved when we deserialize to a gtid object:
cpu-gtid-without-strings-2

2) the cpu performance of the GtidSet
... or the perfect example of Amdahl's law

The cpu usage of the original GtidSet.add method:
cpu-original-2
is 3,3% of the total, mostly dominated by the string parsing

When we use the gtid object instead (no string parsing):
cpu-without-strings-original-gtidset-2
it is reduced to 1,2% of the total cpu usage

With a new GtidSet implementation:
cpu-new-gtidset-2
only 0.1% remains

Some concluding thoughts
I think most gains can be made in changing the gtid deserializer (and the EventType.byEventNumber method). This does imply changing the protected gtid field in the BinaryLogClient to a common supertype of MariadbGtid and a new MySqlGtid. The most sensible common supertype is Object, inventing an other one is a bit artificial.
This smells and therefore I proposed (3) introducing some kind of (Abstract)TransactionState/GtidState/ReplicationState. That change is harder (keeping the API stable, maybe that also includes the protected gitd related fields?) and maybe not feasable within the boundaries of the current API.

While the new GtidSet is significantly faster than the current GtidSet, it probably does not matter because the current GtidSet.add method is only a fraction of the full cpu profile. So although I like my implementation, I don't think it would add much value.

Sources
The html flamegraphs (the source for the screenshots) are here:
flamegraphs-2.zip

You can find the source code to review and reproduce the profiling at https://github.com/janickr/mysql-binlog-connector-java/tree/gtd-profiling-poc

Run the poc with
./mvnw test-compile exec:exec -Dexec.executable=java -Dexec.args="-cp %classpath -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints com.github.shyiko.mysql.binlog.gtidprofilingpoc.ProfileMain" -Dexec.classpathScope=test

It will generate the flamegraphs in the project root directory.

Edit: new screenshots of a run with better JVM warmup

from mysql-binlog-connector-java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.