Problem: Bedrock already leverages multiple CPUs for read capacity

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Awesome. Tyler, I think this is all yours! <span class="email-hid

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Comments (34)

tylerkaraszewski commented on August 27, 2024 2

Current state of multi-write Bedrock

Daily update, Feb 16

The "abandoned command" thing turned out to be something different - it was possible, if two commands had the same creationTimestamp, and would both end up at the end of the same priority queue, for the second command to just get lost. This is fixed now! (Note: this exists in production as well) I've successfully run my test where I create 30 conflicting commands for each node in a three-node cluster, spam them in parallel at bedrock, and wait for it to process them all, successfully 1,000 times in a row, which is to say 90,000 total commands all being inserted and replicated properly.

So as of right now, there are no known strange crashes!

What's next:

Abandoned commands - I want to build a test for this to make sure these are handled correctly.
HTTPS - still not addressed, but still seems mostly straightforward to do (though less straightforward to test).
Unique and creationTimestampd requests used to get special handling which currently doesn't exist, this needs to be re-added.
Status needs to report the status of the rest of the cluster, as well.
PRAGMA schema_version should still be looked at again, I'm not sure if it's a performance issue, though it's currently mostly-disabled (which is fine if queries don't change the DB schema).
It would be nice to test performance. We can use a large-ish test DB and do a bunch of UPDATES and see if they conflict roughly as frequently or infrequently as we expect.

Testing some of these things is hard (how do I know if an HTTPS was repeated, for example), so I've added some scaffolding to the testing so we'll be able to inspect the logs as part of the tests.

I feel like at this point, all of these things are loose ends. Nothing major is holding this up, we just need to verify we haven't broken anything and add back a few small things that are missing.

Then we'll get to see if it's actually any faster.

from bedrock.

tylerkaraszewski commented on August 27, 2024 1

Current state of multi-write Bedrock

Daily update, Feb 15

The main problem we had around COMMIT just plain blocking for 30+ seconds for no reason is fixed! Or rather, avoided, as it only seems to happen on NFS filesystems. The tests are no longer being run with DBs on NFS filesystems.
We may want to re-visit the PRAGMA schema_version issue now, and see if it's still a thing.
It seems there might be a bug around handling abandoned commands. This bug is very hard to trigger, though not so hard I wouldn't expect it to happen in production. More investigation is needed. I've updated the tests locally to try and put myself in a position to investigate this. This causes the test to hang until it times out, though that might not be required. First I need to figure out why a command is getting abandoned in the first place, and then I need to handle that case correctly. In my last test run it took me 38,000 commits to trigger this bug.
Other issues from before remain, notably cleanup and the incomplete STATUS command.

Current priority is the abandoned command issue.

from bedrock.

mcnamamj commented on August 27, 2024

For administrative purposes, is it super painful to ALWAYS have raw SQL and MAY have binary changeset? If binary exists, use that, otherwise use raw SQL? We use the raw SQL multiple times a week for auditing purposes and debugging.

from bedrock.

danielk1977 commented on August 27, 2024

That looks a good plan to start with.

For steps 3 and 4, I think you need to use the code on the "changebatch" branch of the SQLite project:

https://www.sqlite.org/src/timeline?t=changebatch

It may also be that the processing in step 2 should be in SQLite::commitTransaction() or similar. So that the changeset for the transaction is constructed right before the transaction is committed.

The following is copied from an email from drh on August 24 that explains more of the details. There are a couple of follow ups in the same thread to do with thread-safety issues that are probably worth glancing at as well.

Dan Kennedy.

We now have what we believe to be a reasonable changeset conflict
detector. To use our conflict detector:

(1) Compile using the latest code on the "changebatch" branch of SQLite.
All versions on that branch:
     https://www.sqlite.org/src/timeline?t=changebatch
Tarball of the latest code:
     https://www.sqlite.org/src/tarball/sqlite.tar.gz?uuid=changebatch

Using the tarball above, build using "./configure; make sqlite3.c" then
link the resulting "sqlite3.c" source file as you normally would.
(2) Also link in the source file "sqlite3changebatch.c" with your build.
That source file is found in the ext/session subfolder of the tarball.
There is an corresponding "sqlite3changebatch.h" header file.

(3) Use the sqlite3session_fullchangeset() interface (in place of
sqlite3session_changeset()) to extract the changesets. No other
core sessions interfaces need to change.
Docs: https://www.sqlite.org/src/artifact/c772b5440f?ln=284-295
(4) See the comments in the sqlite3changebatch.h header file for
interface documentation for the conflict detector.
Summary:
  *  There is a new sqlite3_changebatch object.
  *  sqlite3changebatch_new() is the constructor for the object.
  *  sqlite3changebatch_delete() is the destructor.
  *  The sqlite3changebatch_add() examines a single new full changeset
     object and reports if the new changeset conflicts with any
     previously examined changeset.  (The first call always reports
     no conflicts.)
  *  The sqlite3changebatch_zero() method resets the accumulated
     conflict information.  This is similar to _delete() followed
     by _new(), only faster.
(5) Be sure to compile everything with SQLITE_ENABLE_SESSION.

from bedrock.

quinthar commented on August 27, 2024

@danielk1977 Awesome, thank you. I've updated based on your comments, can you please re-review? Also, a couple questions:

Is it expected that the changebatch code will forever remain on a branch, and not in the mainline? If so, are you updating this branch with each release?
Do you anticipate any significant performance impact from using the session interface to gather the changeset?
Would you expect applying the changeset to be faster than executing the full query itself? In other words, would you expect binary replication to be more efficient than SQL replication in terms of slave CPU usage?
Would you expect the size of the changeset to be larger or smaller than the SQL itself? (Obviously this depends heavily on the query, but let's say for a simple INSERT.)

from bedrock.

tylerkaraszewski commented on August 27, 2024

I feel like there are probably some improvements we could make in the suggestion for Phase 3. It might involve a bit of an overhaul to some of bedrock's internals, but we could probably make things a bit more straightforward than the combination of what's suggested above and the current implementation (for example - why distinguish read vs. write threads at all? Make them all read/write threads).

But before that, we've got a bit of work to do in phase 1 just to prove this is workable, so we can probably defer the design discussion for phase 3 for a bit.

from bedrock.

danielk1977 commented on August 27, 2024

That all looks to be as discussed a few months back. With much more detail though.

Is it expected that the changebatch code will forever remain on a branch, and not in the mainline? If so, are you updating this branch with each release?

It's on a branch at the moment to preserve flexibility - once it's on the trunk it's more difficult to change the interface or functionality. If we determine that it can be used in Bedrock we'll move it to the trunk.

I updated the branch to match the SQLite trunk yesterday. We can up date the branch with each release while you're experimenting with changebatch easily enough.

Do you anticipate any significant performance impact from using the session interface to gather the changeset?

Possibly. But there are lots of factors. There might be three sources of overhead:

The first time a session sees a change on a table it has to query SQLite for the schema of that table. Which involves a couple of queries. This overhead might mean we need an interface to zero a session object - so that you don't need to create a new session for every transaction, but can zero an existing one so that it can be reused by the next transaction.
Each time a change is made to the database, the PK of the affected row is stored in memory by the sessions module. I think this overhead is fairly minimal.
When you generate a changeset, sessions runs a db query for every PK in memory (to get the final values to put in the changeset). Not sure how significant this is going to be - all the pages these queries hit will be in the cache (since they were just written), but there will still be CPU overhead.

Would you expect applying the changeset to be faster than executing the full query itself? In other words, would you expect binary replication to be more efficient than SQL replication in terms of slave CPU usage?

I think it will be quicker, yes. How much depends on what the original queries were. Applying the binary changeset might use less CPU as:

It will do all its work using pre-prepared queries.
The queries will sometimes be simpler (as for UPDATE and DELETE statements the WHERE clauses will be straight lookups-by-PK, even if the original queries were more complex).

Would you expect the size of the changeset to be larger or smaller than the SQL itself? (Obviously this depends heavily on the query, but let's say for a simple INSERT.)

For an INSERT, the changeset contains the same information, so they probably compress to roughly the same size. Uncompressed the changeset is likely a bit smaller, simply because an SQL INSERT statement uses more bytes of space for formatting.

For UPDATE and DELETE statements that affect many rows, the changeset might be larger than the SQL. For UPDATE and DELETE operations that affect only a single row, a "patchset" likely contains similar information as the SQL UPDATE or DELETE, so will likely compress down to the same size. Uncompressed, it's probably a bit smaller, for the same reason than an INSERT might be. A "changeset" or "fullchangeset" might be a bit larger.

from bedrock.

quinthar commented on August 27, 2024

Awesome. Tyler, I think this is all yours!

…

On Dec 8, 2016 1:31 AM, "danielk1977" ***@***.***> wrote: That all looks to be as discussed a few months back. With much more detail of course though. 1. Is it expected that the changebatch code will forever remain on a branch, and not in the mainline? If so, are you updating this branch with each release? It's on a branch at the moment to preserve flexibility - once it's on the trunk it's more difficult to change the interface or functionality. If we determine that it can be used in Bedrock we'll move it to the trunk. I updated the branch to match the SQLite trunk yesterday. We can up date the branch with each release while you're experimenting with changebatch easily enough. 1. Do you anticipate any significant performance impact from using the session interface to gather the changeset? Possibly. But there are lots of factors. There might be three sources of overhead: - The first time a session sees a change on a table it has to query SQLite for the schema of that table. Which involves a couple of queries. This overhead might mean we need an interface to zero a session object - so that you don't need to create a new session for every transaction, but can zero an existing one so that it can be reused by the next transaction. - Each time a change is made to the database, the PK of the affected row is stored in memory by the sessions module. I think this overhead is fairly minimal. - When you generate a changeset, sessions runs a db query for every PK in memory (to get the final values to put in the changeset). Not sure how significant this is going to be - all the pages these queries hit will be in the cache (since they were just written), but there will still be CPU overhead. 1. Would you expect applying the changeset to be faster than executing the full query itself? In other words, would you expect binary replication to be more efficient than SQL replication in terms of slave CPU usage? I think it will be quicker, yes. How much depends on what the original queries were. Applying the binary changeset might use less CPU as: - It will do all its work using pre-prepared queries. - The queries will sometimes be simpler (as for UPDATE and DELETE statements the WHERE clauses will be straight lookups-by-PK, even if the original queries were more complex). 1. Would you expect the size of the changeset to be larger or smaller than the SQL itself? (Obviously this depends heavily on the query, but let's say for a simple INSERT.) For an INSERT, the changeset contains the same information, so they probably compress to roughly the same size. Uncompressed the changeset is likely a bit smaller, simply because an SQL INSERT statement uses more bytes of space for formatting. For UPDATE and DELETE statements that affect many rows, the changeset might be larger than the SQL. For UPDATE and DELETE operations that affect only a single row, a "patchset" likely contains similar information as the SQL UPDATE or DELETE, so will likely compress down to the same size. Uncompressed, it's probably a bit smaller, for the same reason than an INSERT might be. A "changeset" or "fullchangeset" might be a bit larger. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAjG0ldmfakME8_fWHdCYX0MgXAd2D0Fks5rF85ugaJpZM4LGHx_> .

from bedrock.

tylerkaraszewski commented on August 27, 2024

@quinthar On it!

from bedrock.

quinthar commented on August 27, 2024

@tylerkaraszewski Any progress on this?

from bedrock.

tylerkaraszewski commented on August 27, 2024

Not much. Starting on it again today, we wanted to finish up auto-export by the end of the year, and combined with the holidays, not a lot else got done on this in the past couple weeks. I expect to be working on this through offshore (unless it's completed sooner).

from bedrock.

quinthar commented on August 27, 2024

@danielk1977 - It just occurred to me that our plan for multi-threaded writes depends on two features that aren't in the trunk:

BEGIN CONCURRENT to lock individual pages rather than the entire database, for writes
"deferred page allocation" to avoid multiple transactions competing to lock the page that contains the "free page list" by doing all new page assignment in the COMMIT

What are the status of each of these, and can we get a branch that has both of these and "changebatches"?

from bedrock.

danielk1977 commented on August 27, 2024

Sorry about the delay on this. The "begin-concurrent" branch now has the changebatch functionality on it:

http://www.sqlite.org/src/info/50fb1eb368f14e4d

Since neither of these changes are on the trunk, they are both still experimental. Which means we can change them to respond to Bedrock's requirements. If Bedrock can use either or both in production, we'll likely move them to trunk.

from bedrock.

quinthar commented on August 27, 2024

Cool! And the "deferred page allocation" proposal? At what point do you recommend we begin explore this?

…

On Mon, Jan 9, 2017 at 8:07 AM, danielk1977 ***@***.***> wrote: Sorry about the delay on this. The "begin-concurrent" branch now has the changebatch functionality on it: http://www.sqlite.org/src/info/50fb1eb368f14e4d Since neither of these changes are on the trunk, they are both still experimental. Which means we can change them to respond to Bedrock's requirements. If Bedrock can use either or both in production, we'll likely move them to trunk. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAjG0tDMx_uluhcr99f4zlESDKgLKrsqks5rQdyVgaJpZM4LGHx_> .

from bedrock.

danielk1977 commented on August 27, 2024

Deferred page allocation, assuming I'm not misunderstanding things, is part of the begin-concurrent feature. So it's already on the branch.

from bedrock.

tylerkaraszewski commented on August 27, 2024

Alright, I'm downloading the tarball here: http://www.sqlite.org/src/info/50fb1eb368f14e4d to work on this from, as that seems to be the latest in the 'begin-concurrent' branch. Let's see what happens.

from bedrock.

tylerkaraszewski commented on August 27, 2024

This seems to build a suitable sqlite3 console binary that has all of JSON, BEGIN CONCURRENT, and the SESSION extension.

gcc shell.c sqlite3.c -lpthread -ldl -DSQLITE_ENABLE_JSON1 -DHAVE_READLINE=1 -o sqlite3 -L/usr/lib/x86_64-linux-gnu -lreadline -DSQLITE_ENABLE_SESSION -DSQLITE_ENABLE_PREUPDATE_HOOK

This requires having built the SQLite amalgamation (which you can do with .configure && make sqlite3.c, as long as you have tclsh installed).

So I think I've probably got the right code to start from here, now to stuff this into Bedrock and see if it works...

from bedrock.

tylerkaraszewski commented on August 27, 2024

And there's a branch in bedrock using this SQLite here: https://github.com/Expensify/Bedrock/compare/tyler-multi-write?expand=1

It doesn't do anything yet except build against a new sqlite3.c.

from bedrock.

tylerkaraszewski commented on August 27, 2024

One interesting thing to add (and I plan to do this inside the implementation for "phase 1" is to time how long we spent inside the lock held inside commit(). This needs to be significantly smaller than our overall CPU usage for us to gain much benefit from this change (it probably is, but if it isn't, then this isn't worth doing).

from bedrock.

tylerkaraszewski commented on August 27, 2024

Here's a reiteration of @quinthar's proposal, just for my own understanding. There are some important questions later on in here, though.

Reiteration of @quinthar's Proposal

When the master node starts up, we create a changebatch. Call this changebatch 0. (really, whatever the last changebatch was, +1).

For each transaction (regardless of number of threads), we create a changeset. When we get to the point where we're about to commit this changeset, we add it to the changebatch, to see if it conflicts with any other change in the batch.

If it doesn't, great! We can commit it, and send off the transaction to be committed on all our other nodes. It's marked as being part of changebatch 0, as well as having it's own ID within that. It doesn't matter which thread on master did the commit, or which thread writes it on a slave.

But if it does conflict: Hold the phone! At this point, changebatch 0 is done, we close it and create changebatch 1. We re-add this changeset to changebatch 1 (which won't conflict, because there's nothing in there), commit it to the DB. And send it off to be committed on our other nodes. When those nodes receive it, they'll see they're still on changebatch 1 and they'll hold off. They can't apply this change until they've got all the changes from changebatch 0...

How do they know this? I don't think we've worked this out yet. Maybe this message includes the highest session number from the previous changebatch? Do we have to make this node go SYNCHRONIZING if it hasn't seen all the entries from the previous changebatch?

Then, when all the nodes are back on the same changebatch, things proceed as normal.

Really Important Question

So this work exists to make replication to slaves faster. That'd be nice, but it's not really our current problem. Our current problem is that we can only do processing at all on one thread.

This new proposal still has all of our writes serialized. We need to grab a mutex to do a COMMIT, so fundamentally, that's still our choke point, and even if we have 16 threads running, they're all going to bottleneck here. This is fine, if most of the work we do isn't in the COMMIT (and it probably isn't), but if it is, we don't actually gain any performance here.

On the other hand, if the COMMIT is (say) 5% of our processing time, then we should be able to run 20 simultaneous threads before we start saturating this choke point.

So, what we really want to do then, is have multiple threads only need to lock the DB on COMMIT, and not from BEGIN TRANSACTION through COMMIT which is the current, single-threaded model.

Nothing in the above proposal addresses this. Presumably, the BEGIN CONCURRENT change helps with this, and it seems like that's it's intention, but we haven't addressed that here. In a previous email thread, this was said (by Richard Hipp):

The "COMMIT" command itself is still serialized, but all of the reading and updating that happens in between the BEGIN and the COMMIT can, in theory, run in parallel on multiple threads. The COMMIT might fail here, if another concurrent transaction changes some part of the database that the current transaction has read, and so the writer must be prepared to abandon the transaction and start over. But assuming that the concurrent writers are making truly independent changes, concurrency is possible here.

That's exactly what we want, but we haven't addressed this issue of "the writer must
be prepared to abandon the transaction and start over." which I think is really the core problem we need to solve. If we can't solve what to do when two changes conflict here, we don't have multi-threaded writes.

Maybe we solve this by simply changing BEGIN TRANSACTION to BEGIN CONCURRENT in the existing code, and if a COMMIT fails due to a conflict, we re-queue the command that conflicted and let it run again. Then we just need to add a second write thread and see what happens.

This doesn't use the changebatch or changeset functionality at all, but, if we figure out the conflicts, it gets us the ability to use 16 threads simultaneously on our master node, which is really what we need more than anything.

Here's some examples of this case:

On, master, two threads (essentially) simultaneously open two commands.
These threads both (essentially) simultaneously call BEGIN CONCURRENT.
These threads each perform some work.
One thread or the other finishes first, grabs the COMMIT mutex, and writes to the DB.
The other thread finishes second, grabs the COMMIT mutex, and writes to the DB.

Say both threads did roughly the same thing:
UPDATE table, SET column = $x WHERE key = 1; Where $x was different for each command. Do we let the last writer win, do we let the second writer fail and start over?

Or say we have a different case: one command deletes a policy, where another moves a report to that policy. If these are processed in parallel, the commits don't even conflict, do they? They update different tables. (maybe having to read from the policies table from the other command will cause a conflict? Maybe only if reports.policy is a foreign key from policies or something like that?)

If we can handle those cases with two write threads, then we can get to a situation where we don't hit a bottleneck until we've got so many threads that most of them are waiting on the COMMIT lock.

On the other hand, if we don't solve this problem, then multi-threaded replication isn't even interesting to talk about, because it's less of a bottleneck than multi-threaded writes are, and we haven't figured that out.

Unless I'm misunderstanding something here, I'd be more in favor of doing this change differently:

Make the change to use BEGIN CONCURRENT.
Add handling for conflicting COMMITs on concurrent transactions.
Increase the number of write threads and see how that affects performance.
Worry about parallel serialization after this works.

So, the title of this issue is "Add Multi-threaded replication", and that's accurate for what it describes, but what I think we really need to address first is "Add multi-threaded writing to DB" and see if we can make that work.

from bedrock.

danielk1977 commented on August 27, 2024

Say both threads did roughly the same thing:
UPDATE table, SET column = $x WHERE key = 1; Where $x was different for each command. Do
we let the last writer win, do we let the second writer fail and start over?

The way I understood things, the latter. SQLite will return SQLITE_BUSY_SNAPSHOT when the second of the two writers attempts to COMMIT. It has no option but to execute a ROLLBACK and start again.

I'm not sure I completely understand the reports/policies example. Say command 1 reads table "reports" and writes "policies", and command 2 reads "policies" and writes "reports". Command 1 is committed first. Then, when the user tries to commit command 2, SQLite will observe that command 2 had read pages (from table reports) that were modified concurrently by command 1 and refuse to commit. In other words, it will only be possible to commit both commands if they could have been run serialized with exactly the same results for all read operations.

While working on the performance problem caused by the SQLite upgrade last year we saw some data that suggested COMMIT processing time was relatively small. That said, I think that's your biggest risk too - that the BEGIN CONCURRENT mechanism does a poor job of increasing concurrency on the master.

Speculative Question

I was trying to figure out a contingency plan for this (BEGIN CONCURRENT providing insufficient concurrency) late last year as it happens. It seemed to me then that using SQLite's wal mode is limiting the amount of concurrency we can get in a couple of ways:

all writers have to write to the same wal file, which means serialized COMMIT operations, and
there is no way for writers to coordinate page-allocations mid-transaction, which means we need the deferred page-allocation strategy - slowing down COMMIT operations.

One way around is to leave the wal file out of the system by build more conventional READ/WRITE page-level locking on top of SQLite's rollback mode. Without the wal file the system loses MVCC. The way I worked it out, this would mean:

Readers, as well as writers, would have to hold READ locks on each page read for the duration of the read-transaction. So long running readers could block any writers trying to take a WRITE lock on the same page. This is less concurrency than provided by the current "BEGIN CONCURRENT", where read-only never interfere with writers.
There would be a limit to the maximum number of connections to a single database file. That limit might be somewhere around 50.

How serious are the above for Bedrock? Are ideas that include those limitations worth pursuing?

from bedrock.

tylerkaraszewski commented on August 27, 2024

@danielk1977 - I think you've answered my most important question with this:

SQLite will observe that command 2 had read pages (from table reports) that were modified concurrently by command 1 and refuse to commit.

That means we can do this:

and so the writer must be prepared to abandon the transaction and start over.

Without worrying about cases in which the written rows don't conflict, but read rows do. This is something I wasn't sure about.

suggested COMMIT processing time was relatively small.

This is what I expect, too, and our current model works like this:

BEGIN TRANSACTION
Perform significant work (not all of which is DB-centric)
COMMIT

It looks like BEGIN CONCURRENT lets us parallelize the "perform work" step above, even though step 3 remains mostly serial. This, I think, would be a big win for us, but I can't really confirm until we try it. It's also highly dependent on how often we have to "give up and start over" on a transaction because of conflicting commits. As long as that number is small, I think we've got lots of performance to gain.

Assume we spend 90% . of our time in "perform work" and 10% of our time in COMMIT. This should let us run 10 parallel operations before we start to see a performance slowdown. If, say, half of those 10 operations cause conflicts (this number seems exceptionally high to me, I'd guess in the real world it's more like 1%), then those initial 10 operations become 15 when we include re-processing the 5 conflicting ones (and, potentially, the summation of the series where we keep redoing half of the operations each time we start over), but even in that worse case scenario, we end up processing 2x the amount of operations we started with, but with 10x the capacity, for a 5x performance improvement.

Given this information, I think the first change we should make, is switching to BEGIN CONCURRENT, and then, using the current single-writer model, measure the amount of time we spend in COMMIT. This will show us our theoretical maximum performance increase from using BEGIN CONCURRENT.

If we find that this number is low (say, the 10% in the above example, or lower), then I think we have lots of room to work with, and we implement COMMIT conflict resolution by re-trying transactions, and add a second writer.

If that works well, and we're seeing performance improvements, we can move on to improving synchronization performance between DB nodes. Otherwise, we'll have to look at other options, potentially including your speculative options, but I don't think we're there yet.

from bedrock.

tylerkaraszewski commented on August 27, 2024

Bedrock Parallel Writer Design Doc

This document describes V1 of a version of Bedrock that can do multiple parallel writes. This is a complex project that has shown a need for adjustment as new understanding of the limitations of existing code comes to light, so it's expected that this document will be somewhat fluid. The goal of V1 of this project is mainly to be a proof-of-concept to demonstrate that we can build a working, parallel-writer Bedrock with demonstrable performance improvements over the existing single-writer bedrock. It's expected that we'll add refinement and other improvements in V2. Some portions of a parallel-writer Bedrock will explicitly be left to V2. For example, parallel synchronization between nodes is left to V2. V1 is supposed to be the least work possible to see a real-world performance increase from multiple writers.

Fundamental Support for Parallel Writes - BEGIN CONCURRENT.

The core bit of technology that allows us to do this a new SQLite feature named BEGIN CONCURRENT. This is equivalent to BEGIN TRANSACTION, except that no pages in the DB are locked until a COMMIT statement is issued. It is possible that more than one concurrent transaction attempts to modify the same page. This will result in a "first-wins" scenario where the second writer will receive a CONFLICT error upon commit. We need to be able to reasonably handle this conflict error in order to proceed forward with this plan.

SQLiteNode::update() and the Existing COMMIT Flow.

We currently commit transactions from inside SQLiteNode::update(). If we are the MASTER node, we will run process(), which is the only function that can write to the DB, from inside update(). We then do a distributed COMMIT, wherein we wait for some or all of our peers to acknowledge and approve the transaction. This looks much like the following pseudo-code:

while(true) {
    if (_currentTransaction) {
        // Have our peers acknowledged/approved this?
        if (sufficientAprroval) {
            COMMIT();
            _currentTransaction = 0;
        }
    }

    if (commandToProcess) {
        process(command);
        
        // Did the command try to write to the DB?
        // If so, we're ready to COMMIT, but need approval of peers.
        if (command.writeQueries.size()) {
            for (peer : peers) {
                sendToPeer(peer, command.writeQueries);
                _currentTransaction = command;
            }
        }
    }
}

You can see here that when we're ready to COMMIT a transaction, what we actually do is send that transaction to our peers for approval. Then, on the following loop iteration, if enough of our peers have responded to the message, and approved the transaction, then we'll perform the actual COMMIT, and only THEN will we start to work on the next command to process. This means that a COMMIT, while intended to be a quite quick operation on the local DB, is inherently slow, because we've turned it into a network operation.

Our postulation is that COMMIT should be quite fast, while process() should be quite slow, so if we lock around COMMIT (as SQLite requires), we can still process() many commands in parallel, and they'll wait on a COMMIT lock, but this won't be a problem because COMMIT is significantly faster than process(). The current model doesn't look like it works for this, because of the slow network operations for commit().

We would still, strictly, be able to do a lock around the networked COMMIT, if process() is sufficiently slow enough that we can accomplish multiple networked COMMITs in the time it takes to run a single process(), but this will have limited scalability. Further, and perhaps more importantly, it makes rolling back a conflict convoluted. The networked version of COMMIT conflict handling looks like:

if (sufficientAprroval) {
    result = COMMIT();
    if (result == conflict) {
        for (peer : peers) {
            sendToPeer("ROLLBACK");
        }
        handleConflict(command);
    }
    else {
        for  (peer : peers) {
            sendToPeer("COMMIT");
        }
    }
    _currentTransaction = 0;
}

Proposed COMMIT Flow

We need to speed up our COMMITs to make the lock around them scale sufficiently for multiple writers to work. The proposal is to remove the network aspect of our commits, such that the update() loop from above changes to :

while(true) {
    for (command : _currentTransactions) {
        if (sufficientAprroval(command)) {
            _currentTransactions.remove(command);
        } else {
            // TODO: V2 Detect and resolve.
            FAIL_LOUDLY();
        }
    }

    if (commandToProcess) {
        process(command);
        
        // Did the command try to write to the DB?
        // If so, we're ready to COMMIT, but need approval of peers.
        if (command.writeQueries.size()) {
            result = COMMIT();
            if (result != conflict) {
                _currentTransactions.push_back(command);
                for (peer : peers) {
                    sendToPeer(peer, command.writeQueries);
                }
            }
            else {
                handleConflict(command);
            }
        }
    }
}

This means that we never communicate about a transaction that results in a conflict to slaves, simplying the conflict resultuon case drastically. It also means that we don't have to wait for network operations to complete a COMMIT, drastically reducing the amount of time we need to hold a lock for each transaction, allowing for more parallelism.

However, This opens up the possibility for the master DB to fork from its slaves, by committing transactions that the slaves haven't approved. The most likely scenario for this case is that the master node has lost connectivity to the cluster, which has elected a new master, which is committing its own transactions to its database. We need to detect and resolve this somehow. The proposal for V1 is essentially to fail loudly as soon as we see any problem, and then bring the node back up as a slave. We'll need to roll back recent commits in this case, which is as-of-yet an incomplete exercise. For V2, we can use the reversibility of SQLite's changesets to undo a commit, but details of both solutions are to be worked out in the future. In practice, distributed transactions are always accepted and thus we can only handle the success case for a proof-of-concept version of this, as long as we notice the (very exceptional) failure case.

Parallelizing the new flow.

Above, we've shown an update loop that is no longer dependent on network operations to perform commits. The parallelization of this loop, in principal, is fairly simple, but in practice, has some nuance to make sure it works correctly. First, let's look at the parallelized version of the pseudo-code loop:


static mutex commit_lock;

// THE FOLLOWING LOOP RUNS IN MULTIPLE THREADS:

while(true) {
    for (command : _currentTransactions) {
        if (sufficientAprroval(command)) {
            _currentTransactions.remove(command);
        } else {
            // TODO: V2 Detect and resolve.
            FAIL_LOUDLY();
        }
    }

    if (commandToProcess) {
        process(command);
        
        // Did the command try to write to the DB?
        // If so, we're ready to COMMIT, but need approval of peers.
        if (command.writeQueries.size()) {
            lock(commit_lock);
            result = COMMIT();
            if (result != conflict) {
                _currentTransactions.push_back(command);
                for (peer : peers) {
                    queueMessage(peer, command.writeQueries);
                }
            }
            else {
                handleConflict(command);
            }
            unlock(commit_lock);
        }
    }
}

This adds a lock around COMMIT, which also encompasses sending messages to peers for this COMMIT. This ensures that all peers will receive the queries in the correct order, and will be able to apply them the same as MASTER did. However, while each peer must process, and then respond to, these message strictly in order, we have no guarantee that we don't start seeing responses from certain peers for later commands before others. We hold a list of outstanding commands, and as peers acknowledge them, we remove those commands from the outstanding list. If at any point, we decide a command has failed, we fall into our FAIL_LOUDLY() call, and must revert to an earlier version of the DB. We can keep track of the last transaction that was verified by sufficient peers, but we haven't yet designed the mechanism for rolling back to that version of the DB.

Alternative COMMIT idea

It's been brought up that certain RDBMSes (postgres, oracle?) support a PREPARE COMMIT statement, that is used for a two-phase commit. If SQLite supported this, we could use it to grab locks in the DB at the point in the original COMMIT flow where we send the transaction to peers. This could fail in the case that we would get a CONFLICT on COMMIT, and we could handle the conflict case with less of a change to our existing architecture.

Actually Implementing This With Multiple Threads.

Because we will have multiple threads running update() simultaneously, this code needs to be made thread safe. This includes, at least, synchronizing around reads and writes from peers, as well as synchronizing the change to and from the MASTERING state, as well as the locking already mentioned around COMMITs. It also requires changes to SQLiteNode, such that some member variables, such as the PeerList are made shared between a pool of parallel SQLiteNodes. There have been other proposals made, such as making each SQLiteNode connect to an essentially separate network of peers operating on each other server in the cluster, which would avoid this particular issue, but I think this solution is much more complex in general than making PeerList a shared construct between multiple nodes.

Notably, this doesn't require many changes at all outside of the code for the MASTERING state, as there will not be multiple threads running update() in that scenario. This is explained further in the next section.

What Threads Do We Start, And What Do They Do?

We currently have a model with one write thread per node, and several read threads. Here's what happens now:

On a SLAVE:
A read thread processes commands, peeks them, and escalates them to the write thread if they require write privileges. The write thread communicates with MASTER, sending and receiving escalated commands, and synchronizing commits.

On MASTER:
A read thread performs a peek() on a command, and queues it for the write thread if it needs additional privileges. The write thread does all the important work inside update, of both running process() and sending messages to peers.

Here's the proposal for the change:
This proposal keeps these threads for the most part, but changes them to a replication thread and worker threads. We keep the same model of having a single replication thread, and multiple worker threads, but what they do changes somewhat. The existing read and write loops are kept, but worker threads can seamlessly move from one to another execution loop. replication threads always work on the replication loop, but on each loop iteration, worker threads perform the following check:

while(true) {
    if (MASTER) {
        writeLoop();
    } else {
        readLoop();
    }
}

This means that for a slaving node, nothing changes. The replication thread does the same thing it always has, and the worker threads always execute the same read loop they always have. But as soon as the node is promoted to MASTER, all of the threads become write threads, and work in the writeLoop. This is almost unchanged from the existing write loop, it just adds a provision that on MASTER, our threads will look for escalated requests to process, and if they don't find any, they'll also look for regular queued requests to peek, like read threads normally do.

This has one interesting property in that we can remove the ability for a SQLite object to be created in a read-only mode. Since we'll be able to move seamlessly from read to read-write mode on promotion to master, we need to open our DB with the ability to write at all times. The only alternative to this is to close and re-open all our DB handles on promotion to MASTER (and again on demotion to any other state), and I don't see that as worth the work.

HTTPSRequests

One more thing worth discussing is the process of starting HTTPSRequests. This can currently only be done on MASTER, which is a property we'll keep, but it needs to be able to be done from any thread on MASTER. For this reason, I think it's best if we move handling the network activity for these requests out of each write thread (or, any thread operating in writeLoop, which is all of them on MASTER), and into the main thread, where all other network activity is handled. This is actually a fairly straightforward change, with the exception that our write thread needs to notify the main thread when it has added work that needs to be processed (to interrupt it's poll call), and vice-versa, the main thread needs to interrupt the write threads when HTTPSRequests are completed. This can be done using the same mechanism that we currently use in MessageQueue. We could keep the existing model, but that requires each write thread to service the HTTPSRequests that it spawned, which isn't great if it's stuck on a long command and other threads are sitting idle when the response to the HTTPSRequest comes in.

Outstanding Issues:

Currently, every transaction adds a row to the end of the journal table, which we use for determining if nodes are in sync, and if not, how far behind they are. It's not clear if this will cause a conflict on every single commit, and if so, how to handle that.

from bedrock.

tylerkaraszewski commented on August 27, 2024

Current state of multi-write Bedrock

Daily update, Feb 9

In general, all existing tests pass.
Most of the "wierdnesses" that are easily reproducible have been worked out (i.e., things like "once in a while, it hangs, but if you kill it and restart it, the test completes successfully"). One of these remains, and roughly every dozen test runs or so, I can get the test to hang, though it looks like all transactions have been completed and written to disk.
Status commands may or may not work correctly. There used to to be special handling to escalate them from 'read' threads to the 'write' thread, but never from slaves to master. These concepts don't really make sense any more, and so what exactly gets returned by Status needs to be looked into.
We don't currently handle conflicts for commands with https requests correctly. This is probably an easy fix.
There's plenty of cleanup to do around various things (lots of [TYLER] loglines in random places and such).
No performance metrics have been added (we'd like to see how much time we spend waiting on the core COMMIT mutex and things like that, as well as process() time per thread, count of conflicts and commits, etc).

Current priority is the random and infrequent hang-ups during testing.

from bedrock.

quinthar commented on August 27, 2024

Woo!!

…

On Thu, Feb 9, 2017 at 1:55 PM, Tyler Karaszewski ***@***.***> wrote: Current state of multi-write Bedrock Daily update, Feb 9 - In general, all existing tests pass. - Most of the "wierdnesses" that are easily reproducible have been worked out (i.e., things like "once in a while, it hangs, but if you kill it and restart it, the test completes successfully"). One of these remains, and roughly every dozen test runs or so, I can get the test to hang, though it looks like all transactions have been completed and written to disk. - Status commands may or may not work correctly. There used to to be special handling to escalate them from 'read' threads to the 'write' thread, but never from slaves to master. These concepts don't really make sense any more, and so what exactly gets returned by Status needs to be looked into. - We don't currently handle conflicts for commands with https requests correctly. This is probably an easy fix. - There's plenty of cleanup to do around various things (lots of [TYLER] loglines in random places and such). - No performance metrics have been added (we'd like to see how much time we spend waiting on the core COMMIT mutex and things like that, as well as process() time per thread, count of conflicts and commits, etc). Current priority is the random and infrequent hang-ups during testing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAjG0oZkeoeEcI-tTTazXPf4JOXrS4fUks5ra4tmgaJpZM4LGHx_> .

from bedrock.

tylerkaraszewski commented on August 27, 2024

Current state of multi-write Bedrock

Daily update, Feb 10

The last random hang seems to have been fixed, in most scenarios. If we comment out the failover and restoreMaster tests before the "Conflict Spam" test, then it works 100% of the time (in several runs of 100 tests in a row). However, if we leave failover and restoreMaster in place beforehand, it occasionally fails. However, restoreMaster itself fails approximately 5-10% of the time, which may be related to the Status command not being handled properly.

Next step is to fix restoreMaster and verify Status. Once those are verified working correctly, then we'll re-visit the 100x test runs of the "Conflict Spam" test, and see if the problem is gone for good.

Otherwise, the same issues left from yesterday still exist - mostly today I've been working on this change, which will also need to be merged into multi-write.

from bedrock.

tylerkaraszewski commented on August 27, 2024

Current state of multi-write Bedrock

Daily update, Feb 13

Status has been fixed, where by "fixed" I mean, "doesn't return an error". It currently doesn't return the list of peers or the command queue for its own node, unless it just happens to run on the sync thread. This is an issue to fix further along.
With Status returning valid data, the clusterUp, failover and restoreMaster suite of tests works great.
The conflictSpam test still occasionally does something strange, and I've narrowed it down to two things in SQLite:

We use PRAGMA schema_version to determine whether we've updated the schema of the DB when writing to it. This query doesn't seem to count as part of a transaction, and thus causes any thread that uses it to block indefinitely when making this query against SQLite. I feel like we can work around this by banning changes to the DB schema after startup if we have to, but I've asked SQLite about the issue as well.
More importantly, COMMIT occasionally takes 30+ seconds to return, even for trivially small databases. I've asked SQLite about this as well. I think this is what causes the conflictSpam test to fail - it simply times out. Regardless of whether or not this is he underlying cause of the conflictSpam failure, it's important to fix, because it blocks all write activity in bedrock until it completes. I'll revisit the conflictSpam test once we've figured out the slow COMMIT issue.

Otherwise, I've added support for logging the percentage of time we spend with the commitLock mutex locked, which is going to be our main performance bottleneck. We'll be able to graph this the same way we. currently do master CPU usage.

Other issues from before remain. If we can figure out the COMMIT performance issues, then there are no other strange confounding issues that require much investigation currently known.

from bedrock.

quinthar commented on August 27, 2024

Wooo!!! Go! Go! Go!!

…

On Wed, Feb 15, 2017 at 4:41 PM, Tyler Karaszewski ***@***.*** > wrote: Current state of multi-write Bedrock Daily update, Feb 15 - The main problem we had around COMMIT just plain blocking for 30+ seconds for no reason is fixed! Or rather, avoided, as it only seems to happen on NFS filesystems. The tests are no longer being run with DBs on NFS filesystems. - We may want to re-visit the PRAGMA schema_version issue now, and see if it's still a thing. - It seems there might be a bug around handling abandoned commands. This bug is very hard to trigger, though not so hard I wouldn't expect it to happen in production. More investigation is needed. I've updated the tests locally to try and put myself in a position to investigate this. This causes the test to hang until it times out, though that might not be required. First I need to figure out why a command is getting abandoned in the first place, and then I need to handle that case correctly. In my last test run it took me 38,000 commits to trigger this bug. - Other issues from before remain, notably cleanup and the incomplete STATUS command. Current priority is the abandoned command issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAjG0qQGQyFObH--8MuMPyh8p7NesBitks5rc5tKgaJpZM4LGHx_> .

from bedrock.

mcnamamj commented on August 27, 2024

WoooHOooOOOooOoOo!

…

-- Matt McNamara Expensify.com

On Thu, Feb 16, 2017 at 6:44 PM, Tyler Karaszewski ***@***.*** > wrote: Current state of multi-write Bedrock Daily update, Feb 16 - The "abandoned command" thing turned out to be something different - it was possible, if two commands had the same creationTimestamp, and would both end up at the end of the same priority queue, for the second command to just get lost. This is fixed now! (Note: this exists in production as well) I've successfully run my test where I create 30 conflicting commands for each node in a three-node cluster, spam them in parallel at bedrock, and wait for it to process them all, successfully 1,000 times in a row, which is to say 90,000 total commands all being inserted and replicated properly. So as of right now, there are no known strange crashes! What's next: - Abandoned commands - I want to build a test for this to make sure these are handled correctly. - HTTPS - still not addressed, but still seems mostly straightforward to do (though less straightforward to test). - Unique and creationTimestampd requests used to get special handling which currently doesn't exist, this needs to be re-added. - Status needs to report the status of the rest of the cluster, as well. - PRAGMA schema_version should still be looked at again, I'm not sure if it's a performance issue, though it's currently mostly-disabled (which is fine if queries don't change the DB schema). - It would be nice to test performance. We can use a large-ish test DB and do a bunch of UPDATES and see if they conflict roughly as frequently or infrequently as we expect. Testing some of these things is hard (how do I know if an HTTPS was repeated, for example), so I've added some scaffolding to the testing so we'll be able to inspect the logs as part of the tests. I feel like at this point, all of these things are loose ends. Nothing major is holding this up, we just need to verify we haven't broken anything and add back a few small things that are missing. Then we'll get to see if it's actually any faster. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABAVCRm30F22Qj37Lhq6Q0ZwIwVxIBAyks5rdO1agaJpZM4LGHx_> .

from bedrock.

tylerkaraszewski commented on August 27, 2024

Current state of multi-write Bedrock

Daily update, Feb 20

Less completed today than I really wanted, but Unique and creationTimestamp have been pretty thoroughly investigated. unique requests are no longer supported (the Jobs plugin has it's own notion of unique Jobs which is separate from this). Their semantics were strange and there was no current use for them.

creationTimestamp are retained and working, though with the same restrictions as before - i.e., they only work correctly when sent directly to master. This is because responding to an ESCALATE message at some arbitrary time in the future doesn't make sense. This is functionality that could be added but isn't needed currently.

What's next (same as before, minus the above):

Abandoned commands - I want to build a test for this to make sure these are handled correctly.
HTTPS - still not addressed, but still seems mostly straightforward to do (though less straightforward to test).
Status needs to report the status of the rest of the cluster, as well.
PRAGMA schema_version should still be looked at again, I'm not sure if it's a performance issue, though it's currently mostly-disabled (which is fine if queries don't change the DB schema).
It would be nice to test performance. We can use a large-ish test DB and do a bunch of UPDATES and see if they conflict roughly as frequently or infrequently as we expect.

from bedrock.

tylerkaraszewski commented on August 27, 2024

Current state of multi-write Bedrock

Daily update, Feb 21

I have a bit of a cold and am operating at less than 100%, but still got a fair amount done today:

Abandoned commands - There's a test for these now and they seem to work fine (i.e., not cause any problems for subsequent commands).
HTTPS - Tests created passed and verified, HTTPS seems to work just fine as long as the commands are sent directly to master (which we always do - they might work sent to slaves, but this is untested), even if they end up conflicting and being re-tried, they won't re-send the HTTPS request.
Status - fixed and a test in place, uses the same basic mechanism that I built yesterday to hand creationTimestamp commands to the sync thread.

What's next (same as before, minus the above):

PRAGMA schema_version should still be looked at again, I'm not sure if it's a performance issue, though it's currently mostly-disabled (which is fine if queries don't change the DB schema).
It would be nice to test performance. We can use a large-ish test DB and do a bunch of UPDATES and see if they conflict roughly as frequently or infrequently as we expect.

New stuff to do:

I need to uncomment the lines that automatically add the request ID to log lines. Super-minor, but noting it here so that I don't forget.
final cleanup and PR review.

Looking at this list, I expect to have a PR created and in review by the end of the week.

EDIT: I also updated the Auth plugin branch for this, merged master, and verified all of its tests still pass, so that looks good as well.

from bedrock.

tylerkaraszewski commented on August 27, 2024

Current state of multi-write Bedrock

Daily update, Feb 22

PRAGMA schema_version: I've looked over this again and find no evidence to suggest any problems, so I'm returning it to it's previous behavior and checking it off.
Log line prefixes are fixed.
I started trying to look into how to test this for performance, and for how many conflicts it might cause when we use it in production, but coming up with any effective test is challenging. I've decided to continue on this as I move into review.

What's next (same as before, minus the above):

final cleanup and PR review.
Continue performance testing investigation.

I expect to have a PR ready for review by tomorrow.

from bedrock.

tylerkaraszewski commented on August 27, 2024

PR created, further updates will be in there.

from bedrock.

tylerkaraszewski commented on August 27, 2024

This issue is no longer useful.

from bedrock.

Comments (34)

Current state of multi-write Bedrock

Daily update, Feb 16

Current state of multi-write Bedrock

Daily update, Feb 15

Reiteration of @quinthar's Proposal

Really Important Question

Speculative Question

Bedrock Parallel Writer Design Doc

Fundamental Support for Parallel Writes - BEGIN CONCURRENT.

SQLiteNode::update() and the Existing COMMIT Flow.

Proposed COMMIT Flow

Parallelizing the new flow.

Alternative COMMIT idea

Actually Implementing This With Multiple Threads.

What Threads Do We Start, And What Do They Do?

HTTPSRequests

Outstanding Issues:

Current state of multi-write Bedrock

Daily update, Feb 9

Current state of multi-write Bedrock

Daily update, Feb 10

Current state of multi-write Bedrock

Daily update, Feb 13

Current state of multi-write Bedrock

Daily update, Feb 20

Current state of multi-write Bedrock

Daily update, Feb 21

Current state of multi-write Bedrock

Daily update, Feb 22

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org