Our experiments show that a single Locust instance struggles to run with over 500 TPS.

This should be the fix: <a class="commit-link" data-hovercard-type="commit" data-hover

Small note: This fix uses FastHttpUser</co

Here is another profile from a longer run with more users: <div class="snippet-cli

Improve Locust performance with high TPS about nearcore HOT 11 CLOSED

aborg-dev commented on September 28, 2024

Improve Locust performance with high TPS

from nearcore.

Comments (11)

aborg-dev commented on September 28, 2024 1

Andrei, thanks for your profiling! Can you please repeat this on this commit to check if (and how) my changes reduce time consumption by self.post_json

Yes, I plan to do this as a part of refactoring started in #11364. Concretely, I plan to add an API that allows to submit transaction without waiting for completion directly by the Locust user.
I'm still thinking how exactly the Locust user should behave to avoid overloading the system with these async transactions as we lose the backpressure mechanism that comes from waiting for transaction.

from nearcore.

aborg-dev commented on September 28, 2024

This should be the fix: aabf18b
Let's see if we can productionize it.

from nearcore.

MCJOHN974 commented on September 28, 2024

Small note:

This fix uses FastHttpUser instead of HttpUser, and, according to locust docs it can perform not 4000, but 16000 TPS. From my experience with our setup switching from HttpUser to FastHttpUser gives us around 10-20% extra TPS, not 300%

from nearcore.

MCJOHN974 commented on September 28, 2024

Probably, such difference in ratio shows that in our case bottleneck for locust is computations to generate and sign a transaction, not sending it to the node. (But I didn't read yet what is actually the difference between fast and not fast HttpUser in locust)

from nearcore.

MCJOHN974 commented on September 28, 2024

TLDR: I don't surprised that generation of NEAR transactions is 10 times slower than "best case scenario"

And for me it sounds reasonable, they say about 4k TPS in "best case scenario", so probably they just continuously sending thousands of identical requests like GET / HTTP/1.1 Host: example.com in their benchmark

So probably their request generation takes 0 CPU (while for us CPU is the bottleneck for locust) and all CPU is provided for locust internals. Probably they also located the server on same Mac machine and made it as simple as possible, like just send back HTTP/1.1 200 OK Content-Length: 0

But maybe I'm wrong.

from nearcore.

aborg-dev commented on September 28, 2024

All fair observations - I want to confirm this by running Locust under profiler.

Here is a sample profile of send_tx function:

Timer unit: 1e-09 s

Total time: 2612.09 s
File: /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py
Function: send_tx at line 268

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   268                                               @line_profiler.profile
   269                                               def send_tx(self, tx: Transaction, locust_name: str) -> dict:
   270                                                   """
   271                                                   Send a transaction and return the result, no retry attempted.
   272                                                   """
   273      3115        4e+11    1e+08     14.2          block_hash = self.final_block_hash()
   274      2937 5597757423.0    2e+06      0.2          signed_tx = tx.sign_and_serialize(block_hash)
   275                                           
   276      2937   10644362.0   3624.2      0.0          meta = self.new_locust_metadata(locust_name)
   277      2937    2838567.0    966.5      0.0          start_perf_counter = time.perf_counter()
   278                                           
   279      2937     717732.0    244.4      0.0          try:
   280      2937     467654.0    159.2      0.0              try:
   281                                                           # To get proper errors on invalid transaction, we need to use sync api first
   282      4222        2e+12    5e+08     85.5                  result = self.post_json(
   283      2937     698936.0    238.0      0.0                      "broadcast_tx_commit",
   284      2937   19713730.0   6712.2      0.0                      [base64.b64encode(signed_tx).decode('utf8')])
   285      1462  162273411.0 110994.1      0.0                  evaluate_rpc_result(result.json())
   286         1        747.0    747.0      0.0              except TxUnknownError as err:
   287                                                           # This means we time out in one way or another.
   288                                                           # In that case, the stateless transaction validation was
   289                                                           # successful, we can now use async API without missing errors.
   290                                                           submit_raw_response = self.post_json(
   291                                                               "broadcast_tx_async",
   292                                                               [base64.b64encode(signed_tx).decode('utf8')])
   293                                                           meta["response_length"] = len(submit_raw_response.text)
   294                                                           submit_response = submit_raw_response.json()
   295                                                           # extract transaction ID from response, it should be "{ "result": "id...." }"
   296                                                           if not "result" in submit_response:
   297                                                               meta["exception"] = RpcError(message="Didn't get a TX ID",
   298                                                                                            details=submit_response)
   299                                                               meta["response"] = submit_response.content
   300                                                           else:
   301                                                               tx.transaction_id = submit_response["result"]
   302                                                               # using retrying lib here to poll until a response is ready
   303                                                               self.poll_tx_result(meta, tx)
   304         1        328.0    328.0      0.0          except NearError as err:
   305                                                       logging.warn(f"marking an error {err.message}, {err.details}")
   306                                                       meta["exception"] = err
   307                                           
   308      4386    4688350.0   1068.9      0.0          meta["response_time"] = (time.perf_counter() -
   309      2924    1080117.0    369.4      0.0                                   start_perf_counter) * 1000
   310                                           
   311                                                   # Track request + response in Locust
   312      1462   77682295.0  53134.3      0.0          self.request_event.fire(**meta)
   313      1462     583757.0    399.3      0.0          return meta

2612.09 seconds - /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py:268 - send_tx

The 99% of the wall-clock time is split between two lines:

block_hash = self.final_block_hash()
result = self.post_json("broadcast_tx_commit", ...)

Which are both RPC requests to the node.

The transaction signing takes 0.2% of wall-clock time which is non-zero but also not that significant.

I think the next step is to understand the threading model that Locust is using - if these transaction submissions are executed on a single thread in a blocking fashion that will surely limit the throughput. Instead, we should be using something like asyncio here.

from nearcore.

aborg-dev commented on September 28, 2024

To analyse this further, signing takes 2ms of wallclock time, so we can do around 500 of those per second in a Python process, which would be a bottleneck for a single worker setup. For 16 workers that should not be a problem though.

It would be nice to improve the performance of signing, but that probably is not the root cause of limited throughput that we see with multiple shards.

from nearcore.

aborg-dev commented on September 28, 2024

Here is another profile from a longer run with more users:

Timer unit: 1e-09 s

Total time: 6.04117e+06 s
File: /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py
Function: send_tx at line 268

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   268                                               @line_profiler.profile
   269                                               def send_tx(self, tx: Transaction, locust_name: str) -> dict:
   270                                                   """
   271                                                   Send a transaction and return the result, no retry attempted.
   272                                                   """
   273   2062925        5e+15    2e+09     80.7          block_hash = self.final_block_hash()
   274   1889490        2e+12    1e+06      0.0          signed_tx = tx.sign_and_serialize(block_hash)
   275                                           
   276   1889490 4870509915.0   2577.7      0.0          meta = self.new_locust_metadata(locust_name)
   277   1889490 1717009524.0    908.7      0.0          start_perf_counter = time.perf_counter()
   278                                           
   279   1889490  341764477.0    180.9      0.0          try:
   280   1889490  230492041.0    122.0      0.0              try:
   281                                                           # To get proper errors on invalid transaction, we need to use sync api first
   282   2528557        1e+15    5e+08     19.3                  result = self.post_json(
   283   1889490  337802006.0    178.8      0.0                      "broadcast_tx_commit",
   284   1889490 8958336492.0   4741.1      0.0                      [base64.b64encode(signed_tx).decode('utf8')])
   285    812412        8e+10 104359.0      0.0                  evaluate_rpc_result(result.json())
   286      2262    1064247.0    470.5      0.0              except TxUnknownError as err:
   287                                                           # This means we time out in one way or another.
   288                                                           # In that case, the stateless transaction validation was
   289                                                           # successful, we can now use async API without missing errors.
   290                                                           submit_raw_response = self.post_json(
   291                                                               "broadcast_tx_async",
   292                                                               [base64.b64encode(signed_tx).decode('utf8')])
   293                                                           meta["response_length"] = len(submit_raw_response.text)
   294                                                           submit_response = submit_raw_response.json()
   295                                                           # extract transaction ID from response, it should be "{ "result": "id...." }"
   296                                                           if not "result" in submit_response:
   297                                                               meta["exception"] = RpcError(message="Didn't get a TX ID",
   298                                                                                            details=submit_response)
   299                                                               meta["response"] = submit_response.content
   300                                                           else:
   301                                                               tx.transaction_id = submit_response["result"]
   302                                                               # using retrying lib here to poll until a response is ready
   303                                                               self.poll_tx_result(meta, tx)
   304      2262     490022.0    216.6      0.0          except NearError as err:
   305      2188  144508862.0  66046.1      0.0              logging.warn(f"marking an error {err.message}, {err.details}")
   306      2188     804307.0    367.6      0.0              meta["exception"] = err
   307                                           
   308   2437236 2564044058.0   1052.0      0.0          meta["response_time"] = (time.perf_counter() -
   309   1624824  519629076.0    319.8      0.0                                   start_perf_counter) * 1000
   310                                           
   311                                                   # Track request + response in Locust
   312    812412        4e+10  49686.0      0.0          self.request_event.fire(**meta)
   313    812412  303569684.0    373.7      0.0          return meta

6041174.94 seconds - /home/akashin/repos/work/nearcore/pytest/tests/loadtest/locust/common/base.py:268 - send_tx

Here final_block_hash is taking a massive 80% of the time and would be a first candidate to optimize.

Signing on average takes 1ms which is similar to a previous run.

The rest 20% is in broadcast_tx_commit, which on average takes 500ms.

I'll go ahead and try a few local optimizations to improve the single-thread performance.

from nearcore.

aborg-dev commented on September 28, 2024

For the reference, here is the result of running a sampling profiler py-spy in the same setup:

from nearcore.

MCJOHN974 commented on September 28, 2024

Andrei, thanks for your profiling! Can you please repeat this on this commit to check if (and how) my changes reduce time consumption by self.post_json

from nearcore.

aborg-dev commented on September 28, 2024

We can now genereate 1000 TPS with a single Locust process at 0.8 CPU core usage, which should be good enough for our benchmarks.

from nearcore.

Improve Locust performance with high TPS about nearcore HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent