Giter Club home page Giter Club logo

Comments (13)

c0c0n3 avatar c0c0n3 commented on July 23, 2024

Hello @NunopRolo and thanks for reporting this issue!

We also noticed a certain amount of data loss in our load tests. Under heavy load, the QuantumLeap notify endpoint lost about 4% of the NGSI entities

We've never had the time to figure out why exactly, but it could be a combination of our code being too resource intensive and too little worker threads specified in the Gunicorn config---I mean too little for the request workload.

But your scenario is taking this to a whole new level :-) We're talking about an order of magnitude higher data loss, i.e. you lost about 40% of the incoming NGSI entities. One avenue to explore is Orion notification throttling. If memory serves, the default is to notify subscribers of the last entity update received in the previous second timespan. So if Orion got an average of 10 entity updates (to the same entity) per second, it'd only notify QuantumLeap of one entity update per second. Can you try deleting your subscription and then adding back with a throttling: 0 field and see if it makes any difference?

If you're still losing alot of data, then it could be you need to beef up your test environment. E.g. run Crate and QL on a separate boxes, making sure Crate gets lots of RAM.

Hope this helps!

from ngsi-timeseries-api.

NunopRolo avatar NunopRolo commented on July 23, 2024

Hello @c0c0n3
Thanks for the answer

I tried to put the throttling: 0 and the same problem happens. I also tried to use threadpool mode in Orion Broker, and I tried to add more workers to QuantumLeap and it improved a little, but I still have a big loss of data.

It's a shame, knowing that the problem is in the component itself, and not in some wrong configuration on my part. I will then continue to try by adjusting some parameters, to see if I can avoid this problem.
If you have any more tips please let me know.

Thank you for your help.

from ngsi-timeseries-api.

SBlechmann avatar SBlechmann commented on July 23, 2024

We were facing similar issues when conducting performance tests, even with wq. I am very much looking forward to a solution to this.

BTW: 0 is the default value when not explicitly stating another value to throttling.

from ngsi-timeseries-api.

c0c0n3 avatar c0c0n3 commented on July 23, 2024

Hi @NunopRolo :-)

Since tweaking configuration didn't help, my guess is that you'll need more beefy hardware. Unfortunately QuantumLeap wasn't really designed with performance in mind from the outset and our stack requires lots of horsepower to handle that kind of workload---if I understand correctly, you're trying to process 18,000 requests a minute.

The first thing I'd do is move away from Docker Compose. I'd set up a separate server machine with 4 CPUs + 16GB RAM + SSD to run Crate. Then on another server machine (same specs) I'd run Orion and QuantumLeap with about 20 workers. Finally, JMeter should run on your client machine.

Also keep in mind we a have work queue a solution to mitigate data loss:

It should up a little your throughput w/r/t vanilla QuantumLeap and contain data loss to the bare minimum---possibly no loss at all. But it's way more complex to deploy.

from ngsi-timeseries-api.

c0c0n3 avatar c0c0n3 commented on July 23, 2024

@SBlechmann sorry to hear you're having issues too. Work queue should help, but you'll need beefy hardware to run it smoothly, see my previous comment about it. Anyway, WQ isn't a game changer. Like I said, the problem is that performance wasn't really a design goal from the start and at this point trying to turn QuantumLeap into a high-performance solution would require a complete redesign of the architecture and a rewrite of the code from scratch.

from ngsi-timeseries-api.

c0c0n3 avatar c0c0n3 commented on July 23, 2024

BTW: 0 is the default value when not explicitly stating another value to throttling.

Do you have a reference for that? I always struggle to remember what the defaults are and couldn't find any explicit mention of that in the docs---or the default being one second for that matter. But I've bumped into this

from ngsi-timeseries-api.

SBlechmann avatar SBlechmann commented on July 23, 2024

Hey @c0c0n3 ,

well, we did some performance tests as well... and 300 req / s is not much and a python script should be able to handle this imho.
I don't have the data with me atm, but I believe we ran several tests vrom 100 to 700 req / s. For low rates the data was saved persistently while at 700 req / s it was less than 50 %.

I know of a colleague who also ran some tests but with more hardware ressources... let me reach out.

from ngsi-timeseries-api.

SBlechmann avatar SBlechmann commented on July 23, 2024

BTW: 0 is the default value when not explicitly stating another value to throttling.

Do you have a reference for that? I always struggle to remember what the defaults are and couldn't find any explicit mention of that in the docs---or the default being one second for that matter. But I've bumped into this

* https://fiware-orion.readthedocs.io/en/master/admin/perf_tuning.html#subscription-cache

Indeed, I can't find any ref for that... but I was sure I found out about that in the past.
I just did a little test and posted three subscriptions.

Sub 1: leave out throttling option
Sub 2: throttling:0
Sub 3: throttling:1

Running a GET against orions /v2/subscriptions will not show throttling except for Sub 3.
Yet in MongoDB, it says throttling:0 for Subs 1 and 2 and throttling:1 for Sub 3.

from ngsi-timeseries-api.

c0c0n3 avatar c0c0n3 commented on July 23, 2024

@SBlechmann

300 req / s is not much and a python script should be able to handle this imho.

agree :-) indeed QL + WQ can handle that actually. We happened to experience that exact workload in a prod scenario and there was no data loss. But you'll need a setup similar to the one I mentioned earlier w/ different boxes for Crate, QL and Redis.

100 to 700 req / s. For low rates the data was saved persistently while at 700 req / s it was less than 50 %.

Did you run this test through Docker Compose on a single machine with QL+WQ? I experienced something similar when running everything in Docker Compose on my laptop, but the reason was that I didn't have enough horsepower so the test client pumping requests would keep on getting 500s b/c there wasn't enough CPU and RAM to handle that workload. Also keep in mind, when using WQ, you might have to wait a few minutes before checking if the data is in the DB b/c of the WQ exponential backoff algorithm that retries failed inserts. See:

In general, if you want high-performance and efficient resource usage, QL is not the right solution. The QL architecture wasn't designed for performance and its utterly wasteful when it comes to resource usage. We desperately tried to bolt on performance improvements when we started hitting prod issues, but like I said earlier there's only so much you can do without rewriting the software from scratch using a different architecture. To see why that's the case, think of a scenario where a device sends measurements every 5 seconds. That's one call to Orion, followed by one to MongoDB, followed by a notification to QL which finally issues an insert in the time series DB. The approach doesn't scale well. Just think about the QL bit of the journey: you pay the price for one DB insert every 5 secs. Now imagine you had 1,000 devices sending data every 5 secs. Well, that's 12,000 inserts a minute. Surely you can think of a different architecture where readings are buffered and then bulk-inserted into the DB. In this architecture you would e.g. only do a bulk insert of 6,000 records every 30 secs which is 2 inserts a minute vs 12,000 a minute.

I just did a little test and posted three subscriptions.
...
Yet in MongoDB, it says throttling:0 for Subs 1 and 2 and throttling:1 for Sub 3.

Ha! that's great, thank you so much for this, very valuable piece of info indeed!

from ngsi-timeseries-api.

StWiemann avatar StWiemann commented on July 23, 2024

I did some load-testing about 2 years ago, as well. The results weren't great for QL.
I did my testing in a Kubernetes-Cluster and assigned 3 Worker-Nodes (8 Cores, 64GB RAM) to handle Fiware.
It became clear that QL is the biggest bottle neck. Followed by IoT-Agents. Orion seems to be fine with a lot more load than either of those could handle. After starting enough instances of QL and IoT-Agents I was able to get about 6000 inserts/s working. Most of those tests were missing roughly 0-10 Data-Points out of 180k with enough scaling.
I ran 3 Crate-Nodes and no replica sets for best performance.
It might be important that I didn't reach a limit there. It was just sufficient for my proof of concept. Throwing enough hardware and a load balancer at stuff like this solves most problems, I guess. But it might not be the best solution like c0c0n3 already said.

A big part of this is of course SSDs. If you are running this on some kind of hosting service, sometimes your volumes are mounted on rather slow hardware or you get restricted IOPS. I wasn't able to get past 120 requests/s because of that initially. Caching notifications in redis helps with bursts.

When I did that Scorpio was just about to be useable. I don't know if that is another feasible route to go (provided it doesn't rely on QL as well), since QL won't be rewritten I guess and imho python might not be the most performant choice there.

from ngsi-timeseries-api.

c0c0n3 avatar c0c0n3 commented on July 23, 2024

@StWiemann

I did some load-testing about 2 years ago, as well. The results weren't great for QL.

Not surprised to be honest :-)

Throwing enough hardware and a load balancer at stuff like this solves most problems

Yep it does, but that hurts your pocket :-)
Anyway, your results are totally in line with our experience and performance tests.

QL won't be rewritten I guess

You guessed right :-) At this point in time we only have barely enough resources for minor improvements, but I wish we could do a rewrite to solve most of the problems we have, performance and NGSI-LD coming on top of my list...

imho python might not be the most performant choice there

I couldn't agree more. If we ever do a rewrite, it's most likely going to be Rust...

from ngsi-timeseries-api.

NunopRolo avatar NunopRolo commented on July 23, 2024

Thanks for all the comments. I already understand that the problem is in the QL and there won't be a change soon.
I'm running these tests on a laptop (4 cores, 16GB RAM, NVME SSD), as I'm still investigating the fiware stuff, and apparently it has few resources to deal with the QL. I will try it on a better machine to see if that solves the problem.

If I still have problems with a better machine, I will try to find an alternative, to receive Orion Broker notifications on another service with better performance (I don't know if it will be possible, but I will investigate).

Thank you all

from ngsi-timeseries-api.

c0c0n3 avatar c0c0n3 commented on July 23, 2024

pleasure! keep in mind you could also turn on telemetry in quantum leap and then analyse telemetry data with pandas to figure out exactly what's happening

from ngsi-timeseries-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.