When I run locally (i.e point browser to <a href="http://localhost:8080/org.geppetto.f

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Geppetto crashes when running simulation remotely. about org.geppetto HOT 16 CLOSED

openworm commented on August 16, 2024

Geppetto crashes when running simulation remotely.

from org.geppetto.

Comments (16)

vellamike commented on August 16, 2024

Monitoring memory on the server while this happens:

1.When simulation loaded, approx 500MB increase
2. It creeps up by 175MB before the crash.
3. Memory then returns to value after simulation loaded.

from org.geppetto.

vellamike commented on August 16, 2024

Another observation: Running on another, much faster server, the same thing happens only much faster. Additionally, when I run this on a client with a better network connection to the original what happens is:

The simulation runs faster on the client
More simulation steps happen before the crash

I would like to bounce a theory off @gidili and @tarelli - What is happening is that the server is generating scenes, sending them to the client, the client is not rendering them fast enough (possibly due to network lag or other reasons), causing a buffer to build up on the server, which eventually causes a crash when memory runs out..

from org.geppetto.

gidili commented on August 16, 2024

Thanks for trying and reporting this Mike.

It used to be that the steps buffer (it's actually a state tree now) had a maximum size that was configurable (around 100 usually) as you can see here from a few revisions ago.

On the latest release (0.0.3, latest on master) this doesn't seem to be the case anymore. There is a check but the visitor that removes extra "steps" is never applied, unless the elements are removed from the tree somewhere else now.

@tarelli @jrmartin anybody know anything about this? Maybe some dodgy merge?

from org.geppetto.

vellamike commented on August 16, 2024

Does this mean that network lag could be slowing the simulation down
though?
On 6 Aug 2013 23:21, "Giovanni Idili" [email protected] wrote:

Thanks for trying and reporting this Mike.

It used to be that the steps buffer (it's actually a state tree now) had a
maximum size that was configurable (around 100 usually) as you can see
herehttps://github.com/openworm/org.geppetto.simulation/blob/d72ab4be08acf63a32b937119552e0437f4d8261/src/main/java/org/geppetto/simulation/SimulationCallbackListener.java#L65from a few revisions ago.

On the latest release (0.0.3, latest on master) this doesn't seem to be
the case anymorehttps://github.com/openworm/org.geppetto.simulation/blob/master/src/main/java/org/geppetto/simulation/SimulationCallbackListener.java#L98.
There is a check but the visitor that removes extra "steps" is never
applied.

@tarelli https://github.com/tarelli @jrmartinhttps://github.com/jrmartinanybody know anything about this? Maybe some dodgy merge?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-22216345
.

from org.geppetto.

gidili commented on August 16, 2024

No it's not lag - there's a timer that sends updates to the client at fixed
rate, whenever a time step is sent to the client is also removed from the
buffer.

On a "slow" machine this is not a problem, but on a fast machine the number
of steps stored grows faster compared to how many updates are sent (and
removed from the buffer) and it leads to full memory as you are
experiencing.

The fixed amount of items stored in the buffer was meant to prevent this
from happening - must've been removed for a reason but sounds like we need
to do smt about it.

from org.geppetto.

vellamike commented on August 16, 2024

Does the server send the client the data from the latest timestep or the
whole buffer?

On 6 August 2013 23:44, Giovanni Idili [email protected] wrote:

No it's not lag - there's a timer that sends updates to the client at
fixed
rate, whenever a time step is sent to the client is also removed from the
buffer.

On a "slow" machine this is not a problem, but on a fast machine the
number
of steps stored grows faster compared to how many updates are sent (and
removed from the buffer) and it leads to full memory as you are
experiencing.

The fixed amount of items stored in the buffer was meant to prevent this
from happening - must've been removed for a reason but sounds like we need
to do smt about it.

On 6 Aug 2013, at 23:26, Mike Vella [email protected] wrote:

Does this mean that network lag could be slowing the simulation down
though?
On 6 Aug 2013 23:21, "Giovanni Idili" [email protected] wrote:

Thanks for trying and reporting this Mike.

It used to be that the steps buffer (it's actually a state tree now) had
a
maximum size that was configurable (around 100 usually) as you can see
here<

https://github.com/openworm/org.geppetto.simulation/blob/d72ab4be08acf63a32b937119552e0437f4d8261/src/main/java/org/geppetto/simulation/SimulationCallbackListener.java#L65>from

a few revisions ago.

On the latest release (0.0.3, latest on master) this doesn't seem to be
the case anymore<

https://github.com/openworm/org.geppetto.simulation/blob/master/src/main/java/org/geppetto/simulation/SimulationCallbackListener.java#L98>.

There is a check but the visitor that removes extra "steps" is never
applied.

@tarelli https://github.com/tarelli @jrmartin<
https://github.com/jrmartin>anybody know anything about this? Maybe some
dodgy merge?

—
Reply to this email directly or view it on GitHub<
https://github.com/openworm/org.geppetto/issues/8#issuecomment-22216345>
.

—
Reply to this email directly or view it on
GitHub<
https://github.com/openworm/org.geppetto/issues/8#issuecomment-22216673>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-22217653
.

from org.geppetto.

gidili commented on August 16, 2024

It just sends the oldest timestep it stored, and as it's sent it gets removed from the buffer.

From looking at the code it seems though that my first diagnosis was partially wrong though. There is a while loop that waits when the number of steps stored equals the buffer configured maximum (looks like it's 100 steps by default).

This means that we need to reduce the size of the buffer or increase the available memory for the JVM on your setup, as on your machine(s) it runs much faster filling out the buffer and (surprisingly) running out of memory.

I am guessing this doesn't happen on slower machines as the updates to the client keep the buffer size down, but I can almost swear I remember seeing the buffer filled up to the max many times on my machine as I was blocking some part of the code with breakpoints.

from org.geppetto.

vellamike commented on August 16, 2024

Setting this environment variable on Ubuntu increased memory for the heap
to 9G and prevented the crash:

_JAVA_OPTIONS: -Xmx9g

On 7 August 2013 00:14, Giovanni Idili [email protected] wrote:

It just sends the oldest timestep it stored, and as it's sent it gets
removed from the buffer.

From looking at the code it seems though that my first diagnosis was
partially wrong though. There is a while loophttps://github.com/openworm/org.geppetto.simulation/blob/master/src/main/java/org/geppetto/simulation/SimulationCallbackListener.java#L93that waits when the number of steps stored equals the buffer configured
maximum (looks like it's 100 steps by default).

This means that we need to reduce the size of the buffer or increase the
available memory for the JVM on your setup, as on your machine(s) it runs
much faster filling out the buffer and (surprisingly) running out of memory.

I am guessing this doesn't happen on slower machines as the updates to the
client keep the buffer size down, but I can almost swear I remember seeing
the buffer filled up to the max many times on my machine as I was blocking
some part of the code with breakpoints.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-22218989
.

from org.geppetto.

vellamike commented on August 16, 2024

@gidili and I just had a 3 hour diagnosis of this issue and we have a hypothesis that this problem (and a related problem - slower simulations over remote comms) is the result of network communication. Here are the facts we observed:

Crashes do not occur on server when Java heap size is increased. They also do not occur if server is also client (i.e point Geppetto frontend to localhost).
A server (with a GPU) runs a simulation faster when client is localhost than if client is remote.
Update times on browser are faster for smaller simulations.
Server (with CPU) running large SPH simulation initially has high CPU utilization (~30%) which suddenly drops to <2%.
Drop in CPU utilization mentioned in 4. occurs around when crash would occur without increasing Java heap size.
Theoretical lower bandwith bound of large PCISPH scene (60,000 particles) with 20ms updates is ~500Mb/s (unless my maths is waay off..) - this is probably higher in the current implementation because numbers are represented as characters.

Our conclusion is that Geppetto simulations are slowing down with remote clients because network hardware on the server cannot handle the amount of data being generated. The bottleneck becomes communication rather than computation. This is not network lag - each scene has a minimum of ~1MB of data which the network card is required to handle every 20ms, it probably cannot keep up.

This hypothesis explains the above observations:

When on localhost data can reach client much faster, a queue of updates does not build up and buffer remains almost empty. With remote host the buffer (500 scenes max) builds up to approx 3GB of memory - causing crash). Increasing heap size resolves this issue.
The bottleneck is no longer the communication time, updates happen every 20ms.
When data to be transferred is small, bottleneck goes back to being computation and updates happen every ~20ms.
While buffer is being filled CPU utilization is high, after that CPU is waiting for communication process to clear space in buffer for CPUs to write to and CPU utilization drops.
see above point.
This probably represents a significant amount of data.

from org.geppetto.

tarelli commented on August 16, 2024

Mike thanks for trying this out. I am sorry about the 3 hours diagnosis, I could have probably made you save some of it. As you guys found out there is a buffer that gets filled in as the simulation runs and gets emptied as the steps are streamed to the client. This is known as producer-consumer paradigm, the simulation in this case is the producer of data and the client is the consumer.

process Producer
{while (1) {
produce data; Buffer.Insert(data); }
}
process Consumer {
while (1)
{
data = Buffer.Remove(); consume data;
}
}

The wait loop that Gio found acts like a monitor as the two processes have to wait one after each other when the buffer is full (wait for consumer to consume data) and when the buffer is empty (wait for producer to produce some data).

The out of memory error as you found out was because whatever memory you had allocated was not enough to host a full buffer (because the system was not consuming fast enough) with the big scene. Take away from this is that we should find how much memory is needed for a full buffer of an arbitrary big scene and also reevaluate buffer size (is 100 too much?).

As per the rest of your analysis what you experienced is network latency, same thing as if you try to watch a Full HD 1080p video on youtube and the throughput of your connection is not high enough. Could you measure what bandwidth those machines have in upload? Another take away is that we should do more analysis to establish how much bandwidth is needed for simulations of different sizes and potentially evaluate different formats to transport the information.

from org.geppetto.

vellamike commented on August 16, 2024

Thanks for confirming our ideas Matteo - The CPU has a tested upload speed of 16Mb/s - in fact I monitored upload from the server during the simulation and this is what it was hitting. I've done some calculations regarding needed bandwith below:

Let's estimate the particles per second (p/s) we can send assuming 1Mb/s (the same as particles per megabit):

1 particle has 4 floats, i.e 128 bits
therefore 1 Megabit can represent (1048576/128) = 8000 particles
If we want to send updates every 20ms we need to divide this number by 50 = 160 particles.

So for every Mb/s, with a 20ms update time, you can transport information for 160 particles. Even on my 16Mb/s connection this only represents ~2,500 particles.

On a 16Mb/s, with 60,000 particles you would expect a slowdown of 60000/2500 = 24 and corresponding update time of ~0.5s which is approximately what I observed.

Let's say we want 1 million particles with a 20ms update time. This would require a 6Gb/s connection. If the update time can be slowed to 100ms (remember, computation now also slowing down anyway) its more like ~1Gb/s. With a 90% compression ratio 100Mb/s. Then do some sampling (don't send all those particles) - say sample 5% of them only - 5Mbs.

So I guess its realistic to stream the particles we want, but more of a challenge than we expected.

from org.geppetto.

tarelli commented on August 16, 2024

I just confirmed your finding on an Amazon instance. I just had to increase to 2g to avoid the crash (from the 512m which is there by default). Next release of Geppetto I will default it to 2g. Regarding the amount of data we are sending a lot of data and there are some low hanging fruits to do some quick optimisation, I was just sniffing my connection as I was receiving data from Amazon:

....A../e\":\"Particle\",\"id\":\"p[5870]\",\"position\":{\"x\":11.909845352172852,\"y\":7.037267684936523,\"z\":108.53021240234375}},{\"type\":\"Particle\",\"id\":\"p[5871]\",\"position\":{\"x\":13.922074317932129,\"y\":7.0179972648620605,\"z\":110.3287582397461}},{\"type\":\"Particle\",\"id\":\"p[5872]\",\"position\":{\"x\":13.698995590209961,\"y\":6.940342903137207,\"z\":112.6390609741211}},{\"type\":\"Particle\",\"id\":\"p[5873]\",\"position\":{\"x\":12.391883850097656,\"y\":6.510068416595459,\"z\":114.03316497802734}},{\"type\":\"Particle\",\"id\":\"p[5874]\",\"position\":{\"x\":10.983614921569824,\"y\":6.262143611907959,\"z\":115.64854431152344}},{\"type\":\"Particle\",\"id\":\"p[5875]\",\"position\":{\"x\":14.749643325805664,\"y\":6.5971574783325195,\"z\":117.860595703125}},{\"type\":\"Particle\",\"id\":\"p[5876]\",\"position\":{\"x\":13.7555513381958,\"y\":5.842294692993164,\"z\":120.65282440185547}},{\"type\":\"Particle\",\"id\":\"p[5877]\",\"position\":{\"x\":15.313730239868164,\"y\":5.548252105712891,\"z\":122.5022964477539}},{\"type\":\"Particle\",\"id\":\"p[5878]\",\"position\":{\"x\":15.530537605285645,\"y\":4.753219127655029,\"z\":125.45529174804688}},{\"type\":\"Particle\",\"id\":\"p[5879]\",\"position\":{\"x\":12.753305435180664,\"y\":4.378335475921631,\"z\":126.2254409790039}},{\"type\":\"Particle\",\"id\":\"p[5880]\",\"position\":{\"x\":14.148794174194336,\"y\":5.024
13:41:29.142770 IP ec2-54-213-120-136.us-west-2.compute.amazonaws.com.http-alt > 192.168.1.72.52078: Flags [.], seq 12693800:12695200, ack 1, win 122, options [nop,nop,TS val 827155 ecr 1104476463], length 1400

and there is a lot of redundant information there.

More in general though we might have to look at switching to a binary format and experiment with the tradeoff in terms of extracting the packets. Again thanks a lot for starting this investigation, I will come up with some actions to follow up on this.

from org.geppetto.

gidili commented on August 16, 2024

Low hanging fruits are compression of the string we are sending or switch to binary data (or both) - but it's just going to push the limit a bit higher. I think we need some way to reduce the amount of information being sent. Time to get creative.

from org.geppetto.

tarelli commented on August 16, 2024

Useful comparisons: https://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

from org.geppetto.

vellamike commented on August 16, 2024

I think this issue has been solved and we should open a new one regarding solving the comms bottleneck?

from org.geppetto.

tarelli commented on August 16, 2024

Agreed

from org.geppetto.

Geppetto crashes when running simulation remotely. about org.geppetto HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent