We noticed a threading issue with some of our code that uses MTJ and I've isolated the

btw, in other topic I assume you seen <a class="issue-link js-issue-link" data-error-t

SVD fails in multi-threaded operation about matrix-toolkits-java HOT 24 OPEN

jonnywray commented on June 30, 2024

SVD fails in multi-threaded operation

from matrix-toolkits-java.

Comments (24)

fommil commented on June 30, 2024

Hmm, some of the results here sound like they might be sporadic and you're not seeing some things fail that should fail. e.g. "reference native" should be equivalent to "debian system liblapack3". I can't guarantee that the reference implementation is thread safe, but it would surprise me if it is not thread safe.

If you're seeing threading problems on some system implementations (e.g. ATLAS) but not others (veclib) I'd be tempted to say that ATLAS is simply not thread safe.

What concerns me more is that you see threading issues on the reference implementation. I'll have a look at your standalone project later tonight, thank you for putting it together!

from matrix-toolkits-java.

fommil commented on June 30, 2024

btw, in other topic I assume you seen #45 ?

from matrix-toolkits-java.

fommil commented on June 30, 2024

incidentally, I'd be surprised if you ever seen a performance improvement by running these sorts of things in parallel. See my talk (linked from netlib-java homepage) at ScalaX last December which explains the low-level of what's happening. Adding more threads/cores will probably slow things down.

from matrix-toolkits-java.

fommil commented on June 30, 2024

My memory of the JVM memory boundaries is a little shaky. Can you confirm that you still get errors (just test one of the failing cases) if you do the matrix equality assertion inside a synchronized block (to force memory retrieval in that thread)

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

It is certainly somewhat sporadic. The test runs things for multiple repeats so the error surfaces. In fact, for smaller test matrices we don't see the issue. We just double checked the Debian reference vs system and it still is fail vs non-fail.

We we are wanting to do isn't process one matrix across multiple threads but calculate the SVD on multiple different matrices at the same time in different threads. That's what the test attempts to replicate.

I hadn't seen the pseudo-inverse implementation, thanks.

I'll try the synchronized thing and let you know.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

I can confirm the errors still occur (on Windows) with the synchronized block around the matrix equals assertion.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

Another data point. We tried downloading, compiling and installing ATLAS via apt-get source atlas; apt-get build-dep atlas and then fakeroot debian/rules custom in atlas source directory. The tests then work.

The ATLAS version that comes packaged and installed via aptitude install libatlas3-base on Debian fails.

from matrix-toolkits-java.

fommil commented on June 30, 2024

ok, I'm happy calling ATLAS broken. The really concerning thing for me is that the reference implementation fails. I didn't think it had threading problems at all.

BTW, I mean that running multiple SVDs at the same time won't give you a performance boost. The BLAS will optimise the hell out of all your CPU caches and cores, so trying to do more than one at once will just result in page swapping. The best thing you can do is arrange a single queue for accessing the BLAS layer. Watch my talk for more details. But... that's all about performance. Certainly it should give the correct numbers no matter how parallel it is being used.

SVD basically only calls LAPACK's dgesdd, which is here https://github.com/fommil/netlib-java/blob/master/netlib/LAPACKE/lapacke_dgesdd.c and here https://github.com/fommil/netlib-java/blob/master/netlib/LAPACK/dgesdd.f and as far as I can tell it doesn't use any global variables. Might be a volatile problem.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

Yeah, it's been surprising to us also. Well hopefully it's just something idiotic I'm doing in my Java, but I don't think so. But that would be the easy solution. But yeah, I had a quick look through the code for SVD and couldn't see anything obvious.

I will take a look at the talk. We're plugging MTJ into our own library for processing graphs/networks for calculating properties based on the adjacency matrix. That is used in a system for distributed and parallel computing over a grid of machines (uses GridGain). We'll obviously need to figure out the optimal approach as the default will be multi-threaded.

from matrix-toolkits-java.

fommil commented on June 30, 2024

you might want to look at using the sparse stuff, especially the ARPACK example 😉 that is the kind of problem domain (in an industrial setting) that got me started with these projects years ago.

FWIW, I don't see any obvious problems in your Java. One thing you might want to try is hitting some kind of memory barrier before the copy operation. e.g. maybe do the copy in a synchronized block.

from matrix-toolkits-java.

fommil commented on June 30, 2024

also, does the problem fix itself if you synchronize access to your pseudoInverse method? (I realise that's a hack, but if it fails under those circumstances we'll learn something).

Try this with both sharing and not sharing the PseudoInverseMTJ instance.

from matrix-toolkits-java.

fommil commented on June 30, 2024

and one more, can you find some norm distance (even for the one failing element) between the failing and expected matrices. It would be good to see if the failure was systemic.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

I can certainly try those but it'll be a week or so before I'm back in front of my computer. I'll let you know what I find.

Thanks for the pointers on ARPACK, some of the things we do are based on the eigan spectra of either the adjacency or Laplacian. So should be useful.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

Got a few minutes to look at this before I leave for my conference.

I tried the synchronized access to pseudoInverse method. When the PseudoInverseMTJ instance is created fresh in each task/callable it fails still. If I share the instance and reuse across all tasks the tests pass.

This is the opposite of what I'd expect. No shared state fails but shared state passes? I hope you have more idea about what's going on here than me. Given you suggested the approach I guess you have something in mind.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

I committed some code to illustrate. There are now two, very similar, test methods that only differ in whether they share PseudoInverseMTJ instance or not.

from matrix-toolkits-java.

fommil commented on June 30, 2024

😕 my gut keeps telling me this is about visibility and CPU caches... I wonder if something inside MTJ or netlib-java is being initialised in the PseudoInverseMTJ class and needs to be made volatile.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

I just made a similar test class but looking at SVD on its own without the additional logic of PseudoInverseMTJ and it passes. So yes, something going on in PseudoInverseMTJ. I'll push it in a sec.

But, no shared state in that class at all, it's just a helper method?

Also, we first noticed the issue using the Sandia wrapper over MTJ which was being used in certain places in our code. The PseudoInverseMTJ was to replicate that functionality but removing that code from the problem.

By the way, all this morning's tests done on OSX.

from matrix-toolkits-java.

fommil commented on June 30, 2024

I'm possibly more confused than you are. I can't see anything weird in your wrapper class at all. I'll have a third look.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

Got a little time to look at this again this afternoon.

So, ignore what I said about SVD. It does fail on its own, I just wasn't testing everything. The s vector is always the same but the U and V matrices have different values when calculated in threaded mode.

So, it's not the wrapper. That's now less confusing.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

Just read through the netlib-java issue. You think this could just be rounding errors and such?

I should say, this isn't stopping us using the library. It's more just an itch to scratch now. And thanks for the work, it's been very useful.

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

One other data point. When I look at the two versions of the test with sharing or not of the 'PseudoInverseMTJ ' class I see a big difference in the performance.

the shared version (which works) shows pretty large variance in time taken for one task, from 700ms to over 10s
the non-shared version (which fails) shows hardly any variance in time taken for a task.

the time taken is obviously related to your comments above about performance so thought this might give you a clue.

from matrix-toolkits-java.

fommil commented on June 30, 2024

this just gets weirder, especially on performance.

Can you maybe reduce your tolerance to errors and see if the problem persists?

from matrix-toolkits-java.

jonnywray commented on June 30, 2024

Yep, I tried a bunch of that this afternoon and saw nothing useful. Changed both my definition of equals for a double in the tests and the threshold used in pseudo inverse for defining zero. I trying the code from EJML that calculates a dynamic zero threshold during pseudo inverse. Nothing changed really.

I did this after my comment on tolerances discussed in the other issue. The magnitude of the differences suggests it isn't a tolerance thing either, they are big differences, easily similar in size to the actual values.

On Apr 24, 2015, at 7:00 PM, Sam Halliday [email protected] wrote:

this just gets weirder, especially on performance.

Can you maybe reduce your tolerance to errors and see if the problem persists?

—
Reply to this email directly or view it on GitHub.

from matrix-toolkits-java.

fommil commented on June 30, 2024

it's just got to be memory barrier stuff. Maybe you don't need to do the calculations in a syncronized block, but can you do a synchronized block before and after (and in the comparison) to make sure all the right data is seen in the threads doing the work?

from matrix-toolkits-java.

SVD fails in multi-threaded operation about matrix-toolkits-java HOT 24 OPEN

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent