exploiting symmetries about spec HOT 20 OPEN

princetonuniversity commented on September 23, 2024

exploiting symmetries

from spec.

Comments (20)

jloizu commented on September 23, 2024

OK I agree with the work flow, but then we should first merge the ma00aa branch with master (I will do some final testing) and then create a new branch called something like "fastspec" and work on these matters from there. I will get this set tomorrow and let you know.

from spec.

jloizu commented on September 23, 2024

@SRHudson wrote:

I believe that the l2stell input file is too tiny to really see the impact of the speed improvements that I
intend to pursue, and whether the bottleneck in SPEC when we use large Fourier resolution and a
large number of interfaces is really with ma00aa or somewhere else.

I performed a study with the l2stell.sp testcase, increasing Fourier resolution (Mpol=Ntor=4,6,8), measuring the total computation time, and comparing the two SPEC versions ("fast" is from ma00aa branch, "slow" is from master branch). Find attached the result of this exercise.

l2stell_cputime_versus_fourier.pdf

For example, at m=n=8, the slow code takes ~ 1558 sec and the fast code takes ~645 sec, a factor of ~2.5 speedup, which seems to remain constant with increasing fourier resolution.

from spec.

jloizu commented on September 23, 2024

More over, along these runs I did profiling, and found that the ma00aa routine increases percentage of total cpu time as Fourier resolution increases. This means that code speedup relies on ma00aa-optimization even more at high Fourier resolution.

To illustrate this point, here are some numbers:

"slow spec"
ma00aa 74% at m=n=4 ; 85% at m=n=8

"fast spec"
ma00aa 35% at m=n=4 ; 52% at m=n=8

from spec.

SRHudson commented on September 23, 2024

Comment One) What you are calling "slow spec" is actually non-stellarator-symmetric spec, so comparisons between "fast spec", which exploits stellarator symmetry and slow spec are not particularly meaningful. Recall that the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere being used.

Comment Two) It is interesting that at higher M, N that ma00aa takes a higher percentage of the total time. I will continue to work on this, as I have described earlier (by removing the calculation of DToocc, DToocs, DToosc and DTooss from ma00aa to preset).

from spec.

jloizu commented on September 23, 2024

@SRHudson not sure I understand your Comment One,

the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions > as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere
being used.

Does this mean that you modified ma00aa.h in such a way that IF the system is stellarator symmetric THEN those terms are not calculated, but that in general these are still calculated and thus no much speedup is expected?

from spec.

SRHudson commented on September 23, 2024

Ultimately, it would be nice to determine a scaling law, e.g.,
T \sim L^a M^b N^c Nv^d
for each subroutine, where L, M and N are the radial, poloidal and toroidal resolutions, Nv is the number of volumes, and a, b and c are exponents.

It is very interesting that matrix seems to be taking a lot of time. There is NO computation performed in matrix. It is just loops and assigning matrix elements. I guess that by re-looping that the time of matrix can be reduced significantly.

from spec.

SRHudson commented on September 23, 2024

In
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
I have color-coded the stellarator symmetric terms and the non-stellarator symmetric terms.

As of last week, all the terms were being calculated regardless of Istellsym. All I did was to not compute the non-stellarator-symmetric terms if Istellsym = 1.

For non-stellarator-symmetric equilibria, as you can see from
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
there will be four times as many quantities to compute. We can expect that non-stellarator-symmetric equilibria will take much longer . . . . You can test this. Just set Istellsym = 0 into the l2stell.sp input file and take a look. If the boundary is stellarator symmetric, then the equilibria computed with Istellsym = 1 and Istellsym = 0 should be identical.

from spec.

jloizu commented on September 23, 2024

@SRHudson OK I understand, although somehow I do not see any red terms when I open

http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf

from spec.

SRHudson commented on September 23, 2024

How are you doing the timing?
I implemented a timing capability. It might be clumsy, and I don't know if it works. Set Ltiming=1 in the &diagnos namelist and a screen output should give the time taken by each routine. I see that when running on a single cpu (with Linitialize=1 in l2stell.sp)
mp00ac : 45.35s
ma00aa : 21.96s
tr00ab : 7.68s
You will see that each call to a subroutine is performed using the macro WCALL, which automatically adds time differences to a T(filename) variable. See also the macro BEGIN, which also adds time.

from spec.

lazersos commented on September 23, 2024

@SRHudson I believe @jloizu is doing the timing using the standard gnu profiling extension. This is useful as it automatically calculates timing for all routines.

from spec.

SRHudson commented on September 23, 2024

So, do the results from the different timings agree?

from spec.

jloizu commented on September 23, 2024

No the times do not agree. This is how I do it, you can try:

compile the code with the profiling option, e.g.

make CC=intel_prof

run the code normally; that should generate, after execution, a file called gmon.out
finally type the command

gprof xspec > somename.out

You can see inside somename.out all the profiling information

from spec.

SRHudson commented on September 23, 2024

Ok, I made with CC=intel_prof, ran xspec, gprofed xspec and got this:
% cumulative self self total
time seconds seconds calls s/call s/call name
73.54 22.10 22.10 410 0.05 0.06 ma00aa_
8.92 24.78 2.68 410 0.01 0.01 matrix_
4.36 26.09 1.31 43 0.03 0.69 dforce_
3.76 27.22 1.13 412 0.00 0.00 volume_
which I think is what Joaquim has been reporting, that ma00aa is where all the time is. This method also shows that that code took about 30sec to complete.

My timing shows that mp00ac was taking all the time, and that the code took about 83sec. I guess that most of you will regard the standard gnu profiling extension as being the more reliable, but my timing agrees with my watch!!

Perhaps the difference is because I am using MPI_WTIME() as to provide the time. What is more important: the "timing" provided by the so-called standard gnu profiling extension, or the time that the user has to wait to get the result.

from spec.

SRHudson commented on September 23, 2024

I am still curious about the timing. Can the gnu profiling handle handle the calls to the NAG library?

Also, is there a way to "profile" the memory requirements? I have absolutely no clue how much memory SPEC uses, but I am sure that the memory can be reduced significantly.

from spec.

lazersos commented on September 23, 2024

@SRHudson , no it can't tell you time spent in those subroutine because we don't compile those libraries, we only link to them. However, while gprof is nice, there are better tools. And the discrepancy between gprof and wall clock timing is well documented.

Regarding memory, you just need to compile with the -g option and run with valgrind as a first pass. The tool is good for detecting uninitialized variables, failures to deallocate variables, and other memory related issues. At some level memory reduction begins to interfere with readability of codes. I suspect this is why many legacy codes read more like letter salad as they attempt to reuse variables.

from spec.

jloizu commented on September 23, 2024

OK I tried the Ltiming option and also got that mp00ac takes most of the time (by far: 10 times more than any other subroutine). However, it is weird because if I rerun the same case it prints out different numbers, e.g. for the mp00ac subroutine:
FIRST RUN

finish : 24.02 : time spent in mp00ac = 17.33 ;

SECOND RUN

finish : 75.73 : time spent in mp00ac = 54.48 ;

while both runs took the same "watch time"...also, in the ipp-cluster I get all zeros if I try Ltiming...perhaps some environment variable related to timing is messing up?

In any case, it looks like Ltiming says that mp00ac dominated cpu-time, and gprof says that ma00aa dominates cpu-time: maybe the latter neglects NAG calls, that could be the reason. If this is the case, then we should trust Ltiming output and try to optimize mp00ac.

from spec.

SRHudson commented on September 23, 2024

I think that the Ltiming option will only be reliable if only one cpu is used. Could this explain your strange result?
mp00ac does little more than call a NAG routine to solve a linear equation for the Beltrami field in each volume, so it is entirely consistent that if gprof ignores NAG calls then mp00ac will not show up. Also, remember that the primary calculation that spec performs is to compute the Beltrami field in each volume, so it is entirely expected that this will be a significant cost. Also, ma00aa computes the volume integrals of the Chebyshev-Fourier-metric products, so it similarly expected that this will be costly.
If you have a look around in the source, you will see GETTIME and WCALL macros, which call MPI_WTIME(). If this is not appropriate on a given machine, then Ltiming will be meaningless.
Yesterday, I identified that the linear solver in mp00ac was dominating, and Josh has kindly begun work on this. Initial timing improvements are very encouraging, and there is more that Josh and I will do over the next few days to further reduce the computation.
I will not concentrate on memory at present, but I hope that Joaquim will in the not-to-distant future investigate spec from a memory perspective and identify any areas of concern.

from spec.

jloizu commented on September 23, 2024

If I understand correctly, there is an ongoing "speedup effort" which translates into modifications of the ma00aa branch (which should change to a more generic name since it seems that other routines like mp00ac are crucial). If that is the case, then we should cancel the "pull request" that is been open for a while with the initial idea of merging that branch into master - but that should take quite some time.

from spec.

SRHudson commented on September 23, 2024

I have pushed my changes to ma00aa. I will take a break on this topic now, so this would be a good time for you to pull and merge this branch with the master. We should also merge the NAG_REPLACE with the master. I would expect that we should not let different branches get too far removed from each other, and it is helpful for checking the modifications to NAG_REPLACE if the executable is fast.

Speed improvements to spec will be an ongoing exercise, which suggests that we close ma00aa and open another branch called "rapido", or "azogue", or "relampago". I will let our Spanish expert have the final say on this.

from spec.

jloizu commented on September 23, 2024

The ma00aa branch has been merged into master and deleted.

I would wait until the NAG_REPLACE branch reaches a point in which the code runs well (from the discussions, it seems that some debugging is on the way). Then I can perform some tests and compare the master and NAG_REPLACE branches, before considering merging.

If some new "speedup" improvements are to be made, e.g. by exploiting symmetries in mp00ac, then the person starting such changes should create a new branch,

git branch relampago

make some initial changes, and then,

git add -A
git commit -m "initial commit message"
git push -u origin relampago

from spec.

exploiting symmetries about spec HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent