Comments (20)
OK I agree with the work flow, but then we should first merge the ma00aa branch with master (I will do some final testing) and then create a new branch called something like "fastspec" and work on these matters from there. I will get this set tomorrow and let you know.
from spec.
@SRHudson wrote:
I believe that the l2stell input file is too tiny to really see the impact of the speed improvements that I
intend to pursue, and whether the bottleneck in SPEC when we use large Fourier resolution and a
large number of interfaces is really with ma00aa or somewhere else.
I performed a study with the l2stell.sp testcase, increasing Fourier resolution (Mpol=Ntor=4,6,8), measuring the total computation time, and comparing the two SPEC versions ("fast" is from ma00aa branch, "slow" is from master branch). Find attached the result of this exercise.
l2stell_cputime_versus_fourier.pdf
For example, at m=n=8, the slow code takes ~ 1558 sec and the fast code takes ~645 sec, a factor of ~2.5 speedup, which seems to remain constant with increasing fourier resolution.
from spec.
More over, along these runs I did profiling, and found that the ma00aa routine increases percentage of total cpu time as Fourier resolution increases. This means that code speedup relies on ma00aa-optimization even more at high Fourier resolution.
To illustrate this point, here are some numbers:
"slow spec"
ma00aa 74% at m=n=4 ; 85% at m=n=8
"fast spec"
ma00aa 35% at m=n=4 ; 52% at m=n=8
from spec.
Comment One) What you are calling "slow spec" is actually non-stellarator-symmetric spec, so comparisons between "fast spec", which exploits stellarator symmetry and slow spec are not particularly meaningful. Recall that the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere being used.
Comment Two) It is interesting that at higher M, N that ma00aa takes a higher percentage of the total time. I will continue to work on this, as I have described earlier (by removing the calculation of DToocc, DToocs, DToosc and DTooss from ma00aa to preset).
from spec.
@SRHudson not sure I understand your Comment One,
the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions > as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere
being used.
Does this mean that you modified ma00aa.h in such a way that IF the system is stellarator symmetric THEN those terms are not calculated, but that in general these are still calculated and thus no much speedup is expected?
from spec.
Ultimately, it would be nice to determine a scaling law, e.g.,
T \sim L^a M^b N^c Nv^d
for each subroutine, where L, M and N are the radial, poloidal and toroidal resolutions, Nv is the number of volumes, and a, b and c are exponents.
It is very interesting that matrix seems to be taking a lot of time. There is NO computation performed in matrix. It is just loops and assigning matrix elements. I guess that by re-looping that the time of matrix can be reduced significantly.
from spec.
In
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
I have color-coded the stellarator symmetric terms and the non-stellarator symmetric terms.
As of last week, all the terms were being calculated regardless of Istellsym. All I did was to not compute the non-stellarator-symmetric terms if Istellsym = 1.
For non-stellarator-symmetric equilibria, as you can see from
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
there will be four times as many quantities to compute. We can expect that non-stellarator-symmetric equilibria will take much longer . . . . You can test this. Just set Istellsym = 0 into the l2stell.sp input file and take a look. If the boundary is stellarator symmetric, then the equilibria computed with Istellsym = 1 and Istellsym = 0 should be identical.
from spec.
@SRHudson OK I understand, although somehow I do not see any red terms when I open
from spec.
How are you doing the timing?
I implemented a timing capability. It might be clumsy, and I don't know if it works. Set Ltiming=1 in the &diagnos namelist and a screen output should give the time taken by each routine. I see that when running on a single cpu (with Linitialize=1 in l2stell.sp)
mp00ac : 45.35s
ma00aa : 21.96s
tr00ab : 7.68s
You will see that each call to a subroutine is performed using the macro WCALL, which automatically adds time differences to a T(filename) variable. See also the macro BEGIN, which also adds time.
from spec.
@SRHudson I believe @jloizu is doing the timing using the standard gnu profiling extension. This is useful as it automatically calculates timing for all routines.
from spec.
So, do the results from the different timings agree?
from spec.
No the times do not agree. This is how I do it, you can try:
- compile the code with the profiling option, e.g.
make CC=intel_prof
-
run the code normally; that should generate, after execution, a file called gmon.out
-
finally type the command
gprof xspec > somename.out
- You can see inside somename.out all the profiling information
from spec.
Ok, I made with CC=intel_prof, ran xspec, gprofed xspec and got this:
% cumulative self self total
time seconds seconds calls s/call s/call name
73.54 22.10 22.10 410 0.05 0.06 ma00aa_
8.92 24.78 2.68 410 0.01 0.01 matrix_
4.36 26.09 1.31 43 0.03 0.69 dforce_
3.76 27.22 1.13 412 0.00 0.00 volume_
which I think is what Joaquim has been reporting, that ma00aa is where all the time is. This method also shows that that code took about 30sec to complete.
My timing shows that mp00ac was taking all the time, and that the code took about 83sec. I guess that most of you will regard the standard gnu profiling extension as being the more reliable, but my timing agrees with my watch!!
Perhaps the difference is because I am using MPI_WTIME() as to provide the time. What is more important: the "timing" provided by the so-called standard gnu profiling extension, or the time that the user has to wait to get the result.
from spec.
I am still curious about the timing. Can the gnu profiling handle handle the calls to the NAG library?
Also, is there a way to "profile" the memory requirements? I have absolutely no clue how much memory SPEC uses, but I am sure that the memory can be reduced significantly.
from spec.
@SRHudson , no it can't tell you time spent in those subroutine because we don't compile those libraries, we only link to them. However, while gprof is nice, there are better tools. And the discrepancy between gprof and wall clock timing is well documented.
Regarding memory, you just need to compile with the -g option and run with valgrind as a first pass. The tool is good for detecting uninitialized variables, failures to deallocate variables, and other memory related issues. At some level memory reduction begins to interfere with readability of codes. I suspect this is why many legacy codes read more like letter salad as they attempt to reuse variables.
from spec.
OK I tried the Ltiming option and also got that mp00ac takes most of the time (by far: 10 times more than any other subroutine). However, it is weird because if I rerun the same case it prints out different numbers, e.g. for the mp00ac subroutine:
FIRST RUN
finish : 24.02 : time spent in mp00ac = 17.33 ;
SECOND RUN
finish : 75.73 : time spent in mp00ac = 54.48 ;
while both runs took the same "watch time"...also, in the ipp-cluster I get all zeros if I try Ltiming...perhaps some environment variable related to timing is messing up?
In any case, it looks like Ltiming says that mp00ac dominated cpu-time, and gprof says that ma00aa dominates cpu-time: maybe the latter neglects NAG calls, that could be the reason. If this is the case, then we should trust Ltiming output and try to optimize mp00ac.
from spec.
-
I think that the Ltiming option will only be reliable if only one cpu is used. Could this explain your strange result?
-
mp00ac does little more than call a NAG routine to solve a linear equation for the Beltrami field in each volume, so it is entirely consistent that if gprof ignores NAG calls then mp00ac will not show up. Also, remember that the primary calculation that spec performs is to compute the Beltrami field in each volume, so it is entirely expected that this will be a significant cost. Also, ma00aa computes the volume integrals of the Chebyshev-Fourier-metric products, so it similarly expected that this will be costly.
-
If you have a look around in the source, you will see GETTIME and WCALL macros, which call MPI_WTIME(). If this is not appropriate on a given machine, then Ltiming will be meaningless.
-
Yesterday, I identified that the linear solver in mp00ac was dominating, and Josh has kindly begun work on this. Initial timing improvements are very encouraging, and there is more that Josh and I will do over the next few days to further reduce the computation.
-
I will not concentrate on memory at present, but I hope that Joaquim will in the not-to-distant future investigate spec from a memory perspective and identify any areas of concern.
from spec.
If I understand correctly, there is an ongoing "speedup effort" which translates into modifications of the ma00aa branch (which should change to a more generic name since it seems that other routines like mp00ac are crucial). If that is the case, then we should cancel the "pull request" that is been open for a while with the initial idea of merging that branch into master - but that should take quite some time.
from spec.
I have pushed my changes to ma00aa. I will take a break on this topic now, so this would be a good time for you to pull and merge this branch with the master. We should also merge the NAG_REPLACE with the master. I would expect that we should not let different branches get too far removed from each other, and it is helpful for checking the modifications to NAG_REPLACE if the executable is fast.
Speed improvements to spec will be an ongoing exercise, which suggests that we close ma00aa and open another branch called "rapido", or "azogue", or "relampago". I will let our Spanish expert have the final say on this.
from spec.
The ma00aa branch has been merged into master and deleted.
I would wait until the NAG_REPLACE branch reaches a point in which the code runs well (from the discussions, it seems that some debugging is on the way). Then I can perform some tests and compare the master and NAG_REPLACE branches, before considering merging.
If some new "speedup" improvements are to be made, e.g. by exploiting symmetries in mp00ac, then the person starting such changes should create a new branch,
git branch relampago
make some initial changes, and then,
git add -A
git commit -m "initial commit message"
git push -u origin relampago
from spec.
Related Issues (20)
- GMRES error HOT 10
- Segmentation faults when compiled with cmake HOT 6
- Problem with master branch HOT 4
- Issue when restarting from .end file HOT 6
- Help needed to install python wrappers HOT 15
- Wrapper will not compile (reserved python words?) HOT 3
- Problem running tests on master branch HOT 3
- python_wrapper Makefile setup HOT 2
- Help needed for compilation with cmake HOT 8
- Question about force-gradient HOT 4
- Bug in force gradient when Lrzaxis=1 ?
- MATLAB metric subroutine needs fix HOT 2
- Python wrapper compilation for Henneberg representation branch HOT 2
- VMEC initializer HOT 2
- Read initial guess with python wrappers HOT 5
- Cannot install f90wrap HOT 7
- Question about matrix free method HOT 2
- SPEC license? HOT 5
- Installation on mac m1 HOT 16
- py_spec installation is borken HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spec.