Comments (9)
offline discussion, reply from Mikael:
Regarding your reservation about using Python for HPC. How big are we talking about? Using Python on HPC clusters has a startup cost compared to low-level C or Fortran. The startup is only in terms of a few seconds for low number of MPI cores, but if you are planning on running on several thousands, then it can be a few minutes, depending on architecture. I’m running on https://www.hpc.kaust.edu.sa/content/shaheen-ii and it does not really bother me. The thing is, if you are planning to run 100,000 time steps over three days, then a 5 minutes startup due to Python is not really relevant cost.
The shenfun code tries to be as efficient as possible, and as long as one is writing Python code cleverly, it should not need to be slower than C. All important and computationally heavy array operations are calling C in the background. And I always find that I can write Cython code that is as fast as C. Cython is used for computationally heavy things like linear algebra, and anything that for some reason is rather slow with Numpy.
from shenfun.
Here is a concrete example from a singler layer shallow water solver.
Computation of the RHS of equations of motions is performed with this part of the code:
def compute_rhs(duvh, uvh):
global count
count += 1
duvh.fill(0)
u, v, h = uvh[:]
du, dv, dh = duvh[:]
#
du[:] = -g*D(hf,dv0,0) # -g*dhdx
du += -u*D(uf,dv0,0) # -u*dudx
du += -v*D(uf,dv0,1) # -v*dudy
du += -Cd/H*u*np.sqrt(u**2+v**2)
du += f*v
where:
# init variables for derivatives
uf = Function(T, False, buffer=u)
vf = Function(T, False, buffer=v)
hf = Function(T, False, buffer=h)
dv0 = np.zeros_like(u)
def D(varf,dvar,dim, padded=False):
''' Wrapper around Dx
'''
if not padded:
dvar[:] = T.backward(project(Dx(varf, dim, 1), T))
else:
dvar[:] = Tp.backward(project(Dx(varf, dim, 1), T))
return dvar
Does this look like clever
python to you?
Can anything be done better?
from shenfun.
Hi
Unfortunately, the project
function is not very efficient for Fourier bases, so I would replace it. If the D
function is only about getting derivatives I think you can speed this up a few factors by using the wavenumbers directly, and there is no need for padding. Padding is for nonlinear terms. Here's a suggestion for a faster function D
:
work = Array(T, True)
K = T.local_wavenumbers(scaled=True, eliminate_highest_freq=True)
def D(varf,dvar,dim):
''' Wrapper around Dx
'''
global work
work = T.forward(varf, work)
dvar = T.backward((1j*K[dim])*work, dvar)
return dvar
You can profile the program quite easily using line_profiler. Install it and add decorator @profile
above def compute_rhs
. Run with kernprof -lv swater_1L_physical.py
.
from shenfun.
Thanks Mikael, that is incredibly useful.
Here is an excerpt from the line_profiler output for the original version:
Wrote profile results to swater_1L_physical.py.lprof
Timer unit: 1e-06 s
Total time: 11.2486 s
File: swater_1L_physical.py
Function: compute_rhs at line 102
Line # Hits Time Per Hit % Time Line Contents
==============================================================
102 @profile
103 def compute_rhs(duvh, uvh):
104 global count
105 576 1163.0 2.0 0.0 count += 1
106 576 16547.0 28.7 0.1 duvh.fill(0)
107 576 11322.0 19.7 0.1 u, v, h = uvh[:]
108 576 5874.0 10.2 0.1 du, dv, dh = duvh[:]
109 #
110 576 1152978.0 2001.7 10.2 du[:] = -g*D(hf,dv0,0) # -g*dhdx
111 576 1095948.0 1902.7 9.7 du += -u*D(uf,dv0,0) # -u*dudx
112 576 1091663.0 1895.2 9.7 du += -v*D(uf,dv0,1) # -v*dudy
113 576 100679.0 174.8 0.9 du += -Cd/H*u*np.sqrt(u**2+v**2)
114 576 20936.0 36.3 0.2 du += f*v
115 #
and for the optimized version:
(shenfun) br222-003:swater_1L aponte$ python -m line_profiler swater_1L_physical.py_opt.lprof
Timer unit: 1e-06 s
Total time: 3.60774 s
File: swater_1L_physical.py
Function: compute_rhs at line 100
Line # Hits Time Per Hit % Time Line Contents
==============================================================
100 @profile
101 def compute_rhs(duvh, uvh):
102 global count
103 576 1484.0 2.6 0.0 count += 1
104 576 18370.0 31.9 0.5 duvh.fill(0)
105 576 14141.0 24.6 0.4 u, v, h = uvh[:]
106 576 6692.0 11.6 0.2 du, dv, dh = duvh[:]
107 #
108 576 349008.0 605.9 9.7 du[:] = -g*D(h,dv0,0) # -g*dhdx
109 576 326703.0 567.2 9.1 du += -u*D(u,dv0,0) # -u*dudx
110 576 321089.0 557.4 8.9 du += -v*D(u,dv0,1) # -v*dudy
111 576 117991.0 204.8 3.3 du += -Cd/H*u*np.sqrt(u**2+v**2)
112 576 24379.0 42.3 0.7 du += f*v
113 #
This is quite a difference indeed !
from shenfun.
Hi
@apatlpo In case you're interested in squeezing another few percent in performance. The next step is to use either cython, numba, weave, f2py or something similar on the term 1j*K[dim]*work
. This term is not as efficient as pure C due to Numpy broadcasting, which is fast, but here not fully optimal. An implementation based on Numba can be easily added:
from numba import jit, complex128, float64, int64
@jit((complex128[:, ::1], complex128[:, ::1], float64[::1], float64[::1], int64), nopython=True)
def deriv(work, wx, kx, ky, dim):
if dim == 0:
for i in range(wx.shape[0]):
k0 = kx[i]*1j
for j in range(wx.shape[1]):
wx[i, j] = work[i, j]*k0
else:
for i in range(wx.shape[0]):
for j in range(wx.shape[1]):
k1 = ky[j]*1j
wx[i, j] = work[i, j]*k1
work2 = Array(T, True)
def D(varf, dvar, dim):
''' Wrapper around Dx
'''
global K, work, work2
work = T.forward(varf, work)
deriv(work, work2, np.squeeze(K[0]), np.squeeze(K[1]), dim)
dvar = T.backward(work2, dvar)
return dvar
And for me this is about 50 % faster than 1j*K[dim]*work
. May not be worth the trouble, though, because the forward and backward transforms are much more costly (around 90 % of the cost of D
). A Cython implementation would be very similar, but requires compilation.
from shenfun.
Thanks again Mikael this is again very helpful !
from shenfun.
I put up a comparison of the three implementations here , profiling outputs are found here
Unless I screwed up the numba implementation, it does not seem to make a big difference for me.
My next step is to add a chebyshev case.
from shenfun.
Don't think anything is messed up. It's just the fact that the optimized term is no more than 5-10 % of the total computation time, so the effect of the optimization is not sp large. If you isolate 1j*K[dim]*work
on a line of its own, then you can compare directly to the line calling deriv
. You should see a speed-up for this line.
from shenfun.
Ok, thanks, I get it now. I've updated the links above with better isolation of the optimized sections.
I find similar speed up than you for these.
from shenfun.
Related Issues (20)
- Dx not working with 1D mixed function space HOT 4
- Issue on Stokes equations with doubly periodic boundary conditions. HOT 2
- Error when evaluating derivatives for nonstandard domain HOT 7
- Installation problem: error in "cimport fastgl_wrap" HOT 2
- Old shenfun script for 2D Poisson equation dosn't work with new version HOT 2
- 1d poisson HOT 1
- Bug in FunctionSpace with bc as dictionary HOT 4
- Solve DGL with non-constant (discrete) variable? HOT 5
- AttributeError: 'tuple' object has no attribute 'copy' HOT 1
- Install problem HOT 1
- problems in solving 2D Allen Cahn equation HOT 2
- Two questions in the demo MixedPoisson.py HOT 7
- How to compute the numerical solution on the general points HOT 4
- 1. Solver for multidimensional problem 2. Product with variable coefficients HOT 4
- Two questions about the Legendre-Galerkin method for the Cahn-Hilliard equation HOT 2
- How to compute the integral about the nonlinear term in the Navier-Stokes equation? HOT 2
- How to assemble the right-hand side vector? HOT 3
- Understanding the chebyshev solver in poisson2ND.py HOT 3
- Time dependent boundary conditions in segments for Rayleigh-Benard 2D HOT 2
- Implementing the one-dimensional Kuramoto-Sivashinsky system HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from shenfun.