I have not seen either in the shenfun paper or on spectralDNS/shenfun guidelines r

Here is a concrete example from a <a href="https://github.com/apatlpo/shenfun_trial/bl

Hi <a class="user-mention notranslate" data-hovercard-type="user" da

I put up a comparison of the three implementations <a href="https://github.com/apatlpo

computational performance about shenfun HOT 9 OPEN

spectraldns commented on June 1, 2024

computational performance

from shenfun.

Comments (9)

apatlpo commented on June 1, 2024

offline discussion, reply from Mikael:

Regarding your reservation about using Python for HPC. How big are we talking about? Using Python on HPC clusters has a startup cost compared to low-level C or Fortran. The startup is only in terms of a few seconds for low number of MPI cores, but if you are planning on running on several thousands, then it can be a few minutes, depending on architecture. I’m running on https://www.hpc.kaust.edu.sa/content/shaheen-ii and it does not really bother me. The thing is, if you are planning to run 100,000 time steps over three days, then a 5 minutes startup due to Python is not really relevant cost.

The shenfun code tries to be as efficient as possible, and as long as one is writing Python code cleverly, it should not need to be slower than C. All important and computationally heavy array operations are calling C in the background. And I always find that I can write Cython code that is as fast as C. Cython is used for computationally heavy things like linear algebra, and anything that for some reason is rather slow with Numpy.

from shenfun.

apatlpo commented on June 1, 2024

Here is a concrete example from a singler layer shallow water solver.

Computation of the RHS of equations of motions is performed with this part of the code:

def compute_rhs(duvh, uvh):
    global count
    count += 1
    duvh.fill(0)
    u, v, h = uvh[:]
    du, dv, dh = duvh[:]
    #
    du[:] = -g*D(hf,dv0,0) # -g*dhdx
    du += -u*D(uf,dv0,0)   # -u*dudx
    du += -v*D(uf,dv0,1)   # -v*dudy
    du += -Cd/H*u*np.sqrt(u**2+v**2)
    du += f*v

where:

# init variables for derivatives
uf = Function(T, False, buffer=u)
vf = Function(T, False, buffer=v)
hf = Function(T, False, buffer=h)
dv0 = np.zeros_like(u)
def D(varf,dvar,dim, padded=False):
    ''' Wrapper around Dx
    '''
    if not padded:
        dvar[:] = T.backward(project(Dx(varf, dim, 1), T))
    else:
        dvar[:] = Tp.backward(project(Dx(varf, dim, 1), T))
    return dvar

Does this look like clever python to you?
Can anything be done better?

from shenfun.

mikaem commented on June 1, 2024

Hi
Unfortunately, the project function is not very efficient for Fourier bases, so I would replace it. If the D function is only about getting derivatives I think you can speed this up a few factors by using the wavenumbers directly, and there is no need for padding. Padding is for nonlinear terms. Here's a suggestion for a faster function D:

work = Array(T, True)
K = T.local_wavenumbers(scaled=True, eliminate_highest_freq=True)
def D(varf,dvar,dim):
    ''' Wrapper around Dx
    '''
    global work
    work = T.forward(varf, work)
    dvar = T.backward((1j*K[dim])*work, dvar)
    return dvar

You can profile the program quite easily using line_profiler. Install it and add decorator @profile above def compute_rhs. Run with kernprof -lv swater_1L_physical.py.

from shenfun.

apatlpo commented on June 1, 2024

Thanks Mikael, that is incredibly useful.

Here is an excerpt from the line_profiler output for the original version:

Wrote profile results to swater_1L_physical.py.lprof
Timer unit: 1e-06 s

Total time: 11.2486 s
File: swater_1L_physical.py
Function: compute_rhs at line 102

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   102                                           @profile
   103                                           def compute_rhs(duvh, uvh):
   104                                               global count
   105       576       1163.0      2.0      0.0      count += 1
   106       576      16547.0     28.7      0.1      duvh.fill(0)
   107       576      11322.0     19.7      0.1      u, v, h = uvh[:]
   108       576       5874.0     10.2      0.1      du, dv, dh = duvh[:]
   109                                               #
   110       576    1152978.0   2001.7     10.2      du[:] = -g*D(hf,dv0,0) # -g*dhdx
   111       576    1095948.0   1902.7      9.7      du += -u*D(uf,dv0,0)   # -u*dudx
   112       576    1091663.0   1895.2      9.7      du += -v*D(uf,dv0,1)   # -v*dudy
   113       576     100679.0    174.8      0.9      du += -Cd/H*u*np.sqrt(u**2+v**2)
   114       576      20936.0     36.3      0.2      du += f*v
   115                                               #

and for the optimized version:

(shenfun) br222-003:swater_1L aponte$ python -m line_profiler swater_1L_physical.py_opt.lprof
Timer unit: 1e-06 s

Total time: 3.60774 s
File: swater_1L_physical.py
Function: compute_rhs at line 100

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                           @profile
   101                                           def compute_rhs(duvh, uvh):
   102                                               global count
   103       576       1484.0      2.6      0.0      count += 1
   104       576      18370.0     31.9      0.5      duvh.fill(0)
   105       576      14141.0     24.6      0.4      u, v, h = uvh[:]
   106       576       6692.0     11.6      0.2      du, dv, dh = duvh[:]
   107                                               #
   108       576     349008.0    605.9      9.7      du[:] = -g*D(h,dv0,0) # -g*dhdx
   109       576     326703.0    567.2      9.1      du += -u*D(u,dv0,0)   # -u*dudx
   110       576     321089.0    557.4      8.9      du += -v*D(u,dv0,1)   # -v*dudy
   111       576     117991.0    204.8      3.3      du += -Cd/H*u*np.sqrt(u**2+v**2)
   112       576      24379.0     42.3      0.7      du += f*v
   113                                               #

This is quite a difference indeed !

from shenfun.

mikaem commented on June 1, 2024

@apatlpo In case you're interested in squeezing another few percent in performance. The next step is to use either cython, numba, weave, f2py or something similar on the term 1j*K[dim]*work. This term is not as efficient as pure C due to Numpy broadcasting, which is fast, but here not fully optimal. An implementation based on Numba can be easily added:

from numba import jit, complex128, float64, int64
@jit((complex128[:, ::1], complex128[:, ::1], float64[::1], float64[::1], int64), nopython=True)
def deriv(work, wx, kx, ky, dim):
    if dim == 0:
        for i in range(wx.shape[0]):
            k0 = kx[i]*1j
            for j in range(wx.shape[1]):
                wx[i, j] = work[i, j]*k0

    else:
        for i in range(wx.shape[0]):
            for j in range(wx.shape[1]):
                k1 = ky[j]*1j
                wx[i, j] = work[i, j]*k1

work2 = Array(T, True)
def D(varf, dvar, dim):
    ''' Wrapper around Dx
    '''
    global K, work, work2
    work = T.forward(varf, work)
    deriv(work, work2, np.squeeze(K[0]), np.squeeze(K[1]), dim)
    dvar = T.backward(work2, dvar)
    return dvar

And for me this is about 50 % faster than 1j*K[dim]*work. May not be worth the trouble, though, because the forward and backward transforms are much more costly (around 90 % of the cost of D). A Cython implementation would be very similar, but requires compilation.

from shenfun.

apatlpo commented on June 1, 2024

Thanks again Mikael this is again very helpful !

from shenfun.

apatlpo commented on June 1, 2024

I put up a comparison of the three implementations here , profiling outputs are found here
Unless I screwed up the numba implementation, it does not seem to make a big difference for me.

My next step is to add a chebyshev case.

from shenfun.

mikaem commented on June 1, 2024

Don't think anything is messed up. It's just the fact that the optimized term is no more than 5-10 % of the total computation time, so the effect of the optimization is not sp large. If you isolate 1j*K[dim]*work on a line of its own, then you can compare directly to the line calling deriv. You should see a speed-up for this line.

from shenfun.

apatlpo commented on June 1, 2024

Ok, thanks, I get it now. I've updated the links above with better isolation of the optimized sections.
I find similar speed up than you for these.

from shenfun.

computational performance about shenfun HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent