The dns's intro from mt5555

Copyright 2007.  Los Alamos National Security, LLC. This material was
produced under U.S. Government contract DE-AC52-06NA25396 for Los
Alamos National Laboratory (LANL), which is operated by Los Alamos
National Security, LLC for the U.S. Department of Energy. The
U.S. Government has rights to use, reproduce, and distribute this
software.  NEITHER THE GOVERNMENT NOR LOS ALAMOS NATIONAL SECURITY,
LLC MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY
FOR THE USE OF THIS SOFTWARE.  If software is modified to produce
derivative works, such modified software should be clearly marked, so
as not to confuse it with the version available from LANL.

Additionally, this program is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version. Accordingly, this
program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
===================================================================

               Sandia/LANL DNS code
                    Mark Taylor
                 [email protected]

===================================================================
README: updated 3/19/2013:  
added 'pseudospectral.pdf' to svn repo

README: updated 2/20/2009:  

Added support for the SDSC P3DFFT parallel FFT library.  
P3DFFT is up to 2x faster than the our internal transpose + FFTW code.  
To use P3DFFT, build the "dnsp3" model.  
This obsoletes are previous "dnsp" optimized model.  


README: updated 2/5/2009:  

Added new optimizations to dnsp code to skip
on processor transpose_to/from_y operations. 
To enable, padding must be "0 2 0" and ref decomp must be
y-pencil.  The vorticity3 trick uses some ffts with this optimization
and some ffts without, so one should benchmark with and without "-nov3"
To verify that this optimization is enabled, the code will
print this message to stdout:

"Ref decomp is y-pencil decomp: skipping transpose_to_y calls"


README: updated 9/26/2008:  

Updated to note that for the pure x-y slab decomposition 
for a problem of size N^3, NCPUS can be as large as N.
(we used to have to switch to a pencil decomposition for 
NCPUS>N/2 )

README: updated 7/22/2008:  
minor edits

README: updated 1/20/2008:  

We now support the exact dealiasing based on phase shifting +
spherical truncation.  To measure/test dealiasing error, see
testing/dealias.sh.


README: updated 8/16/2007:  

For pencil decompositions, we now (8/15/2006) have a more efficient
model "dnsp" that should be used instead of "dns" (but dnsp does not
yet allow for passive tracers) To compile, type "make dnsp" instead of
"make dns", and the executable will be named "dnsp".


===================================================================

This directory and its subdirectories contain the Sandia/LANL DNS
code.  This code solves viscous fluid dynamics equations in a periodic
rectangular domain (2D or 3D) with a pseudo-spectral method or 4th
order finite differences and with the standard RK4 time stepping
scheme.

The code is documented in 
[1] Taylor, Kurien and Eyink, Phys. Rev. E 68, 2003. 
[2] Kurien and Taylor, Los Alamos Science 29, 2005. 
[3] 'pseudospectral.pdf' included with the source code

The code has options to solve the following equations:
1. Navier-Stokes (primitive variables)  2D and 3D
2. Navier-Stokes (stream function/vorticity) 2D only
2. Lagrangian Averaged Navier Stokes (The Alpha Model)  2D and 3D
3. Shallow water and Shallow water alpha  2D only
4. Boussinesque (with stratification)  3D only

In can optionally allow for rotation, arbitrary aspect ratio and an
arbitrary number of passive scalars.  It is an MPI code and can use a
3D domain decomposition, (allowing for slab decomposition, pencil
decomposition or cube decomposition).  For a grid of size N^3, it can
run on up to N^3/8 processors.  It has been run on as many as 18432
processors (with grids up to 4096^3)

This README file contains instructions to compile and run
the purest form of the code: Navier-Stokes with deterministic
low waver number forcing, use RK4 and a pseudo-spectral 
method.  The equations are solved in a triply periodic cube
of side length 1 with no passive scalars or rotation.
The deterministic low wave number forcing is a simplified
version of Overholt and Pope, Comput. Fluids 27 1998, and
is documented in detail in [1].  Some more, but incomplete documentation
for this code are in the files:  pseudospectral.tex and dns.doc
Documentation for running the Boussinesq version is in 
rotation_doc/rotation.tex. 



The steps required to compile and run the code are:

0.  Choose the resolution
1.  set up the grid dimensions
2.  compile the code
3.  edit the input file, forcing12.inp or forcing12-bench.inp
4.  run 
5.  validate the results 
6.  analysis of output


Step 0.
Choose the resolution and resolution requirements.

The parameters set by the user are the grid resolution and viscosity
coefficieint.  The forcing we use is such that epsilon=3.58 and
KE=1.89, and so the eddy turnover time is 1.05.  (These values are
from N=1024^3. At other values of N, they may change slightly)
But if we assume these values of KE and epsilon, we can then
determine the correct viscosity to use for a given N:

grid of size:             N^3
resolution constraint:    G = eta*kmax

Suggested value of G for this problem (forced low wave number 
turbulence) G=1.0.   (For improved resolution in the viscous regime,
some people use the more restrictive G=1.5)

The code supports several types of dealasing:

1. partial spherical (the most efficient, not fully dealiased)
2. 2/3 rule (the most efficient in terms of retained modes)
3. phase shifting (the most efficient in terms of maximum spherical
   wave number)

with phase shifting or partial spherical, kmax = 2*pi*N*sqrt(2)/3
with 2/3 dealising:                       kmax = 2*pi*N/3

(the 2pi shows up in the wave number because our box is of side
length 1). 

Given the choice of G and N, we can then determine the viscosity.

Using that eta = (mu^3/epsilon)^.25, we have:

mu =    epsilon**(1/3) * [G/kmax]**4/3 

The resulting Taylor Reynolds number:

Rl  = KE sqrt(20/(3*mu*epsilon))


Example 1
    1024^3, with 2/3 dealiasing and G=1.  The viscosity
    coefficient used should be: mu = 5.5e-5.  
    The expected Rl = 347

Example 2:
    1024^3, with spherical dealiasing and G=1.  The viscosity
    coefficient used should be: mu = 3.5e-5
    The expected Rl = 437

Example 3:
    64^3, with spherical dealiasing and G=1.  The viscosity
    coefficient used should be: mu = .0014
    The expected Rl = 69


Step 1.
Set up the grid dimensions.  

I use a python script, 'gridsetup.py' which will create the
needed 'params.h' file.   To run an N^3 problem, with a 
domain decompostion of nx*ny*nz (so NCPUS=nx*ny*nz):

      ./gridsetup.py  nx ny nz  N N N 0 0 2

The 0,0,2 specifies how the arrays are padded in x,y,z directions.
0,0,2 is required when using P3DFFT.  When using the internal 
parallel fft, 2,0,0 is optimal.  P3DFFT requires that nx=1.  



Here are some examples:
For P3DFFT, which only supports slab and pencil decompositions:

      A 32x32x32 grid, to run using just 1 cpu:
      % cd dns/src
      % ./gridsetup.py 1 1 1 32 32 32 0 0 2

      A 64x64x64 grid, on 4 cpu:
      (parallel decomposition: 1x1x4, so 4 hyperslabs in the z-direction)
      % cd dns/src
      % ./gridsetup.py 1 1 4 64 64 64 0 0 2

      A 64x64x64 grid, on 64 cpu:
      (parallel decomposition: 2x1x32, so a pensil decomposition)
      % cd dns/src
      % ./gridsetup.py 1 2 32 64 64 64 0 0 2

The DNS code's internal parallel FFT supports cube decompositions
and also supports slab decompositions up to size N:
      % ./gridsetup.py 1 1 4 64 64 64 2 0 0
      % ./gridsetup.py 2 2 4 64 64 64 2 0 0
      % ./gridsetup.py 1 1 64 64 64 64 2 0 0


For a grid N^3, N must be a power of 2,3,5, and N/nx, N/ny and N/nz
must be an even integers (with one exception for the case
nx=1, ny=1, nz=N, which is allowed).  

There are some other restrictions because
of how we wrote our transpose routines.  gridsetup.py will issue
warnings if they are violated, and, if the code is run it will print
error messages and stop.

If you do not have python installed on your system, you can instead
copy the file 'params.h.test' to params.h' and then edit params.h by
hand.  By default, it is set up to run a 32^3 simulation on 1 cpu.

NOTE:  The most efficient configuration for large processor counts
will be with ny=1 and nx<nz.  For example, to run a 4096^3 simulation 
on 8192 processors, I would modify the instructions below to:

./gridsetup.py 1 4 2048 4096 4096 4096 0 0 2
Other possibilities which might be more efficient:
(we need a performance model :-)
./gridsetup.py 1 8 1024 4096 4096 4096 0 0 2
./gridsetup.py 1 16 512 4096 4096 4096 0 0 2 
./gridsetup.py 1 32 256 4096 4096 4096 0 0 2 
./gridsetup.py 1 64 128 4096 4096 4096 0 0 2 






Step 2.
Compilation

A makefile is included which should run on Linux, SGI, OSF1 (Compaq)
AIX and SunOS, but some editing will probably be required.  On Linux,
the makefile by default used the Intel F90 compiler, but you can edit
the file and switch this to Lahey or PGI.  For the other systems, it
uses the vendor supplied F90 compiler.

P3DFFT: For the best performance with high resolution runs on large
processor counts, use the version of the code that used the SDSC
p3dfft() parallel FFT.  Both P3DFFT and FFTW must be built and
installed in advance.  Currently we require that P3DFFT be built
double precision and with the -DSTRIDE1 option.  You must also
edit the makefile to build with -DUSE_P3DFFT -DUSE_FFTW and 
configure appropriate include and lib paths.  Then:

      % cd dns/src
      % make dnsp3

To use the DNS code's internal parallel FFT:

      % cd dns/src
      % make dns

There is also an optimized version of dns, "dnsp", with uses the DNS code's 
internal parallel FFT and is but limited to pencil decompositions like
P3DFFT.  But "dnsp" is no longer supported since "dnsp3" is faster.  

For low resolutions on moderate numbers of processors, there is almost
no difference between dns, dnsp and dnsp3.


Step 3.
Edit the input file

    There are two choices for input files for this forced case in
    the src directory:

      forcing12.inp         runs 1 eddy turnover time, with output

      forcing12-bench.inp   runs 5 timesteps, only diagnostic output
                            reports cpu time per timestep, averaged
                            over the last 4 timesteps.
                             
    Parameters of interest:

    viscosity coefficient mu (line 9)    change to value computed above
                                         in step 0.
    derivative method (line 14)          set to "fft-dealias" for 2/3 rule 
                                         (exact dealiasing)
                                         or "fft-sphere" for spherical dealising
                                         (partial dealiasing, suitable for k^-5/3 or steeper spectra)
                                         or "fft-phase" for phase shifted + spherical dealising
                                         (exact dealiasing, 2x more expensive) 
                                            
    time to run (line 26)                time is measured in dimensional units
                                         For this problem, 1 eddy turnover time
                                         (after the code equilibriates) is 
                                         close to 1 dimensional time, so set
                                         this to 1.0
 

Step 4.
Run the code:
    To run the code on a single processor:

        ./dnsp3  -i forcing12-bench.inp output_name  

    using the input file 'forcing12-bench.inp'.  This runs a triply 
    periodic forced turbulence problem.  The forcing is in
    wave numbers 1 and 2. 

    The output files are:

        output_name0000.0000.u         u component of velocity
        output_name0000.0000.v         v component of velocity
        output_name0000.0000.w         w component of velocity

        output_name0000.0000.spec      power spectrum (1D and spherical)
        output_name0000.0000.spect     transfer spectrum
   
        output_name0000.0000.scalars         KE, dissipation rates, etc...
        output_name0000.0000.scalars-turb    skewness, other scalars...

    where 0000.0000 is the time of the snapshot.  So for t=0.25,
    the filename would be:  output_name0000.2500.u.


    For a parallel run:
        mpirun -np X  ./dnsp3 -mio -i forcing12.inp output_name  

    The "-mio" option will turn on MPI-IO.  Without MPI-IO, all output
    will be funneled through processor 0.  With MPI-IO, the code will
    still produce identical files, but will have up to M
    processors write data with asynchronious non-overlapping
    writes.  The default value of M is 32, but it can be tweaked by
    editing subroutine "mpi_io_init".   M can be set to be equal to
    the number of processes if your MPI-IO library and/or parallel file
    system has good I/O aggregation.


    These output files are also used as restart files.  
    The restart is exact. To do a restart run copy (or create links)
    the snapshot to used to restart.u, restart.v and restart.w,
    and then add the "-r" option when running the code.
   

Step 5.
Validate the results.

There are several test cases and scripts which run short problems
and make sure the output is identical to the reference solutions.
(These are not yet documented.)

A quick test:  run the 64^3 problem, with the forcing12.inp
input file, with mu=.0014, "fft-sphere", and time=1.0.  

To run this test on 2 processors:
cd src
./gridsetup.py 1 1 2 64 64 64 2 2 0
make dnsp3
./dnsp3 -i forcing12.inp

To run on more processors, see scripts/readme.job

At simulation time 1.0, the data (given in stdout) should be approximatly:
             

*** dns code w/ FFT99 ********************************************************
             AMD Athlon     PGI compiler     g95           AMD X2
           FC4   FC6         pentium M     powerPC        F7 gfortran   
          intel  gfortran   Linux RHEL4   MAC OS10.4    2.1GHZ DDR2-800 
                                            2cpu         1 core   2 core
kmax*eta = .9990 1.0214       .9839         .9626        1.0214   same  
KE       =1.6800 1.6265      1.6828        1.7197        1.6265  
R_lambda =61.47   62.22       59.7          58.4         62.22  
run time  20min   22.8       66min         12.65          8.41    4.96  


2.4ghz quad core, DDR2-666, gfortran, 4 cores: 1.30min

*** dnsp code w/ FFT99 ******************************************************

            intel compiler  PGI compiler    g95         intel on Thunderbird
              AMD Athlon    pentium M      2ghz powerPC   intel x86_64
              Linux FC4     Linux RHEL4    MAC OS10.4     Linux
CPUS     =                                 1cpu  2cpu      1x1x32  2x1x32  4x1x32
kmax*eta =                                .9626  same     .9990   SAME    SAME
KE       =                               1.7197           1.6801 
R_lambda =                                 58.4           61.47 
d/dt vis =                               -4.1263          -3.5569         
d/dt f   =                                4.3034           4.5351 
run time =                                14.5m   11.4m    .42m


Other collected run times:
2.4ghz Xeon Linux, ifort, 1 cpu:  13.81
2.4ghz quad core, DDR2-666, gfortran, 1 cores: 2.97
2.4ghz quad core, DDR2-666, gfortran, 4 cores: 1.26min


*** dnsp3 code (P3DFFT) ************************************************
2.4ghz quad core, DDR2-666, gfortran, 4 cores: 1.23min

kmax*eta =  0.9641
KE       =  1.6525
R_lambda =  58.9
d/dt vis =  -3.747
d/dt f   =   4.122



**************************************************************************

The initial condition has random phases, and so these results are
compiler and OS dependent.  I haven't done detailed sensitivity study,
but you can get a sence of the fluctuations by looking at the above
numbers.
 



Step 6.
Looking at the data.

Not yet documented - contact Mark Taylor ([email protected]) for help.

There are matlab scripts in the dns/matlab directory for
reading and processing all the output produced by the code.

Some more complex analysis, such as computing structure
functions and PDFs is done with fortran programs in the dns/src
directory.
mt5555 / dns Goto Github PK

dns's Introduction

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent