I work on a workstation with NVIDIA card and multicore Intel CPU (opencl SDK of intel is installed).
I try to use cpu as the opencl device but get nan at training and then training is aborted. I tried various options and share their outputs below. The opencl@CPU case makes compilation and vectorizations before training steps. Is this expected?
I also tried training by opencl@CPU with the fork:
https://github.com/hughperkins/char-rnn
which goes on training with nans and floats.
Is there a solution to get rid of these nans?
I share lengthy outputs for the ones interested in performance comparison.
And lastly I want to point your attention to the output of the training of fork at the last section. There are two iterations whose loss are not nan but some floating point.
clTorch output:
th> require('cutorch')
{
streamWaitFor : function: 0x4058b558
deviceReset : function: 0x4058ba98
test : function: 0x409a58f0
_state : userdata: 0x0220bca0
streamSynchronize : function: 0x4058b928
manualSeed : function: 0x4058bf20
setStream : function: 0x4058b2b0
getMemoryUsage : function: 0x4058bcd8
setDefaultStream : function: 0x4058b488
getBlasHandle : function: 0x4058b030
CudaHostAllocator : torch.Allocator
getNumStreams : function: 0x4058b1e0
manualSeedAll : function: 0x4058bfe0
initialSeed : function: 0x4058bec0
getStream : function: 0x4058b370
setRNGState : function: 0x405840b0
setBlasHandle : function: 0x4058af68
seed : function: 0x4058bda0
getDeviceProperties : function: 0x4058bc18
reserveStreams : function: 0x4058b100
withDevice : function: 0x409a5958
setDevice : function: 0x4058bd40
seedAll : function: 0x4058be60
getNumBlasHandles : function: 0x4058af00
getDeviceCount : function: 0x4058bb58
createCudaHostTensor : function: 0x409a5998
getState : function: 0x40584170
getDevice : function: 0x4058b9b0
synchronize : function: 0x4058ad00
getRNGState : function: 0x40584048
streamWaitForMultiDevice : function: 0x4058b650
reserveBlasHandles : function: 0x4058ae30
streamBarrierMultiDevice : function: 0x4058b838
streamBarrier : function: 0x4058b720
}
[1.0545s]
th> require('cltorch')
{
setAddFinish : function: 0x41f1bd90
getDeviceCount : function: 0x41f1bbd8
getDeviceProperties : function: 0x41f1bcc8
getState : function: 0x41f1bcf0
getDevice : function: 0x412c43e0
setDevice : function: 0x412c4440
_state : userdata: 0x02467900
dumpTimings : function: 0x41f1bb28
setTrace : function: 0x41f1bd40
synchronize : function: 0x412c4468
finish : function: 0x41f1bbb0
}
[0.0146s]
th> cltorch.getDeviceProperties(1)
{
deviceType : "CPU"
maxClockFrequency : 2000
deviceName : " Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz"
maxMemAllocSizeMB : 16089
globalMemCachelineSizeKB : 0
deviceVersion : "OpenCL 1.2 (Build 43)"
localMemSizeKB : 32
openClCVersion : "OpenCL C 1.2 "
maxWorkGroupSize : 8192
globalMemSizeMB : 64358
platformVendor : "Intel(R) Corporation"
maxComputeUnits : 32
}
[0.0011s]
th> cltorch.getDeviceProperties(2)
{
deviceType : "GPU"
maxClockFrequency : 705
deviceName : "Quadro 410"
maxMemAllocSizeMB : 128
globalMemCachelineSizeKB : 0
deviceVersion : "OpenCL 1.1 CUDA"
localMemSizeKB : 47
openClCVersion : "OpenCL C 1.1 "
maxWorkGroupSize : 1024
globalMemSizeMB : 509
platformVendor : "NVIDIA Corporation"
maxComputeUnits : 1
}
training on CPU
th train.lua -data_dir data/tinyshakespeare/ -opencl 0 -gpuid -1
vocab.t7 and data.t7 do not exist. Running preprocessing...
one-time setup: preprocessing input text file data/tinyshakespeare/input.txt...
loading text file...
creating vocabulary mapping...
putting data into tensor...
saving data/tinyshakespeare/vocab.t7
saving data/tinyshakespeare/data.t7
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
number of parameters in the model: 240321
cloning criterion
cloning rnn
1/21150 (epoch 0.002), train_loss = 4.19766416, grad/param norm = 4.5006e-01, time/batch = 1.74s
2/21150 (epoch 0.005), train_loss = 4.10134056, grad/param norm = 6.3375e-01, time/batch = 2.14s
3/21150 (epoch 0.007), train_loss = 3.44502399, grad/param norm = 9.4798e-01, time/batch = 1.48s
4/21150 (epoch 0.009), train_loss = 3.45054399, grad/param norm = 1.1340e+00, time/batch = 2.03s
5/21150 (epoch 0.012), train_loss = 3.33238818, grad/param norm = 7.8976e-01, time/batch = 1.62s
6/21150 (epoch 0.014), train_loss = 3.37363688, grad/param norm = 7.0334e-01, time/batch = 1.61s
7/21150 (epoch 0.017), train_loss = 3.36438210, grad/param norm = 6.5300e-01, time/batch = 2.70s
8/21150 (epoch 0.019), train_loss = 3.33342581, grad/param norm = 7.6950e-01, time/batch = 1.72s
9/21150 (epoch 0.021), train_loss = 3.29173263, grad/param norm = 6.1282e-01, time/batch = 1.79s
10/21150 (epoch 0.024), train_loss = 3.38000728, grad/param norm = 4.1881e-01, time/batch = 2.92s
training on GPU - CUDA
th train.lua -data_dir data/tinyshakespeare/ -opencl 0 -gpuid 0
using CUDA on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
number of parameters in the model: 240321
cloning criterion
cloning rnn
1/21150 (epoch 0.002), train_loss = 4.16315975, grad/param norm = 4.5507e-01, time/batch = 0.50s
2/21150 (epoch 0.005), train_loss = 4.06560737, grad/param norm = 6.1592e-01, time/batch = 0.45s
3/21150 (epoch 0.007), train_loss = 3.50594769, grad/param norm = 1.2221e+00, time/batch = 0.45s
4/21150 (epoch 0.009), train_loss = 3.45355825, grad/param norm = 1.3675e+00, time/batch = 0.44s
5/21150 (epoch 0.012), train_loss = 3.35222242, grad/param norm = 1.2052e+00, time/batch = 0.44s
6/21150 (epoch 0.014), train_loss = 3.37636928, grad/param norm = 8.7048e-01, time/batch = 0.44s
7/21150 (epoch 0.017), train_loss = 3.36737326, grad/param norm = 6.1815e-01, time/batch = 0.44s
8/21150 (epoch 0.019), train_loss = 3.32496874, grad/param norm = 4.2533e-01, time/batch = 0.44s
9/21150 (epoch 0.021), train_loss = 3.29095509, grad/param norm = 4.5369e-01, time/batch = 0.44s
10/21150 (epoch 0.024), train_loss = 3.38070163, grad/param norm = 4.3267e-01, time/batch = 0.44s
training on GPU - OPENCL
th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 1
registering spatialconvolutionmm
using OpenCL on GPU 1...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using NVIDIA Corporation platform: NVIDIA CUDA
Using device: Quadro 410
statefultimer v0.6
number of parameters in the model: 240321
cloning criterion
cloning rnn
1/21150 (epoch 0.002), train_loss = 4.19766393, grad/param norm = 4.5006e-01, time/batch = 1.31s
2/21150 (epoch 0.005), train_loss = 4.10134039, grad/param norm = 6.3375e-01, time/batch = 1.10s
3/21150 (epoch 0.007), train_loss = 3.44484827, grad/param norm = 9.4796e-01, time/batch = 1.11s
4/21150 (epoch 0.009), train_loss = 3.45040853, grad/param norm = 1.1346e+00, time/batch = 1.10s
5/21150 (epoch 0.012), train_loss = 3.33218116, grad/param norm = 7.8938e-01, time/batch = 1.09s
6/21150 (epoch 0.014), train_loss = 3.37349831, grad/param norm = 7.0234e-01, time/batch = 1.04s
7/21150 (epoch 0.017), train_loss = 3.36418301, grad/param norm = 6.5344e-01, time/batch = 0.96s
8/21150 (epoch 0.019), train_loss = 3.33336397, grad/param norm = 7.7021e-01, time/batch = 0.95s
9/21150 (epoch 0.021), train_loss = 3.29151368, grad/param norm = 6.1312e-01, time/batch = 0.95s
10/21150 (epoch 0.024), train_loss = 3.37983895, grad/param norm = 4.1876e-01, time/batch = 0.95s
training on CPU - OPENCL
th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Intel(R) Corporation platform: Intel(R) OpenCL
Using device: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
statefultimer v0.6
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
number of parameters in the model: 240321
cloning criterion
cloning rnn
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (8)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClTensorMathTransformReduce.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_transformReduceOuterDimIndex> was successfully vectorized (4)
Kernel <THClTensor_kernel_transformReduceInnermostDimIndex> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClReduce.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceNoncontigDim> was successfully vectorized (4)
Kernel <THClTensor_reduceContigDim> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
/home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClGather.cpp build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_gather> was successfully vectorized (4)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
/home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClScatter.cpp build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_scatterFill> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
1/21150 (epoch 0.002), train_loss = nan, grad/param norm = 1.0629e+02, time/batch = 84.92s
loss is exploding, aborting.
training on CPU - OPENCL - fork of hughperkins
th train.lua -data_dir data/tinyshakespeare/ -opencl 1 -gpuid 0
registering spatialconvolutionmm
using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Intel(R) Corporation platform: Intel(R) OpenCL
Using device: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
statefultimer v0.6
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
number of parameters in the model: 240321
cloning criterion
cloning rnn
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (8)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClTensorMathTransformReduce.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_transformReduceOuterDimIndex> was successfully vectorized (4)
Kernel <THClTensor_kernel_transformReduceInnermostDimIndex> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClReduce.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceNoncontigDim> was successfully vectorized (4)
Kernel <THClTensor_reduceContigDim> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
/home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClGather.cpp build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_gather> was successfully vectorized (4)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
/home/bozkalayci/torch-distro/updates/cltorch/lib/THCl/THClScatter.cpp build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_kernel_scatterFill> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was not vectorized
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClApply.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_pointwiseApplyD> was successfully vectorized (4)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
THClReduceAll.cl build log:
Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <THClTensor_reduceAll> was successfully vectorized (8)
Kernel <THClTensor_reduceAllPass1> was successfully vectorized (4)
Kernel <THClTensor_reduceAllPass2> was successfully vectorized (8)
Done.
1/21150 (epoch 0.002), train_loss = nan, grad/param norm = 1.0629e+02, time/batch = 87.60s
2/21150 (epoch 0.005), train_loss = nan, grad/param norm = 1.0279e+02, time/batch = 2.42s
3/21150 (epoch 0.007), train_loss = nan, grad/param norm = 9.8932e+01, time/batch = 2.40s
4/21150 (epoch 0.009), train_loss = nan, grad/param norm = 9.5078e+01, time/batch = 2.41s
5/21150 (epoch 0.012), train_loss = nan, grad/param norm = 9.1377e+01, time/batch = 2.38s
6/21150 (epoch 0.014), train_loss = nan, grad/param norm = 8.7885e+01, time/batch = 2.43s
7/21150 (epoch 0.017), train_loss = nan, grad/param norm = 8.4618e+01, time/batch = 2.38s
8/21150 (epoch 0.019), train_loss = nan, grad/param norm = 8.1572e+01, time/batch = 2.38s
9/21150 (epoch 0.021), train_loss = nan, grad/param norm = 7.8735e+01, time/batch = 2.38s
10/21150 (epoch 0.024), train_loss = nan, grad/param norm = 7.6093e+01, time/batch = 2.38s
11/21150 (epoch 0.026), train_loss = nan, grad/param norm = 7.3629e+01, time/batch = 2.64s
12/21150 (epoch 0.028), train_loss = nan, grad/param norm = 7.1329e+01, time/batch = 2.44s
13/21150 (epoch 0.031), train_loss = nan, grad/param norm = 6.9178e+01, time/batch = 2.40s
14/21150 (epoch 0.033), train_loss = nan, grad/param norm = 6.7162e+01, time/batch = 2.40s
15/21150 (epoch 0.035), train_loss = 4.19833310, grad/param norm = 2.8035e-01, time/batch = 2.42s
16/21150 (epoch 0.038), train_loss = nan, grad/param norm = 6.5226e+01, time/batch = 2.38s
17/21150 (epoch 0.040), train_loss = nan, grad/param norm = 6.3410e+01, time/batch = 2.39s
18/21150 (epoch 0.043), train_loss = nan, grad/param norm = 6.1704e+01, time/batch = 2.48s
19/21150 (epoch 0.045), train_loss = nan, grad/param norm = 6.0097e+01, time/batch = 2.56s
20/21150 (epoch 0.047), train_loss = nan, grad/param norm = 5.8581e+01, time/batch = 2.62s
21/21150 (epoch 0.050), train_loss = 4.20001215, grad/param norm = 2.8320e-01, time/batch = 2.57s
22/21150 (epoch 0.052), train_loss = nan, grad/param norm = 5.7113e+01, time/batch = 2.43s
23/21150 (epoch 0.054), train_loss = nan, grad/param norm = 5.5728e+01, time/batch = 2.40s
24/21150 (epoch 0.057), train_loss = nan, grad/param norm = 5.4416e+01, time/batch = 2.38s
25/21150 (epoch 0.059), train_loss = nan, grad/param norm = 5.3173e+01, time/batch = 2.37s