The memory requirement of DTW grows quadratically with the length of the input. The current approach may only be practical for source audio that's about up to 60 - 120 minutes long (with low
granularity level, and minimally sized window).
This large memory size is mostly related to the size of the cost matrix built during the DTW algorithm, and the impact of its Sakoe-Chiba band's width (also called the "window duration").
The size of the cost matrix (in bytes) is computed by:
costMatrixMemorySizeBytes = sequence1Length * Min(sequence2Length, windowLength) * 4
Where:
sequence1Length
: length of the synthesized transcript, in audio frames
sequence2Length
: length of the source audio, in audio frames
windowLength
: length of the Sakoe-Chiba band, in audio frames
4
represents the byte size of the floating-point value for each matrix element (32 bit float).
The default number of audio frames per second is in the range of 25 - 100.
For a 10 minute reference (synthesized) audio (10 * 60 * 100 = 60000
frames), and a 2.5 minute window (2.5 * 60 * 100 = 15000
frames), at high
granularity (100 frames per second) we get:
costMatrixMemorySizeBytes = 60000 * 15000 * 4 = 3600000000
Which is a total memory size of 3.6GB for the cost matrix.
- 20 minute source, 5 minute window: 14.4GB
- 30 minute source, 5 minute window: 21.6GB
- 1 hour source, 10 minute window: 86.4GB!
Using low
granularity
Using low
granularity, which has 25 frames per second (40ms frames) causes both source and reference frame counts to be smaller by a factor of 4, and in turn, the size of the matrix reduces nonlinearly. In the 10 minutes case, it only requires 225MB:
costMatrixMemorySizeBytes = 15000 * 3750 * 4 = 225000000
Other sizes:
- 20 minute source, 5 minute window: 900MB
- 30 minute source, 5 minute window: 1.8GB
- 1 hour source, 10 minute window: 5.4GB
- 2 hour source, 20 minute window: 21.6GB
Using x-low
granularity
Using x-low
, which has only 12.5 frames per second (80ms frames), can reduce memory size further. However note that with this granularity accuracy is now mostly phrase-level - individual word timings may have significant inaccuracies:
- 10 minute source, 2.5 minute window: 56MB
- 20 minute source, 5 minute window: 225MB
- 30 minute source, 5 minute window: 337.5MB
- 1 hour source, 10 minute window: 1.35GB
- 2 hour source, 20 minute window: 5.4GB
- 4 hour source, 30 minute window: 16.2GB
Using two-stage DTW
It's possible to combine both low and high granularities, to allow fine-grained alignment while still having lower memory requirements, using a technique called "multi-stage DTW" (or also "hierarchical DTW").
- Run a coarse alignment stage, with a
low
or x-low
granularity a larger window size, that may be as wide as 20, 30 minutes or more. This stage would get rough estimate of a high-level alignment path.
- Then, run a second alignment stage, with
high
, or even x-high
granularity, and a smaller window duration (30 seconds up to about 1 minute), centered around the alignment path found on the first stage.
This approach may be able to work for audio inputs of up to several hours in length.
Possible enhancement: cutting down memory by half using 16-bit quantization of matrix elements
It's possible to experiment with 16-bit integer quantization for the matrix elements. This may work but would require careful tuning and scaling of the cost values stored in the matrix.
Possible enhancement: storing the matrix on disk
Currently, a large cost matrix may already be swapped to disk via virtual memory, allowing sizes greater than physical RAM. It's possible to directly use file-system I/O to manage the data. However, it isn't clear if it's worth it.