echogarden-project / echogarden Goto Github PK

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.

License: GNU General Public License v3.0

TypeScript 99.91% JavaScript 0.09%

forced-alignment language-identification speech speech-alignment speech-recognition speech-synthesis speech-to-text speech-translation text-to-speech language-detection

echogarden's People

Contributors

Stargazers

Watchers

Forkers

keyzf lcsouzamenezes sophiagrayshon jzj soltrinox fieri leonardocavalcanti yhown589 an-lee iamilluminati

echogarden's Issues

SMIL output format for the "alignment" command

I've been trying to make an audiobook/epub alignment tool work on non-english content for some time without success. Tools like Storyteller, Aeneas or Syncabook exist but either support only English or end up with very subpar results.

I found echogarden by pure chance and was amazed by how good the alignment ended up being from the first attempt on Swedish content. Unfortunately, as far as I'm aware the only alignment outputs available are VTT and SRT. Any chance you could add SMIL format to the list? That would allow people like me to leverage Echogarden to create the alignment and then glue everything together in an Epub with media overlay through other apps like Syncabook. Moreover, it could encourage those solutions like Storyteller to use echogarden in their backend to get better multilingual support.

a problem ocurred in text-to-speech

the following is my code, it successfully generates a.mp3, but this file cannot nomally play.
what's the problem in this code

import {synthesize} from "echogarden"
import fs from "fs"

let SynthesisOptions = {
    "language": "en",
}

synthesize("hello world", SynthesisOptions).then(
    (result) => {
        fs.writeFileSync("F:\\example\\a.mp3", result.audio.audioChannels[0])
    }
)

Rules for splitting sentences on punctuation characters in subtitles

Hi, I have a suggestion. When using the transcript alignment function, could the program try to split sentences at commas and periods as much as possible? Sometimes a sentence doesn't reach the characters limit, but it is still divided into two subtitles by the program. BTW, the transcript is Chinese

Originally posted by @Tendaliu in #18 (comment)

Finish development of new text language detection engine

The current two engines (tinyld and fasttext) aren't always accurate and sometime produce odd or nonsensical classifications, like classifying English text as Klingon.

I've developed a custom engine, based on N-grams and naïve Bayes inference, with high accuracy, supporting more than 100 languages. However, the work isn't fully ready yet.

Things left to be done:

Compile and optimize model data as a single binary file
Only include shorter N-grams for some easy-to-classify languages like Chinese and Japanese
Decide if to include less common languages, or how to produce smaller and larger variants of the model with varying sets of supported languages

Alignment: only dtw can generate .srt file, others engine cannot generate srt file

I select the dtw engine to align audio with txt. It succeefully generates srt file
echogarden align "E:\example.mp3" "E:\example.txt" "E:\example.srt" --engine=dtw --language=en --overwrite=true --

If I change the dtw to the dtw-ra, it only generate some info into the cmd, but no srt file:
echogarden align "E:\example.mp3" "E:\example.txt" "E:\example.srt" --engine=dtw-ra --language=en --overwrite=true --

is this a bug?

status of project/forced alignment?

Found this project looking for forced alignment solutions between an audio file and a preexisting transcript. Looks promising!

Wondering what the status of the project was? Getting an error when I try to run echogarden:
/.asdf/installs/nodejs/20.2.0/bin/node: bad option: --experimental-wasm-threads

Small issue regarding error message ISO 639-2 codes and `Echogarden.recognize`

There's a small issue regarding the error message when supplying ISO 639-2 codes to Echogarden.recognize as such:

  const result = await Echogarden.recognize(input, {
    whisper: {
      model: 'small'
    },
    language: 'spa'
  });

This returns

Transcode with command-line ffmpeg.. 5.3ms
Crop using voice activity detection.. 3.5ms
Prepare for recognition.. 0.3ms
Language specified: Spanish (spa)
Load whisper module.. 0.3ms
The language Spanish is not supported by the Whisper engine.

While supplying 'es' works fine

Transcode with command-line ffmpeg.. 5.5ms
Crop using voice activity detection.. 10.2ms
Prepare for recognition.. 2.0ms
Language specified: Spanish (es)
Load whisper module.. 13.5ms
Load tokenizer data.. 72.6ms
Create encoder inference session for model 'small'.. 732.2ms
(--etcetera--)

So I have to supply ISO 639-1 language codes, not ISO 639-2. But the message indicates that Spanish is not supported at all.

Synthesis: VITS voices have various pronunciation errors that can be fixed using lexicons

In the VITS and eSpeak engines, the text is converted to phonemes using the phoneme events produced by the eSpeak speech synthesizer during synthesis. eSpeak does a reasonable job in some languages (especially English), but have many errors and inaccuracies in others.

Fortunately, we can improve these inaccurate pronunciations, and thus improve the quality of the VITS voices, by applying corrections using lexicon files. The lexicons are applied as part of a preprocessing step, in JavaScript, to specify the exact pronunciations of some words, before the tokens are sent to eSpeak.

An example of a pronunciation file is the English heteronyms file in data/lexicons/heteronyms.en.json. It specifies pronunciations of various English words, like "read", "present", "content" and "use", that are written the same, but pronounced differently based on context.

The heteronym lexicon demonstrates more advanced capabilities of the lexicon system, but lexicon files, can, of course, be used in a simpler way, to correct pronunciations when there is only a single alternative.

The overall structure for a basic correction entry, would look like:

{
	"en":
	{
		"hello": [{
			"pronunciation": {
				"espeak": {
					"en-us": "h ə l ˈoʊ",
					"en-gb-x-rp": "h ə l ˈəʊ"
				}
			}
		}]
	}
}

You can specify a custom lexicon JSON file for synthesis (as well as alignment), using the customLexiconPaths option, which accepts an array of file paths:

echogarden speak-file myText.txt --customLexiconPaths=['myLexicon.json']

The only engines that currently make use of them are vits and espeak for synthesis, dtw and dtw-ra for alignment.

We can also collect these pronunciation corrections, add them to the main repository, and load them by default, to improve pronunciations across many different languages.

Issues with Importing Echogarden

Hey. So:

When I try to npm install echogarden, and then run my node project, I get

ERROR in ./src/App.js 6:0-41
Module not found: Error: Cannot find file: 'API.js' does not match the corresponding name on disk: './node_modules/echogarden/dist/API/api'.

Can't quite figure out why it doesn't npm install right / what exactly I'm missing in the setup process.

Overwriting a package instead of downloading it.

Thanks for the package.

I'm wondering if there's a possibility to overwrite the supplied packages by something like a configuration file or environment variables. This is because something like sox would need patching if you are using NixOS. Eg using something like:

 patchelf \
  --set-interpreter "$(cat $NIX_CC/nix-support/dynamic-linker)" \
  ~/.local/share/echogarden/packages/sox-14.4.2-linux-minimal-20230802/sox

However it'd be nice if we could just use the sox binary provided by the package manager, like how it's done in MacOS:

echogarden/src/audio/SoxPath.ts

Lines 16 to 21 in c2f6bb4

 else if (process.platform == "linux" && process.arch == "x64") { 

 const soxPackagePath = await loadPackage("sox-14.4.2-linux-minimal") 

 soxPath = path.join(soxPackagePath, "sox") 

 } else if (await commandExists("sox")) { 

 soxPath = "sox" 

 }

This way this binary also would not need to download sox itself, as that can be done beforehand.

package the Echogarden project into a standalone executable

I can't package the Echogarden project into a standalone executable using tools like pkg. When I use pkg, I get the following warnings: Warning Babel parse has failed: 'await' is only allowed within async functions and at the top levels of modules. (2:12)

Warning Babel parse has failed: 'await' is only allowed within async functions and at the top levels of modules. (8:0)
Warning Entry 'main' not found in %1
%1: C:\Users\Sunda\AppData\Roaming\npm\node_modules\echogarden\node_modules\tinyld\package.json
%2: C:\Users\Sunda\AppData\Roaming\npm\node_modules\echogarden\dist\text-language-detection\TinyLDLanguageDetection.js
Warning Babel parse has failed: import.meta may appear only with 'sourceType: "module"' (58:56)
Warning Babel parse has failed: import.meta may appear only with 'sourceType: "module"' (323:51)
Warning Babel parse has failed: Unexpected character ''. (1:0)

SoX playback unexpectedly stops after a few seconds on macOS

Hi, I am encountering an issue with the text-to-speech synthesis using Microsoft Edge as the engine on my Macmini (M2 chip).
here is the output:

Echogarden v0.10.5

Get voice list for microsoft-edge.. 2154.3ms
Selected voice: 'Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)' (zh-CN, Chinese (China))

Synthesizing segment 1/1: "我的测试已经开始了"

Synthesizing sentence 1/1: "我的测试已经开始了"
Prepare for synthesis.. 0.2ms
Get voice list for microsoft-edge.. 0.6ms
Initialize microsoft-edge module.. 0.6ms
Request synthesis from Microsoft Edge cloud API.. 1055.9ms
Transcode with command-line ffmpeg.. 689.0ms
Convert boundary events to timeline.. 0.1ms
Postprocess synthesized audio.. 2.1ms
Total synthesis time: 1752.9ms
Merge and postprocess sentences.. 0.1ms

我的

Merge and postprocess segments.. 9.4ms

I suspect the problem may be related to SoX

Error with whisper large model

Running transcription with whisper large model crashes with error

Create ONNX inference session for model 'large'.. 2024-01-13 22:41:57.360987848 [E:onnxruntime:, inference_session.cc:1798 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/initializer.cc:43 onnxruntime::Initializer::Initializer(const onnx::TensorProto&, const onnxruntime::Path&) [ONNXRuntimeError] : 1 : FAIL : tensorprotoutils.cc:789 GetExtDataFromTensorProto External initializer: positional_embedding offset: 0 size to read: 7680000 given file_length: 2293760 are out of bounds or can not be read in full.

Error: Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/initializer.cc:43 onnxruntime::Initializer::Initializer(const onnx::TensorProto&, const onnxruntime::Path&) [ONNXRuntimeError] : 1 : FAIL : tensorprotoutils.cc:789 GetExtDataFromTensorProto External initializer: positional_embedding offset: 0 size to read: 7680000 given file_length: 2293760 are out of bounds or can not be read in full.

I am running the transcription on Ubuntu Node v20.3.0 with this command
echogarden transcribe audio.wav audio.txt --language=lv --engine=whisper --whisper.model=large

Finish initial development of browser extension

The browser extension, currently in development, already has the core functionality of communicating with the server, full word highlighting, and being able to speak starting at a selected page element.

Some things not yet done:

Settings UI
Figure out some aspects of the user experience, for example, how to stop, pause, resume currently spoken text

I have a question with the running

When I finish running the code, the program should be stop.
like I run console.log("hello world"), when this code finish running, the program will stop and exist.
but in my webStrom run icon, it will look like the picture shows:

it must manually click stop icon.
I mean, are there other component alway keep running? Jsut like express module

Permission Denied Error when Trying to Create Directory in Root of D Drive

Here's the error message that I received:

Writing output files.. Error: EPERM: operation not permitted, mkdir 'D:'

Alignment: DTW may give inaccurate results due to silent or non-speech sections

Hi, it's me again

I've been encountering an problem with the text alignment feature. Specifically, the subtitles often appear earlier than the corresponding audio. There are also instances where they appear later than the audio.

Is there a known solution or workaround for this problem?

VITS: option to accept a custom model path

Allow the user to pass a path to a custom VITS model, not in the package manager.

Please let me know if you need this feature and I'll prioritize it!

Error: Sentence has no word entries even though text can be found in audio

Hi,

I'm facing an error with an alignment that I can't manage to solve.

This is the complete output:

Prepare for alignment.. 8338.3ms
No language specified. Detecting language.. 160.2ms
Language detected: English (en)
Get espeak voice list and select best matching voice.. 125.5ms
Selected voice: 'gmw/en-US' (en-US, American English)
Load alignment module.. 0.2ms
Create alignment reference with eSpeak.. 11393.7ms
Compute reference MFCC features.. 4791.5ms
Compute source MFCC features.. 5662.9ms
DTW cost matrix memory size (120s maximum window): 10375.9MiB
Warning: Maximum DTW window duration is set to 120.0s, which is smaller than 25% of the source audio duration of 2200.3s. This may lead to suboptimal results in some cases. Consider increasing window length if needed.
Align MFCC features using DTW.. 84616.5ms
Convert path to timeline.. 40.5ms
Error: Sentence has no word entries

I'm uploading an archive that contains the audio and the text files.
Archive.zip

Here is the command that I perform on my computer:
echogarden align "./audiobook_chapters/Project Hail Mary [B08GB66Q3R] - 03 - Chapter 1.mp3" "./ebook_files/text/part0007.html" result.srt result.json

I've already tried to increase the window duration to 200 seconds, but it does not work either.

I've tried running the same command but on smaller files based on the same model, and it works, should I increase the window even further?

Modular imports causing error in API npm package

Hi, I have installed the "echogarden": "^0.11.14" package and when importing it in my file, on startup it's giving the following error:

const echogarden_1 = require("echogarden"); ^ Error [ERR_REQUIRE_ESM]: require() of ES Module C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\node_modules\echogarden\dist\API\API.js from C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\dist\projects\projects.service.js not supported. Instead change the require of API.js in C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\dist\projects\projects.service.js to a dynamic import() which is available in all CommonJS modules. at Object.<anonymous> (C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\dist\projects\projects.service.js:33:22) at Object.<anonymous> (C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\dist\user\user.module.js:22:28) at Object.<anonymous> (C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\dist\folders\folders.module.js:17:23) at Object.<anonymous> (C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\dist\app.module.js:13:26) at Object.<anonymous> (C:\Users\HASSAN\Desktop\PROJECTS\SubtitleO\subtitleo-server\dist\main.js:6:22)

I am using NestJS, so when it's compiled, the commonjs & module crashes with each other.

This is how I am importing it:

import * as Echogarden from 'echogarden';

Please resolve this ASAP, I am stucked due to this.

My nest versions:

"@nestjs/common": "^8.0.0",
"@nestjs/core": "^8.0.0",

@rotemdan

Recognition: real-time, streaming Whisper recognition

Tokens are already decoded and displayed live during Whisper decoding, at least on the CLI.

Getting Whisper to recognize in real-time (or at least near real-time) is possible. However:

It's really important for me to get a low, usable latency. Preferably something that can be responsive enough for a real-time voice chat with a language model (along with low-latency synthesis, which is already mostly ready).
That would require some planning and code reorganization to get right.
Need to integrate an effective VAD (voice activity detection) strategy to cut the audio at the right places. Fortunately, Echogarden already has several working VAD implementations.

Python client logic for connecting the echogarden web socket

I am trying alignment, the node client works. But having issue with doing it in python. I tried sending in base64 data for audio, it broke, couldn't pass bytes serialized for json.

import asyncio
import websockets
import json
from pydub import AudioSegment


async def align_audio():
    async with websockets.connect('ws://localhost:45054') as ws:
        print('Connected to the WebSocket server')

        audio = AudioSegment.from_file(
            './audio-files/event_24ce4d18-feb0-4de6-acf4-0eac5ae9ff03.mp3', format='mp3')

        raw_audio = audio.raw_data
        sample_rate = audio.frame_rate
        channels = audio.channels

        request = {
            'messageType': 'AlignmentRequest',
            'requestId': 'some-unique-request-id',
            'input': {
                'audioChannels': [raw_audio],
                'sampleRate': sample_rate
            },
            'transcript': 'Watson, we have a most intriguing case on our hands. The Tinderbox Killer has struck again, leaving behind his chilling signature - a small wooden matchbox containing a single match.',
            'options': {
                'language': 'en-US'
            }
        }

        await ws.send(json.dumps(request))

        response = await ws.recv()
        alignment_result = json.loads(response)
        print('Alignment result:', alignment_result)

        with open('alignment_result.json', 'w') as file:
            json.dump(alignment_result, file)
            print('Alignment result saved to alignment_result.json')

        print('Disconnected from the WebSocket server')

if __name__ == '__main__':
    asyncio.run(align_audio())

Recognition: Whisper model may get stuck in a token repeat loop when it encounters silence or non-speech segments

This is a common problem with Whisper: when it encounters silence or non-speech segments, it may hallucinate and start to repeat a token pattern, like:

Thanks for watching! Thanks for watching! Thanks for watching! ...

There are various proposed strategies to improve the situation, but none of them has shown to completely solve it in all cases, or are always practical to apply.

Using VAD (voice activity detection) to cut non-speech part is a possibility, but in practice, no reasonably fast VAD engine (including WebRTC and Silero, which Echogarden already supports), is accurate enough to avoid significant false positives, so I don't think that's the best path for now.

Meanwhile, if you encounter this problem, try to ensure the input doesn't have significant segments of silence or music, by pre-trimming or slicing out those parts.

Investigate porting some of the engines to run in the browser

It is technically possible, overall, since the core components: espeak-ng and onnxruntime both fully support running in the browser. Actually, onnxruntime-web, unlike onnxruntime-node (the currently used package), can also make use of the GPU via WebGL and WebGPU, which may give a performance boost for some users.

However, it is a lot of work, and only a subset of the engines can be supported (no cloud engines, in particular). There are several reasons why the web may not be the most effective platform for Echogarden:

Significantly slower inference when using CPU for ONNX models
No cross-domain network connectivity - can't connect to Google Cloud, Microsoft, Amazon etc. without a proxy
Large initial download size would make it too heavy and slow to load as part of a standard web page directly, especially for Whisper models which are several hundred megabytes to gigabytes in size
Large memory requirement for the VITS models, starting at about 800MB - 1GB, which is a bit too much for a browser
Due to the high code complexity, data size, and memory consumption, it is unlikely that a browser extension, internally bundling some of the models, would be accepted to the Chrome and Firefox web stores
Will require a virtual file system to store models and make use of downloadable packages
Requires duplicating a lot of prior work, porting many node.js-only APIs, and increasing code complexity
Possibly lots of issues with inconsistent browser support and browser security constraints
Not future-proof. Due to changing restrictions of browsers, the runtime environment is not guaranteed be reliably reproducible in the future, meaning that it may need continuous maintenance to ensure it keeps working on the newest browsers

Currently, getting a working web-based UI client is a higher priority, so work on this is not likely to start any time soon.

Some issues when using echogarden in an Electron app

First of all, thank you for this fantastic package!

I'm currently integrating echogarden into an Electron application, which relies on ffmpeg as well. To manage ffmpeg's binaries, I'm utilizing ffmpeg-static, enabling its inclusion in the user's download of the app.

Despite this, echogarden insists on downloading its own ffmpeg binary—except on macOS.

Here's the crux of the issue:

Windows and Linux users are forced to download an additional ffmpeg binary, even though one is already bundled with the application.
For macOS users, although the ffmpeg binary is bundled, echogarden repeatedly issues a warning that it cannot find the ffmpeg utility. Given that it's a desktop app, we can't presume that users have ffmpeg installed, and most wouldn't know how to install it, hence why we bundle it with the app.

A couple of potential solutions spring to mind:

If echogarden could integrate with ffmpeg-static, it would gracefully resolve my issue.
Alternatively, providing a configuration option within echogarden for specifying the ffmpeg binary path could be very helpful.

I really appreciate all the effort that's gone into making such an exceptional package! Thank you once again!

Note: version `0.10.7` accidentally introduced an issue where a package installation may get corrupted (fixed on `0.10.9`)

You may get an error like:

Error: Cannot overwrite non-directory 'C:\Users\X\AppData\Local\echogarden\packages\so
x-14.4.1a-win32-20230718' with directory 'C:\Users\X\AppData\Local\Temp\echogarden\525c
0e8fc35a26d2\sox-14.4.1a-win32-20230718'.

(This was due to new code I added, to attempt to detect if a directory was writable before writing to it)

The error only existed in 0.10.7 (released 05 August 2023) and was fixed 24 hours later in 0.10.9.

If you keep getting the error after updating to 0.10.9 you can run echogarden uninstall [corrupted-package-name], where [corrupted-package-name] is whatever package produces this error for you, (example: sox-14.4.1a-win32-20230718), to remove the corrupted package, and allow it to be installed properly.

CLI: add support for sentence templates in filenames

Add support for sentence templates, which split the output files according to sentences boundaries. Like echogarden speak-file text.txt /parts/[sentence].wav.

eSpeak sometimes joins multiple successive tokens into a single word if the word has a compound pronunciation

For example, in the word sequence "there was" ("there", "was"), eSpeak includes the concatenated phonemes for both "there" and "was" (ðɛɹwˌʌz) only under the entry for "there", and leaves the entry for "was" empty.

This also impacts word timing, timelines, and alignment.

The best solution for this is not clear at this point.

I could add separators, like | between words, when synthesizing, which I've already tried. That would prevent the issue, but would also cause a somewhat longer and worse sounding synthesized audio, and incorrect pronunciations and phonemizations of some words.

Need to see if I can find a more robust workaround.

Develop web-based UI

Currently, running the server (echogarden serve) and opening the local host HTTP page (http://localhost:45054) shows a basic placeholder message ("This is the Echogarden HTTP server!")

Gradually, start developing a graphical user interface to replace it.

Since a lot of functionality is already available via the command line, there's no need to rush to have all features supported immediately. Try to concentrate on features that benefit from a graphical UI the most.

For example the ability to try different TTS engines and voices is much easier and faster to do using a UI, than with a CLI.

Task list

Add a basic TTS text box with the ability to try different voices and languages
Add some basic TTS options like speed and pitch
Support loading text from a file
Support saving audio to a file
Add support for more options like plainText.*, subtitles.*, etc.
Support saving subtitles to a file

(TODO: extend with more tasks..)

Alignment: implement multi-stage DTW to reduce memory requirement for long audio inputs

The memory requirement of DTW grows quadratically with the length of the input. The current approach may only be practical for source audio that's about up to 60 - 120 minutes long (with low granularity level, and minimally sized window).

This large memory size is mostly related to the size of the cost matrix built during the DTW algorithm, and the impact of its Sakoe-Chiba band's width (also called the "window duration").

The size of the cost matrix (in bytes) is computed by:

costMatrixMemorySizeBytes = sequence1Length * Min(sequence2Length, windowLength) * 4

Where:

sequence1Length: length of the synthesized transcript, in audio frames
sequence2Length: length of the source audio, in audio frames
windowLength: length of the Sakoe-Chiba band, in audio frames
4 represents the byte size of the floating-point value for each matrix element (32 bit float).

The default number of audio frames per second is in the range of 25 - 100.

For a 10 minute reference (synthesized) audio (10 * 60 * 100 = 60000 frames), and a 2.5 minute window (2.5 * 60 * 100 = 15000 frames), at high granularity (100 frames per second) we get:

costMatrixMemorySizeBytes = 60000 * 15000 * 4 = 3600000000

Which is a total memory size of 3.6GB for the cost matrix.

20 minute source, 5 minute window: 14.4GB
30 minute source, 5 minute window: 21.6GB
1 hour source, 10 minute window: 86.4GB!

Using `low` granularity

Using low granularity, which has 25 frames per second (40ms frames) causes both source and reference frame counts to be smaller by a factor of 4, and in turn, the size of the matrix reduces nonlinearly. In the 10 minutes case, it only requires 225MB:

costMatrixMemorySizeBytes = 15000 * 3750 * 4 = 225000000

Other sizes:

20 minute source, 5 minute window: 900MB
30 minute source, 5 minute window: 1.8GB
1 hour source, 10 minute window: 5.4GB
2 hour source, 20 minute window: 21.6GB

Using `x-low` granularity

Using x-low, which has only 12.5 frames per second (80ms frames), can reduce memory size further. However note that with this granularity accuracy is now mostly phrase-level - individual word timings may have significant inaccuracies:

10 minute source, 2.5 minute window: 56MB
20 minute source, 5 minute window: 225MB
30 minute source, 5 minute window: 337.5MB
1 hour source, 10 minute window: 1.35GB
2 hour source, 20 minute window: 5.4GB
4 hour source, 30 minute window: 16.2GB

Using two-stage DTW

It's possible to combine both low and high granularities, to allow fine-grained alignment while still having lower memory requirements, using a technique called "multi-stage DTW" (or also "hierarchical DTW").

Run a coarse alignment stage, with a low or x-low granularity a larger window size, that may be as wide as 20, 30 minutes or more. This stage would get rough estimate of a high-level alignment path.
Then, run a second alignment stage, with high, or even x-high granularity, and a smaller window duration (30 seconds up to about 1 minute), centered around the alignment path found on the first stage.

This approach may be able to work for audio inputs of up to several hours in length.

Possible enhancement: cutting down memory by half using 16-bit quantization of matrix elements

It's possible to experiment with 16-bit integer quantization for the matrix elements. This may work but would require careful tuning and scaling of the cost values stored in the matrix.

Possible enhancement: storing the matrix on disk

Currently, a large cost matrix may already be swapped to disk via virtual memory, allowing sizes greater than physical RAM. It's possible to directly use file-system I/O to manage the data. However, it isn't clear if it's worth it.

Developer's task list

Many additional issues, enhancements and ideas are listed on the task list document.

A large number of them have been added long before the project was posted to GitHub (late April 2023):

Over time, I'll try to open individual tracker issues for those I find the most important.

Meanwhile, you can check out the document to get a better sense of what might be missing or remains to be done (there are currently more than 100 entries). If you find something that's already in the document, but not on the tracker, and you personally care about it, you can open a new issue for it, and I'll try to do my best to prioritize it.

Error: Wave audio format id 3 is not supported

The error is in the title. When I try to align anything, it gives this error. OS is Win 11. Please help me 😞

Alignment: automatically choose best window duration based on audio length, if not specified

With the new automatic granularity selection in 0.11.0, it makes sense to also try to choose the window size automatically based on a set of rules.

Rules (draft, work in progress):

0 to 4 minute audio: set maximum window duration to 1 minute
4 minutes or more: set window duration to 25% of the audio duration, but no larger than 20 minutes

The rules try to take into account the possibility of having significant silent or non-speech segments for shorter audio, which become less likely for longer audio. The user can manually override this setting with higher or lower window durations if they wish.

is there an option to set it to run in the gpu?

when I run audio-to-txt using api, it always run on my CPU and my gpu is free, I want to set it run on my gpu to improve running speed.

Problems aligning certain Japanese characters with Whisper engine

I'm trying to do forced alignment with a Japaense transcript and corresponding audio using the Whisper engine, but it keeps raising errors like the following:

Prepare for alignment.. 153.9ms
Load alignment module.. 0.7ms
Load tokenizer data.. 289.2ms
Create ONNX inference session for model 'tiny'.. 8657.4ms
Prepare for alignment.. Error: Failed tokenizing the given text. The word ' 遅刻' contains a subword '遅' which is not in the vocabulary.

I am using the following command:

echogarden align audio.wav transcript.txt --language=ja --engine=whisper --whisper.model=medium --plainText.paragraphBreaks=double

If I transcribe 遅刻 into hiragana ちこく, it will align this just fine. However, not all hiragana will work, like ぱ:

Prepare for alignment.. Error: Failed tokenizing the given text. The word ' ぱん' contains a subword 'ぱ' which is not in the vocabulary.

Various other Japanese characters break, including katakana ヨ and certain fullwidth punctuation like ？, but other katakana and punctuation seem to work (including ハ and …).

Aligment: is that a method to generate srt file in JS code?

I try align() method, it return a promise object, it only can output the json string in the terminal. I haven't found a feature that can generate srt file according to my alignmentOptions.
Although I know that can manually generated srt file through fs module using json string, but this json string generated by align() is fixed, I want to generate srt file through the configuration I set in advance.
does feature exist?

import * as alignment from 'echogarden'
import * as fs from "fs"


let audioPath = "E:\\Programming Tools\\1_News One.mp3"
let audioUint8Array = fs.readFileSync(audioPath)
let transcriptPath = "E:\\Programming Tools\\1_News One.txt"
let transcriptStr = fs.readFileSync(transcriptPath).toString()
let alignmentOptions = {
    "engine": "dtw",
    "overwrite": true,
    "subtitles": {
        "format": "srt",
        "mode": "line"
    },
}


alignment.align(audioUint8Array, transcriptStr, alignmentOptions).then(
    (result) => {
        console.log(result)
    }
)

Synthesis: symbol sequences like `************` or `------------` may cause eSpeak to fail

This is due to some internal limit eSpeak has on pronouncing sequences of symbols, which causes it to stop sending events for more than a few instances of the token.

The cldr-segmentation library segments ********** to individual tokens as words, like, ['*', '*', '*', '*', '*', ...], which is a part of the problem.

One solution could be to detect these symbol sequences and concatenate them to a single token instead.

Recognition: add support for OpenAI's cloud Whisper API

OpenAI provides a subscription-based cloud service that is able to transcribe speech using the largest Whisper model (large-v2):

https://api.openai.com/v1/audio/transcriptions

And translate speech using the same model:

https://api.openai.com/v1/audio/translations

Beyond the very basics, the raw REST API doesn't seem to be well-documented. They provide a Python library that supports many undocumented features, so it can be used for reference.

Based on the official OpenAI Python library, and a post on the issue tracker, it may be possible to get timing information for fragments (that is, groups of few words - not word-level) using a response format of verbose_json. Then, I can then run DTW alignment individually on each segment to approximate word-level timestamps.

What audio formats does the program support?

"What audio formats does the program support? It seems not all audio can be used for alignment."

Originally posted by @Tendaliu in #11 (comment)

eSpeak doesn't work with the Polish language

I have a question, not sure where to post so I'll post it here.

I'm unsure how to align when the language is Polish. For example, using the following audio and transcript:

Audio
transcript.txt

And the following command:

npx echogarden align ./audio.mp3  transcript.txt result.json --language=pl

The result is:

Echogarden v0.12.1

Prepare for alignment.. 45.4ms
Load alignment module.. 3.0ms
Create alignment reference with eSpeak.. Error: Word entry 0 already has phones before its start marker was seen. The words were: [ 'On', 'zawsze', 'robi', 'notatki', '.' ]

Different subtitle outputs with CLI commands

Is there a way to edit subtitle outputs with CLI commands? It would be very good to have formats like this:

1
00:00:00,030 --> 00:00:00,070
<font color="#00ff00">The</font> first sentence.

2
00:00:00,070 --> 00:00:00,080
The first sentence.

3
00:00:00,080 --> 00:00:00,450
The <font color="#00ff00">first</font> sentence.

4
00:00:00,450 --> 00:00:00,530
The first sentence.

5
00:00:00,530 --> 00:00:01,100
The first <font color="#00ff00">sentence</font>.

6
00:00:01,740 --> 00:00:01,780
<font color="#00ff00">The</font> second sentence.

7
00:00:01,780 --> 00:00:01,800
The second sentence.

8
00:00:01,800 --> 00:00:02,250
The <font color="#00ff00">second</font> sentence.

9
00:00:02,250 --> 00:00:02,260
The second sentence.

10
00:00:02,260 --> 00:00:02,800
The second <font color="#00ff00">sentence</font>.

So far the only method I can think of is converting JSON files but it's a bit hard for me as a non-coder.

CLI: add source offsets to timeline output file if possible

This is mostly relevant for the case where the input is a string or a plain text file. In other input types, like HTML, SSML or Subtitle files, source offsets wouldn't always be that useful.

CLI / `speak-url`, `speak-wikipedia`: support accepting and parsing full Wikipedia article URLs

When a Wikipedia article URL like https://en.wikipedia.org/wiki/Garden is given to speak-url or speak-wikipedia, detect the article's language from the URL, and use the Wikipedia parsing package to get plain text article version that can be synthesized.

CLI: option to control logging verbosity

Not very easy to implement at the moment. May require significant changes in many source files.

Synthesis: VITS voices have various issues related to model training

For example, when the default English voice (Amy / Low) gets an utterance that is a single word, like "two", it seems to mispronounce it as something that sounds closer to "ten". Other voices have much more serious issues. For example, the Greek voice may produce bizarre, nonsensical utterances when given English text (most likely it hasn't been trained for English, or Latin characters in general, and doesn't know what to do).

This is an issue with the training of the models, not related to the code itself.

These models are trained as part of the Piper speech system, mostly by Michael Hansen. You can check out the Piper issue tracker to give feedback on these sorts of problems.

Echogarden doesn't actually use the Piper system, but reimplements it in JavaScript, with several enhancements that are not present in the original C++ code. Only the ONNX models are shared.

The original ONNX models are published on the piper-voices Hugging Face repository. I repackage them as tar.gz archives and upload them to the echogarden-packages Hugging Face repository, from which they (and all other packages) are downloaded when needed.

justing thanks

Hi, in this link:
https://github.com/echogarden-project/echogarden/issues/39
you posted a critical paragraphs inspiring me:
Also, it would be much easier to use the library from TypeScript, where you get autocomplete suggestions for all the types.
After, I have learned TypeScript for a few days. Now, I can do what you said: using ts much easier get autocomplete suggestions for all the types:

Before I use ts, I always to look through the source code, it pays me a lot of time and energy to understand arguments. now ,it's so easy to understand it.
thank you very very much.

Recognition: implement beam search for Whisper decoder

Beam search would enable the decoder to consider multiple recognitions simultaneously.

Currently not a high priority, because of several reasons:

The goal of the Whisper implementation is a good speed / quality tradeoff. Not sure having more than one decoding path would be a good tradeoff in all cases.
Whisper inference is currently only supported on CPU, meaning even a beam width of 2 would significantly reduce speed.
It is more important, at this moment, to get real-time and streaming recognition running. Due to the extra cost of beam search, it's unlikely it would be used in real-time situations (at least over CPU).
There are alternative approaches to get better quality, like using a larger model, or various guided decoding strategies.

espeak-ng.data cannot be found

Hello, when I call the Echogarden.align method, I get the following error:

Create alignment reference with eSpeak.. package error: [Error: ENOENT: no such file or directory, open 'myProjectPath/.next/server/vendor-chunks/espeak-ng.data'] {
errno: -2,
code: 'ENOENT',
syscall: 'open',
path: 'myProjectPath/.next/server/vendor-chunks/espeak-ng.data'
}

What is the reason for this?

Where is the file espeak-ng.data generated? Which directory does it need to be placed in?

	else if (process.platform == "linux" && process.arch == "x64") {
	const soxPackagePath = await loadPackage("sox-14.4.2-linux-minimal")
	soxPath = path.join(soxPackagePath, "sox")
	} else if (await commandExists("sox")) {
	soxPath = "sox"
	}