lowerquality / gentle Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 294.0 1.47 MB

gentle forced aligner

Home Page: https://lowerquality.com/gentle/

License: MIT License

Makefile 1.22% Shell 1.57% Python 64.78% C++ 17.28% HTML 14.39% Dockerfile 0.77%

gentle's People

Contributors

Stargazers

Watchers

Forkers

hihihippp soumith seantrue soroushmehr kastnerkyle saisus ronen therewasaguy wanjinchang leo23 engassa bit somaticapi ddbourgin saxenauts freehayden saquibajaz maheshm stucksubstitute rohithkodali cesarodriguez4 ecoblockchain enfratec davegreenwood conde-nast-international jiangyangbo woifram kwstewar leoq vimily opennewslabs pandeydivesh15 opengxia adbrebs wan luckyarthur tshields44 koyo922 lparam tyaq somtts edicon yifan dummy-ai prouhard mimbres davidfumo worthwhileindustries small-yellow-duck deepread kundan2510 prabhasp ml-lab linghuichen dacson madhusud r9y9 rajivpoddar lincaiming litetoooooom heypinch chuckcho copperdong bkthesisteam humblewolf sirifarif melspectrum007 wan54 twistedmove jasonray716 unforeseenocean adam1105 af258963 dacrusa romanscott canius tlc rakeshshrestha31 wkranti sensecollective afcarl feeltraincoop descriptinc icesmith acwooding llf-amy robinsax rafaelvalle pengyulong montzkie18 merik-vt emlynazuma dariusk statmuse hackalog kdavis-mozilla kishorekolla lvscar cmc1023 xiaotingfu

gentle's Issues

Add Example Alignments

So people can have an idea of what it does before downloading, we should link to some example outputs from the documentation page. What are some fun public domain audio files that we could align?

Alignment struggles at 20-second intervals

The "grain" of arbitrary, 20-second endpointing results in a high propensity for multi-word error around the seams.

Adding to Gentle's Pronunciation Dictionary

I'm a big fan of Gentle but have repeatedly run into words which are OOV (which I take to mean Out Of Vocabulary).

After reading some other questions here on Github, I saw that Gentle uses ARPAnet phonemes (as does the CMU Pronunciation Dictionary).

I would greatly appreciate being able to add words and their corresponding ARPAnet phonemes to Gentle's pronunciation dictionary (even if it only applied to my local instance).

Looking through the source here on Github, I have not been able to locate where this is stored.
If you could direct me to where the file is located, I would appreciate it.

Building an interface into Gentle where OOV words are listed and the user is presented a form to enter the corresponding ARPAbet phonemes for each word before rerunning alignment would also be desirable.

Any help you can provide me to this end would be welcome.

P.S. In the meantime, I have been attempting to use homophone phrases as stand-ins for OOV words in order to gain timing matches with essentially correct phonemes, but this is time-consuming when thinking of the best homophones and introduces many unwanted complications into maintaining my master transcript.

Remove sil phones

The current alignment process includes some leading and trailing silence in the output offsets. It would be better to return tighter bounds around each spoken word.

I think we can get a lot closer if we use the phone-level alignment and ignore remove silence phones.

For instance, this one can just start at 17.77:

{
  "case": "success", 
  "end": 17.85, 
  "phones": [
    {
      "duration": 3.62, 
      "phone": "sil"
    }, 
    {
      "duration": 0.08, 
      "phone": "ay_S"
    }
  ], 
  "alignedWord": "i", 
  "start": 14.15, 
  "word": "I"
},

Bug Report: Recent Instability

Early this morning I gave Gentle's hosted version three audio files without transcripts.

The third completed but the first two (alignments 488ca0e1 and 4467f32d) stalled and are still stuck in the transcription phase.

This afternoon, when retrying to submit the first and second, the first finished without error, however the second dumped me out at http://gentle-demo.lowerquality.com/transcriptions (without a transcription ID) and presented me with the following:

<html><head><title>web.Server Traceback (most recent call last)</title></head><body><b>web.Server Traceback (most recent call last):</b>

<div>
  <style type="text/css">
    div.error {
      color: red;
      font-family: Verdana, Arial, helvetica, sans-serif;
      font-weight: bold;
    }

    div {
      font-family: Verdana, Arial, helvetica, sans-serif;
    }

    div.stackTrace {
    }

    div.frame {
      padding: 1em;
      background: white;
      border-bottom: thin black dashed;
    }

    div.frame:first-child {
      padding: 1em;
      background: white;
      border-top: thin black dashed;
      border-bottom: thin black dashed;
    }

    div.location {
    }

    span.function {
      font-weight: bold;
      font-family: "Courier New", courier, monospace;
    }

    div.snippet {
      margin-bottom: 0.5em;
      margin-left: 1em;
      background: #FFFFDD;
    }

    div.snippetHighlightLine {
      color: red;
    }

    span.code {
      font-family: "Courier New", courier, monospace;
    }
  </style>

  <div class="error">
    <span>exceptions.IOError</span>: <span>[Errno 2] No such file or directory: 'www/view_alignment.html'</span>
  </div>
  <div class="stackTrace">
    <div class="frame">
      <div class="location">
        <span>/usr/local/lib/python2.7/dist-packages/twisted/web/server.py</span>:<span>183</span> in
        <span class="function">process</span>
      </div>
      <div class="snippet">
        <div class="snippetLine">
          <span class="lineno">182</span>
          <code class="code"> &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;self._encoder = encoder</code>
        </div><div class="snippetHighlightLine">
          <span class="lineno">183</span>
          <code class="code"> &#160; &#160; &#160; &#160; &#160; &#160;self.render(resrc)</code>
        </div><div class="snippetLine">
          <span class="lineno">184</span>
          <code class="code"> &#160; &#160; &#160; &#160;except:</code>
        </div>
      </div>
    </div><div class="frame">
      <div class="location">
        <span>/usr/local/lib/python2.7/dist-packages/twisted/web/server.py</span>:<span>234</span> in
        <span class="function">render</span>
      </div>
      <div class="snippet">
        <div class="snippetLine">
          <span class="lineno">233</span>
          <code class="code"> &#160; &#160; &#160; &#160;try:</code>
        </div><div class="snippetHighlightLine">
          <span class="lineno">234</span>
          <code class="code"> &#160; &#160; &#160; &#160; &#160; &#160;body = resrc.render(self)</code>
        </div><div class="snippetLine">
          <span class="lineno">235</span>
          <code class="code"> &#160; &#160; &#160; &#160;except UnsupportedMethod as e:</code>
        </div>
      </div>
    </div><div class="frame">
      <div class="location">
        <span>/usr/local/lib/python2.7/dist-packages/twisted/web/resource.py</span>:<span>250</span> in
        <span class="function">render</span>
      </div>
      <div class="snippet">
        <div class="snippetLine">
          <span class="lineno">249</span>
          <code class="code"> &#160; &#160; &#160; &#160; &#160; &#160;raise UnsupportedMethod(allowedMethods)</code>
        </div><div class="snippetHighlightLine">
          <span class="lineno">250</span>
          <code class="code"> &#160; &#160; &#160; &#160;return m(request)</code>
        </div><div class="snippetLine">
          <span class="lineno">251</span>
          <code class="code"></code>
        </div>
      </div>
    </div><div class="frame">
      <div class="location">
        <span>serve.py</span>:<span>219</span> in
        <span class="function">render_POST</span>
      </div>
      <div class="snippet">
        <div class="snippetLine">
          <span class="lineno">218</span>
          <code class="code"> &#160; &#160; &#160; &#160;# Copy over the HTML</code>
        </div><div class="snippetHighlightLine">
          <span class="lineno">219</span>
          <code class="code"> &#160; &#160; &#160; &#160;shutil.copy(get_resource('www/view_alignment.html'), os.path.join(outdir, 'index.html'))</code>
        </div><div class="snippetLine">
          <span class="lineno">220</span>
          <code class="code"></code>
        </div>
      </div>
    </div><div class="frame">
      <div class="location">
        <span>/usr/lib/python2.7/shutil.py</span>:<span>119</span> in
        <span class="function">copy</span>
      </div>
      <div class="snippet">
        <div class="snippetLine">
          <span class="lineno">118</span>
          <code class="code"> &#160; &#160; &#160; &#160;dst = os.path.join(dst, os.path.basename(src))</code>
        </div><div class="snippetHighlightLine">
          <span class="lineno">119</span>
          <code class="code"> &#160; &#160;copyfile(src, dst)</code>
        </div><div class="snippetLine">
          <span class="lineno">120</span>
          <code class="code"> &#160; &#160;copymode(src, dst)</code>
        </div>
      </div>
    </div><div class="frame">
      <div class="location">
        <span>/usr/lib/python2.7/shutil.py</span>:<span>82</span> in
        <span class="function">copyfile</span>
      </div>
      <div class="snippet">
        <div class="snippetLine">
          <span class="lineno">81</span>
          <code class="code"></code>
        </div><div class="snippetHighlightLine">
          <span class="lineno">82</span>
          <code class="code"> &#160; &#160;with open(src, 'rb') as fsrc:</code>
        </div><div class="snippetLine">
          <span class="lineno">83</span>
          <code class="code"> &#160; &#160; &#160; &#160;with open(dst, 'wb') as fdst:</code>
        </div>
      </div>
    </div>
  </div>
  <div class="error">
    <span>exceptions.IOError</span>: <span>[Errno 2] No such file or directory: 'www/view_alignment.html'</span>
  </div>
</div>

</body></html>

Multi-pass alignment

I suspect accuracy on long inputs could be significantly improved with a two-pass alignment. When the input has line or paragraph breaks, if the initial alignment is reasonably good, the audio from each paragraph could be isolated and re-run (with a correspondingly smaller language model).

Other language models - where to get

Not the actual issue of gentle. But could you advice where it is better to look for appropriate models for other (major) languages (DE, SP, IT, FR)?

Add Confidence Measure

Not sure what the best one would be... This would helpful to get higher quality results when doing supercut-like experiments with large datasets.

Unable to compile Kaldi version after El Capitan

Since upgrading to El Capitan I haven't been able to install the included version of Kaldi. There's an issue with gcc-fortran. Were you able to get it compiling?

Can I adjust the data represented within align.csv?

Hello again, I see that align.json has the timestamp data for phonemes. I was wondering if there is a way to get align.csv to also list that info. Maybe by adjusting the why in which the data is being parsed? Any thoughts on the best way to go about this?

[investigate] Allow Relative Paths in Kaldi Config

I want to make a patch to Kaldi that allows relative paths in config files. How hard will it be?

IPA Transcription

Is the English-language transcription intentional? The International Phonetic Alphabet would be generalizable and more precise.

C++ Exceptions Hard To Trace

We're not capturing stderr from standard_kaldi so it's really hard to figure out what went wrong when there's a problem. I know we disabled it because of some problem with the Mac app packaging. However, I can't remember the specific issue. Is it safe to re-enable stderr so we can get better exceptions?

Hyphenated words cannot be aligned

get_matched_kaldi_sequence returns words with hyphens intact, while the alignment returns hyphenated words as separate tokens. Probably, get_matched_kaldi_sequence should return hyphenated words as separate tokens, and then a post-processing step can re-merge (or leave separate) such words.

Consider Removing numpy Dep

Numpy is really huge. Installation would be much faster if you dropped the numpy dep and did the audio chunking in pure stdlib Python.

Does Gentle use alternate pronunciation phoneme sets where available?

I have I have noticed that whether a speaker says "probably" or "prob'ly", Gentle seems to always use the phoneme set for "prob'ly" (P R AA B L IY).

At first, I suspected that this was due to a defect/eccentricity in CMU's pronunciation dictionary or that perhaps it only listed one pronunciation of each word.

After looking a couple of copies of the dictionary, however, I see that:
CMUSphinxDict lists both:
PROBABLY P R AA B AH B L IY
PROBABLY(2) P R AA B L IY

and CMUDict-0.7b lists only:
PROBABLY P R AA1 B AH0 B L IY2

Could you shed some light on what is happening in cases such as these?

In future, it would be desirable (given that Gentle has correctly marked the beginning and the end of the word) to tell Gentle that specific instances in the transcript are actually using the other pronunciation and have the aligner re-examine them for new phoneme timing on the basis of that alternate phoneme set.

view_alignment "update" race condition

On very long alignments, get_json("align.json") may hit the file as it's being written, leading to a (fatal) parse error.

We should either catch the exception, be smarter about clobbering align.json on the server, or both.

Feature Request: Identify OOV words prior to alignment.

It would save time and processing power if Gentle were able to tell the user before it ever attempts alignment if they have words in their transcript which are out of Gentle's pronunciation dictionary and allow them to re-attempt validation after edits have been made to the transcript to compensate, until they are able to submit a transcript for alignment which contains no OOV words (if desired).

I remember your comments about wanting to use Phonetisaurus to dynamically generate pronunciations for OOV words, and while I agree that that is a better long term solution, I think that this could be helpful in the mean time.

Model File Layout

I wanted to clarify how you see the model files being laid out. I think there's been some confusion and the code as-is doesn't work with the latest model files from lowerquality.com.

Is this the intended file layout?

data
├── nnet_a_gpu_online
│   ├── conf
│   │   ├── ivector_extractor.conf
│   │   ├── ivector_extractor.conf.orig
│   │   ├── mfcc.conf
│   │   ├── mfcc.conf.orig
│   │   ├── online_cmvn.conf
│   │   ├── online_cmvn.conf.orig
│   │   ├── online_nnet2_decoding.conf
│   │   ├── online_nnet2_decoding.conf.orig
│   │   ├── splice.conf
│   │   └── splice.conf.orig
│   ├── final.mdl
│   ├── ivector_extractor
│   │   ├── final.dubm
│   │   ├── final.ie
│   │   ├── final.mat
│   │   └── global_cmvn.stats
│   └── smbr_epoch2.mdl
└── smbr_epoch2.mdl

PROTO_LANGDIR/
├── graphdir
│   ├── phones
│   │   ├── disambig.int
│   │   ├── disambig.txt
│   │   ├── silence.csl
│   │   ├── word_boundary.int
│   │   └── word_boundary.txt
│   ├── phones.txt
│   └── words.txt
├── langdir
│   ├── L.fst
│   ├── L_disambig.fst
│   ├── phones
│   │   ├── disambig.int
│   │   ├── disambig.txt
│   │   ├── silence.csl
│   │   ├── word_boundary.int
│   │   └── word_boundary.txt
│   ├── phones.txt
│   └── words.txt
├── modeldir
│   ├── final.mdl
│   └── tree

I'll update the code to match whatever the correct layout is.

Long files crash Gentle DMG

Reported on Twitter by @jarm:

struggling with Gentle using large files (window dies), should I chop them up?

same again with just audio (80MB; 2hrs47m). I tried with smaller clip (8.7MB;4m) and that works fine.

seems to hang when transcription finishes and page layout changes

I wasn't watching moments crashes happened, but both seem to be when transcription ends. this is w/ offline version btw.

"damaged and can't be opened" message on startup

I get this message with the 0.5.0 dmg on my mac running El Capitan

"Prons" alignment is inconsistent

Often, the phonemes returned seem to cross between words. For instance, here the iy of "Memory" is grouped under "and."

    {
      "duration": 0.43,
      "k_word": "memory",
      "phones": [
        {
          "duration": 0.03,
          "phone": "sil"
        },
        {
          "duration": 0.1,
          "phone": "m_B"
        },
        {
          "duration": 0.09,
          "phone": "eh_I"
        },
        {
          "duration": 0.09,
          "phone": "m_I"
        },
        {
          "duration": 0.12,
          "phone": "er_I"
        }
      ],
      "start": 36.24,
      "word": "Memory"
    },
    {
      "duration": 0.42,
      "k_word": "and",
      "phones": [
        {
          "duration": 0.03,
          "phone": "iy_E"
        },
        {
          "duration": 0.23,
          "phone": "ae_B"
        },
        {
          "duration": 0.08,
          "phone": "n_I"
        },
        {
          "duration": 0.07,
          "phone": "d_E"
        },
        {
          "duration": 0.01,
          "phone": "d_B"
        }
      ],
      "start": 36.67,
      "word": "and"
    },

Handle exceptions in status.html

Currently the program just silently fails if there's an exception and doesn't let the user know to retry or file a bug.

Make Me Owner

Could you make an an owner of the lowerquality organization? I need to be an owner to enable Jenkins CI.

Once that's done we can automatically build, test, and distribute Docker images!

How do I insert a picture into the result HTML

Hi, Gentle is such an amazing project! Thank you the team for your great work!

One question: Gentle reuqire audio and text-only input, and the output index.html is text-only.
The epub books or articles my project uses often have picture elements in them, and I have to delete the pictures and their descriptions, so that it fit gentle's requirement.

Is there any way for me to insert pictures in the output index.html? or input text?

If picture is not supported officially, can you please point to me a direction on how should I do this as less painfully as possible?

Bug: Transcription result has non-linear time sequence for transcript words

TL;dr:

The TODO note at https://github.com/lowerquality/gentle/blob/master/gentle/transcription.py#L52 which says # Combine chunks / # TODO: remove overlap? ...or just let the sequence aligner deal with it is correct I think: Leaving it to the sequence aligner to deal with leads to weird results, in some weird cases.

I'm happy to code up a PR to remove the overlap. If you have any thoughts on how best to do that, let me know!

Bug description

I've attached a zip file with audio and transcript: rita_pierson.zip (FYI they're from this TED talk)

There's a section of the transcript which has:

... so we could show everybody else how to do it."
One of the students said, "Really?"
(Laughter)
I said, "Really. We have to show the other classes
how to do it ...

When I run Gentle, the resulting transcribed words have timing glitches. Most noticeably they're not in linearly increasing order! Also, there's an inappropriately long gap between "students" and "said". Here's an exceprt of the data:

start	case	word
211.73	success	the
211.84	success	students
215.74	success	said
216.16	success	Really
None	not-found-in-audio	Laughter
None	not-found-in-audio	I
216	success	said
216.17	success	Really
216.64	success	We
216.86	success	have

Analysis...

This is due to a perfect storm of the transcript having repeated words combined with misidentified words and not-found-in-audio words, all in the overlap region of two chunks, causing the diff_align result to end up putting together a bit of a frankenstein creation. Here's the details:

There are two overlapping chunks:

One chunk spans t=198 to t=218, covering this part of the transcript:
- transcript: ... One of the students said, "Really?" (Laughter) I said, "Really. We have to show the other
- kaldi finds] ... one of the students they pay me you like i said really we have to show the other
Notice that kaldi misidentified said, Really? (Laughter) as they pay me you like.
The other chunk spans t=216 to t=236, covering this part of the transcript:
- transcript: "said, "Really. We have to show the other classes how to do it..."
- kaldi finds: said really we have to show the other classes how to do it...
Here kaldi has identified all the words correctly.

Concatenating the chunk results gives:

... one of the students they pay me you like i said really we have to show the other said really we have to show the other classes how to do it ...

And because "said really" appears twice in the concatenation, diff_align makes it work by using both occurrences:

... one of the students ~~they pay me you like i~~ said really ~~we have to show the other~~ [Laughter i] said really we have to show the other classes now to do it...

Since both occurrences of "said really" were at about t=216, that's what they're both listed as in the final result, which means they end up overlapping temporally. That also explains the timing gap between "students" and "said", since it deleted a few seconds worth of words.

(Also, along the way it dropped the "i" found in the first chunk, and brought it back from the transcript, causing "i" to be listed as out-of-transcript even though it was actually be identified properly.)

What to be done about it?

As per the comment at https://github.com/lowerquality/gentle/blob/master/gentle/transcription.py#L52, it seems likely that the answer is to remove the overlap rather than letting the diff_aligner deal with it.

Removing the overlap without introducing new bugs could be slightly tricky:

The overlapped words don't have the exact same timings
What if the overlapped regions didn't find the exact same words?

I'll put some thought into how to deal with those robustly and try to code up a PR, but as I said above, If you have any suggestions, I'm all ears!

Cheers

Bug report: Sequential out-of-transcript words disappear during realignment

In some cases, the final result ends up not including some words from the transcript, rather than keeping them all and marking them as out-of-transcript.

Tracking it down, this happens e.g. when the first alignment pass ends up with several out-of-transcript words in a row, and then the realignment just finds a single [oov]; the splice that inserts the realignment word-by-word ends up dropping all but the first of the transcript words.

A PR with a fix for this is on its way...

e2e tests fail in Docker container

I was able to get Gentle to build in Docker. My results are in my dockerfile branch. Please check it out and see if it works for you.

There are still some issues:

Model files install to unexpected locations and must be manually moved
The e2e test fails (likely blas/numerical randomness)

Once these things are resolved I'll make a Travis build so our unit tests and Linux binary builds run automatically.

Retaining punctuation

Gentle works great! We have a feature request: to retain punctuation, treating each punctuation mark as a token. From this input:

john, are you hungry?

-- we'd like this sort of output, where the punctuation mark inherits the end time of the preceeding word:

john,john,6.74,6.92
",",",",6.92,6.92
are,are,6.92,7.21
you,you,7.48,7.970000000000001
hungry,hungry,7.97,8.09
"?","?",8.09,8.09

Is this something anyone has looked at? Did you have a reason for omitting punctuation?

Cheers,
David

Retrying (Partial) Alignment Within User-Specified Time Boundaries

I have two separate ideas that both fit into this category.

The first is more basic, wherein, for a specific selection in the text of an extended consecutive string of failed words, the user can specify the beginning and ending time codes to analyze the audio again.

That is to say, give the user a way to limit Gentle's attention for a range of text to only the portion of audio which contains the missed words and retry alignment on only those missed words. I know that strob recently added second pass alignment, so this seems even more in the realm of possibilities now.

The second idea is a scenario like I mentioned in #78 where the user has exported the time codes for all words to an external tool such as Aegisub in order to manually create start and end time codes for all words which Gentle was unable to align, but having done so wishes to feed those word boundaries back to Gentle as input to re-acquire the phoneme timing for as many words as possible.

Matching the timing information in the CSV to a complete transcript word list...

I apologize for asking such a simple question, but I would like to take the timing information from the words which Gentle has matched (as represented in the CSV output file) and align those with a list of every single word in the transcript.

That is to say, I wish to have a similar CSV file where every single word in the transcript is seen in the order that it occurred. (This is not complicated. I can derive this by searching for spaces, hyphens which are directly adjacent to alpha characters, and periods/full stops which are directly followed by an alpha character and replacing them with said character (space, hyphen, or period) plus a line break in a word processor.)

But I wish to have all of Gentle's timing information next to the words which it has matched.
This will directly open up a little bit of basic search and replacing and allow me to paste the entire transcript into a subtitle and caption editor such as Aegisub in order to be able to use its GUI to manually correct any errant timing from Gentle and also create timing information for words which could not be automatically aligned.

The resulting subtitle can be pasted back into Excel (or the spreadsheet of your choice) and presumably mapped back to Gentle's HTML output file.

If you could point me to the simplest way of arriving at this end, I would be very grateful, as this will remove some roadblocks that I've been facing for the better part of a month.

Is there any plan/roadmap to adding numeral pronunciation to Gentle?

In my transcripts there are a significant amount of numbers and dates mentioned.

In case of shorter numbers, I spell them out with their alphabetical representation ("three" as opposed to "3"), however in many cases it makes sense to leave them written as numerals, in terms of how I wish them to eventually be published.

I understand that longer strings of numbers might have a variety of ways of being pronounced (and because of this unconstrained nature it is not a low hanging fruit) but similar to #76 I would like to add the English and phoneme representation of particular instances of what would be written as numerals as multiple alternate pronunciations.

For example, "2007" might be pronounced as:
"two-thousand-seven" T UW2 TH AW1 Z AH0 N D S EH1 V AH0 N
"two thousand and seven" T UW2 TH AW1 Z AH0 N D AH0 N D S EH1 V AH0 N
"twenty o seven" T W EH1 N T IY0 OW S EH1 V AH0 N
etc.

Mapping Phonemes to Letters in the English Spelling of Words

I love that Gentle produces phoneme-specific time codes where possible.

Without taking too much of your time, I would dearly love to tie this timing information back to the actual letters used in the words' English spelling.

I do realize that English spelling is a peculiar and non-phonetic beast with all sorts of silent letters, "e"s on the end of words which modify vowels which occur before an intermediate consonant, etc. but I would very much like to slice the letters in the English spelling into the same number of clusters as there are phonemes for that word and map them to each other as exactly as possible.

I have had a dream of animating text to the cadence of an audio rendition of it for a very long time and Gentle's output is so very close to giving me what I need.

I appeal to you guys since I'm no programmer, as to what the best technical solution to this would be, but essentially my thought is that whereas the CMU Pronouncing Dictionary presents each word and a list of phonemes, that if someone were to go through that dictionary and provide an intermediate where the English spelling letters are grouped by the number of phonemes, then this is all that would be needed to tie those phoneme timings back to the input text.

A couple of examples would be:
CMU gives us: RENDEZVOUS R AA1 N D IH0 V UW2
and, presumably, all we need is: R E N D EZ V OUS

CMU gives us: KNOWLEDGE N AA1 L AH0 JH (or N AA1 L IH0 JH)
and, presumably, all we need is: KN OW L E DGE

From that point onward, the world is your oyster in terms of animating the text to the audio.

You could hide the phonemes under the words and :

simply highlight the part of the word being heard at the time while the entire transcript is visible,
make only the part of the transcript which has already been heard visible (so it is like watching the transcript being spoken into being),
(assuming you also know where syllable breaks in words are) you could produce the 'bouncing ball' animation like those old Disney sing-along videos,
you could display the text scrolling past in a single line like a ticker tape,
and the list goes on.

What are your recommendations for giving Gentle knowledge of syllables (or exposing that information from it, if it already possesses it) or giving Gentle knowledge of how the phoneme list for each word relates/correlates back to the original letters?

How can people like myself who are keen for this functionality contribute toward this goal?

kaldi-models-0.02.zip broken

max@echo ~/p/gentle> wget http://lowerquality.com/gentle/kaldi-models-0.02.zip
--2015-12-18 05:21:44--  http://lowerquality.com/gentle/kaldi-models-0.02.zip
Resolving lowerquality.com... 82.221.106.101
Connecting to lowerquality.com|82.221.106.101|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59662929 (57M) [application/zip]
Saving to: 'kaldi-models-0.02.zip.1'

kaldi-models-0.02.z   0%[                      ]  53.49K  36.6KB/s             
kaldi-models-0.02.z 100%[=====================>]  56.90M   648KB/s   in 91s    

2015-12-18 05:23:19 (638 KB/s) - 'kaldi-models-0.02.zip.1' saved [59662929/59662929]

max@echo ~/p/gentle> unzip kaldi-models-0.02.zip 
Archive:  kaldi-models-0.02.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of kaldi-models-0.02.zip or
        kaldi-models-0.02.zip.zip, and cannot find kaldi-models-0.02.zip.ZIP, period.

i could not execute serve.py , it gives an error in multipass.py file at line 15

i installed this aligner on linux
which at last shows this:

Fetched 5,296 kB in 1min 6s (80.2 kB/s)                                        
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package ffmpeg is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'ffmpeg' has no installation candidate

after that when i execute python serve.py it gives this error:

Traceback (most recent call last):
  File "serve.py", line 23, in <module>
    from gentle import multipass
  File "/home/ucertify/Desktop/gentle/gentle/multipass.py", line 15, in <module>
    with open(vocab_path) as f:
IOError: [Errno 2] No such file or directory: 'PROTO_LANGDIR/graphdir/words.txt'

Please help me...

Smartquotes

Smart quotes should be converted into ASCII quotes for normalization into kaldi's dictionary.

Can't get Gentle to process any files.

Hello, I installing Gentle from source but when I try to align text from http://localhost:8765 and error pops up when I click the align button. The error occurs within the terminal.

Unhandled error in Deferred:

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(_self.__args, *_self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
task()
File "/usr/local/lib/python2.7/dist-packages/twisted/_threads/_team.py", line 190, in doWork
task()
--- ---
File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 246, in inContext
result = inContext.theWork()
File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 262, in
inContext.theWork = lambda: context.call(ctx, func, _args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, _args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
return func(_args,__kw)
File "serve.py", line 123, in transcribe
gen_hclg_filename = language_model.make_bigram_language_model(ks, proto_langdir, *_kwargs)
File "/home/kahless/gentle-master/gentle/language_model.py", line 122, in make_bigram_language_model
raise e
exceptions.OSError: [Errno 2] No such file or directory

Any Idea how I can resolve this issue based on the error message?

Bug Report: Déjà Vu in Auto Transcripts

I don't know if it's a Kaldi setting, bug, or something else, but since I have recently been testing giving Gentle a media file with no transcript, I've noticed that for a single segment of audio, there is often two sections of text which appear to try to describe that single segment on the timeline.

For example, on alignment 6afd2a94 toward the very beginning, we can hear:

Do we need to cite, like, uh retrogaming.tv or whatever?

And Kaldi/Gentle hands back:

do we need to buy like a preacher gaming got t._v. or a richer gaming got t._v. or whatever

In this example, the phrases:
preacher gaming got t._v. and
richer gaming got t._v.
both appear to be alternative attempts to transcribe retrogaming.tv.

I could cite (many) more examples, but perhaps the above is expected behavior?

It's just a bit jarring when watching Gentle's interactive transcript.

Also, I would think that Gentle would have trouble trying to align two consecutive alternate transcriptions of a single passage of audio to that single section of time.

Increasing accuracy on Local Build

It is reasonable that on my local machine (4GB RAM) the accuracy of alignment is somewhat jittered. There's an offset of 0.02 - 0.04 seconds between the gentle server and my local build.
Compare the CSV generated on gentle server with the CSV generated on my local build.

An example with 20 - 160 ms offset.

Gentle : because 37.74 37.88
Local : because 37.72 37.88

Another example with an offset of 2 seconds

Gentle : they're 22.26 22.46
Local : they're 20.66 20.88

I am sorry, for the naive requests that follow, I just started exploring Kaldi as a tool. I have no prior experience with ASR systems.

What can I do to increase the accuracy on my local build?

Now, I need these timestamps to do a research project. Specifically, I need to segment the audio on the basis of word boundaries. And gentle was the best available tool, from a developer's perspective. As I am not even a beginner in ASR and other such tools.

I believe that if I hire an amazon instance, this will not be a problem. But they are quite expensive.
Also, can anyone direct me if there is any other language model that might work better for English?
Meanwhile I will dive into the code, to understand it better.

Thanks

When should I use "conservative" and "Include disfluencies"?

The title pretty much says it all. Under what circumstances should I use these options.

Tests Broken

The tests fail for me using the most recent build. I think the golden master needs to be updated.

error on transcribe without graph/ dir

Attempting to transcribe (not align) when there is no graph/ model directory raises an exception rather than failing gracefully. People with the standard PROTO_LANGDIR configuration will see an error here.

Using existing inexact timestamps

A quick test to see how Gentle handles gaps in transcripts reveals the following.

Intact examples/data/lucier.txt showing end of the first sentence and start of the third:

now,now,13.68,14.14
...
What,what,56.42,56.660000000000004
you,you,56.660000000000004,56.800000000000004
will,will,56.800000000000004,56.970000000000006
hear,hear,56.970000000000006,57.32000000000001
then,then,57.32000000000001,57.86
the,the,59.63,60.36
natural,natural,60.7,61.660000000000004
resonant,resonant,62.17,63.38
frequencies,frequencies,63.38,64.10000000000001

Second sentence removed:

now,now,13.68,14.12
What,what,15.33,15.42
you,you,17.26,17.330000000000002
will,will,22.19,22.28
hear,hear,27.26,27.37
then,then,27.42,27.82
are,are,62.19,62.519999999999996
the,the,62.55,62.61
natural,natural,62.67,62.86
resonant,resonant,62.86,63.38
frequencies,frequencies,63.38,64.10000000000001

The alignment gradually recovers from the 40-second gap until the tenth word is perfect. Nine badly or imperfectly aligned words is the cost of the transition.

Intact lucier.txt showing end of the first and start of the fourth sentence:

now,now,13.68,14.12
...
I,i,73.06,73.27
regard,regard,73.27,74.02
this,this,75.93,76.21000000000001
activity,activity,76.21,77.1

Second and third sentences removed:

now,now,13.68,14.14
I,i,73.06,73.27
regard,regard,73.27,74.02
this,this,75.93,76.21000000000001
activity,activity,76.21,77.1

Perfect recovery.

So this is impressive and reassuring, yet there is room for improvement. With long gaps and a poor transcript, mistakes will be common.

We have transcripts with timestamps, but they are inexact -- late by 5 to 10 seconds. Could you point us to a way we can feed this information to gentle to help it handle gaps more robustly?

For any given word, there will be a temporal range of, say, twenty seconds; the search for a match should be limited to this range. The input file should be similar to the current align.csv output --

what,46,65
you,46,66
will,47,67

-- that is to say, each word is given a range within which the search should be performed. Maybe some of the logic for this is already present in the second pass?

Cheers,
David

Skip progressive alignment on async=false

When the transcription is run with async=false, the transcription should run without in-progress alignment previews. Currently it runs the full alignment every 20s which is inefficient.

Using Eesen as a base?

Hi there,
Is there a plan to using Eesen as the speech recognizer instead of Kaldi?
I would love to get rid of phonetics, and train a pure DNN models instead of the hybrid ones from Kaldi.
By the way, the software you wrote here is fantastic! I can't wait to test it out thoroughly with my own models.
Thanks!

Word skips (drops)

I've noticed that aligner drops some words from transcript, actually they are in audio. Its not big amount less then 0.5% (for example 7 words from 1500), is something can be done about it, maybe some approximation if user is 100% sure that words are in audio?

DMG crashes on boot

gentle0.03.dmg does not start; may be related to spurious homebrew packages on my system.

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes:       0x0000000000000001, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   org.python.python               0x000000010324d060 PyTuple_New + 112
1   org.python.python               0x000000010325049e PyType_Ready + 199
2   org.python.python               0x0000000103250460 PyType_Ready + 137
3   org.python.python               0x000000010323d19a _Py_ReadyTypes + 16
4   org.python.python               0x00000001032a6475 Py_InitializeEx + 395
5   org.pythonmac.unspecified.gentle    0x00000001000024ac 0x100000000 + 9388
6   org.pythonmac.unspecified.gentle    0x000000010000117a main + 650
7   org.pythonmac.unspecified.gentle    0x0000000100000be4 start + 52

[safari] seeking fails after audio completion

One astute tester wrote:

One thing I noticed on the webpage was if the audio plays all the way through (it gets to the end and then stops), then you can no longer click on words to start playing from that point -- it will always start playing from the beginning.

Not sure why this was happening for them.

Audio element race condition

The <audio> element may be initialized before the wav file has finished encoding, leading to inconsistent behavior