jitsi / jiwer Goto Github PK

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)

License: Apache License 2.0

Python 100.00%

wer automatic-speech-recognition python3 speech-to-text evaluation-metrics word-error-rate

jiwer's Introduction

JiWER

JiWER is a simple and fast python package to evaluate an automatic speech recognition system. It supports the following measures:

word error rate (WER)
match error rate (MER)
word information lost (WIL)
word information preserved (WIP)
character error rate (CER)

These measures are computed with the use of the minimum-edit distance between one or more reference and hypothesis sentences. The minimum-edit distance is calculated using RapidFuzz, which uses C++ under the hood, and is therefore faster than a pure python implementation.

Documentation

For further info, see the documentation at jitsi.github.io/jiwer.

Installation

You should be able to install this package using poetry:

$ poetry add jiwer

Or, if you prefer old-fashioned pip and you're using Python >= 3.7:

$ pip install jiwer

Usage

The most simple use-case is computing the word error rate between two strings:

from jiwer import wer

reference = "hello world"
hypothesis = "hello duck"

error = wer(reference, hypothesis)

Licence

The jiwer package is released under the Apache License, Version 2.0 licence by 8x8.

For further information, see LICENCE.

Reference

For a comparison between WER, MER and WIL, see:
Morris, Andrew & Maier, Viktoria & Green, Phil. (2004). From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.

jiwer's People

Contributors

Stargazers

Watchers

Forkers

silva-luana sipmaster tezet burakakrishna kellyquek nikvaessen zhiji6 runngezhang fabiangermann icezmake cnxtech tejas-11 mahbubnoor emekalu mbencherif mohdsanadzakirizvi milizza90 kuonanhong mn5k ranyajumah zerkceus bgrozev tabulon-ext zizhengxu kevwe2502 yangmain itechstro pgn-dev luizdiego wallscope-research jofrantoba biranyucel msgpo stjordanis kineticengines dophist mpariente isabella232 thesjace xiaoyufan 9q9q phbprsns fortin-alex aiinnova nmstoker shirosweets jvcanavarro guangminglion khursani8 yunzqq shammur raethlo segutz oddtwang techthiyanes optimeet-ai djdubois xiaofu2730 jmokoistinen nancyleitt nkaenzig-aifund bingoral harusametime vscv mysqlsc tuannvhust sariskaio f4hy dannv0602 lucasleandro1204 whaozl ckjellson ivanvladimir paulsunnypark mdi26 bmilde skucherlapati shcxlee ipf-ghwang u1234x1234 jaypinho zperling beanscloset anasrehman

jiwer's Issues

Update Levenshtein dependency to maintained version

Currently, you depend on python-Levenshtein which is outdated and not maintained. This leads to issues when compiling. For instance, I cannot compile it for Python 3.7 on both my Windows and Ubuntu machines by default without installing/updating build tools. The newer, maintained Levenshtein provides prebuilt wheels which saves a lot of headaches.

Usage is otherwise the same. So no other changes are needed than updating the dependency.

Regarding visualize_alignment() function.

visualize_alignment() function takes parameters one output from process_words() and second show_measures. If I want to show only those sentences with errors greater than zero. How can I proceed?

I know I can loop through, but adding this as an argument would be good.

ReplaceWords transform is broken

There are a few issues with the ReplaceWords transform. Most importantly, due to the way it's implemented, it actually alters any matching substring:

ReplaceWords({'foo': 'bar'})('foobar') returns barbar

This isn't necessarily a problem (it could be useful), but it is not consistent with the function name. I see two possibible fixes:

Change internals to re.sub(r'\b{}\b'.format(re.escape(key)), value, s). Note that \b is the word boundary mark.
Rename the function to ReplaceRegexes (or keep both). The example should reflect the current behavior. Regex replaces can be nice: you can for instance use ReplaceWords({r'\b(foo|bar)baz\b': r'\1'}) to remove 'baz' from 'foobaz' and 'barbaz' but keep it in all other cases (artificial example, I know).

The above indicates that this function (or these functions, if you go for the second option) can use a unit test. This is currently missing..

Finally, the transform is not included in transforms.__all__, and as a result it isn't directly available on jiwer, but only as part of jiwer.transforms.

module 'jiwer.transforms' has no attribute 'ReduceToListOfListOfWords'

Hi,

I did downloaded this repo as zip and facing following exception after configuring the project, when executing analyzer.py :

AttributeError: module 'jiwer.transforms' has no attribute 'ReduceToListOfListOfWords'

Major performance regression in 2.5.0 for jiwer.transforms.RemovePunctuation

There is a major performance regression in jiwer.transforms.RemovePunctuation in 2.5.0
%timeit jiwer.transforms.RemovePunctuation() gives 228 ms ± 4.34 ms per loop on 2.5.0
but
%timeit jiwer.transforms.RemovePunctuation() gives 1.35 µs ± 8.61 ns per loop on 2.4.0
this is 168,888 times slower.

I had a pipeline where I was computing the WER both with and without removing punctuation. I was recreating the transform list each time. This can be solved by computing it up front and caching the transformations, but I am very surprised to find this to be the cause of my regression.

Update rapidfuzz version

Is it possible that you update the rapidfuzz version used here? Using rapidfuzz and jiwer in the same project is a bit difficult as of now to to rapidfuzz being at 3.x while jiwer using 2.x without leniency.

WER score bigger than 1.0

Literature states that WER is bounded between 0 and 1.

wer(truth='fries', hypothesis='no fry')

gives

> 2.0

Is this an expected behaviour?

This code is 100x faster:

import Levenshtein as Lev
def cer(self, s1, s2):
        """
        Computes the Character Error Rate, defined as the edit distance.

        Arguments:
            s1 (string): space-separated sentence
            s2 (string): space-separated sentence
        """
        s1, s2, = s1.replace(' ', ''), s2.replace(' ', '')
        return Lev.distance(s1, s2)

def wer(self, s1, s2):
        """
        Computes the Word Error Rate, defined as the edit distance between the
        two provided sentences after tokenizing to words.
        Arguments:
            s1 (string): space-separated sentence
            s2 (string): space-separated sentence
        """

        # build mapping of words to integers
        b = set(s1.split() + s2.split())
        word2char = dict(zip(b, range(len(b))))

        # map the words to a char array (Levenshtein packages only accepts
        # strings)
        w1 = [chr(word2char[w]) for w in s1.split()]
        w2 = [chr(word2char[w]) for w in s2.split()]

        return Lev.distance(''.join(w1), ''.join(w2))

Can't

Life is te hard

jiwer gives an error when passed a very long list of strings

Issue

When passing a very long list of strings (>350k strings) as the reference and hypothesis, jiwer gives the following error:

chr() arg not in range(0x110000)

What's been tried:

Calculating wer on individual list elements - this works successfully with no error
Splitting the large lists into smaller chunks - this works successfully with no error
Passing the entire list to another library such as fastwer - this works successfully with no error

The error only seems to happen when the entire long list is passed into jiwer.

Additional Context

It seems like the vocabulary in the _word2char function isn't built properly. After adding words from the first N sentences in the list, words from rest of the sentences do not seem to be a part of the vocabulary. This results in the chr() arg not found error when these lines are executed.

Jiwer version - v3.0.3

Still running into MemoryError? Huggingface

Hello Guys, i found several Posts about WER Running out of Memory and most the Posts suggest ist fixed.
Im using the latest WER Version in combination with Huggingface and i still Run out of Memory?

wer_metric = load_metric("wer")

def compute_metrics(processor):
    def __call__(pred):
        pred_logits = pred.predictions
        pred_ids = np.argmax(pred_logits, axis=-1)

        pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

        pred_str = processor.batch_decode(pred_ids)
        # we do not want to group tokens when computing the metrics
        label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

        wer = wer_metric.compute(predictions=pred_str, references=label_str)

        return {"wer": wer}

    return __call__
...

   trainer = Trainer(
        model=model,
        data_collator=data_collator,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=processor.feature_extractor,
        train_seq_lengths=train_dataset.input_seq_lengths,
        compute_metrics=compute_metrics(processor),
    )

***** Running Evaluation *****
  Num examples = 15588
  Batch size = 4
100%|███████████████████████████████████████| 3897/3897 [21:27<00:00,  3.07it/s]Traceback (most recent call last):
  File "/tmp/pycharm_project_263/audioengine/model/finetuning/wav2vec2/finetune_parquet.py", line 151, in <module>
  File "/tmp/pycharm_project_263/audioengine/model/finetuning/wav2vec2/finetune_parquet.py", line 128, in main
    max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1757, in evaluate
    output = self.prediction_loop(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1930, in prediction_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))
  File "/tmp/pycharm_project_263/audioengine/model/finetuning/wav2vec2/wav2vec2_trainer.py", line 191, in __call__
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
  File "/usr/local/lib/python3.8/dist-packages/datasets/metric.py", line 403, in compute
    output = self._compute(predictions=predictions, references=references, **kwargs)
  File "/home/warmachine/.cache/huggingface/modules/datasets_modules/metrics/wer/73b2d32b723b7fb8f204d785c00980ae4d937f12a65466f8fdf78706e2951281/wer.py", line 94, in _compute
    return wer(references, predictions)
  File "/usr/local/lib/python3.8/dist-packages/jiwer/measures.py", line 80, in wer
    measures = compute_measures(
  File "/usr/local/lib/python3.8/dist-packages/jiwer/measures.py", line 192, in compute_measures
    H, S, D, I = _get_operation_counts(truth, hypothesis)
  File "/usr/local/lib/python3.8/dist-packages/jiwer/measures.py", line 273, in _get_operation_counts
    editops = Levenshtein.editops(source_string, destination_string)
MemoryError
100%|███████████████████████████████████████| 3897/3897 [21:40<00:00,  3.00it/s]

Process finished with exit code 1

RemovePunctuation does not remove smart/curly quotes

RemovePunctuation does not remove the following:
‘
’
“
”

jiwer.visualize_measures doesn't work as in the docs

When I try to run

import jiwer

out = jiwer.compute_measures(
    ["short one here", "quite a bit of longer sentence"],
    ["shoe order one", "quite bit of an even longest sentence here"],
)

print(jiwer.visualize_measures(out))

I get AttributeError: module 'jiwer' has no attribute 'visualize_measures'

It seems the function is not implemented yet in jiwer 3.0.0 or maybe under another name ?

Multispeaker WER

Hi, thanks a bunch for this tool !

When working with speech mixtures, WER can take into account that words from each speaker might be picked up.
There is a description of the method here: https://my.fit.edu/~vkepuska/ece5527/sctk-2.3-rc1/doc/asclite.html

Would you be willing to integrate this feature in Jiwer?

Current licenses might not be allowed

Hey, I see this project depends on levenshtein but that package is released under GPL license meaning packages using it (like this one) need to have a GPL license but the current one is not

Source: https://www.gnu.org/licenses/gpl-faq.html#IfLibraryIsGPL

pip installation is failing

Collecting jiwer
  Downloading https://files.pythonhosted.org/packages/c7/fd/88639901195f2625941efdf2a1496c540b3390149
9a986fb271af28e4436/jiwer-1.3.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/sausharma/temp/pip-build-sMieMl/jiwer/setup.py", line 6, in <module>
        with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
    TypeError: 'encoding' is an invalid keyword argument for this function
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /home/sausharma/temp/pip-build-sMieMl/
jiwer/

AttributeError: module 'jiwer' has no attribute 'cer'

bug: AttributeError: module 'jiwer' has no attribute 'cer'
environment: python 3.8.5
reproduce the bug: jiwer.cer(ground_truth, hypothesis)

Version 3.0.0 can produce wrong results

Hi,

when I use jiwer version 3.0.0 with the following commands:

wer_transform = jiwer.Compose(
    [
        jiwer.ToLowerCase(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.Strip(),
        jiwer.ReduceToListOfListOfWords(),
    ]
    )
wer = jiwer.wer(
        truth = 'This is a short Sentence with a few Words with upper and Lower cases',
        hypothesis = 'His is a short Sentence with a few Words with upper and Lower cases',
        truth_transform = wer_transform,
        hypothesis_transform = wer_transform,
        )

I get wer=0.0714, which is the expected result based on the WER formula.

But when I use the non-deprecated keywords 'reference' and 'reference_transform' instead of 'truth' and 'truth_transform':

wer = jiwer.wer(
        reference = 'This is a short Sentence with a few Words with upper and Lower cases',
        hypothesis = 'His is a short Sentence with a few Words with upper and Lower cases',
        reference_transform = wer_transform,
        hypothesis_transform = wer_transform,
        )

I get wer=0.2857, which is clearly wrong.
This does not happen if the 'jiwer.ToLowerCase()' transform is removed from the transformation, thus I expect the bug to be related to that.

Cheers,
Paul

jiwer.wer(outputs_true, outputs_pred, standardize=True)

Traceback (most recent call last):
File "main.py", line 222, in
main(0, args)
File "main.py", line 121, in main
model.fit(dataset_train,
File "EfficientConformer-master/models/model.py", line 344, in fit
raise e
File "EfficientConformer-master/models/model.py", line 303, in fit
wer, truths, preds, val_loss = self.evaluate(dataset, val_steps, verbose_val, eval_loss=True)
File "EfficientConformer-master/models/model.py", line 425, in evaluate
batch_wer = jiwer.wer(outputs_true, outputs_pred, standardize=True)
TypeError: wer() got an unexpected keyword argument 'standardize'

Character Error Rate?

Will simple CER be implemented? Seems like it is a common metric as well.

Don't support Chinese?

confusion about the result

error = wer("test", "test")
print(error)

0.0

Avoid error when a string in the truth is empty after transformation

A suggestion to add:
To avoid getting an error message when transforming the truth leads to an empty string, I added the following in _preprocess:
pop the empty string item from the transformed_truth and the transformed_hypothesis.
I found this useful for my purposes. Perhaps others may too.

# Apply transforms. The transforms should collapses input to a list of list of words
transformed_truth = truth_transform(truth)
transformed_hypothesis = hypothesis_transform(hypothesis)
n=0
for e in transformed_truth:
    if (len(e) == 0):
       #if truth is empty, remove that item from the list and from the transformed hypotheses
       transformed_truth.pop(n)
       transformed_hypothesis.pop(n)
    n += 1

# raise an error if the ground truth is empty or the output
# is not a list of list of strings

WIL and MER

Additionally to WER, WIL (Word Information Lost) and MER (Match Error Rate), can be good performance measures for ASR. I could not find any pypi packages that support those measures, and think they might be a good fit for jiwer.

See:

I'd be happy to open a pull request with those measures added, I you think they could fit this package.

Can somebody explain me what the `_preprocess()` function is doing?

I ran the standard example and walk through it in the debugger.

from jiwer import wer

ground_truth = "hello world"
hypothesis = "hello duck"

error = wer(ground_truth, hypothesis)

In the steps, it uses the _preprocess() function to convert the words/tokens into integer representations.

hello world is converted to \x00\x02 and hello duck is converted to \x00\x01. Then the Levenshtein distance is calculated on these strings rather than the original words. I am not sure how that maps to the original definition of Word Error Rate.

Can somebody explain to me what is happening?

Apparent WER bug?

Love the package, thanks for all the hard work!

I came across what I think is a WER calculation bug:

import jiwer

jiwer.process_words('the cat went to the store', 'the car went to store front').wer
# 0.5

jiwer.process_words('the cat went to the store', 'the car went to green store').wer
# 0.3333333333333333

Given that WER = S + D + I / N, shouldn't the second example also have a WER of 0.5? i.e. one substitution (cat -> car), one deletion (the -> None), and one insertion (None -> green).

Pypi version outdated

After the fix on PR #4 the Pypi version is outdated.

standardize parameter broken

MWE:

from jiwer import wer

e1 = wer("he's my neminis", "he is my <unk> [laughter]", standardize=True)
e2 = wer("he is my neminis", "he is my")

assert e1 == e2

According to the documentation, these should be equivalent, but are not.

Version 1.3 from PyPi

jiwer WER runs very fast , compared to Torchmetrics WER how?

Can somebody explain how JiWER runs much faster?

Is it possible just to get the number of errors?

I know I could probably just get the wer and multiply by the number of words to get the number errors, but I was hoping that was unnecessary.

Edit: The reason I am asking is this... I want to rollup all of my sentence WER into an overall document WER.

Feature Request: Allow for equivalent/alternative words

When calculating WER, MER, and WIL it would be nice to allow for equivalent or alternative words. For example:

Hi, This is Ann.
Hi, This is Anne.

The cost is 21 dollars
The cost is $21.00 dollars
The cost is $21
The cost is twenty one dollars

I'm going to go for a walk
I'm gonna go for a walk

could all be equivalent

A method to implement something like this could be to allow for something like this syntax [<alternatives>] in the ground truth, for example:

Hi this is [Anne|Ann] I'm [going to|gonna] go for a walk

RemoveKaldiNonWords transformation not working as expected

Hello, when trying to use the RemoveKaldiNonWords transformation, I get different results when comparing these text pairs:

<unk> xx to xx -> 0.5 WER (0.33 when not using SentencesToListOfWords)
<unk>xx to xx -> 0.0 WER

I'd expect both to be zero when using RemoveKaldiNonWords. Is it an actual bug or am I not understanding the usage correctly? I've tried different order combination of the transformations in the below code, the results were always the same.

transformation = jiwer.Compose([
    jiwer.RemoveKaldiNonWords(),
    jiwer.RemoveMultipleSpaces(),
    jiwer.RemoveWhiteSpace(replace_by_space=True),
    jiwer.SentencesToListOfWords(word_delimiter=' '),
])
wer = jiwer.wer(
    truth,
    hypothesis,
    truth_transform=transformation, 
    hypothesis_transform=transformation,
)

Question: How can I get words alignment between ground_truth and hypothesis?

I want to get detailed alignment ( including correct, substitution, deletion, insertion).
Can I get it and how?

Alignment options similar to `fstalign`

fstalign has an --output-sbs flag which allows the user to use the alignment for post-processing. I was wondering if jiwer has such a flag?

RemoveSpecificWords is not functioning as expected

Hi! As the title says, the RemoveSpecificWords function does not work as I would expect it to. As an example, the following code

text = "he asked a helpful question"
    stop_words = ['a', 'he']
    print(jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.Strip(),
        jiwer.SentencesToListOfWords(),
        jiwer.RemoveEmptyStrings(),
        jiwer.RemovePunctuation(),
        jiwer.RemoveSpecificWords(stop_words),
    ])(text))

returns

['', 'sked', '', 'lpful', 'question']

Is there a way to get this function recognize word boundaries? Thank you!

SentencesToListOfWords is removed after 2.2.0

Change appears introduced in 187575c
The README suggests using SentencesToListOfWords but the code suggests ReduceToListOfListOfWords. It's unclear if these are synonymous, they feel slightly different.

This feels like a breaking API change in a minor version update. Our team has scripts depending on the prior API.
Could SentencesToListOfWords be re-introduced as a thin wrapper over ReduceToListOfListOfWords?

Otherwise, the README examples should be updated with corrected examples.

Getting different results based on punctuation

Is this intended behavior?

print(wer('hello', 'hello.'))
print(wer('hello', 'hello!'))

Gives

0.0
1.0

Batch vs Individual results are not same

I have tried calculating WER and CER using jiwer in one case I'm running jiwer.wer and jiwer.cer function individually on single (target, prediction) pairs and take the average at the end, and in the other case, I'm passing the whole list of targets and predictions to jiwer.wer and jiwer.cer functions. In both cases, I'm getting different WER and CER with the same set of target and prediction pairs which ideally shouldn't be the case.

Any idea why is this happening?

is it possible to get the exact word insertion/deletion for the WER

I wonder if I can get the words that need to be deleted or inserted to get the wer?
Assuming I use wer(['i','am','g'],['i','am','f']). that it will tell me that there is a problem with the 3rd word.

Permutations of the sentences give different results

Thanks for open-sourcing this library.

By default, the sentences are concatenated together using jiwer.SentencesToListOfWords(). Therefore, different permutations of the (ground_truth, hypothesis) pairs give different results.

ground_truth = ["I", "am I good?"]
hypothesis = ["I am", "I good?"]
print(wer(ground_truth, hypothesis))
# prints 0.0


ground_truth = ["am I good?", "I"]
hypothesis = ["I good?", "I am"]
print(wer(ground_truth, hypothesis))
# prints 0.5

So it seems that this formulation is not suitable to compute WER on a test set of sentences. Am I missing something?

Doesn't work on lists

Hi,

Applying jiwer.wer on reference ['a a'], hypothesis ['a', 'a'] gives a WER of 1, whereas it should be 0?