Giter Club home page Giter Club logo

Comments (7)

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 28, 2024

Exposing a function to check words or lines, one that is actually used by codespell itself, might affect performance. On the other hand:

  • the performance hit is just speculation without actual tests to measure performance,
  • exposing functions would probably reduce the complexity of parse_file(), a long-overdue target (see #3276).

Do you agree refactoring is a necessary step to expose functions that could be added to a public API later on ? If so, could you have a look at #3276, see if it's a step in your direction, and create a new pull request from it? A test to measure performance would be a great addition. I am not the maintainer, but I think these steps might convince @larsoner.

I think you would need to refactor near the code that starts line processing:

for i, line in enumerate(lines):

and near the code that splits lines into words:
check_matches = extract_words_iter(line, word_regex, ignore_word_regex)

Alternatively, the lightweight API to access dictionaries sounds like a good alternative indeed.

from codespell.

nthykier avatar nthykier commented on June 28, 2024

Thanks for the feedback and the suggestions.

I could be convinced to do a PR for this change, but I would prefer to have alignment up front what the expectations would be to determine if that is within my capacity to deal with it and matches the amount of time I am willing to invest. I am hoping we can meet in the middle obviously, but I would rather know upfront so there is no disappointment on either side down the line. Like, if there is going to be a performance test, what is expectations on "before vs. after" performance, what kind of corpus would you consider a "valid performance test" (are we talking micro-benchmark or real-life use-cases or both; which versions of python count, etc.)

For clarity, I read phrases like "would be nice" or "a great addition" as "optional and would not prevent merge if omitted".

I guess that means I am waiting for @larsoner to respond on the expectations. :)

from codespell.

larsoner avatar larsoner commented on June 28, 2024

I think as long as the changes are reviewable and there isn't much performance hit we're okay. Something like a 10% performance hit would be kind of a bummer. But I would also be surprised if there were one based on briefly thinking about the code.

One approach would be to make a quick attempt at this with the necessary refactoring and some new Python API, and some basic tests in a new codespell_lib/tests/test_api.py or similar. Hopefully that's not too much work based on what @DimitriPapadopoulos outlined above? And then we can check to make sure performance isn't too bad, then iterate on the API as proposed in the PR.

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 28, 2024

@larsoner I wonder whether this will eventually require some file/directory renaming. All files currently live under codespell_lib. However, Python modules usually put Python code under codespell or src/codespell. I wonder why this is not the case here, and whether we should change that.

from codespell.

larsoner avatar larsoner commented on June 28, 2024

Can't remember why it was codespell_lib. It could be a reasonable time to mkdir src && git mv codespell src/ and adjust pyproject.toml. I think to change this we'd probably want the release to be 3.0 but that would be okay.

It can/should be done in a separate PR though

from codespell.

nthykier avatar nthykier commented on June 28, 2024

Ok, thanks for the input so far. Up to 10% performance is quite a leeway indeed and like larsoner, I doubt we will hit it. Nevertheless, I will establish a baseline that I will use.

I did not see any references to how we decide the baseline. For now, I will be using https://sherlock-holm.es/stories/plain-text/cano.txt as a base text. Since what we are concerned with is the cost of refactoring into having a spellcheck_word function that is called per word (consider the name a description more than a target at this point), I did not think it relevant to look at different file types. Let me know if you have reason to believe otherwise.

I am describing my method below. Ideally, none of this would be new to you if you have prior experience with performance measurements of this kind.

The setup:

mkdir performance-corpus/
cd performance-corpus
wget https://sherlock-holm.es/stories/plain-text/cano.txt
# Create some extra copies (a single file takes less than 0.5 second; adding a bit more makes it easier to test)
for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

# Setup `_version.py`, so we can run directly from git checkout.
echo '__version__ = "do not crash, please"' > codespell_lib/_version.py

The resulting corpus is 79329 bytes. The wc command line tool estimates that each file has 76764 lines (including many empty ones) and 657438 word (in wc definition of a word). The codespell code flags about 175 words as possible typos in this text (that is, per file).

The python version I used for these numbers:

$ python3 --version
Python 3.11.9

Running the test:

# Test, repeated at least 3 times, fastest result is chosen;
# Usually slower times are caused by unrelated jitter - at least on laptops/desktops that
# also does other work
time PYTHONPATH=. python3 -m codespell_lib performance-corpus/ >/dev/null

# One run with `-m profile` to see where the bottlenecks are.
# Takes considerably longer (x10-x20), so it is not done on the full corpus
PYTHONPATH=. python3 -m profile -o baseline.pstat  -m codespell_lib performance-corpus/cano.txt

Baseline results

On my hardware, the performance test on the baseline is recorded as 5.6 seconds (0m5.600s).

The bottleneck profiling has the following key numbers as far as I can tell:

  • Total time 3.030s as baseline (2196160 function calls (2195259 primitive calls) in 3.030 seconds)
  • The code spend 0.503s (~1/6) in build_dict (split into 2 calls)
  • The code spend 2.472s (~5/6) in parse_file in one call [only one file was examined].
  • About of half of parse_file's runtime is spend in parse_file (1.178/.2.472). The other half is then assumed to come from callers.
  • The time spent in "".lower() is 0.433s (~1/6) and another ~1/6 is spend in regex operations (group() + search() + finditer()).

These numbers were extracted by an interactive session with the following commands (as examples):

import pstats
s = pstats.Stats("baseline.pstat")
s.sort_stats(2).print_stats(20)
s.print_callees('parse_file')

Types of corrections

Per file, we get 175 suggestions. These are split into the following groups:

  • Cased (capitalized or all upper): 41 (~23%)
  • Non-cased: 134 (~77%)
  • Corrections with multiple candidates: 77 (44%)
    • Cased with multiple candidate corrections: 37 (21% of the 175 corrections)
    • Non-cased with multiple candidate corrections: 40 (~23% of the 175 corrections)
  • Corrections with reason: 0

Some ratios:

  • Cased / non-cased: ~30% (40/134).
    • Strong bias towards non-cased (seems reasonable since most words are lower-cased)
  • Multiple candidates cased / Multi candidate non-cased corrections: 93% (37/40).
    • This one is probably too close to 1:1. Given the strong bias towards non-cased, one would expect a similar bias here. This is a potential weakness with the chosen corpus.
  • Multiple candidates / Single candidate corrections: 79% (77/98).
    • This one is close to 1:1, so we get decent coverage of both cases.

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 28, 2024

Perhaps it would be worth checking with more files, so that application startup time remains small compared to individual file checking.

I guess 20 files ought to be enough;

for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done

from codespell.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.