Comments (7)
Exposing a function to check words or lines, one that is actually used by codespell itself, might affect performance. On the other hand:
- the performance hit is just speculation without actual tests to measure performance,
- exposing functions would probably reduce the complexity of
parse_file()
, a long-overdue target (see #3276).
Do you agree refactoring is a necessary step to expose functions that could be added to a public API later on ? If so, could you have a look at #3276, see if it's a step in your direction, and create a new pull request from it? A test to measure performance would be a great addition. I am not the maintainer, but I think these steps might convince @larsoner.
I think you would need to refactor near the code that starts line processing:
codespell/codespell_lib/_codespell.py
Line 967 in 0b09d75
and near the code that splits lines into words:
codespell/codespell_lib/_codespell.py
Line 989 in 0b09d75
Alternatively, the lightweight API to access dictionaries sounds like a good alternative indeed.
from codespell.
Thanks for the feedback and the suggestions.
I could be convinced to do a PR for this change, but I would prefer to have alignment up front what the expectations would be to determine if that is within my capacity to deal with it and matches the amount of time I am willing to invest. I am hoping we can meet in the middle obviously, but I would rather know upfront so there is no disappointment on either side down the line. Like, if there is going to be a performance test, what is expectations on "before vs. after" performance, what kind of corpus would you consider a "valid performance test" (are we talking micro-benchmark or real-life use-cases or both; which versions of python count, etc.)
For clarity, I read phrases like "would be nice" or "a great addition" as "optional and would not prevent merge if omitted".
I guess that means I am waiting for @larsoner to respond on the expectations. :)
from codespell.
I think as long as the changes are reviewable and there isn't much performance hit we're okay. Something like a 10% performance hit would be kind of a bummer. But I would also be surprised if there were one based on briefly thinking about the code.
One approach would be to make a quick attempt at this with the necessary refactoring and some new Python API, and some basic tests in a new codespell_lib/tests/test_api.py
or similar. Hopefully that's not too much work based on what @DimitriPapadopoulos outlined above? And then we can check to make sure performance isn't too bad, then iterate on the API as proposed in the PR.
from codespell.
@larsoner I wonder whether this will eventually require some file/directory renaming. All files currently live under codespell_lib
. However, Python modules usually put Python code under codespell
or src/codespell
. I wonder why this is not the case here, and whether we should change that.
from codespell.
Can't remember why it was codespell_lib
. It could be a reasonable time to mkdir src && git mv codespell src/
and adjust pyproject.toml
. I think to change this we'd probably want the release to be 3.0 but that would be okay.
It can/should be done in a separate PR though
from codespell.
Ok, thanks for the input so far. Up to 10% performance is quite a leeway indeed and like larsoner, I doubt we will hit it. Nevertheless, I will establish a baseline that I will use.
I did not see any references to how we decide the baseline. For now, I will be using https://sherlock-holm.es/stories/plain-text/cano.txt as a base text. Since what we are concerned with is the cost of refactoring into having a spellcheck_word
function that is called per word (consider the name a description more than a target at this point), I did not think it relevant to look at different file types. Let me know if you have reason to believe otherwise.
I am describing my method below. Ideally, none of this would be new to you if you have prior experience with performance measurements of this kind.
The setup:
mkdir performance-corpus/
cd performance-corpus
wget https://sherlock-holm.es/stories/plain-text/cano.txt
# Create some extra copies (a single file takes less than 0.5 second; adding a bit more makes it easier to test)
for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done
# Setup `_version.py`, so we can run directly from git checkout.
echo '__version__ = "do not crash, please"' > codespell_lib/_version.py
The resulting corpus is 79329 bytes. The wc
command line tool estimates that each file has 76764 lines (including many empty ones) and 657438 word (in wc
definition of a word). The codespell
code flags about 175 words as possible typos in this text (that is, per file).
The python version I used for these numbers:
$ python3 --version
Python 3.11.9
Running the test:
# Test, repeated at least 3 times, fastest result is chosen;
# Usually slower times are caused by unrelated jitter - at least on laptops/desktops that
# also does other work
time PYTHONPATH=. python3 -m codespell_lib performance-corpus/ >/dev/null
# One run with `-m profile` to see where the bottlenecks are.
# Takes considerably longer (x10-x20), so it is not done on the full corpus
PYTHONPATH=. python3 -m profile -o baseline.pstat -m codespell_lib performance-corpus/cano.txt
Baseline results
On my hardware, the performance test on the baseline is recorded as 5.6 seconds (0m5.600s
).
The bottleneck profiling has the following key numbers as far as I can tell:
- Total time 3.030s as baseline (
2196160 function calls (2195259 primitive calls) in 3.030 seconds
) - The code spend 0.503s (~1/6) in
build_dict
(split into 2 calls) - The code spend 2.472s (~5/6) in
parse_file
in one call [only one file was examined]. - About of half of
parse_file
's runtime is spend inparse_file
(1.178/.2.472). The other half is then assumed to come from callers. - The time spent in
"".lower()
is 0.433s (~1/6) and another ~1/6 is spend in regex operations (group()
+search()
+finditer()
).
These numbers were extracted by an interactive session with the following commands (as examples):
import pstats
s = pstats.Stats("baseline.pstat")
s.sort_stats(2).print_stats(20)
s.print_callees('parse_file')
Types of corrections
Per file, we get 175 suggestions. These are split into the following groups:
- Cased (capitalized or all upper): 41 (~23%)
- Non-cased: 134 (~77%)
- Corrections with multiple candidates: 77 (44%)
- Cased with multiple candidate corrections: 37 (21% of the 175 corrections)
- Non-cased with multiple candidate corrections: 40 (~23% of the 175 corrections)
- Corrections with
reason
: 0
Some ratios:
- Cased / non-cased: ~30% (40/134).
- Strong bias towards non-cased (seems reasonable since most words are lower-cased)
- Multiple candidates cased / Multi candidate non-cased corrections: 93% (37/40).
- This one is probably too close to 1:1. Given the strong bias towards non-cased, one would expect a similar bias here. This is a potential weakness with the chosen corpus.
- Multiple candidates / Single candidate corrections: 79% (77/98).
- This one is close to 1:1, so we get decent coverage of both cases.
from codespell.
Perhaps it would be worth checking with more files, so that application startup time remains small compared to individual file checking.
I guess 20 files ought to be enough;
for i in $(seq 1 20); do cp -a cano.txt cano-$i.txt; done
from codespell.
Related Issues (20)
- "Goverment" not regonized in all relevant lines HOT 2
- support ignore for blocks of code
- Codespell Not Ignoring Words Properly when Provided in Original Case HOT 1
- release a new version ? HOT 15
- Unable to add inline comments for markdown documents HOT 7
- Splitting multiple 'skip' paths onto indented config file lines behaves differently than when on a single line HOT 4
- Are inline ignores restricted to hash-style comments? HOT 1
- Is it possible to combine a custom dictionary with a builtin dictionary? HOT 1
- pre-commit to remove `Used config files:` noisy output HOT 1
- Better control over decoding problems reporting HOT 2
- Codespell inline ignores don't appear to work in Markdown files HOT 3
- assertIn false positivies HOT 9
- Better project description HOT 3
- Tests fail if not 'codespell' installed HOT 5
- I need to ignore "image/png" the contents of the ipynb(python) file / .pre-commit-config.yaml HOT 3
- false positive: unparseable HOT 6
- Unable to Ignore Files Completely Using .codespell.toml Configuration HOT 1
- --ignore-words is not working in pre-commit HOT 2
- Still cannot specify custom + default dictionary in codespellrc file HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codespell.