Comments (5)
Hi @sasi143, thanks for your interest in Spaczz. I am very interested in improving the speed of the fuzzy matching process, however, for reasons I'll outline below, I unfortunately do not think this will happen in the near future without additional contributor(s).
I believe the performance bottleneck(s) in spaczz's fuzzy matching come from the amount of time the code spends in pure Python iterating through and processing text and potential matches. I do not believe the fuzzy comparisons themselves are a bottleneck because they are done with RapidFuzz which is already written in C++. I do need to do some profiling to confirm this though.
The pattern that spaCy proper uses to achieve it's rapid speed is dropping most of it's internal text processing code to C, and that is the pattern I would eventually like to follow. Unfortunately, I have almost no experience with C/C++, so without help from additional contributor(s) with C/C++ experience, my progress will be quite slow. That being said, I do intend to work on it myself, I just can't make any promises about timelines. Also, while I think incorporating GPU support is an interesting idea, to be honest I have even less of an idea about how to implement that than dropping portions of the code down to C/C++.
In the shorter term here are some possible actions:
- Raise the
min_r1
threshold from the current value of25
for your patterns.min_r1
is essentially a trade-off between speed and accuracy. Raising it will mean some potential matches will be missed because they won't be passed to further match optimization, but raising it should result in modest speed increases. Look at the documentation for theFuzzySearcher
for details onmin_r1
and the documentation for theFuzzyMatcher
andSpaczzRuler
for how to incorporate this into your patterns or change the default value for all patterns in an instance of those classes. - Think about rewriting some of your fuzzy patterns as fuzzy regex patterns, look at the "Approximate “fuzzy” matching" section in the Regex package documentation. spaczz supports these kind of patterns and they will likely run faster as Regex is mostly implemented in lower-level code.
- I can run some profiling on spaczz to confirm my hypothesis of where the bottleneck(s) are. In addition I may discover some "easy-wins" where I have written something inefficiently even for Python.
- I can try to rethink/redesign aspect of the fuzzy matching algorithm to potentially reduce the number of comparisons it currently does.
I'm sorry I can't give you a more definite solution or timeline right now. Hopefully as people continue to discover/use spaczz some more experienced programmers may become interested in contributing. As things stand, I will slowly be working on accomplishing these speed improvements myself.
I'll use this issue as a place to keep track of these updates as they come.
from spaczz.
@gandersen101 Really thankfull to your well explaination and appreciate for your time. Keep doing good work and stay safe.
from spaczz.
Issue #41 has turned into a performance discussion and I am planning to make some performance improvements very soon. I will provide a summary of those changes on this thread soon.
from spaczz.
@gandersen101 Thanks for the inspiration. I started a low-level integration of rapidfuzz into spaCy, to attempt to improve performance explosion/spaCy#11359
Any thoughts/ideas welcome there.
from spaczz.
@kwhumphreys very cool. Best of luck! Obviously I have not put much time into spaczz
over the past couple years but the functionality is something people have been looking for. Hopefully you can get an official implementation into spaCy
proper.
I'm going to add an announcement to the README - essentially I intend to address some issues/ add some functionality with spaczz
going forward but I don't know if I'll ever have the time to "Cythonize" this library to extent that it's fast-enough for many use-cases.
from spaczz.
Related Issues (20)
- Fuzzy Match of Term Combinations HOT 3
- IndexError: [E201] Span index out of range. HOT 12
- Plural is not chosen over similar word HOT 18
- Possible infinite loop HOT 3
- Handling the same token in different categories HOT 5
- Add spaCy 3 compatibility HOT 4
- Comparison method(s) for fuzzy ratios and fuzzy regex counts HOT 1
- Return match quality details from the TokenMatcher HOT 1
- Get original matched pattern back HOT 5
- SpaczzRuler configuration HOT 2
- Op + does only match 1 token HOT 2
- Compare strings stripping accents/casi sensitive HOT 4
- RegexMatcher: Match Captures? HOT 2
- Threshold fuzzy ratio using FuzzyMatcher HOT 4
- UserWarning: [W036] The component 'matcher' does not have any patterns defined. matches = matcher(doc) HOT 3
- Is there a way to get back the original dictionary item that matches? HOT 5
- Add pattern after adding to spacy pipeline taking long time and memory HOT 1
- install spaczz HOT 1
- Installing spaczz with successful RapidFuzz installed HOT 1
- Update rapidfuzz HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spaczz.