gandersen101 / spaczz Goto Github PK

Fuzzy matching and more functionality for spaCy.

License: MIT License

Python 92.15% Jupyter Notebook 7.85%

natural-language-processing data-science python nlp artificial-intelligence ai spacy nlp-library fuzzy-matching regex

spaczz's Introduction

spaczz: Fuzzy matching and more for spaCy

spaczz provides fuzzy matching and additional regex matching functionality for spaCy. spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.

Fuzzy matching is currently performed with matchers from RapidFuzz's fuzz module and regex matching currently relies on the regex library. spaczz certainly takes additional influence from other libraries and resources. For additional details see the references section.

Supports spaCy >= 3.0

spaczz has been tested on Ubuntu, MacOS, and Windows Server.

v0.6.0 Release Notes:

Returning the matching pattern for all matchers, this is a breaking change as matches are now tuples of length 5 instead of 4.
Regex and token matches now return match ratios.
Support for python<=3.11,>=3.7, along with rapidfuzz>=1.0.0.
Dropped support for spaCy v2. Sorry to do this without a deprecation cycle, but I stepped away from this project for a long time.
Removed support of "spaczz_" preprended optional SpaczzRuler init arguments. Also, sorry to do this without a deprecation cycle.
Matcher.pipe methods, which were deprecated, are now removed.
spaczz_span custom attribute, which was deprecated, is now removed.

Please see the changelog for previous release notes. This will eventually be moved to the Read the Docs page.

Installation
Basic Usage
Known Issues
- Performance
Roadmap
Development
References

Installation

Spaczz can be installed using pip.

pip install spaczz

Basic Usage

Spaczz's primary features are the FuzzyMatcher, RegexMatcher, and "fuzzy" TokenMatcher that function similarly to spaCy's Matcher and PhraseMatcher, and the SpaczzRuler which integrates the spaczz matchers into a spaCy pipeline component similar to spaCy's EntityRuler.

FuzzyMatcher

The basic usage of the fuzzy matcher is similar to spaCy's PhraseMatcher except it returns the fuzzy ratio and matched pattern, along with match id, start and end information, so make sure to include variables for the ratio and pattern when unpacking results.

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")])
matcher.add("GPE", [nlp("Nashville")])
matches = matcher(doc)

for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

NAME Grint M Anderson 80 Grant Andersen
GPE Nashv1le 82 Nashville

Unlike spaCy matchers, spaczz matchers are written in pure Python. While they are required to have a spaCy vocab passed to them during initialization, this is purely for consistency as the spaczz matchers do not use currently use the spaCy vocab. This is why the match_id above is simply a string instead of an integer value like in spaCy matchers.

Spaczz matchers can also make use of on-match rules via callback functions. These on-match callbacks need to accept the matcher itself, the doc the matcher was called on, the match index and the matches produced by the matcher.

import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)


def add_name_ent(matcher, doc, i, matches):
    """Callback on match function. Adds "NAME" entities to doc."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    _match_id, start, end, _ratio, _pattern = matches[i]
    entity = Span(doc, start, end, label="NAME")
    doc.ents += (entity,)


matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")], on_match=add_name_ent)
matches = matcher(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

('Grint M Anderson', 0, 3, 'NAME')

Like spaCy's EntityRuler, a very similar entity updating logic has been implemented in the SpaczzRuler. The SpaczzRuler also takes care of handling overlapping matches. It is discussed in a later section.

Unlike spaCy's matchers, rules added to spaczz matchers have optional keyword arguments that can modify the matching behavior. Take the below fuzzy matching examples:

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
# Let's modify the order of the name in the text.
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")])
matches = matcher(doc)

# The default fuzzy matching settings will not find a match.
for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

Next we change the fuzzy matching behavior for the pattern in the "NAME" rule.

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
# Let's modify the order of the name in the text.
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")], kwargs=[{"fuzzy_func": "token_sort"}])
matches = matcher(doc)

# The default fuzzy matching settings will not find a match.
for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

NAME Anderson, Grint 83 Grant Andersen

The full list of keyword arguments available for fuzzy matching settings includes:

ignore_case (bool): Whether to lower-case text before matching. Default is True.
min_r (int): Minimum match ratio required.
thresh (int): If this ratio is exceeded in initial scan, and flex > 0, no optimization will be attempted. If flex == 0, thresh has no effect. Default is 100.
fuzzy_func (str): Key name of fuzzy matching function to use. All rapidfuzz matching functions with default settings are available. Additional fuzzy matching functions can be registered by users. Default is "simple":
- "simple" = ratio
- "partial" = partial_ratio
- "token" = token_ratio
- "token_set" = token_set_ratio
- "token_sort" = token_sort_ratio
- "partial_token" = partial_token_ratio
- "partial_token_set" = partial_token_set_ratio
- "partial_token_sort" = partial_token_sort_ratio
- "weighted" = WRatio
- "quick" = QRatio
- "partial_alignment" = partial_ratio_alignment (Requires rapidfuzz>=2.0.3)
flex (int|Literal['default', 'min', 'max']): Number of tokens to move match boundaries left and right during optimization. Can be an int with a max of len(pattern) and a min of 0, (will warn and change if higher or lower). "max", "min", or "default" are also valid. Default is "default": len(pattern) // 2.
min_r1 (int|None): Optional granular control over the minimum match ratio required for selection during the initial scan. If flex == 0, min_r1 will be overwritten by min_r2. If flex > 0, min_r1 must be lower than min_r2 and "low" in general because match boundaries are not flexed initially. Default is None, which will result in min_r1 being set to round(min_r / 1.5).

RegexMatcher

The basic usage of the regex matcher is also fairly similar to spaCy's PhraseMatcher. It accepts regex patterns as strings so flags must be inline. Regexes are compiled with the regex package so approximate "fuzzy" matching is supported. To provide access to these "fuzzy" match results the matcher returns a calculated fuzzy ratio and matched pattern, along with match id, start and end information, so make sure to include variables for the ratio and pattern when unpacking results.

import spacy
from spaczz.matcher import RegexMatcher

nlp = spacy.blank("en")
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

matcher = RegexMatcher(nlp.vocab)
# Use inline flags for regex strings as needed
matcher.add(
    "ZIP",
    [r"\b\d{5}(?:[-\s]\d{4})?\b"],
)
matcher.add("GPE", [r"(usa){d<=1}"])  # Fuzzy regex.
matches = matcher(doc)

for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

ZIP 55555-1234 100 \b\d{5}(?:[-\s]\d{4})?\b
GPE US 80 (usa){d<=1}

Like the fuzzy matcher, the regex matcher has optional keyword arguments that can modify matching behavior. Take the below regex matching example.

import spacy
from spaczz.matcher import RegexMatcher

nlp = spacy.blank("en")
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the USA."""  # Spelling errors intentional. Notice 'USA' here.
doc = nlp(text)

matcher = RegexMatcher(nlp.vocab)
# Use inline flags for regex strings as needed
matcher.add(
    "STREET", ["street_addresses"], kwargs=[{"predef": True}]
)  # Use predefined regex by key name.
# Below will not expand partial matches to span boundaries.
matcher.add("GPE", [r"(?i)[U](nited|\.?) ?[S](tates|\.?)"], kwargs=[{"partial": False}])
matches = matcher(doc)

for match_id, start, end, ratio, pattern in matches:
    print(
        match_id, doc[start:end], ratio, pattern
    )  # comma in result isn't ideal - see "Roadmap"

STREET 555 Fake St, 100 street_addresses

The full list of keyword arguments available for regex matching settings includes:

ignore_case (bool): Whether to lower-case text before matching. Default is True.
min_r (int): Minimum match ratio required.
fuzzy_weights (str): Name of weighting method for regex insertion, deletion, and substituion counts. Additional weighting methods can be registered by users. Default is "indel".
- "indel" = (1, 1, 2)
- "lev" = (1, 1, 1)
partial: (bool): Whether partial matches should be extended to Token or Span boundaries in doc or not. For example, the regex only matches part of a Token or Span in doc. Default is True.
predef (string): Whether the regex string should be interpreted as a key to a predefined regex pattern or not. Additional predefined regex patterns can be registered by users. Default is False.
- "dates"
- "times"
- "phones"
- "phones_with_exts"
- "links"
- "emails"
- "ips"
- "ipv6s"
- "prices"
- "hex_colors"
- "credit_cards"
- "btc_addresses"
- "street_addresses"
- "zip_codes"
- "po_boxes"
- "ssn_numbers"

SimilarityMatcher

The basic usage of the similarity matcher is similar to spaCy's PhraseMatcher except it returns the vector similarity ratio and matched pattern, along with match id, start and end information, so make sure to include variables for the ratio and pattern when unpacking results.

In order to produce meaningful results from the similarity matcher, a spaCy model with word vectors (ex. medium or large English models) must be used to initialize the matcher, process the target document, and process any patterns added.

import spacy
from spaczz.matcher import SimilarityMatcher

nlp = spacy.load("en_core_web_md")
text = "I like apples, grapes and bananas."
doc = nlp(text)

# lowering min_r2 from default of 75 to produce matches in this example
matcher = SimilarityMatcher(nlp.vocab, min_r2=65)
matcher.add("FRUIT", [nlp("fruit")])
matches = matcher(doc)

for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

FRUIT apples 70 fruit
FRUIT grapes 73 fruit
FRUIT bananas 70 fruit

Please note that even for the mostly pure-Python spaczz, this process is currently extremely slow so be mindful of the scope in which it is applied. Enabling GPU support in spaCy (see here) should improve the speed somewhat, but I believe the process will still be bottlenecked in the pure-Python search algorithm until I develop a better search algorithm and/or drop the search to lower-level code (ex C).

Also as a somewhat experimental feature, the similarity matcher is not currently part of the SpaczzRuler nor does it have a separate ruler. If you need to add similarity matches to a Doc's entities you will need to use an on-match callback for the time being. Please see the fuzzy matcher on-match callback example above for ideas. If there is enough interest in integrating/creating a ruler for the similarity matcher this can be done.

The full list of keyword arguments available for similarity matching settings includes:

ignore_case (bool): Whether to lower-case text before fuzzy matching. Default is True.
min_r (int): Minimum match ratio required.
thresh (int): If this ratio is exceeded in initial scan, and flex > 0, no optimization will be attempted. If flex == 0, thresh has no effect. Default is 100.
flex (int|Literal['default', 'min', 'max']): Number of tokens to move match boundaries left and right during optimization. Can be an int with a max of len(pattern) and a min of 0, (will warn and change if higher or lower). "max", "min", or "default" are also valid. Default is "default": len(pattern) // 2.
min_r1 (int|None): Optional granular control over the minimum match ratio required for selection during the initial scan. If flex == 0, min_r1 will be overwritten by min_r2. If flex > 0, min_r1 must be lower than min_r2 and "low" in general because match boundaries are not flexed initially. Default is None, which will result in min_r1 being set to round(min_r / 1.5).
min_r2 (int|None): Optional granular control over the minimum match ratio required for selection during match optimization. Needs to be higher than min_r1 and "high" in general to ensure only quality matches are returned. Default is None, which will result in min_r2 being set to min_r.

TokenMatcher

Note: spaCy's Matcher now supports fuzzy matching, so unless you need a specific feature from spaczz's TokenMatcher, it is highly recommended to use spaCy's much faster Matcher.

The basic usage of the token matcher is similar to spaCy's Matcher. It accepts labeled patterns in the form of lists of dictionaries where each list describes an individual pattern and each dictionary describes an individual token.

The token matcher accepts all the same token attributes and pattern syntax as it's spaCy counterpart but adds fuzzy and fuzzy-regex support.

"FUZZY" and "FREGEX" are the two additional spaCy token pattern options.

For example:

[
    {"TEXT": {"FREGEX": "(database){e<=1}"}},
    {"LOWER": {"FUZZY": "access", "MIN_R": 85, "FUZZY_FUNC": "quick_lev"}},
]

Make sure to use uppercase dictionary keys in patterns.

The full list of keyword arguments available for token matching settings includes:

ignore_case (bool): Whether to lower-case text before matching. Can only be set at the pattern level. For "FUZZY" and "FREGEX" patterns. Default is True.
min_r (int): Minimum match ratio required. For "FUZZY" and "FREGEX" patterns.
fuzzy_func (str): Key name of fuzzy matching function to use. Can only be set at the pattern level. For "FUZZY" patterns only. All rapidfuzz matching functions with default settings are available, however any token-based functions provide no utility at the individual token level. Additional fuzzy matching functions can be registered by users. Included, and useful, functions are (the default is simple):
- "simple" = ratio
- "partial" = partial_ratio
- "quick" = QRatio
- "partial_alignment" = partial_ratio_alignment (Requires rapidfuzz>=2.0.3)
fuzzy_weights (str): Name of weighting method for regex insertion, deletion, and substituion counts. Additional weighting methods can be registered by users. Default is "indel".
- "indel" = (1, 1, 2)
- "lev" = (1, 1, 1)
predef: Whether regex should be interpreted as a key to a predefined regex pattern or not. Can only be set at the pattern level. For "FREGEX" patterns only. Default is False.

import spacy
from spaczz.matcher import TokenMatcher

# Using model results like POS tagging in token patterns requires model that provides these.
nlp = spacy.load("en_core_web_md")
text = """The manager gave me SQL databesE acess so now I can acces the Sequal DB.
My manager's name is Grfield"""
doc = nlp(text)

matcher = TokenMatcher(vocab=nlp.vocab)
matcher.add(
    "DATA",
    [
        [
            {"TEXT": "SQL"},
            {"LOWER": {"FREGEX": "(database){s<=1}"}},
            {"LOWER": {"FUZZY": "access"}},
        ],
        [{"TEXT": {"FUZZY": "Sequel"}, "POS": "PROPN"}, {"LOWER": "db"}],
    ],
)
matcher.add("NAME", [[{"TEXT": {"FUZZY": "Garfield"}}]])
matches = matcher(doc)

for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

DATA SQL databesE acess 91 [{"TEXT":"SQL"},{"LOWER":{"FREGEX":"(database){s<=1}"}},{"LOWER":{"FUZZY":"access"}}]
DATA Sequal DB 87 [{"TEXT":{"FUZZY":"Sequel"},"POS":"PROPN"},{"LOWER":"db"}]
NAME Grfield 93 [{"TEXT":{"FUZZY":"Garfield"}}]

Even though the token matcher can be a drop-in replacement for spaCy's Matcher, it is still recommended to use spaCy's Matcher if you do not need the spaczz token matcher's fuzzy capabilities - it will slow processing down unnecessarily.

Reminder: spaCy's Matcher now supports fuzzy matching, so unless you need a specific feature from spaczz's TokenMatcher, it is highly recommended to use spaCy's much faster Matcher.

SpaczzRuler

The spaczz ruler combines the fuzzy and regex phrase matchers, and the "fuzzy" token matcher, into one pipeline component that can update a Doc.ents similar to spaCy's EntityRuler.

Patterns must be added as an iterable of dictionaries in the format of {label (str), pattern(str or list), type(str), optional kwargs (dict), and optional id (str)}.

For example, a fuzzy phrase pattern:

{'label': 'ORG', 'pattern': 'Apple' 'kwargs': {'min_r2': 90}, 'type': 'fuzzy'}

Or, a token pattern:

{'label': 'ORG', 'pattern': [{'TEXT': {'FUZZY': 'Apple'}}], 'type': 'token'}

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.blank("en")
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the USA.
Some of his favorite bands are Converg and Protet the Zero."""  # Spelling errors intentional.
doc = nlp(text)

patterns = [
    {
        "label": "NAME",
        "pattern": "Grant Andersen",
        "type": "fuzzy",
        "kwargs": {"fuzzy_func": "token_sort"},
    },
    {
        "label": "STREET",
        "pattern": "street_addresses",
        "type": "regex",
        "kwargs": {"predef": True},
    },
    {"label": "GPE", "pattern": "Nashville", "type": "fuzzy"},
    {
        "label": "ZIP",
        "pattern": r"\b(?:55554){s<=1}(?:(?:[-\s])?\d{4}\b)",
        "type": "regex",
    },  # fuzzy regex
    {"label": "GPE", "pattern": "(?i)[U](nited|\.?) ?[S](tates|\.?)", "type": "regex"},
    {
        "label": "BAND",
        "pattern": [{"LOWER": {"FREGEX": "(converge){e<=1}"}}],
        "type": "token",
    },
    {
        "label": "BAND",
        "pattern": [
            {"TEXT": {"FUZZY": "Protest"}},
            {"IS_STOP": True},
            {"TEXT": {"FUZZY": "Hero"}},
        ],
        "type": "token",
    },
]

ruler = SpaczzRuler(nlp)
ruler.add_patterns(patterns)
doc = ruler(doc)


for ent in doc.ents:
    print(
        (
            ent.text,
            ent.start,
            ent.end,
            ent.label_,
            ent._.spaczz_ratio,
            ent._.spaczz_type,
            ent._.spaczz_pattern,
        )
    )

('Anderson, Grint', 0, 3, 'NAME', 83, 'fuzzy', 'Grant Andersen')
('555 Fake St,', 9, 13, 'STREET', 100, 'regex', 'street_addresses')
('Nashv1le', 17, 18, 'GPE', 82, 'fuzzy', 'Nashville')
('55555-1234', 20, 23, 'ZIP', 90, 'regex', '\\b(?:55554){s<=1}(?:(?:[-\\s])?\\d{4}\\b)')
('USA', 25, 26, 'GPE', 100, 'regex', '(?i)[U](nited|\\.?) ?[S](tates|\\.?)')
('Converg', 34, 35, 'BAND', 93, 'token', '[{"LOWER":{"FREGEX":"(converge){e<=1}"}}]')
('Protet the Zero', 36, 39, 'BAND', 89, 'token', '[{"TEXT":{"FUZZY":"Protest"}},{"IS_STOP":true},{"TEXT":{"FUZZY":"Hero"}}]')

We see in the example above that we are referencing some custom attributes, which are explained below.

For more SpaczzRuler examples see here. In particular this provides details about the ruler's sorting process and fuzzy matching parameters.

Custom Attributes

Spaczz initializes some custom attributes upon importing. These are under spaCy's ._. attribute and are further prepended with spaczz_ so there should be not conflicts with your own custom attributes. If there are spaczz will force overwrite them.

These custom attributes are only set via the spaczz ruler at the token level. Span and doc versions of these attributes are getters that reference the token level attributes.

The following Token attributes are available. All are mutable:

spaczz_token: default = False. Boolean that denotes if the token is part of an entity set by the spaczz ruler.
spaczz_type: default = None. String that shows which matcher produced an entity using the token.
spaczz_ratio: default = None. If the token is part of a matched entity, it will return fuzzy ratio.
spaczz_pattern: default = None. If the token is part of a matched entity, it will return the pattern as a string (JSON-formatted for token patterns) that produced the match.

The following Span attributes reference the token attributes included in the span. All are immutable:

spaczz_ent: default = False. Boolean that denotes if all tokens in the span are part of an entity set by the spaczz ruler.
spaczz_type: default = None. String that denotes which matcher produced an entity using the included tokens.
spaczz_types: default = set(). Set that shows which matchers produced entities using the included tokens. An entity span should only have one type, but this allows you to see the types included in any arbitrary span.
spaczz_ratio: default = None. If all the tokens in span are part of a matched entity, it will return the fuzzy ratio.
spaczz_pattern: default = None. If all the tokens in a span are part of a matched entity, it will return the pattern as a string (JSON-formatted for token patterns) that produced the match.

The following Doc attributes reference the token attributes included in the doc. All are immutable:

spaczz_doc: default = False. Boolean that denotes if any tokens in the doc are part of an entity set by the spaczz ruler.
spaczz_types: default = set(). Set that shows which matchers produced entities in the doc.

Saving/Loading

The SpaczzRuler has it's own to/from disk/bytes methods and will accept config parameters passed to spacy.load(). It also has it's own spaCy factory entry point so spaCy is aware of the SpaczzRuler. Below is an example of saving and loading a spacy pipeline with the small English model, the EntityRuler, and the SpaczzRuler.

import spacy
from spaczz.pipeline import SpaczzRuler

nlp = spacy.load("en_core_web_md")
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the USA.
Some of his favorite bands are Converg and Protet the Zero."""  # Spelling errors intentional.
doc = nlp(text)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

('Anderson, Grint', 0, 3, 'ORG')
('555', 9, 10, 'CARDINAL')
('Apt 5', 14, 16, 'PRODUCT')
('Nashv1le', 17, 18, 'GPE')
('TN 55555-1234', 19, 23, 'ORG')
('USA', 25, 26, 'GPE')
('Converg', 34, 35, 'PERSON')
('Zero', 38, 39, 'CARDINAL')

While spaCy does a decent job of identifying that named entities are present in this example, we can definitely improve the matches - particularly with the types of labels applied.

Let's add an entity ruler for some rules-based matches.

from spacy.pipeline import EntityRuler

entity_ruler = nlp.add_pipe("entity_ruler", before="ner") #spaCy v3 syntax
entity_ruler.add_patterns(
    [{"label": "GPE", "pattern": "Nashville"}, {"label": "GPE", "pattern": "TN"}]
)

doc = nlp(text)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

('Anderson, Grint', 0, 3, 'ORG')
('555', 9, 10, 'CARDINAL')
('Apt 5', 14, 16, 'PRODUCT')
('Nashv1le', 17, 18, 'GPE')
('TN', 19, 20, 'GPE')
('USA', 25, 26, 'GPE')
('Converg', 34, 35, 'PERSON')
('Zero', 38, 39, 'CARDINAL')

We're making progress, but Nashville is spelled wrong in the text so the entity ruler does not find it, and we still have other entities to fix/find.

Let's add a spaczz ruler to round this pipeline out. We will also include the spaczz_span custom attribute in the results to denote which entities were set via spaczz.

spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax
spaczz_ruler.add_patterns(
    [
        {
            "label": "NAME",
            "pattern": "Grant Andersen",
            "type": "fuzzy",
            "kwargs": {"fuzzy_func": "token_sort"},
        },
        {
            "label": "STREET",
            "pattern": "street_addresses",
            "type": "regex",
            "kwargs": {"predef": True},
        },
        {"label": "GPE", "pattern": "Nashville", "type": "fuzzy"},
        {
            "label": "ZIP",
            "pattern": r"\b(?:55554){s<=1}(?:[-\s]\d{4})?\b",
            "type": "regex",
        },  # fuzzy regex
        {
            "label": "BAND",
            "pattern": [{"LOWER": {"FREGEX": "(converge){e<=1}"}}],
            "type": "token",
        },
        {
            "label": "BAND",
            "pattern": [
                {"TEXT": {"FUZZY": "Protest"}},
                {"IS_STOP": True},
                {"TEXT": {"FUZZY": "Hero"}},
            ],
            "type": "token",
        },
    ]
)

doc = nlp(text)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ent))

('Anderson, Grint', 0, 3, 'NAME', True)
('555 Fake St,', 9, 13, 'STREET', True)
('Apt 5', 14, 16, 'PRODUCT', False)
('Nashv1le', 17, 18, 'GPE', True)
('TN', 19, 20, 'GPE', False)
('55555-1234', 20, 23, 'ZIP', True)
('USA', 25, 26, 'GPE', False)
('Converg', 34, 35, 'BAND', True)
('Protet the Zero', 36, 39, 'BAND', True)

Awesome! The small English model still makes a named entity recognition mistake ("5" in "Apt 5" as CARDINAL), but we're satisfied overall.

Let's save this pipeline to disk and make sure we can load it back correctly.

import tempfile

with tempfile.TemporaryDirectory() as tmp_dir:
    nlp.to_disk(f"{tmp_dir}/example_pipeline")
    nlp = spacy.load(f"{tmp_dir}/example_pipeline")

nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'entity_ruler',
 'spaczz_ruler',
 'ner']

We can even ensure all the spaczz ruler patterns are still present.

spaczz_ruler = nlp.get_pipe("spaczz_ruler")
spaczz_ruler.patterns

[{'label': 'NAME',
  'pattern': 'Grant Andersen',
  'type': 'fuzzy',
  'kwargs': {'fuzzy_func': 'token_sort'}},
 {'label': 'GPE', 'pattern': 'Nashville', 'type': 'fuzzy'},
 {'label': 'STREET',
  'pattern': 'street_addresses',
  'type': 'regex',
  'kwargs': {'predef': True}},
 {'label': 'ZIP',
  'pattern': '\\b(?:55554){s<=1}(?:[-\\s]\\d{4})?\\b',
  'type': 'regex'},
 {'label': 'BAND',
  'pattern': [{'LOWER': {'FREGEX': '(converge){e<=1}'}}],
  'type': 'token'},
 {'label': 'BAND',
  'pattern': [{'TEXT': {'FUZZY': 'Protest'}},
   {'IS_STOP': True},
   {'TEXT': {'FUZZY': 'Hero'}}],
  'type': 'token'}]

Known Issues

Performance

The main reason for spaczz's slower speed is that the c in it's name is not capitalized like it is in spaCy. Spaczz is written in pure Python and it's matchers do not currently utilize spaCy language vocabularies, which means following it's logic should be easy to those familiar with Python. However this means spaczz components will run slower and likely consume more memory than their spaCy counterparts, especially as more patterns are added and documents get longer. It is therefore recommended to use spaCy components like the EntityRuler for entities with little uncertainty, like consistent spelling errors. Use spaczz components when there are not viable spaCy alternatives.

I am not currently working on performance optimizations to spaczz, but algorithmic and optimization suggestions are welcome.

The primary methods for speeding up the FuzzyMatcher and SimilarityMatcher are by decreasing the flex parameter towards 0, or if flex > 0, increasing the min_r1 parameter towards the value of min_r2 and/or lowering the thresh parameter towards min_r2. Be aware that all of these "speed-ups" come at the opportunity cost of potentially improved matches.

As mentioned in the SimilarityMatcher description, utilizing a GPU may also help speed up it's matching process.

Roadmap

I am always open and receptive to feature requests but just be aware, as a solo-dev with a lot left to learn, development can move pretty slow. The following is my roadmap for spaczz so you can see where issues raised might fit into my current priorities.

Note: while I want to keep spaczz functional, I am not actively developing it. I try to be responsive to issues and requests but this project is not currently a focus of mine.

High Priority

Bug fixes - both breaking and behavioral. Hopefully these will be minimal.
Ease of use and error/warning handling and messaging enhancements.
Building out Read the Docs.
Option to prioritize match quality over length and/or weighing options.
Profiling - hopefully to find "easy" performance optimizations.

Enhancements

Having spaczz matchers utilize spaCy vocabularies.
Rewrite the phrase and token searching algorithms in Cython to utilize C speed.
1. Try to integrate closer with spaCy.

Development

Pull requests and contributors are welcome.

spaczz is linted with Flake8, formatted with Black, type-checked with MyPy (although this could benefit from improved specificity), tested with Pytest, automated with Nox, and built/packaged with Poetry. There are a few other development tools detailed in the noxfile.py, along with Git pre-commit hooks.

To contribute to spaczz's development, fork the repository then install spaczz and it's dev dependencies with Poetry. If you're interested in being a regular contributor please contact me directly.

poetry install # Within spaczz's root directory.

I keep Nox and pre-commit outside of my poetry environment as part of my Python toolchain environments. With pre-commit installed you may also need to run the below to commit changes.

pre-commit install

The only other package that will not be installed via Poetry but is used for testing and in-documentation examples is the spaCy medium English model (en-core-web-md). This will need to be installed separately. The command below should do the trick:

poetry run python -m spacy download "en_core_web_md"

References

spaczz tries to stay as close to spaCy's API as possible. Whenever it made sense to use existing spaCy code within spaczz this was done.
Fuzzy matching is performed using RapidFuzz.
Regexes are performed using the regex library.
The search algorithm for phrased-based fuzzy and similarity matching was heavily influnced by Stack Overflow user Ulf Aslak's answer in this thread.
spaczz's predefined regex patterns were borrowed from the commonregex package.
spaczz's development and CI/CD patterns were inspired by Claudio Jolowicz's Hypermodern Python article series.

spaczz's People

Contributors

Stargazers

Watchers

spaczz's Issues

Switch fuzzywuzzy to rapidfuzz

Fuzzy matching currently provided by fuzzywuzzy should be switched to rapidfuzz. The latter has a more liberal license and runs faster.

Return match quality details from the TokenMatcher

Need to develop a way to return match quality details (fuzzy ratios and fuzzy regex counts) from TokenMatcher matches. I currently only do the fuzzy matching token patterns in spaczz before dropping the fuzzy matched patterns into spaCy's Matcher. While utilizing spaCy's Matcher means less work on my end and better performance I don't have an easy way to map the fuzzy details back to the Matcher matches.

Comparison method(s) for fuzzy ratios and fuzzy regex counts

In order to sort SpaczzRuler matches across the different matchers by quality, I need a method or methods for comparing fuzzy ratios (ints between 0 and 100) and fuzzy regex counts (tuples of insertion, deletion and substitution counts). Standardizing the fuzzy regex counts back to a ratio is probably the most effective method at first thought.

Phrase search considering synonyms

Would it be possible to use this matching library also for some smarter phrase search which would take into consideration spacy's word vectors?
For instance, if I create a matcher object like this:

import spacy
import spaczz

nlp = spacy.load('en_core_web_lg')
matcher = spaczz.matcher.FuzzyMatcher(nlp.vocab)
matcher.add("my_phrase", [nlp('humorous story')])

Then it would be maybe interesting to see also match for a sentence like in this example:

matcher(nlp('He told me a very funny story.'))

where there is a sub-phrase "funny story" which is a synonym to a phrase "humorous story" we added to the matcher.

Clean Unnecessary Methods From FuzzySearcher

There are some redundant methods in the fuzzy searcher and the naming of methods should be updated.

Possible infinite loop

Running my tests with spaczz@master they seem to get into an infinite loop at the nlp() call. Stack dumps:

  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
    matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
    matches_w_nones = [
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
    self._optimize(
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 217, in _optimize
    r = self.compare(query, doc[bp_l:bp_r], *args, **kwargs)
  File "doc.pyx", line 308, in spacy.tokens.doc.Doc.__getitem__
  File "/usr/lib64/python3.8/site-packages/spacy/util.py", line 491, in normalize_slice
    if not (step is None or step == 1):

another ctrl-c during another run:

   self._doc = nlp(text)
  File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
    for fuzzy_match in self.fuzzy_matcher(doc):
  File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
    matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
    matches_w_nones = [
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
    self._optimize(
  File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 205, in _optimize
    rl = self.compare(query, doc[p_l : p_r - f], *args, **kwargs)
  File "/usr/lib64/python3.8/site-packages/spaczz/search/fuzzysearcher.py", line 109, in compare
    return round(self._fuzzy_funcs.get(fuzzy_func)(a_text, b_text))

Add Span Start/End Trimming Class Functionality

Add pipeline component to "clean" entities after setting (primarily intended for spaczz entities). I.e. if punctuation is included at the start/end of a fuzzy matched entity the span can be trimmed to remove punctuation.

Will also require registering a custom span attribute on entities created through spaczz.

Refactor to Support Long-Term Design Plans

Refactoring code to better align with long-term plans of further integrating with spaCy vocabs and eventually implementing pieces in Cython.

RegexMatcher: Match Captures?

I am able to get this regex working using below code.

import spacy
from spaczz.matcher import RegexMatcher

nlp = spacy.blank("en")
text = "Hello how are you? Proficiency in ETL tools like Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview"
doc = nlp(text)

matcher = RegexMatcher(nlp.vocab)
matcher.add(
    "SKILL",
    [
        r"""(?i)proficiency in ([\w\s]+) tools like (.*$)"""
    ],
)  
matches = matcher(doc)

for match_id, start, end, counts in matches:
    print(match_id, doc[start:end], counts)

And I get the matched sentence as output as expected.
However I am unsure if there is way I can get access to the match capture ([\w\s]+) & (.*$). Looking for any suggestions or advise.
Once I get matched result/sentence, I would like to access the match captures ETL and Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview.

Is there a way to get back the original dictionary item that matches?

NAME Grint Anderson 86
GPE Nashv1le 82

@gandersen101 For both 'Grint' and 'Nashv1le' are from text, and their corresponding dictionary items are "Grant Andersen" and "Nashville". Is there a way to know which dictionary item matches the text? For example, if I want to find misspellings, how do I know 'Grint' is whose misspelling?

accessing match ratio/score

Is it possible to get ratio (some sort of distance/similarity score) for each match?

Add Predefined Regex Customization

Extend API to allow for adding/removing user-defined predefined regexes and fuzzy matchers.

Handling the same token in different categories

This is not a report, but a question. If I have the same token with two different labels, how will spaczz handle it? The question comes because spacy seems to pick the label unpredictably: explosion/spaCy#6752

Questions:

a) is it possible to get both matches somehow? I'm interested in getting a list of all matches of a LABEL sometimes, and the "best ones" in other cases, to some definition of BEST :)
b) if I can't get both, it is possible to get a callback to decide myself what to do?

Get original matched pattern back

Hi,
A very useful feature would be to have the original pattern matched by SpaczzRuler, because when similar patterns are added, there may be doubts about which one is the original pattern matched. I guess this issue connects to a potential link with spacy knowledge base id's.
Thank you

Op + does only match 1 token

I'm using the SpaczzRuler pipeline as specified in here to detect companies based on patterns. Is a very simple pipeline in which I'm trying to match uppercase tokens, but when using the operator + it matches only one token in uppercase, and not as many as possible. The documentation says:
+ | Require the pattern to match 1 or more times.

However, if using the * operator it indeed matches all possible times, as expected.

How to reproduce the behaviour

import spacy
from spaczz.pipeline import SpaczzRuler

model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
    {"label": "COMPANY", 'pattern': [
        {"IS_UPPER": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "?"},
        {"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
        {"IS_PUNCT": True, "OP": "?"}], 
     "type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)

doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (MARMG S.L.,)

model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
    {"label": "COMPANY", 'pattern': [
        {"IS_UPPER": True, "OP": "*"}, {"IS_PUNCT": True, "OP": "?"},
        {"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
        {"IS_PUNCT": True, "OP": "?"}], 
     "type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)

doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (LARGO AND MARMG S.L.,)

Your Environment

Info about spaCy

Platform: Windows-10-10.0.17134-SP0
Python version: 3.8.6
spaCy version: 2.3.5
**spaczz Version Used: 0.5.0

Threshold fuzzy ratio using FuzzyMatcher

I have a custom component in a spaCy pipeline where I am using the FuzzyMatcher. The tutorial does a good job showing how to implement a threshold using the spaczz_ruler but it is less clear how to do this using the FuzzyMatcher. I am struggling to implement a threshold system where a user can configure a threshold. The following system is not effective. What would be a better way to implement a threshold in this design pattern?

class TermPipeline(Component):

    def __init__(self, nlp, term_list, fuzzy_threshold=75):
        self.fuzzy_threshold = fuzzy_threshold
        self.term_list = term_list
        self.label_hash = nlp.vocab.strings['TERM_MATCH']
        self.matcher = FuzzyMatcher(nlp.vocab, attr="LEMMA")
        Token.set_extension("parent_term", force=True, default=None)
        if isinstance(self.term_list[0], dict):
            patterns = [nlp(text['term']) for text in term_list]
            # Creates the term word patterns and adds them to the matcher
        else:
            patterns = [nlp(text) for text in term_list]

        self.matcher.add('TerminologyList', on_match=self.add_event_ent, patterns=patterns)


    def __call__(self, doc):
        matches = self.matcher(doc)
        if isinstance(self.term_list[0], dict):
            for _, start, end, ratio in matches:
                if self.fuzzy_threshold <= ratio:

                      entity = Span(doc, start, end, label=self.label_hash)
                      for term in self.term_list:
                            if term['term'].lower() == entity.text.lower():
                                 doc[start]._.set('parent_term', term['parent_term'])
        return doc


@Language.factory("term_component", default_config={})
def create_term_component(nlp, name, term_list, fuzzy_threshold):
    return TermPipeline(nlp, term_list, fuzzy_threshold)

Extend Read the Docs

Build out Read the Docs .rst documentation for comprehensive details.

Fuzzy Match of Term Combinations

My task is querying medical texts for institute names using a rule as below:
[{'ENT_TYPE': 'institute_name'}, {'TEXT': 'Hospital'}]
The rule will extract the hospital name only if it is bound by the word 'Hospital', including for example "Mount Sinai Hospital" but excluding "Mount Sinai".
spaczz works great for single term or phrase but I did not see an option to build multi-words rules as in the rule above.
Can I use scpaczz to identify typos for this entity, for example, "Mount Sinai Mospital"?

Speed up the detection process

First of all, really appreciate your work and time.

With small input data patterns, it is doing a good job, but when input data patterns crossing more than 1 lakh, it is taking too much time. Is there any possibility which can speed up the process (maybe using on GPU)

Plural is not chosen over similar word

How to reproduce: fuzzy pattern with "Goldriesling", "Riesling", default fuzzy_func. Search on a phrase like They sell many Rieslings.

Found: Goldriesling. Expected: Riesling.

install spaczz

I try pip install spaczz but ....

clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -Iextern/rapidfuzz-cpp/ -Icapi/ -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -c src/cpp_process.cpp -o build/temp.macosx-10.9-universal2-cpython-311/src/cpp_process.o -O3 -std=c++14 -Wextra -Wall -Wconversion -g0 -DNDEBUG -stdlib=libc++ -mmacosx-version-min=10.9 -DVERSION_INFO="1.9.1"
src/cpp_process.cpp:253:12: fatal error: 'longintrepr.h' file not found
#include "longintrepr.h"
^~~~~~~~~~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for rapidfuzz
Failed to build rapidfuzz
ERROR: Could not build wheels for rapidfuzz, which is required to install pyproject.toml-based projects

SpazzyRuler ID

is it possible to add an "id" to the pattern like you can with original spacy api? https://spacy.io/usage/rule-based-matching#entityruler-ent-ids

https://github.com/gandersen101/spaczz#spaczzruler

 p = {
            "label": entity_type,
            "pattern": d[col],
            "id": d["id"],
            "type": "fuzzy"
        }

Pattern: {'label': 'PERSON', 'pattern': 'DrDisrespect', 'id': '148265b0-847b-414c-9f8e-916561412c55', 'type': 'fuzzy'}
Text: "Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options"

This fails to find Dr Disrespect and im unable to print the id


    data = [{
        "label": ent.label_,
        "name": ent.text,
        "id": ent.ent_id_
    } for ent in doc.ents]

print(data)

I need this feature because when i find the entity i need to match it to an "id" that stored in my database.

aside: is it also possible to edit the fuzzy search edit distance to 1 character

Installing spaczz with successful RapidFuzz installed

Similar to another post, but it still doesn't install even after successfully installing RapidFuzz separately.
This is what I get:
cl : Command line warning D9025 : overriding '/W3' with '/W4'
cpp_process.cpp
src/cpp_process.cpp(16): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.34.31933\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

Add pattern after adding to spacy pipeline taking long time and memory

There are 1 Million patterns I am trying to add. On adding to blank spacy model:
import spacy from spaczz.pipeline import SpaczzRuler nlp=spacy.blank('en') spaczz_ruler = SpaczzRuler(nlp) spaczz_ruler = nlp.add_pipe("spaczz_ruler") #spaCy v3 syntax spaczz_ruler.add_patterns(patterns) It takes 8 GB of RAM and inference time is around 28 seconds.

If I try to add SpaczzRuler to current ner pipeline using
spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax
It is taking high RAM and time. On 32 GB RAM also it is failing
patterns = [ { "label": "NAME", "pattern": "Grant Andersen", "type": "fuzzy", "kwargs": {"min_r2": 90} }]

UserWarning: [W036] The component 'matcher' does not have any patterns defined. matches = matcher(doc)

Thanks a lot for your fabulous package; it is really helpful. However, when I tried to reproduce your results, I run into this error:

"UserWarning: [W036] The component 'matcher' does not have any patterns defined. matches = matcher(doc)"

Specifically, I used this code snippet:

import spacy
from spacy.pipeline import EntityRuler
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("en_core_web_sm")

entity_ruler = nlp.add_pipe("entity_ruler", before="ner")
entity_ruler.add_patterns(
    [{"label": "GPE", "pattern": "Nashville"}, 
     {"label": "GPE", "pattern": "TN"}]
)

spaczz_ruler = nlp.add_pipe("spaczz_ruler", before="ner") #spaCy v3 syntax
spaczz_ruler.add_patterns(
    [
        {
            "label": "NAME",
            "pattern": "Grant Andersen",
            "type": "fuzzy",
            "kwargs": {"fuzzy_func": "token_sort"},
        },
    ]
)

When I add patterns to the spaCy's EntityRuler, it works OK, when I add patters to SpaczzRuler, it throws the error I specified in above.

I am using Python 3.9, spaCy 3.2.2 on Ubuntu 16.04.

Any help will be highly appreciated.

IndexError: [E201] Span index out of range.

fuzzy matcher unable to process matcher(doc). It was working 48 hours back. It's not working now.

File "xxxxxxxxxxxxxxxxxxxxxxxxxx", line 46, in pattern_matcher
matched_by_fuzzy_phrase = matcher_fuzzy(doc)
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/matcher/fuzzymatcher.py", line 105, in call
matches_wo_label = self.match(doc, pattern, **kwargs)
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 216, in match
matches_w_nones = [
File "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 217, in
self._adjust_left_right_positions(
File "/home/aravind/nlu_endpoint/NLUSQL_ENV3/lib/python3.8/site-packages/spaczz/fuzz/fuzzysearcher.py", line 326, in _adjust_left_right_positions
r = self.compare(query.text, doc[bp_l:bp_r].text, fuzzy_func, ignore_case)
File "span.pyx", line 503, in spacy.tokens.span.Span.text.get

File "span.pyx", line 190, in spacy.tokens.span.Span.getitem

IndexError: [E201] Span index out of range.

Refactor Search

There is a lot of redundancy between fuzzy search and similarity search - they both essentially use the same search algorithm. They should inherit the same base and only make the small tweaks necessary.

Refactoring Matchers

Matchers could benefit from inheriting from a base class and the searchers used in them should be composed rather than inherited.

Add start/end span trimmers to fuzzy matching

Add functionality to trim fuzzy match spans to keep unwanted tokens from populating the start/end of a match span, i.e. spaces, punctuation, and/or stop words.

SpaczzRuler configuration

Hi,
I could not find a way to set the various matching parameters using spacy 2, in the SpaczzRuler documentation https://github.com/gandersen101/spaczz/blob/master/examples/fuzzy_matching_tweaks.md there is only the spacy 3 syntax.

I also found this #18 and I tried to applied using the following:

ruler = SpaczzRuler(self.nlp, spaczz_fuzzy_defaults={'min_r2': 98, 'min_r1': 95, 'flex': 2})

but changing parameters didn't seem to change anything.

is that correct or I'm missing something?

Thank you for the cool library!

Update rapidfuzz

Currently rapidfuzz is required in versions >=1.0.0 and <2.0.0 in pyproject.toml:

rapidfuzz = "^1"

however rapidfuzz is currently available in version 2.15.1.

This breaks other packages or installations which require a more recent version of rapidfuzz directly or indirectly.

After updating to:

rapidfuzz = ">=1.0.0"

all unit tests are still successful.

Is there a reason why versions 2.x are hold back?

Build - Python 3.9 Testing

Add Python 3.9 into the nox build process. Don't anticipate issues but want to have it covered.

How to restrict fuzzy search

Is it possible to restrict the fuzzy search because in my example it is returning unwanted entities.

Text: Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.

patterns = [
{'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
{'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
]

('Dr Disrespect', 'PERSON')
('YouTube', 'PERSON')

The unwanted entity here is ('YouTube', 'PERSON'), is there some way to restrict the fuzzy search so that it does not identify YouTube in the text to be a person?

Full Code:

      import spacy
        from spaczz.pipeline import SpaczzRuler

        nlp = spacy.blank("en")
        text = """Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.
        doc = nlp(text)

        patterns = [
            {'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
            {'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
        ]

        ruler = SpaczzRuler(nlp)
        ruler.add_patterns(patterns)
        doc = ruler(doc)

        data = [{
            "label": ent.label_,
            "name": ent.text,
        } for ent in doc.ents]

        for ent in doc.ents:
            print((ent.text, ent.label_))

EDIT:

i noticed rapidfuzz library provides a score_cutoff as a parameter im looking to set this to 95 so it's strict. I was hoping something like this could be exposed.

Compare strings stripping accents/casi sensitive

First of all, thanks for the library @gandersen101 . I'm starting using it and it's really powerful.

Using SpaczzRuler with fuzzy patterns, by default it compares strings in a case-insensitive way. Is there a way of changing this behaviour?

Similarly, is there a way of comparing strings w/o taking into account accents? This is, making "test" equivalent to "tést". It could be hacked changing the string for a accent-stripped version of it (since it maintains the token structure), but maybe is an easier way.

import sys
import spacy
import spaczz
from spaczz.pipeline import SpaczzRuler

print(f"{sys.version = }")
print(f"{spacy.__version__ = }")
print(f"{spaczz.__version__ = }")


nlp = spacy.blank("en")

fuzzy_ruler = SpaczzRuler(nlp, name="test_ruler")
fuzzy_ruler.add_patterns([{"label" : "TEST", 
            "pattern" : "test", 
            "type": "fuzzy",}])


doc = fuzzy_ruler(nlp("this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst"))
print(f"\nText:\n{doc}\n")
print("Fuzzy Matches:")
for ent in doc.ents:
    if ent._.spaczz_type == "fuzzy":
        print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))

Output

sys.version = '3.9.0 (default, Nov 15 2020, 06:25:35) \n[Clang 10.0.0 ]'
spacy.version = '3.0.6'
spaczz.version = '0.5.2'

Text:
this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst

Fuzzy Matches:
('test', 3, 4, 'TEST', 100)
('TEST', 9, 10, 'TEST', 100)
('tast', 13, 14, 'TEST', 75)
('TesT', 18, 19, 'TEST', 100)
('tést', 20, 21, 'TEST', 75)
('tëst', 22, 23, 'TEST', 75)

Add spaCy 3 compatibility

Currently spaczz does not install alongside the new spaCy 3 release. Attempting to upgrade a project to spaCy 3 while also using spaczz gives the following Pip error message:

The conflict is caused by:
    The user requested spacy==3.0.3
    spaczz 0.4.1 depends on spacy<3.0.0 and >=2.2.0

Adding POS tagging while building pattern for Spaczzruler

Hello, I am really liking Spaczz, to fuzzy match entity patterns.

Quick question is there a way to add a for example POS tagging constraints as well. For example: I want to extract only Noun phrases of AS, but fuzzy match also getting me 'as' from "as above function"
'i' below here is each string from list of vocab to match
{'label': "ECHO", 'pattern': [{'TEXT': i, 'POS': 'NOUN'}], 'type': 'fuzzy'}

Include Windows Testing

Add Windows testing to GitHub actions and modify noxfile.py to support Windows.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.