Giter Club home page Giter Club logo

trex's Introduction

trrex logo
ย 
Trrex Downloads PyPI Version Package Status Code Coverage Status Documentation Status

Efficient string matching with regular expressions

This package includes a pure Python function that enables you to represent a set of strings as a regular expression. With this regular expression, you can perform various operations, such as replacing, extracting and matching keywords. The name of the package comes from the internal trie used to build the regular expression (TRie to REgeX)

Install trrex

Use pip,

pip install trrex

Usage

import trrex as tx
import re

pattern = tx.make(['baby', 'bat', 'bad'])
hits = re.findall(pattern, 'The baby was scared by the bad bat.')
# hits = ['baby', 'bat', 'bad']

pandas

import trrex as tx
import pandas as pd

frame = pd.DataFrame({
    "txt": ["The baby", "The bat"]
})
pattern = tx.make(['baby', 'bat', 'bad'], prefix=r"\b(", suffix=r")\b") # need to specify capturing groups
frame["match"] = frame["txt"].str.extract(pattern)
hits = frame["match"].tolist()
print(hits)
# hits = ['baby', 'bad']

Why use trrex?

  • trrex builds a better regex pattern, than the simple regex union, therefore searching (and replacing) strings is about 300 times faster than a regex union pattern, and about 2.5 times faster than FlashText algorithm. See below for a performance comparison:

Performance comparison

  • Plays well with others, can be integrated easily with pandas, spacy and any other regex engine. See the documentation for examples.
  • Pure Python, no other dependencies

Issues

If you have any issues with this repository, please don't hesitate to raise them. It is actively maintained, and we will do our best to help you.

Acknowledgments

This project is based on the following resources:

Liked the work?

If you've found this repository helpful, why not give it a star? It's an easy way to show your appreciation and support for the project. Plus, it helps others discover it too!

trex's People

Contributors

mesejo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

trex's Issues

extra closed beckets are getting generated

Hello Developer,

Below is the query sent and it is generating extra ")".

code:
import trrex as tx
tx.make(['IBS(.|)TECH','IBS(.|)SOL','IBS(.|)TEHK'], prefix=r"\b(", suffix=r")")

output='\b(IBS(.(?:|)SOL|*|)TE(?:HK|CH)))'

Please let me know if any further information is required on the same.

Thanks,
Rangam

Add Documentation

Add missing documentation detailing how to integrate trrex with different libraries. The documentation should use the
pydata-sphinx-theme

Remove compile function

The compile function offers a vanilla wrapper for using with the re Python module, is better to remove this function and then add it later if needed.

Remove automatic escaping of characters of input words

What is happening?
Currently we escape every character of the input words even if they do not need escaping.

What should happen?
Remove this behavior because it prevents the end user from adding patterns and also it makes the processing slow.
Document how to escape regex characters.

List of regex patterns possible?

First, great package. Love using it! Makes my life much easier, and the speed is phenomenal!

I find that it works with a list of words, such as ['Love', 'Hello', ' Book',...]

Does it work with a list of regex patterns?

For example

regex_patterns = [
    r'(?!^\d+)(?=.*)(\b\d+$\b)'               #Remove any numbers that end string i.e "SHELL 545436"
   ,r'^\bSQ\b'                                #Remove the "SQ" if it starts the string ie. "SQ NORDSTROM"
   ,r'(?!\b(ST|^\w{1,2})\b$)\b\w{1,2}\b$'     #Remove any words that are one or two chars at end of string i.e. "Burger King CA" -- except if they = "ST"

]

tx_maker = tx.make(entity_final_clean_regex_patterns,  prefix=r"", suffix=r"") 


#clean column
df['clean'] = df.STRING_COLUMN.str.replace(tx_maker, "", regex=True)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.