Giter Club home page Giter Club logo

esmre's Introduction

esmre - Efficient String Matching Regular Expressions
=====================================================

esmre is a Python module that can be used to speed up the execution of a large
collection of regular expressions. It works by building a index of compulsory
substrings from a collection of regular expressions, which it uses to quickly
exclude those expressions which trivially do not match each input.

Here is some example code that uses esmre:

>>> import esmre
>>> index = esmre.Index()
>>> index.enter(r"Major-General\W*$", "savoy opera")
>>> index.enter(r"\bway\W+haye?\b", "sea shanty")
>>> index.query("I am the very model of a modern Major-General.")
['savoy opera']
>>> index.query("Way, hay up she rises,")
['sea shanty']
>>> 

The esmre module builds on the simpler string matching facilities of the esm
module, which wraps a C implementation some of the algorithms described in
Aho's and Corasick's paper on efficient string matching [Aho, A.V, and
Corasick, M. J. Efficient String Matching: An Aid to Bibliographic Search.
Comm. ACM 18:6 (June 1975), 333-340]. Some minor modifications have been made
to the algorithms in the paper and one algorithm is missing (for now), but
there is enough to implement a quick string matching index.

Here is some example code that uses esm directly:

>>> import esm
>>> index = esm.Index()
>>> index.enter("he")
>>> index.enter("she")
>>> index.enter("his")
>>> index.enter("hers")
>>> index.fix()
>>> index.query("this here is history")
[((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')]
>>> index.query("Those are his sheep!")
[((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')]
>>> 

You can see more usage examples in the tests.

esmre's People

esmre's Issues

Goto function lookup is inefficient

The goto function for each state in the automaton is stored as a linked
list. The list has to be scanned whenever the function is evaluated. This
is actually quite efficient for most states, as they typically have very
few transitions, but for the root state, where every possible transition is
populated, scanning a list is inefficient.

The goto function should use a better mechanism for evaluating to goto
function, at least for the root state.

Original issue reported on code.google.com by [email protected] on 28 Jan 2008 at 5:47

enter() method does not handle unicode

What steps will reproduce the problem?
1. index = esmre.Index()
2. index.enter(u'espa\u00f1ol','language')
3.

What is the expected output? What do you see instead?
I would expect to be able to use unicode strings as the regex. Instead, I am 
getting a UnicodeEncodeError.

What version of the product are you using? On what operating system?
version 0.3.1 on Mac OS X 10.9.2

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 12 Sep 2014 at 9:01

Source distribution does not include AUTHORS, COPYING, or INSTALL.

What steps will reproduce the problem?
1. Build a source distribution with python setup.py sdist
2. Look in the tarball in dist/
3. Note that AUTHORS, COPYING, and INSTALL are missing.

What is the expected output? What do you see instead?

Those file should be included in the tarball.

Original issue reported on code.google.com by [email protected] on 28 Jan 2008 at 3:02

Hints extraction doesn't understand repetition

What steps will reproduce the problem?

{{{
>>> import esmre
>>> i = esmre.Index()
>>> i.enter(r"Ar{1-10}, me harties!", "Hello")
>>> i.query("Arrrr, me harties!")
[]
>>> 
}}}

What is the expected output? What do you see instead?

{{{
>>> import esmre
>>> i = esmre.Index()
>>> i.enter(r"Ar{1-10}, me harties!", "Hello")
>>> i.query("Arrrr, me harties!")
['Hello']
>>>
}}}

Original issue reported on code.google.com by [email protected] on 29 Feb 2008 at 6:57

Attachments:

PKG-INFO lacks fields.

What steps will reproduce the problem?

1. Build a source distribution with python setup.py sdist.
2. Unpack the generated tarball and inspect its PKG-INFO.
3. Note that the 'Home-page', 'License', and 'Platform' fields are set to
UNKNOWN.

What is the expected output? What do you see instead?

Those fields should be as follows:

Home-page: http://code.google.com/p/esmre/
License: GNU LGPL
Platform: POSIX

Original issue reported on code.google.com by [email protected] on 28 Jan 2008 at 3:06

No acceleration when using alternation operator

The Index won't extract hints from expressions that use the | operator. It
would be better if the object was put into the underlying esm index once
for each term in the alternation. Eg. Index().enter(r'red|green|blue',
'primary') would enter 'primary' under 'red', 'green', and 'blue' in the
esm index.

What steps will reproduce the problem?

>>> import esmre
>>> i = esmre.Index()
>>> i.enter(r"egg|bacon|sausage|spam|tomato|baked beans|lobster thermidor",
...         "ingredients")
>>> i.hintless_objects
['ingredients']
>>> 


What is the expected output? What do you see instead?

>>> import esmre
>>> i = esmre.Index()
>>> i.enter(r"egg|bacon|sausage|spam|tomato|baked beans|lobster thermidor",
...         "ingredients")
>>> i.hintless_objects
[]
>>> 


Original issue reported on code.google.com by [email protected] on 3 Mar 2008 at 5:04

PKG-INFO description has one word per line.

What steps will reproduce the problem?

1. Build a source distribution with python setup.py sdist.
2. Unpack the generated tarball and inspect its PKG-INFO.
3. Note that the 'Description' field is split on each word.

What is the expected output? What do you see instead?

The description field should be formatted as one line, or be properly
justified.

Original issue reported on code.google.com by [email protected] on 28 Jan 2008 at 3:08

Named grouping not supported

Currently esmre doesn't support regular expressions with named groups.
For example hints(...) function works well with regexp :
[0-3][0-9]/[0-1][0-9]/[1-2][0-9]{3}
but returns nothing (?P<date>[0-3][0-9]/[0-1][0-9]/[1-2][0-9]{3})

Original issue reported on code.google.com by [email protected] on 26 Sep 2008 at 7:37

Invalid regex match

Esmre finds invalid regex matches that the vanilla re module shows do not exist.

The following unittest should pass, but does not due to the invalid match esmre 
finds.

import unittest
import re
import esmre

class Test(unittest.TestCase):

    def test_match(self):

        pattern1 = "what\s+cities\s+are\s+located\s+in\s+[a-zA-Z]+\?"
        pattern2 = "what\s+cities\s+are\s+near\s+[a-zA-Z]+\?"
        question1 = "what cities are located in China?"
        question2 = "what cities are near China?"

        # Build index.
        index = esmre.Index()
        index.enter(pattern1, 1)
        index.enter(pattern2, 2)

        # These pass
        self.assertEqual(re.findall(pattern1, question1), [question1])
        self.assertEqual(re.findall(pattern1, question2), [])
        self.assertEqual(re.findall(pattern2, question2), [question2])
        self.assertEqual(re.findall(pattern2, question1), [])

        # These fail?
        self.assertEqual(index.query(question1), [1])
        self.assertEqual(index.query(question2), [2])

if __name__ == '__main__':
    unittest.main()

Original issue reported on code.google.com by [email protected] on 20 Nov 2012 at 5:17

Tests not included in source distribution

What steps will reproduce the problem?

1. python setup.py sdist
2. tar ztf dist/esmre-0.2.1.tar.gz | grep test

What is the expected output?

esmre-0.2.1/test/
esmre-0.2.1/test/esm_tests.py
esmre-0.2.1/test/esmre_tests.py

What do you see instead?

No results.

Original issue reported on code.google.com by [email protected] on 20 Mar 2008 at 2:55

python-3 compatible package?

Hi Folks, in my effort to get the package in debian I noticed the package is 
almost ready for python3 but not fully ready.

I'm not a python programmer, I tried to make a patch that makes it build 
correctly 
--- python-esmre-0.3.1.orig/src/esm.c
+++ python-esmre-0.3.1/src/esm.c
@@ -21,6 +21,12 @@ Foundation, Inc., 51 Franklin Street, Fi

 #include "aho_corasick.h"

+
+// to support python 2.5 and earlier
+#ifndef Py_TYPE
+    #define Py_TYPE(ob) (((PyObject*)(ob))->ob_type)
+#endif
+
 typedef struct {
     PyObject_HEAD
     ac_index*   index;
@@ -35,7 +41,7 @@ decref_result_object(void* item, void* d
 static void
 esm_Index_dealloc(esm_IndexObject* self) {
     ac_index_free(self->index, decref_result_object);
-    self->ob_type->tp_free((PyObject*) self);
+    Py_TYPE(self)->tp_free ((PyObject*) self);
 }

 static PyObject*


something seems still needing to be addressed
src/esm.c:191:5: warning: missing braces around initializer [-Wmissing-braces]
     PyObject_HEAD_INIT(NULL)
     ^
src/esm.c:191:5: warning: (near initialization for 
'esm_IndexType.ob_base.ob_base') [-Wmissing-braces]
src/esm.c:193:5: warning: initialization makes integer from pointer without a 
cast
     "esm.Index",                        /*tp_name*/
     ^
src/esm.c:193:5: warning: (near initialization for 'esm_IndexType.tp_basicsize')
src/esm.c:196:5: warning: initialization from incompatible pointer type
     (destructor) esm_Index_dealloc,     /*tp_dealloc*/
     ^
src/esm.c:196:5: warning: (near initialization for 'esm_IndexType.tp_print')
src/esm.c:211:5: warning: initialization makes pointer from integer without a 
cast
     Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE, /*tp_flags*/
     ^
src/esm.c:211:5: warning: (near initialization for 'esm_IndexType.tp_doc')
src/esm.c:212:5: warning: initialization from incompatible pointer type
     "Index() -> new efficient string matching index",  /* tp_doc */
     ^
src/esm.c:212:5: warning: (near initialization for 'esm_IndexType.tp_traverse')
src/esm.c:219:5: warning: initialization from incompatible pointer type
     esm_Index_methods,             /* tp_methods */
     ^
src/esm.c:219:5: warning: (near initialization for 'esm_IndexType.tp_members')
src/esm.c:220:5: warning: initialization from incompatible pointer type
     esm_Index_members,             /* tp_members */
     ^
src/esm.c:220:5: warning: (near initialization for 'esm_IndexType.tp_getset')
src/esm.c:227:5: warning: initialization from incompatible pointer type
     (initproc) esm_Index_init,      /* tp_init */
     ^
src/esm.c:227:5: warning: (near initialization for 'esm_IndexType.tp_alloc')
src/esm.c:229:5: warning: initialization from incompatible pointer type
     esm_Index_new,                 /* tp_new */
     ^
src/esm.c:229:5: warning: (near initialization for 'esm_IndexType.tp_free')
src/esm.c: In function 'initesm':
src/esm.c:245:9: warning: 'return' with no value, in function returning 
non-void [-Wreturn-type]
         return;
         ^
src/esm.c:247:5: warning: implicit declaration of function 'Py_InitModule3' 
[-Wimplicit-function-declaration]
     m = Py_InitModule3("esm", esm_methods,
     ^
src/esm.c:247:7: warning: assignment makes pointer from integer without a cast
     m = Py_InitModule3("esm", esm_methods,
       ^
src/esm.c:251:9: warning: 'return' with no value, in function returning 
non-void [-Wreturn-type]
         return;
         ^
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes 
-g -O2 -fstack-protector-strong -Wformat -Werror=format-security 
-D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.4m -c src/aho_corasick.c -o 
build/temp.linux-x86_64-3.4/src/aho_corasick.o

--------------

running install_scripts
  File "/usr/lib/python3.4/dist-packages/esmre.py", line 249
    raise TypeError, "enter() cannot be called after query()"
                   ^
SyntaxError: invalid syntax


Original issue reported on code.google.com by [email protected] on 19 Feb 2015 at 10:16

Source distributions don't work.

What steps will reproduce the problem?
1. Download and unpack a source distribution.
2. Try to build with python setup.py build

What is the expected output?

The package builds.

What do you see instead?

Build errors like the following:
src/esm.c:22:26: error: aho_corasick.h: No such file or directory 

Original issue reported on code.google.com by [email protected] on 20 Mar 2008 at 11:53

Can't enter expressions after query

When I try to add a new expression to an esmre index, I get an exception:

>>> import esmre
>>> index = esmre.Index()
>>> index.enter("spam", "spam")
>>> index.query("100 tonnes of spam")
['spam']
>>> index.enter("eggs", "eggs")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/e
smre.py",
line 140, in enter
    raise TypeError, "enter() cannot be called after query()"
TypeError: enter() cannot be called after query()
>>> 

I'd like to be able to add expressions after I've queried:

>>> import esmre
>>> index = esmre.Index()
>>> index.enter("spam", "spam")
>>> index.query("100 tonnes of spam")
['spam']
>>> index.enter("eggs", "eggs")
>>> index.query("Spam, spam, spam, eggs, and spam")
['spam', 'spam', 'spam', 'eggs', 'spam']
>>> 


Original issue reported on code.google.com by [email protected] on 28 Jan 2008 at 2:34

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.