maxharlow / csvmatch Goto Github PK

View Code? Open in Web Editor NEW

179.0 10.0 22.0 121 KB

🔎 Finds fuzzy matches between CSV files

License: Other

Python 100.00%

data-matching entity-resolution fuzzy-matching record-linkage csv

csvmatch's People

Contributors

Stargazers

Watchers

csvmatch's Issues

Parameter to parse names

Using Nameparser? https://pypi.python.org/pypi/nameparser

Move to setup.cfg

A la https://news.ycombinator.com/item?id=18613806

Investigate speed/memory improvements

eg:

generators and generator comprehensions
cStringIO, and other C-versions
streaming output? (probably required before #11)
write some performance tests

Parameter to save Bilenko training to a file

A la how csvlink and csvmatch does by default.

invalid value encountered in double_scalars module 'time' has no attribute 'clock'

Hello, in python 3.8 on windows, when you run the Bilenko match and after you enter the response y/n you have the next error :
invalid value encountered in double_scalars
module 'time' has no attribute 'clock'

Parameter to specify what columns are outputted

Should be able to specify what columns are in the output, separately from the specification of which columns are used for matching.

Installation on Mac -- scikit-learn Issue

During the install process on my Mac, I seem to be getting an error when it comes to scikit-learn. Any advice?
`Collecting scikit-learn (from dedupe==2.0.21->csvmatch)
Using cached scikit-learn-1.2.2.tar.gz (7.3 MB)
Installing build dependencies ... error
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> [74 lines of output]
Ignoring numpy: markers 'python_version == "3.10" and platform_system == "Windows" and platform_python_implementation != "PyPy"' don't match your environment
Collecting setuptools
Using cached setuptools-67.7.2-py3-none-any.whl (1.1 MB)
Collecting wheel
Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
Collecting Cython>=0.29.24
Using cached Cython-0.29.34-py2.py3-none-any.whl (988 kB)
Collecting oldest-supported-numpy
Using cached oldest_supported_numpy-2022.11.19-py3-none-any.whl (4.9 kB)
Collecting scipy>=1.3.2
Using cached scipy-1.10.1.tar.gz (42.4 MB)
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Installing backend dependencies: started
Installing backend dependencies: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'error'
error: subprocess-exited-with-error

Better deal with non-unique column names

Currently the data gets coalesced into a single column -- not ideal

Parameter to include matching number in output

A number between 0 and 1 indicating how strong the match was.

Parameter to specify match threshold

Would only be relevant to algorithms that give a match degree number

Issues with dedupe dependency because of Levenshtein-search

Hi,

Thank you for your project!

I use csvmatch for a project but started getting build issues since today or a few days ago. It seems like Levenshtein-search==1.4.5, which dedupe depends on, is no longer available on pip. Others created issues in their repository, but there has been no solution so far: mattandahalfew/Levenshtein_search#29

The dedupe package released a hotfix to no longer depend on Levenshtein-search==1.4.5, but to use that hotfix the dedupe dependency in csvmatch would need to be updated from 2.0.11 to 2.0.21. Would it be possible to update this dependency and create a new release?

Segmentation Fault: 11 when running fuzzy levenshtein

Hey when I try run csvmatch test1.csv test2.csv --fuzzy levenshtein in the terminal it crashes my terminal and give segmentation fault 11. I am using python 2.7.11. Sometimes it will output results in ther terminal sometimes it won't.

Matching across multiple fields

Hello,
It appears as though you don't combine the degrees of multiple fields and instead look at them individually. I'm looking to combine the degree from matching name and address. Any idea on how to accomplish that?

Parameter to include original line numbers in output

Ignore_letters option

For the purpose of matching strings with helpful numbers and unhelpful words (such as precinct names with codes and messy names), adding an "ignore_letters" option would be nice. This would only match numbers from two columns of interest.

I implemented something like this in my code with:

def ignore_alpha(row):
    regex = re.compile('[\D_]+')
    return [regex.sub('', value) for value in row]

Speed up with memory mapping?

Could use something like Apache Arrow so files don't have to be kept in memory?

descriptor 'union' of 'set' object needs an argument

I am receiving "descriptor 'union' of 'set' object needs an argument" when using --fuzzy. This is happening after I hit f to finish matching records for the machine learning. Below is my syntax:

csvmatch file1.csv file2.csv --fields1 'Employee#' 'Last Name' First Name' --fields2 'Employee#' 'Last Name' 'First Name' --fuzzy > newfile.csv

Unrecognized arguments on Ubuntu

Hi Max
I watched your talk at NICAR21. I'm Reinaldo, from Brazil
Congratulations on the project
I'm using a computer with Ubuntu 20.04 and python 3.8

I tested some commands, but only the first one worked:
csvmatch forbes-billionaires.csv bloomberg-billionaires.csv --fields1 name --fields2 Name > billionaires-in-both.csv

The others always show the error: csvmatch: error: unrecognized arguments

csvmatch cia-world-leaders.csv davos-delegates.csv --fields1 name --fields2 full_name --ignore-case > leaders-at-davos.csv
usage: csvmatch [-h] [-1 FIELDS1 [FIELDS1 ...]] [-2 FIELDS2 [FIELDS2 ...]]
[FILE1] [FILE2]
csvmatch: error: unrecognized arguments: --ignore-case

csvmatch forbes-billionaires.csv bloomberg-billionaires.csv --fields1 name --fields2 Name --output 1.name 1.rank 2.Rank > billionaires-in-both-ranked.csv
usage: csvmatch [-h] [-1 FIELDS1 [FIELDS1 ...]] [-2 FIELDS2 [FIELDS2 ...]]
[FILE1] [FILE2]
csvmatch: error: unrecognized arguments: --output 1.name 1.rank 2.Rank

Please, is there any incompatibility with Ubuntu or do I need to install other libraries?
Or did I make some silly mistake?

Best

Output notation for all columns from each file

eg. 1* should be all columns from file one, and ditto for 2*

Add more algorithms

Jaro-Winkler
Q-gram

Sources:

Functionality to upload known matches during training of fuzzy match?

In Dedupe's library we see the option to prepare_training and start off with some known matches for training data. However, is this functionality available in csvmatch to pass in another training file also?

Parameters for blocking

One option to specify fields to use to create the blocks, another (?) to set the method -- default to exact match, options for Metaphone etc

Interesting bit on Soundex plus Levenshtein here: https://info.crunchydata.com/blog/fuzzy-name-matching-in-postgresql

bug: wrong number of carriage return in output

Hi,
I have these two example CSVs

cat input_01.csv

field1,field2
a,b
c,d

cat input_02.csv

field1,field3
a,12
c,13

If I run

csvmatch input_01.csv input_02.csv --fields1 "field1" --fields2 "field1" --output 1.field1 2.field3 >out.csv

I have two (and not only one) carriage return in the output file, and this is an error because row 4 is completely blank.

field1,field3
a,12
c,13

Thank always you for this great tool

-l option example or documentation

Hi,
for -l I read "filter out terms from a newline-separated file of regular expressions when comparing".

But how to use it? Could you add an example with two lines of regular expressions to apply to the compare job?

Thank you

Add other fuzzy matching algorithms

There's a list here: http://ntz-develop.blogspot.co.uk/2011/03/fuzzy-string-search.html

Suggestion: Add some topics to this repo

I found this project via another one from a ddg search. But I realised that there's no topics listed on the project. It might help this package be discovered more easily on GitHub if you added some topics, like python, csv, stuff like that.

Thanks for making open source software. 🧡

Fuzzy merge for integer values

Am I wrong , fuzzy_pandas cant handle fuzzy logic related to integer values? Let's say we have fuzzy coordinates (X,Y). The only way is to convert them to string?
Thank you

Parameter to perform an outer match

feature request: add wrong whitespaces option

Hi,
using these input files

Name,Age
Andy,32
Mary Jane,43


Name,City
Andy,Rome
Mary  Jane,New York

"Mary Jane" does not match "Mary Jane" because in the first there are two spaces. Probably I can use -l option, but I do not know how to do it.
If it's not possible with l, it would be great to have ignore wrong white space option, to strip leading and trailing whitespace, and replace multiple whitespace with singles.

Thank you

How to output the score of the match

Hi,
especially for the fuzzy method, is there a way to add to the output the score of the match found
to be able to quickly find the ones which have the least match and so the most one which could be a wrong match...

comparison question

Does it read the csv line by line for the comparison? Or does it search the entire csv for the matches? I have 2 documents that although are both alphabetized one has about 1000 more lines than the other so I know everything isn't aligned.

dash character and -a option

Hi,
I have these two input files

Name,Age
Andy,32
Mary-Jane,43


Name,City
Andy,Rome
Mary Jane,New York

If I run

csvmatch -i -a -n input_01.csv input_02.csv --fields1 "Name" --fields2 "Name"

I have

Name,Name
Andy,Andy

Using -a "Mary-Jane" should be equal to "Mary Jane". Dash character is a non-alphanumeric char or not?

Thank you

Upgrade Jellyfish

https://pypi.org/project/jellyfish/

0.7.2+ has Unicode support for Damerau Levenshtein: jamesturk/cjellyfish#5

So can remove hack: https://github.com/maxharlow/csvmatch/blob/master/fuzzylevenshtein.py#L10-L13

Rationalise ignore option names

They should all probably start --ignore-* -- will require releasing a new major version

CSVmatch great but install problem in Ubuntu Linux

Hi Max,
I think that csvmatch
is clever and great!

But when trying to install with:
pip install csvmatch
in my PC with Ubuntu Linux 14.04 LTS 32-bit,
(it includes:. Python 2.7.6 [GCC 4.8.4] on linux2)
I get the errors below.

I'm not familiar w/Python - I'm an R person...
and don't understand
what these error messages mean.

What can I do to complete
the csvmatch install on my PC?.

Love to try it out,,,help, Max!
SFd99
San Francisco
INSTALL MESSAGES:

~$ pip install csvmatch
Downloading/unpacking csvmatch
  Downloading csvmatch-1.13-py2.py3-none-any.whl
Downloading/unpacking doublemetaphone==0.1 (from csvmatch)
  Downloading DoubleMetaphone-0.1.tar.gz
  Running setup.py (path:/tmp/pip_build_ray/doublemetaphone/setup.py) egg_info for package doublemetaphone
    
Downloading/unpacking colorama==0.3.9 (from csvmatch)
  Downloading colorama-0.3.9-py2.py3-none-any.whl
Downloading/unpacking chardet==3.0.4 (from csvmatch)
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloaded
Downloading/unpacking jellyfish==0.5.6 (from csvmatch)
  Downloading jellyfish-0.5.6.tar.gz (132kB): 132kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/jellyfish/setup.py) egg_info for package jellyfish
    
    warning: no previously-included files matching '.git' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): unidecode==0.4.21 in /usr/local/lib/python2.7/dist-packages (from csvmatch)
Downloading/unpacking dedupe==1.8.1 (from csvmatch)
  Downloading dedupe-1.8.1.tar.gz (54kB): 54kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/dedupe/setup.py) egg_info for package dedupe
    
Downloading/unpacking fastcluster (from dedupe==1.8.1->csvmatch)
  Downloading fastcluster-1.1.24.tar.gz (166kB): 166kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/fastcluster/setup.py) egg_info for package fastcluster
    Version: 1.1.24
    
Downloading/unpacking dedupe-hcluster (from dedupe==1.8.1->csvmatch)
  Downloading dedupe-hcluster-0.3.2.tar.gz (166kB): 166kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/dedupe-hcluster/setup.py) egg_info for package dedupe-hcluster
    
Downloading/unpacking affinegap>=1.3 (from dedupe==1.8.1->csvmatch)
  Downloading affinegap-1.10.tar.gz
  Running setup.py (path:/tmp/pip_build_ray/affinegap/setup.py) egg_info for package affinegap
    
Downloading/unpacking categorical-distance>=1.9 (from dedupe==1.8.1->csvmatch)
  Downloading categorical_distance-1.9-py2-none-any.whl
Downloading/unpacking dedupe-variable-datetime (from dedupe==1.8.1->csvmatch)
  Downloading dedupe_variable_datetime-0.1.5-py2-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): future>=0.14 in /usr/local/lib/python2.7/dist-packages (from dedupe==1.8.1->csvmatch)
Downloading/unpacking rlr>=2.4.3 (from dedupe==1.8.1->csvmatch)
  Downloading rlr-2.4.3-py2.py3-none-any.whl
Downloading/unpacking numpy>=1.12 (from dedupe==1.8.1->csvmatch)
  Downloading numpy-1.14.1.zip (4.9MB): 4.9MB downloaded
  Running setup.py (path:/tmp/pip_build_ray/numpy/setup.py) egg_info for package numpy
    Running from numpy source directory.
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'python_requires'
      warnings.warn(msg)
    
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    warning: no previously-included files matching '*.pyo' found anywhere in distribution
    warning: no previously-included files matching '*.pyd' found anywhere in distribution
    warning: no previously-included files matching '*.swp' found anywhere in distribution
    warning: no previously-included files matching '*.bak' found anywhere in distribution
    warning: no previously-included files matching '*~' found anywhere in distribution
Downloading/unpacking highered>=0.2.0 (from dedupe==1.8.1->csvmatch)
  Downloading highered-0.2.1-py2.py3-none-any.whl
Downloading/unpacking simplecosine>=1.2 (from dedupe==1.8.1->csvmatch)
  Downloading simplecosine-1.2-py2.py3-none-any.whl
Downloading/unpacking haversine>=0.4.1 (from dedupe==1.8.1->csvmatch)
  Downloading haversine-0.4.5.tar.gz
  Running setup.py (path:/tmp/pip_build_ray/haversine/setup.py) egg_info for package haversine
    
Downloading/unpacking BTrees>=4.1.4 (from dedupe==1.8.1->csvmatch)
  Downloading BTrees-4.4.1.tar.gz (166kB): 166kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/BTrees/setup.py) egg_info for package BTrees
    warning: no previously-included files matching '*.dll' found anywhere in distribution
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    warning: no previously-included files matching '*.pyo' found anywhere in distribution
    warning: no previously-included files matching '*.so' found anywhere in distribution
    warning: no previously-included files matching 'coverage.xml' found anywhere in distribution
    no previously-included directories found matching 'docs/_build'
    no previously-included directories found matching 'persistent/__pycache__'
    In file included from persistent/cPersistence.h:18:0,
                     from persistent/cPersistence.c:19:
    persistent/_compat.h:18:20: fatal error: Python.h: No such file or directory
     #include "Python.h"
                        ^
    compilation terminated.
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_ray/BTrees/setup.py", line 158, in <module>
        """
      File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
        _setup_distribution = dist = klass(attrs)
      File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 239, in __init__
        self.fetch_build_eggs(attrs.pop('setup_requires'))
      File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 264, in fetch_build_eggs
        replace_conflicting=True
      File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 620, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 858, in best_match
        return self.obtain(req, installer) # try and download/install
      File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 870, in obtain
        return installer(requirement)
      File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 314, in fetch_build_egg
        return cmd.easy_install(req)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 616, in easy_install
        return self.install_item(spec, dist.location, tmpdir, deps)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 646, in install_item
        dists = self.install_eggs(spec, download, tmpdir)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 834, in install_eggs
        return self.build_and_install(setup_script, setup_base)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1040, in build_and_install
        self.run_setup(setup_script, setup_base, args)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1028, in run_setup
        raise DistutilsError("Setup script exited with %s" % (v.args[0],))
    distutils.errors.DistutilsError: Setup script exited with error: command 'i686-linux-gnu-gcc' failed with exit status 1
    Complete output from command python setup.py egg_info:
    warning: no previously-included files matching '*.dll' found anywhere in distribution

warning: no previously-included files matching '*.pyc' found anywhere in distribution

warning: no previously-included files matching '*.pyo' found anywhere in distribution

warning: no previously-included files matching '*.so' found anywhere in distribution

warning: no previously-included files matching 'coverage.xml' found anywhere in distribution

no previously-included directories found matching 'docs/_build'

no previously-included directories found matching 'persistent/__pycache__'

In file included from persistent/cPersistence.h:18:0,

                 from persistent/cPersistence.c:19:

persistent/_compat.h:18:20: fatal error: Python.h: No such file or directory

 #include "Python.h"

                    ^

compilation terminated.

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_ray/BTrees/setup.py", line 158, in <module>

    """

  File "/usr/lib/python2.7/distutils/core.py", line 111, in setup

    _setup_distribution = dist = klass(attrs)

  File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 239, in __init__

    self.fetch_build_eggs(attrs.pop('setup_requires'))

  File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 264, in fetch_build_eggs

    replace_conflicting=True

  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 620, in resolve

    dist = best[req.key] = env.best_match(req, ws, installer)

  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 858, in best_match

    return self.obtain(req, installer) # try and download/install

  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 870, in obtain

    return installer(requirement)

  File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 314, in fetch_build_egg

    return cmd.easy_install(req)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 616, in easy_install

    return self.install_item(spec, dist.location, tmpdir, deps)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 646, in install_item

    dists = self.install_eggs(spec, download, tmpdir)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 834, in install_eggs

    return self.build_and_install(setup_script, setup_base)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1040, in build_and_install

    self.run_setup(setup_script, setup_base, args)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1028, in run_setup

    raise DistutilsError("Setup script exited with %s" % (v.args[0],))

distutils.errors.DistutilsError: Setup script exited with error: command 'i686-linux-gnu-gcc' failed with exit status 1

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_ray/BTrees
Storing debug log for failure in /home/ray/.pip/pip.log

maxharlow / csvmatch Goto Github PK

csvmatch's People

Contributors

Stargazers

Watchers

Forkers

csvmatch's Issues

Recommend Projects

Recommend Topics

Recommend Org