maxharlow / csvmatch Goto Github PK
View Code? Open in Web Editor NEW๐ Finds fuzzy matches between CSV files
License: Other
๐ Finds fuzzy matches between CSV files
License: Other
Using Nameparser? https://pypi.python.org/pypi/nameparser
eg:
cStringIO
, and other C-versionsA la how csvlink
and csvmatch
does by default.
Hello, in python 3.8 on windows, when you run the Bilenko match and after you enter the response y/n you have the next error :
invalid value encountered in double_scalars
module 'time' has no attribute 'clock'
Should be able to specify what columns are in the output, separately from the specification of which columns are used for matching.
During the install process on my Mac, I seem to be getting an error when it comes to scikit-learn. Any advice?
`Collecting scikit-learn (from dedupe==2.0.21->csvmatch)
Using cached scikit-learn-1.2.2.tar.gz (7.3 MB)
Installing build dependencies ... error
error: subprocess-exited-with-error
ร pip subprocess to install build dependencies did not run successfully.
โ exit code: 1
โฐโ> [74 lines of output]
Ignoring numpy: markers 'python_version == "3.10" and platform_system == "Windows" and platform_python_implementation != "PyPy"' don't match your environment
Collecting setuptools
Using cached setuptools-67.7.2-py3-none-any.whl (1.1 MB)
Collecting wheel
Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
Collecting Cython>=0.29.24
Using cached Cython-0.29.34-py2.py3-none-any.whl (988 kB)
Collecting oldest-supported-numpy
Using cached oldest_supported_numpy-2022.11.19-py3-none-any.whl (4.9 kB)
Collecting scipy>=1.3.2
Using cached scipy-1.10.1.tar.gz (42.4 MB)
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Installing backend dependencies: started
Installing backend dependencies: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'error'
error: subprocess-exited-with-error
`
Currently the data gets coalesced into a single column -- not ideal
A number between 0 and 1 indicating how strong the match was.
Would only be relevant to algorithms that give a match degree number
Hi,
Thank you for your project!
I use csvmatch for a project but started getting build issues since today or a few days ago. It seems like Levenshtein-search==1.4.5, which dedupe depends on, is no longer available on pip. Others created issues in their repository, but there has been no solution so far: mattandahalfew/Levenshtein_search#29
The dedupe package released a hotfix to no longer depend on Levenshtein-search==1.4.5, but to use that hotfix the dedupe dependency in csvmatch would need to be updated from 2.0.11 to 2.0.21. Would it be possible to update this dependency and create a new release?
Hey when I try run csvmatch test1.csv test2.csv --fuzzy levenshtein in the terminal it crashes my terminal and give segmentation fault 11. I am using python 2.7.11. Sometimes it will output results in ther terminal sometimes it won't.
Hello,
It appears as though you don't combine the degrees of multiple fields and instead look at them individually. I'm looking to combine the degree from matching name and address. Any idea on how to accomplish that?
For the purpose of matching strings with helpful numbers and unhelpful words (such as precinct names with codes and messy names), adding an "ignore_letters" option would be nice. This would only match numbers from two columns of interest.
I implemented something like this in my code with:
def ignore_alpha(row):
regex = re.compile('[\D_]+')
return [regex.sub('', value) for value in row]
Could use something like Apache Arrow so files don't have to be kept in memory?
I am receiving "descriptor 'union' of 'set' object needs an argument" when using --fuzzy. This is happening after I hit f to finish matching records for the machine learning. Below is my syntax:
csvmatch file1.csv file2.csv --fields1 'Employee#' 'Last Name' First Name' --fields2 'Employee#' 'Last Name' 'First Name' --fuzzy > newfile.csv
Hi Max
I watched your talk at NICAR21. I'm Reinaldo, from Brazil
Congratulations on the project
I'm using a computer with Ubuntu 20.04 and python 3.8
I tested some commands, but only the first one worked:
csvmatch forbes-billionaires.csv bloomberg-billionaires.csv --fields1 name --fields2 Name > billionaires-in-both.csv
The others always show the error: csvmatch: error: unrecognized arguments
csvmatch cia-world-leaders.csv davos-delegates.csv --fields1 name --fields2 full_name --ignore-case > leaders-at-davos.csv
usage: csvmatch [-h] [-1 FIELDS1 [FIELDS1 ...]] [-2 FIELDS2 [FIELDS2 ...]]
[FILE1] [FILE2]
csvmatch: error: unrecognized arguments: --ignore-case
csvmatch forbes-billionaires.csv bloomberg-billionaires.csv --fields1 name --fields2 Name --output 1.name 1.rank 2.Rank > billionaires-in-both-ranked.csv
usage: csvmatch [-h] [-1 FIELDS1 [FIELDS1 ...]] [-2 FIELDS2 [FIELDS2 ...]]
[FILE1] [FILE2]
csvmatch: error: unrecognized arguments: --output 1.name 1.rank 2.Rank
Please, is there any incompatibility with Ubuntu or do I need to install other libraries?
Or did I make some silly mistake?
Best
eg. 1*
should be all columns from file one, and ditto for 2*
In Dedupe's library we see the option to prepare_training and start off with some known matches for training data. However, is this functionality available in csvmatch
to pass in another training file also?
One option to specify fields to use to create the blocks, another (?) to set the method -- default to exact match, options for Metaphone etc
Interesting bit on Soundex plus Levenshtein here: https://info.crunchydata.com/blog/fuzzy-name-matching-in-postgresql
Hi,
I have these two example CSVs
cat input_01.csv
field1,field2
a,b
c,d
cat input_02.csv
field1,field3
a,12
c,13
If I run
csvmatch input_01.csv input_02.csv --fields1 "field1" --fields2 "field1" --output 1.field1 2.field3 >out.csv
I have two (and not only one) carriage return in the output file, and this is an error because row 4 is completely blank.
field1,field3
a,12
c,13
Thank always you for this great tool
Hi,
for -l
I read "filter out terms from a newline-separated file of regular expressions when comparing".
But how to use it? Could you add an example with two lines of regular expressions to apply to the compare job?
Thank you
There's a list here: http://ntz-develop.blogspot.co.uk/2011/03/fuzzy-string-search.html
I found this project via another one from a ddg search. But I realised that there's no topics listed on the project. It might help this package be discovered more easily on GitHub if you added some topics, like python, csv, stuff like that.
Thanks for making open source software. ๐งก
Am I wrong , fuzzy_pandas cant handle fuzzy logic related to integer values? Let's say we have fuzzy coordinates (X,Y). The only way is to convert them to string?
Thank you
Hi,
using these input files
Name,Age
Andy,32
Mary Jane,43
Name,City
Andy,Rome
Mary Jane,New York
"Mary Jane" does not match "Mary Jane" because in the first there are two spaces. Probably I can use -l
option, but I do not know how to do it.
If it's not possible with l
, it would be great to have ignore wrong white space option, to strip leading and trailing whitespace, and replace multiple whitespace with singles.
Thank you
Hi,
especially for the fuzzy method, is there a way to add to the output the score of the match found
to be able to quickly find the ones which have the least match and so the most one which could be a wrong match...
Does it read the csv line by line for the comparison? Or does it search the entire csv for the matches? I have 2 documents that although are both alphabetized one has about 1000 more lines than the other so I know everything isn't aligned.
Hi,
I have these two input files
Name,Age
Andy,32
Mary-Jane,43
Name,City
Andy,Rome
Mary Jane,New York
If I run
csvmatch -i -a -n input_01.csv input_02.csv --fields1 "Name" --fields2 "Name"
I have
Name,Name
Andy,Andy
Using -a
"Mary-Jane" should be equal to "Mary Jane". Dash character is a non-alphanumeric char or not?
Thank you
https://pypi.org/project/jellyfish/
0.7.2+ has Unicode support for Damerau Levenshtein: jamesturk/cjellyfish#5
So can remove hack: https://github.com/maxharlow/csvmatch/blob/master/fuzzylevenshtein.py#L10-L13
They should all probably start --ignore-*
-- will require releasing a new major version
Hi Max,
I think that csvmatch
is clever and great!
But when trying to install with:
pip install csvmatch
in my PC with Ubuntu Linux 14.04 LTS 32-bit,
(it includes:. Python 2.7.6 [GCC 4.8.4] on linux2)
I get the errors below.
I'm not familiar w/Python - I'm an R person...
and don't understand
what these error messages mean.
What can I do to complete
the csvmatch install on my PC?.
Love to try it out,,,help, Max!
SFd99
San Francisco
INSTALL MESSAGES:
~$ pip install csvmatch
Downloading/unpacking csvmatch
Downloading csvmatch-1.13-py2.py3-none-any.whl
Downloading/unpacking doublemetaphone==0.1 (from csvmatch)
Downloading DoubleMetaphone-0.1.tar.gz
Running setup.py (path:/tmp/pip_build_ray/doublemetaphone/setup.py) egg_info for package doublemetaphone
Downloading/unpacking colorama==0.3.9 (from csvmatch)
Downloading colorama-0.3.9-py2.py3-none-any.whl
Downloading/unpacking chardet==3.0.4 (from csvmatch)
Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloaded
Downloading/unpacking jellyfish==0.5.6 (from csvmatch)
Downloading jellyfish-0.5.6.tar.gz (132kB): 132kB downloaded
Running setup.py (path:/tmp/pip_build_ray/jellyfish/setup.py) egg_info for package jellyfish
warning: no previously-included files matching '.git' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): unidecode==0.4.21 in /usr/local/lib/python2.7/dist-packages (from csvmatch)
Downloading/unpacking dedupe==1.8.1 (from csvmatch)
Downloading dedupe-1.8.1.tar.gz (54kB): 54kB downloaded
Running setup.py (path:/tmp/pip_build_ray/dedupe/setup.py) egg_info for package dedupe
Downloading/unpacking fastcluster (from dedupe==1.8.1->csvmatch)
Downloading fastcluster-1.1.24.tar.gz (166kB): 166kB downloaded
Running setup.py (path:/tmp/pip_build_ray/fastcluster/setup.py) egg_info for package fastcluster
Version: 1.1.24
Downloading/unpacking dedupe-hcluster (from dedupe==1.8.1->csvmatch)
Downloading dedupe-hcluster-0.3.2.tar.gz (166kB): 166kB downloaded
Running setup.py (path:/tmp/pip_build_ray/dedupe-hcluster/setup.py) egg_info for package dedupe-hcluster
Downloading/unpacking affinegap>=1.3 (from dedupe==1.8.1->csvmatch)
Downloading affinegap-1.10.tar.gz
Running setup.py (path:/tmp/pip_build_ray/affinegap/setup.py) egg_info for package affinegap
Downloading/unpacking categorical-distance>=1.9 (from dedupe==1.8.1->csvmatch)
Downloading categorical_distance-1.9-py2-none-any.whl
Downloading/unpacking dedupe-variable-datetime (from dedupe==1.8.1->csvmatch)
Downloading dedupe_variable_datetime-0.1.5-py2-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): future>=0.14 in /usr/local/lib/python2.7/dist-packages (from dedupe==1.8.1->csvmatch)
Downloading/unpacking rlr>=2.4.3 (from dedupe==1.8.1->csvmatch)
Downloading rlr-2.4.3-py2.py3-none-any.whl
Downloading/unpacking numpy>=1.12 (from dedupe==1.8.1->csvmatch)
Downloading numpy-1.14.1.zip (4.9MB): 4.9MB downloaded
Running setup.py (path:/tmp/pip_build_ray/numpy/setup.py) egg_info for package numpy
Running from numpy source directory.
/usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'python_requires'
warnings.warn(msg)
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.pyd' found anywhere in distribution
warning: no previously-included files matching '*.swp' found anywhere in distribution
warning: no previously-included files matching '*.bak' found anywhere in distribution
warning: no previously-included files matching '*~' found anywhere in distribution
Downloading/unpacking highered>=0.2.0 (from dedupe==1.8.1->csvmatch)
Downloading highered-0.2.1-py2.py3-none-any.whl
Downloading/unpacking simplecosine>=1.2 (from dedupe==1.8.1->csvmatch)
Downloading simplecosine-1.2-py2.py3-none-any.whl
Downloading/unpacking haversine>=0.4.1 (from dedupe==1.8.1->csvmatch)
Downloading haversine-0.4.5.tar.gz
Running setup.py (path:/tmp/pip_build_ray/haversine/setup.py) egg_info for package haversine
Downloading/unpacking BTrees>=4.1.4 (from dedupe==1.8.1->csvmatch)
Downloading BTrees-4.4.1.tar.gz (166kB): 166kB downloaded
Running setup.py (path:/tmp/pip_build_ray/BTrees/setup.py) egg_info for package BTrees
warning: no previously-included files matching '*.dll' found anywhere in distribution
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.so' found anywhere in distribution
warning: no previously-included files matching 'coverage.xml' found anywhere in distribution
no previously-included directories found matching 'docs/_build'
no previously-included directories found matching 'persistent/__pycache__'
In file included from persistent/cPersistence.h:18:0,
from persistent/cPersistence.c:19:
persistent/_compat.h:18:20: fatal error: Python.h: No such file or directory
#include "Python.h"
^
compilation terminated.
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_ray/BTrees/setup.py", line 158, in <module>
"""
File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 239, in __init__
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 264, in fetch_build_eggs
replace_conflicting=True
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 620, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 858, in best_match
return self.obtain(req, installer) # try and download/install
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 870, in obtain
return installer(requirement)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 314, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 616, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 646, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 834, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1040, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1028, in run_setup
raise DistutilsError("Setup script exited with %s" % (v.args[0],))
distutils.errors.DistutilsError: Setup script exited with error: command 'i686-linux-gnu-gcc' failed with exit status 1
Complete output from command python setup.py egg_info:
warning: no previously-included files matching '*.dll' found anywhere in distribution
warning: no previously-included files matching '*.pyc' found anywhere in distribution
warning: no previously-included files matching '*.pyo' found anywhere in distribution
warning: no previously-included files matching '*.so' found anywhere in distribution
warning: no previously-included files matching 'coverage.xml' found anywhere in distribution
no previously-included directories found matching 'docs/_build'
no previously-included directories found matching 'persistent/__pycache__'
In file included from persistent/cPersistence.h:18:0,
from persistent/cPersistence.c:19:
persistent/_compat.h:18:20: fatal error: Python.h: No such file or directory
#include "Python.h"
^
compilation terminated.
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_ray/BTrees/setup.py", line 158, in <module>
"""
File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 239, in __init__
self.fetch_build_eggs(attrs.pop('setup_requires'))
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 264, in fetch_build_eggs
replace_conflicting=True
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 620, in resolve
dist = best[req.key] = env.best_match(req, ws, installer)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 858, in best_match
return self.obtain(req, installer) # try and download/install
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 870, in obtain
return installer(requirement)
File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 314, in fetch_build_egg
return cmd.easy_install(req)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 616, in easy_install
return self.install_item(spec, dist.location, tmpdir, deps)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 646, in install_item
dists = self.install_eggs(spec, download, tmpdir)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 834, in install_eggs
return self.build_and_install(setup_script, setup_base)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1040, in build_and_install
self.run_setup(setup_script, setup_base, args)
File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1028, in run_setup
raise DistutilsError("Setup script exited with %s" % (v.args[0],))
distutils.errors.DistutilsError: Setup script exited with error: command 'i686-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_ray/BTrees
Storing debug log for failure in /home/ray/.pip/pip.log
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.