Giter Club home page Giter Club logo

csvmatch's People

Contributors

aborruso avatar gruberma avatar maxharlow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csvmatch's Issues

Installation on Mac -- scikit-learn Issue

During the install process on my Mac, I seem to be getting an error when it comes to scikit-learn. Any advice?
`Collecting scikit-learn (from dedupe==2.0.21->csvmatch)
Using cached scikit-learn-1.2.2.tar.gz (7.3 MB)
Installing build dependencies ... error
error: subprocess-exited-with-error

ร— pip subprocess to install build dependencies did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> [74 lines of output]
Ignoring numpy: markers 'python_version == "3.10" and platform_system == "Windows" and platform_python_implementation != "PyPy"' don't match your environment
Collecting setuptools
Using cached setuptools-67.7.2-py3-none-any.whl (1.1 MB)
Collecting wheel
Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
Collecting Cython>=0.29.24
Using cached Cython-0.29.34-py2.py3-none-any.whl (988 kB)
Collecting oldest-supported-numpy
Using cached oldest_supported_numpy-2022.11.19-py3-none-any.whl (4.9 kB)
Collecting scipy>=1.3.2
Using cached scipy-1.10.1.tar.gz (42.4 MB)
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Installing backend dependencies: started
Installing backend dependencies: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'error'
error: subprocess-exited-with-error

`

Issues with dedupe dependency because of Levenshtein-search

Hi,

Thank you for your project!

I use csvmatch for a project but started getting build issues since today or a few days ago. It seems like Levenshtein-search==1.4.5, which dedupe depends on, is no longer available on pip. Others created issues in their repository, but there has been no solution so far: mattandahalfew/Levenshtein_search#29

The dedupe package released a hotfix to no longer depend on Levenshtein-search==1.4.5, but to use that hotfix the dedupe dependency in csvmatch would need to be updated from 2.0.11 to 2.0.21. Would it be possible to update this dependency and create a new release?

Segmentation Fault: 11 when running fuzzy levenshtein

Hey when I try run csvmatch test1.csv test2.csv --fuzzy levenshtein in the terminal it crashes my terminal and give segmentation fault 11. I am using python 2.7.11. Sometimes it will output results in ther terminal sometimes it won't.

Matching across multiple fields

Hello,
It appears as though you don't combine the degrees of multiple fields and instead look at them individually. I'm looking to combine the degree from matching name and address. Any idea on how to accomplish that?

Ignore_letters option

For the purpose of matching strings with helpful numbers and unhelpful words (such as precinct names with codes and messy names), adding an "ignore_letters" option would be nice. This would only match numbers from two columns of interest.

I implemented something like this in my code with:

def ignore_alpha(row):
    regex = re.compile('[\D_]+')
    return [regex.sub('', value) for value in row]

descriptor 'union' of 'set' object needs an argument

I am receiving "descriptor 'union' of 'set' object needs an argument" when using --fuzzy. This is happening after I hit f to finish matching records for the machine learning. Below is my syntax:

csvmatch file1.csv file2.csv --fields1 'Employee#' 'Last Name' First Name' --fields2 'Employee#' 'Last Name' 'First Name' --fuzzy > newfile.csv

Unrecognized arguments on Ubuntu

Hi Max
I watched your talk at NICAR21. I'm Reinaldo, from Brazil
Congratulations on the project
I'm using a computer with Ubuntu 20.04 and python 3.8

I tested some commands, but only the first one worked:
csvmatch forbes-billionaires.csv bloomberg-billionaires.csv --fields1 name --fields2 Name > billionaires-in-both.csv

The others always show the error: csvmatch: error: unrecognized arguments

csvmatch cia-world-leaders.csv davos-delegates.csv --fields1 name --fields2 full_name --ignore-case > leaders-at-davos.csv
usage: csvmatch [-h] [-1 FIELDS1 [FIELDS1 ...]] [-2 FIELDS2 [FIELDS2 ...]]
[FILE1] [FILE2]
csvmatch: error: unrecognized arguments: --ignore-case

csvmatch forbes-billionaires.csv bloomberg-billionaires.csv --fields1 name --fields2 Name --output 1.name 1.rank 2.Rank > billionaires-in-both-ranked.csv
usage: csvmatch [-h] [-1 FIELDS1 [FIELDS1 ...]] [-2 FIELDS2 [FIELDS2 ...]]
[FILE1] [FILE2]
csvmatch: error: unrecognized arguments: --output 1.name 1.rank 2.Rank

Please, is there any incompatibility with Ubuntu or do I need to install other libraries?
Or did I make some silly mistake?

Best

bug: wrong number of carriage return in output

Hi,
I have these two example CSVs

cat input_01.csv

field1,field2
a,b
c,d
cat input_02.csv

field1,field3
a,12
c,13

If I run

csvmatch input_01.csv input_02.csv --fields1 "field1" --fields2 "field1" --output 1.field1 2.field3 >out.csv

I have two (and not only one) carriage return in the output file, and this is an error because row 4 is completely blank.

field1,field3
a,12
c,13


Thank always you for this great tool

-l option example or documentation

Hi,
for -l I read "filter out terms from a newline-separated file of regular expressions when comparing".

But how to use it? Could you add an example with two lines of regular expressions to apply to the compare job?

Thank you

Suggestion: Add some topics to this repo

I found this project via another one from a ddg search. But I realised that there's no topics listed on the project. It might help this package be discovered more easily on GitHub if you added some topics, like python, csv, stuff like that.

Thanks for making open source software. ๐Ÿงก

Fuzzy merge for integer values

Am I wrong , fuzzy_pandas cant handle fuzzy logic related to integer values? Let's say we have fuzzy coordinates (X,Y). The only way is to convert them to string?
Thank you

feature request: add wrong whitespaces option

Hi,
using these input files

Name,Age
Andy,32
Mary Jane,43


Name,City
Andy,Rome
Mary  Jane,New York

"Mary Jane" does not match "Mary Jane" because in the first there are two spaces. Probably I can use -l option, but I do not know how to do it.
If it's not possible with l, it would be great to have ignore wrong white space option, to strip leading and trailing whitespace, and replace multiple whitespace with singles.

Thank you

How to output the score of the match

Hi,
especially for the fuzzy method, is there a way to add to the output the score of the match found
to be able to quickly find the ones which have the least match and so the most one which could be a wrong match...

comparison question

Does it read the csv line by line for the comparison? Or does it search the entire csv for the matches? I have 2 documents that although are both alphabetized one has about 1000 more lines than the other so I know everything isn't aligned.

dash character and -a option

Hi,
I have these two input files

Name,Age
Andy,32
Mary-Jane,43


Name,City
Andy,Rome
Mary Jane,New York

If I run

csvmatch -i -a -n input_01.csv input_02.csv --fields1 "Name" --fields2 "Name"

I have

Name,Name
Andy,Andy

Using -a "Mary-Jane" should be equal to "Mary Jane". Dash character is a non-alphanumeric char or not?

Thank you

CSVmatch great but install problem in Ubuntu Linux

Hi Max,
I think that csvmatch
is clever and great!

But when trying to install with:
pip install csvmatch
in my PC with Ubuntu Linux 14.04 LTS 32-bit,
(it includes:. Python 2.7.6 [GCC 4.8.4] on linux2)
I get the errors below.

I'm not familiar w/Python - I'm an R person...
and don't understand
what these error messages mean.

What can I do to complete
the csvmatch install on my PC?.

Love to try it out,,,help, Max!
SFd99
San Francisco

INSTALL MESSAGES:

~$ pip install csvmatch
Downloading/unpacking csvmatch
  Downloading csvmatch-1.13-py2.py3-none-any.whl
Downloading/unpacking doublemetaphone==0.1 (from csvmatch)
  Downloading DoubleMetaphone-0.1.tar.gz
  Running setup.py (path:/tmp/pip_build_ray/doublemetaphone/setup.py) egg_info for package doublemetaphone
    
Downloading/unpacking colorama==0.3.9 (from csvmatch)
  Downloading colorama-0.3.9-py2.py3-none-any.whl
Downloading/unpacking chardet==3.0.4 (from csvmatch)
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB): 133kB downloaded
Downloading/unpacking jellyfish==0.5.6 (from csvmatch)
  Downloading jellyfish-0.5.6.tar.gz (132kB): 132kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/jellyfish/setup.py) egg_info for package jellyfish
    
    warning: no previously-included files matching '.git' found anywhere in distribution
Requirement already satisfied (use --upgrade to upgrade): unidecode==0.4.21 in /usr/local/lib/python2.7/dist-packages (from csvmatch)
Downloading/unpacking dedupe==1.8.1 (from csvmatch)
  Downloading dedupe-1.8.1.tar.gz (54kB): 54kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/dedupe/setup.py) egg_info for package dedupe
    
Downloading/unpacking fastcluster (from dedupe==1.8.1->csvmatch)
  Downloading fastcluster-1.1.24.tar.gz (166kB): 166kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/fastcluster/setup.py) egg_info for package fastcluster
    Version: 1.1.24
    
Downloading/unpacking dedupe-hcluster (from dedupe==1.8.1->csvmatch)
  Downloading dedupe-hcluster-0.3.2.tar.gz (166kB): 166kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/dedupe-hcluster/setup.py) egg_info for package dedupe-hcluster
    
Downloading/unpacking affinegap>=1.3 (from dedupe==1.8.1->csvmatch)
  Downloading affinegap-1.10.tar.gz
  Running setup.py (path:/tmp/pip_build_ray/affinegap/setup.py) egg_info for package affinegap
    
Downloading/unpacking categorical-distance>=1.9 (from dedupe==1.8.1->csvmatch)
  Downloading categorical_distance-1.9-py2-none-any.whl
Downloading/unpacking dedupe-variable-datetime (from dedupe==1.8.1->csvmatch)
  Downloading dedupe_variable_datetime-0.1.5-py2-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): future>=0.14 in /usr/local/lib/python2.7/dist-packages (from dedupe==1.8.1->csvmatch)
Downloading/unpacking rlr>=2.4.3 (from dedupe==1.8.1->csvmatch)
  Downloading rlr-2.4.3-py2.py3-none-any.whl
Downloading/unpacking numpy>=1.12 (from dedupe==1.8.1->csvmatch)
  Downloading numpy-1.14.1.zip (4.9MB): 4.9MB downloaded
  Running setup.py (path:/tmp/pip_build_ray/numpy/setup.py) egg_info for package numpy
    Running from numpy source directory.
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'python_requires'
      warnings.warn(msg)
    
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    warning: no previously-included files matching '*.pyo' found anywhere in distribution
    warning: no previously-included files matching '*.pyd' found anywhere in distribution
    warning: no previously-included files matching '*.swp' found anywhere in distribution
    warning: no previously-included files matching '*.bak' found anywhere in distribution
    warning: no previously-included files matching '*~' found anywhere in distribution
Downloading/unpacking highered>=0.2.0 (from dedupe==1.8.1->csvmatch)
  Downloading highered-0.2.1-py2.py3-none-any.whl
Downloading/unpacking simplecosine>=1.2 (from dedupe==1.8.1->csvmatch)
  Downloading simplecosine-1.2-py2.py3-none-any.whl
Downloading/unpacking haversine>=0.4.1 (from dedupe==1.8.1->csvmatch)
  Downloading haversine-0.4.5.tar.gz
  Running setup.py (path:/tmp/pip_build_ray/haversine/setup.py) egg_info for package haversine
    
Downloading/unpacking BTrees>=4.1.4 (from dedupe==1.8.1->csvmatch)
  Downloading BTrees-4.4.1.tar.gz (166kB): 166kB downloaded
  Running setup.py (path:/tmp/pip_build_ray/BTrees/setup.py) egg_info for package BTrees
    warning: no previously-included files matching '*.dll' found anywhere in distribution
    warning: no previously-included files matching '*.pyc' found anywhere in distribution
    warning: no previously-included files matching '*.pyo' found anywhere in distribution
    warning: no previously-included files matching '*.so' found anywhere in distribution
    warning: no previously-included files matching 'coverage.xml' found anywhere in distribution
    no previously-included directories found matching 'docs/_build'
    no previously-included directories found matching 'persistent/__pycache__'
    In file included from persistent/cPersistence.h:18:0,
                     from persistent/cPersistence.c:19:
    persistent/_compat.h:18:20: fatal error: Python.h: No such file or directory
     #include "Python.h"
                        ^
    compilation terminated.
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_ray/BTrees/setup.py", line 158, in <module>
        """
      File "/usr/lib/python2.7/distutils/core.py", line 111, in setup
        _setup_distribution = dist = klass(attrs)
      File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 239, in __init__
        self.fetch_build_eggs(attrs.pop('setup_requires'))
      File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 264, in fetch_build_eggs
        replace_conflicting=True
      File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 620, in resolve
        dist = best[req.key] = env.best_match(req, ws, installer)
      File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 858, in best_match
        return self.obtain(req, installer) # try and download/install
      File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 870, in obtain
        return installer(requirement)
      File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 314, in fetch_build_egg
        return cmd.easy_install(req)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 616, in easy_install
        return self.install_item(spec, dist.location, tmpdir, deps)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 646, in install_item
        dists = self.install_eggs(spec, download, tmpdir)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 834, in install_eggs
        return self.build_and_install(setup_script, setup_base)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1040, in build_and_install
        self.run_setup(setup_script, setup_base, args)
      File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1028, in run_setup
        raise DistutilsError("Setup script exited with %s" % (v.args[0],))
    distutils.errors.DistutilsError: Setup script exited with error: command 'i686-linux-gnu-gcc' failed with exit status 1
    Complete output from command python setup.py egg_info:
    warning: no previously-included files matching '*.dll' found anywhere in distribution

warning: no previously-included files matching '*.pyc' found anywhere in distribution

warning: no previously-included files matching '*.pyo' found anywhere in distribution

warning: no previously-included files matching '*.so' found anywhere in distribution

warning: no previously-included files matching 'coverage.xml' found anywhere in distribution

no previously-included directories found matching 'docs/_build'

no previously-included directories found matching 'persistent/__pycache__'

In file included from persistent/cPersistence.h:18:0,

                 from persistent/cPersistence.c:19:

persistent/_compat.h:18:20: fatal error: Python.h: No such file or directory

 #include "Python.h"

                    ^

compilation terminated.

Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_ray/BTrees/setup.py", line 158, in <module>

    """

  File "/usr/lib/python2.7/distutils/core.py", line 111, in setup

    _setup_distribution = dist = klass(attrs)

  File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 239, in __init__

    self.fetch_build_eggs(attrs.pop('setup_requires'))

  File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 264, in fetch_build_eggs

    replace_conflicting=True

  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 620, in resolve

    dist = best[req.key] = env.best_match(req, ws, installer)

  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 858, in best_match

    return self.obtain(req, installer) # try and download/install

  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 870, in obtain

    return installer(requirement)

  File "/usr/lib/python2.7/dist-packages/setuptools/dist.py", line 314, in fetch_build_egg

    return cmd.easy_install(req)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 616, in easy_install

    return self.install_item(spec, dist.location, tmpdir, deps)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 646, in install_item

    dists = self.install_eggs(spec, download, tmpdir)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 834, in install_eggs

    return self.build_and_install(setup_script, setup_base)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1040, in build_and_install

    self.run_setup(setup_script, setup_base, args)

  File "/usr/lib/python2.7/dist-packages/setuptools/command/easy_install.py", line 1028, in run_setup

    raise DistutilsError("Setup script exited with %s" % (v.args[0],))

distutils.errors.DistutilsError: Setup script exited with error: command 'i686-linux-gnu-gcc' failed with exit status 1

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_ray/BTrees
Storing debug log for failure in /home/ray/.pip/pip.log

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.