Giter Club home page Giter Club logo

valentine's People

Contributors

andraionescu avatar archer6621 avatar asteriosk avatar chrisk21 avatar dependabot[bot] avatar jorgesia avatar kpsarakis avatar thanostsiamis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

valentine's Issues

Run the jobs on Slurm

Given a directory of configuration files that describe a job, run these jobs in parallel using the Slurm workload manager.

Add CUPID to the framework

Add the CUPID implementation to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.

API: Metrics object with functions

It would be nice to have a matches object and do something along these lines:

...
matches.get_one_to_one()

...
matches.metrics(ground_truth)

...

Add batching

It would be nice to be able to compare two lists of datasets, resuing intermediate data structures to speed up the processing, instead of restarting the computation for each unique pair of datasets.

Cupid tests fail on master due to nltk resource download issue

Tested on MacOS 13.6 (Ventura)

This is the log generated by unittest upon running python3 -m unittest discover:

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading omw-1.4: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
E..........
======================================================================
ERROR: test_cupid (tests.test_algorithms.TestAlgorithms.test_cupid)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 27, in normalization
    tokens = nltk.word_tokenize(element)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
                      ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/Users/wisguest/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/wisguest/Repositories/valentine/tests/test_algorithms.py", line 28, in test_cupid
    matches_cu_matcher = cu_matcher.get_matches(d1, d2)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 36, in get_matches
    self.__add_data("DB__"+source_input.name, source_input)
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 56, in __add_data
    self.__schemata[schema_name].add_node(table_name=table.name, table_guid=table.unique_identifier,
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/schema_tree.py", line 24, in add_node
    self.nodes[table_name].tokens = normalization(table_name).tokens
^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 33, in normalization
    tokens = nltk.word_tokenize(element)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
                      ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

Add embedding-based methods

Add methods that utilize column vector representations and cosine similarity among them to determine matches.

get_matches for distribution based matching fails with error "'charmap' codec can't encode character..."

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Geodata/location_cities_countries/cities.csv", encoding='utf-8')
df2 = pd.read_csv("data/Geodata/location_cities_countries/countries.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:
image
(UnicodeEncodeError: 'charmap' codec can't encode character '\u0103' in position 4: character maps to )

The csv files come from here: https://www.kaggle.com/datasets/liewyousheng/geolocation

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Does Valentine currently support SemProp and EmbDI?

I saw "we implement and integrate six schema matching algorithms [14]–[19] and our own baseline method, and adapt them to the needs of dataset discovery" in your paper. At present, Valentine does not seem to support these two algorithms.

JaccardLeven with process_num=10 has errors?

Hi folks: nice package. I tried below and curious if you had ideas on this?

  1. I tried on two DFs with 200k rows and 10 columns. It didn't converge. I had to use df.sample(4000) instead to cut down the processing to 10mins on a MacMini. with 32GB RAM and 3GHz 6-core i5. How long should I expect such a run to take? Two files of 13MB and 2MB in https://drive.google.com/drive/folders/1BIX240k6GEouT5SrjY9pWDaT7X6_QkY4?usp=sharing.

  2. I'd interpreted your comment in JaccardLevel as this spawns 10 processes for speedup? It raised error.

matcher = valentine.algorithms.JaccardLevenMatcher(0.2, 10)

raise RuntimeError('''

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

similarity_flooding case where e[1].long_name=None in __get_attribute_tuple

Hi Valentine authors!

I am having trouble with a bug that seems to be coming from Valentine, but I am unsure:

  • in similarity_flooding.py, is it expected that long_name may sometimes be None? (this is causing my experiments to crash)

  • dumbish question: is it possible that column_name should be =e[0].long_name ?

    def __get_attribute_tuple(self, node):
        column_name = None
        if node in self.__graph1.nodes():
            for e in self.__graph1.out_edges(node):
                links = self.__graph1.get_edge_data(e[0], e[1])
                if links.get('label') == "name":


                    column_name = e[1].long_name  ##### LONG_NAME is None
                    
                   
        else:
            for e in self.__graph2.out_edges(node):
                links = self.__graph2.get_edge_data(e[0], e[1])
                if links.get('label') == "name":
                    column_name = e[1].long_name
        return column_name

Incorporating structural schema information

Dear valentine devs,
I was wondering about losing the structural information found in json / dictionaries when normalizing them into pandas data frames. To my understanding, coma3 would usually use this information in a matching process to improve the results. What do you think about supporting a nested (JSON) data source? I guess one would need to transform it to xml to be able to use it with coma ect.

All the best, and thanks for the great work!

Confusing Coma API

The Coma algorithm could either use only schema information or both schema and instance information.

Currently, we ask users to specify the strategy parameter to either COMA_OPT (schema) or COMA_OPT_INST (schema + instances) which can be difficult to understand.

It would be much easier if we replaced the strategy param with a use_instances boolean flag.

wrong column pkl filename with DistributionBased matching

Hello, and thank you for the library! 🍾

When using DistributionBased matching, I have the following use case:

  • I create an instance of the matcher. I have two source tables (source_1 and source_2), and one target table target.
  • I call matcher.get_matches(source_1, target). Pickle files for columns of source_1 and target tables are written to e.g., /tmp/tmpkpakbdjz, and the same files are read back with clustering_utils.get_column_from_store. Matches are generated.
  • I call matcher.get_matches(source_2, target). Pickle files for columns of source_2 and target tables are written to e.g., /tmp/tmp41gf90n2. HOWEVER, clustering_utils.get_column_from_store attempts to read pkl files created for columns of source_1 from directory /tmp/tmp41gf90n2

get_matches for distribution based matching fails with error "len(ranks) < 2"

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Geographie_00164.csv", encoding='utf-8')
df2 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Stationsname_00164.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:
Unbenannt4.png
(RuntimeError: len(ranks) < 2)

The csv files come from here: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/weather_phenomena/recent/. I converted the data to csvs. I'll attach my csvs:
Metadaten_Geographie_00164.csv
Metadaten_Stationsname_00164.csv
(Overview and explanation of data: https://www.dwd.de/EN/ourservices/cdc/cdc_ueberblick-klimadaten_en.html)

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Failed installation on Windows

C:\Users\akatsifodimos>pip install valentine
Collecting valentine
  Using cached valentine-0.1.1.tar.gz (38.2 MB)
Collecting numpy<2.0,>=1.21
  Using cached numpy-1.21.2.zip (10.3 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting valentine
  Using cached valentine-0.1.0.tar.gz (38.2 MB)
ERROR: Cannot install valentine==0.1.0 and valentine==0.1.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    valentine 0.1.1 depends on scipy<1.8 and >=1.7
    valentine 0.1.0 depends on scipy<1.8 and >=1.7

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

Run time of Cupid()

Hello,

I am trying to match the feature name in two example data sets with 100 rows and 300 columns. It cost me more than 20 mins but still can't get the output. Is there anything wrong?
image

FileNotFoundError with matches command

Hello

I am trying to execute the following 2 lines of code but I get FileNotFoundError :

import valentine
matcher =valentine.algorithms.Coma(strategy="COMA_OPT")
matches = valentine_match(df1, df2, 'matcher'==matcher)

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine_init_.py", line 20, in valentine_match
matches = dict(sorted(matcher.get_matches(table_1, table_2).items(),
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 32, in get_matches
self.__run_coma_jar(s_f_name, t_f_name, coma_output_file, tmp_folder_path)
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 49, in __run_coma_jar
subprocess.call(['java', f'-Xmx{self.__java_XmX}',
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

How could this be resolved? Is it a windows issue or an issue with the valentine library itself?

Write output to file

Each run's output should be written to a JSON file with the following structure:

{
"name": "a unique identifier for the specific run",
"matches": "a dictionary that contains the output of the algorithm",
"metrics": "a dictionary with the metrics and their values"
}

get_matches for distribution based matching fails due to pickled files not refound

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/book_titles.csv", encoding='utf-8')
df2 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/collaborative_books_df.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:
https://i.postimg.cc/ryr5ZYWP/Unbenannt3.png
([WinError 2] The system cannot find the file specified: 'C:\Users\xxx\AppData\Local\Temp\tmptv_b_0v6\table1title.pkl')

The pickled rank files can not be found again. I can't find the files manually in my AppData folder either.
The csv files come from here: https://www.kaggle.com/datasets/thedevastator/book-recommender-system-itembased

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.