delftdata / valentine Goto Github PK

A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.

License: Apache License 2.0

Python 100.00%

schema-matching experiment-suite dataset-discovery

valentine's People

Contributors

Stargazers

Watchers

valentine's Issues

Run the jobs on Slurm

Given a directory of configuration files that describe a job, run these jobs in parallel using the Slurm workload manager.

Add CUPID to the framework

Add the CUPID implementation to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.

Add COMA 3.0 to the framework

Add COMA 3.0 to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.

API: Metrics object with functions

It would be nice to have a matches object and do something along these lines:

...
matches.get_one_to_one()

...
matches.metrics(ground_truth)

...

Merge all requirements.txt files to one

It will be nice if we had 1 requirements.txt file for the entire framework

Time performance measurements

We want to measure execution time for each different algorithm

Create release for Python 3.11

Python 3.11 is released on 24/10/2022. A new release is needed to support it

Add batching

It would be nice to be able to compare two lists of datasets, resuing intermediate data structures to speed up the processing, instead of restarting the computation for each unique pair of datasets.

Cupid tests fail on master due to nltk resource download issue

Tested on MacOS 13.6 (Ventura)

This is the log generated by unittest upon running python3 -m unittest discover:

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading omw-1.4: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>
E..........
======================================================================
ERROR: test_cupid (tests.test_algorithms.TestAlgorithms.test_cupid)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 27, in normalization
    tokens = nltk.word_tokenize(element)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
                      ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/Users/wisguest/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.11/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/wisguest/Repositories/valentine/tests/test_algorithms.py", line 28, in test_cupid
    matches_cu_matcher = cu_matcher.get_matches(d1, d2)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 36, in get_matches
    self.__add_data("DB__"+source_input.name, source_input)
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 56, in __add_data
    self.__schemata[schema_name].add_node(table_name=table.name, table_guid=table.unique_identifier,
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/schema_tree.py", line 24, in add_node
    self.nodes[table_name].tokens = normalization(table_name).tokens
^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 33, in normalization
    tokens = nltk.word_tokenize(element)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
                      ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

get_matches fails for Coma when the column has special characters

Hello,

I ran Valentine with the Coma matcher
valentine_match(df1, df2, Coma(strategy="COMA_OPT"))
for a dataset which contains columns with special characters like:
1:00-2:00AM (that is actually the name of the column)
and the code failed here:

valentine/valentine/algorithms/coma/coma.py

Line 76 in 37b3103

m, similarity = match.split(":")

Feature Request: Top n matches

It would be incredibly useful to give for each column in df1, give top n column matches in df2.

Create script that generates configuration files for the framework

Given a list of directories with different experiments and a set of algorithms, create a script that generates the configuration files required for running them as a job in the framework.

Add embedding-based methods

Add methods that utilize column vector representations and cosine similarity among them to determine matches.

Refactor output and config filenames

Refactor configuration generation and output filenames to remove special characters from the file name

get_matches for distribution based matching fails with error "'charmap' codec can't encode character..."

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Geodata/location_cities_countries/cities.csv", encoding='utf-8')
df2 = pd.read_csv("data/Geodata/location_cities_countries/countries.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:

(UnicodeEncodeError: 'charmap' codec can't encode character '\u0103' in position 4: character maps to )

The csv files come from here: https://www.kaggle.com/datasets/liewyousheng/geolocation

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Does Valentine currently support SemProp and EmbDI?

I saw "we implement and integrate six schema matching algorithms [14]–[19] and our own baseline method, and adapt them to the needs of dataset discovery" in your paper. At present, Valentine does not seem to support these two algorithms.

JaccardLeven with process_num=10 has errors?

Hi folks: nice package. I tried below and curious if you had ideas on this?

I tried on two DFs with 200k rows and 10 columns. It didn't converge. I had to use df.sample(4000) instead to cut down the processing to 10mins on a MacMini. with 32GB RAM and 3GHz 6-core i5. How long should I expect such a run to take? Two files of 13MB and 2MB in https://drive.google.com/drive/folders/1BIX240k6GEouT5SrjY9pWDaT7X6_QkY4?usp=sharing.
I'd interpreted your comment in JaccardLevel as this spawns 10 processes for speedup? It raised error.

matcher = valentine.algorithms.JaccardLevenMatcher(0.2, 10)

raise RuntimeError('''

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

similarity_flooding case where e[1].long_name=None in __get_attribute_tuple

Hi Valentine authors!

I am having trouble with a bug that seems to be coming from Valentine, but I am unsure:

in similarity_flooding.py, is it expected that long_name may sometimes be None? (this is causing my experiments to crash)
dumbish question: is it possible that column_name should be =e[0].long_name ?

    def __get_attribute_tuple(self, node):
        column_name = None
        if node in self.__graph1.nodes():
            for e in self.__graph1.out_edges(node):
                links = self.__graph1.get_edge_data(e[0], e[1])
                if links.get('label') == "name":


                    column_name = e[1].long_name  ##### LONG_NAME is None
                    
                   
        else:
            for e in self.__graph2.out_edges(node):
                links = self.__graph2.get_edge_data(e[0], e[1])
                if links.get('label') == "name":
                    column_name = e[1].long_name
        return column_name

Incorporating structural schema information

Dear valentine devs,
I was wondering about losing the structural information found in json / dictionaries when normalizing them into pandas data frames. To my understanding, coma3 would usually use this information in a matching process to improve the results. What do you think about supporting a nested (JSON) data source? I guess one would need to transform it to xml to be able to use it with coma ect.

All the best, and thanks for the great work!

Confusing Coma API

The Coma algorithm could either use only schema information or both schema and instance information.

Currently, we ask users to specify the strategy parameter to either COMA_OPT (schema) or COMA_OPT_INST (schema + instances) which can be difficult to understand.

It would be much easier if we replaced the strategy param with a use_instances boolean flag.

wrong column pkl filename with DistributionBased matching

Hello, and thank you for the library! 🍾

When using DistributionBased matching, I have the following use case:

I create an instance of the matcher. I have two source tables (source_1 and source_2), and one target table target.
I call matcher.get_matches(source_1, target). Pickle files for columns of source_1 and target tables are written to e.g., /tmp/tmpkpakbdjz, and the same files are read back with clustering_utils.get_column_from_store. Matches are generated.
I call matcher.get_matches(source_2, target). Pickle files for columns of source_2 and target tables are written to e.g., /tmp/tmp41gf90n2. HOWEVER, clustering_utils.get_column_from_store attempts to read pkl files created for columns of source_1 from directory /tmp/tmp41gf90n2

get_matches for distribution based matching fails with error "len(ranks) < 2"

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Geographie_00164.csv", encoding='utf-8')
df2 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Stationsname_00164.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:

(RuntimeError: len(ranks) < 2)

The csv files come from here: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/weather_phenomena/recent/. I converted the data to csvs. I'll attach my csvs:
Metadaten_Geographie_00164.csv
Metadaten_Stationsname_00164.csv
(Overview and explanation of data: https://www.dwd.de/EN/ourservices/cdc/cdc_ueberblick-klimadaten_en.html)

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Add seeping semantics to the framework

Add seeping semantics to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.

Syntax error while importing the library

Hello, I am trying to import valentine on colab but I get this error. Could you please take a look?

Thanks

Check similarity flooding for best coding practices

There are some duplicate code and naming issues in the similarity flooding implementation. If there is time it would be nice to fix those

Add the data fabricator the Valentine package

I would like to be able to load a dataframe, and then add noise to specific columns of that dataset.

COMA returns empty results if java is not installed, making test assertions fail for the wrong reason

Tested on MacOS 13.6 (Ventura).

This also affects other parts of the test suite that may use COMA, causing assertions to fail.

Probably the right course of action would be to throw an exception if COMA cannot be ran properly on the system, which can then be caught in the tests to make the user aware of this.

Failed installation on Windows

C:\Users\akatsifodimos>pip install valentine
Collecting valentine
  Using cached valentine-0.1.1.tar.gz (38.2 MB)
Collecting numpy<2.0,>=1.21
  Using cached numpy-1.21.2.zip (10.3 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting valentine
  Using cached valentine-0.1.0.tar.gz (38.2 MB)
ERROR: Cannot install valentine==0.1.0 and valentine==0.1.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    valentine 0.1.1 depends on scipy<1.8 and >=1.7
    valentine 0.1.0 depends on scipy<1.8 and >=1.7

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

Run time of Cupid()

Hello,

I am trying to match the feature name in two example data sets with 100 rows and 300 columns. It cost me more than 20 mins but still can't get the output. Is there anything wrong?

FileNotFoundError with matches command

Hello

I am trying to execute the following 2 lines of code but I get FileNotFoundError :

import valentine
matcher =valentine.algorithms.Coma(strategy="COMA_OPT")
matches = valentine_match(df1, df2, 'matcher'==matcher)

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine_init_.py", line 20, in valentine_match
matches = dict(sorted(matcher.get_matches(table_1, table_2).items(),
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 32, in get_matches
self.__run_coma_jar(s_f_name, t_f_name, coma_output_file, tmp_folder_path)
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 49, in __run_coma_jar
subprocess.call(['java', f'-Xmx{self.__java_XmX}',
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

How could this be resolved? Is it a windows issue or an issue with the valentine library itself?

Write output to file

Each run's output should be written to a JSON file with the following structure:

{
"name": "a unique identifier for the specific run",
"matches": "a dictionary that contains the output of the algorithm",
"metrics": "a dictionary with the metrics and their values"
}

get_matches for distribution based matching fails due to pickled files not refound

Hello,

I ran Valentine with the DistributionBased strategy

import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match

df1 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/book_titles.csv", encoding='utf-8')
df2 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/collaborative_books_df.csv", encoding='utf-8')

matches = valentine_match(df1, df2, DistributionBased())

And I get this error:

([WinError 2] The system cannot find the file specified: 'C:\Users\xxx\AppData\Local\Temp\tmptv_b_0v6\table1title.pkl')

The pickled rank files can not be found again. I can't find the files manually in my AppData folder either.
The csv files come from here: https://www.kaggle.com/datasets/thedevastator/book-recommender-system-itembased

I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.

Add more string similarity measures for JaccardLevenMatcher

Enrich the JaccardLevenMatcher with more string similarity measures apart from Levenshtein distance. Rename the method.

If I want to run SemProp, I need to run the run_semprop_docker.sh script in https://github.com/delftdata/valentine/tree/v1.1?

If I want to run the SemProp algorithm, what do I need to do for preparation? And run which script for experiment?
Is it required to use the docker provided by the SemProp team themselves?
Thanks for your patience!

Add similarity flooding to the framework

Add the similarity flooding implementation to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.

delftdata / valentine Goto Github PK

valentine's People

Contributors

Stargazers

Watchers

Forkers

valentine's Issues

Recommend Projects

Recommend Topics

Recommend Org