delftdata / valentine Goto Github PK
View Code? Open in Web Editor NEWA tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.
License: Apache License 2.0
A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.
License: Apache License 2.0
Given a directory of configuration files that describe a job, run these jobs in parallel using the Slurm workload manager.
Add the CUPID implementation to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.
Add COMA 3.0 to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.
It would be nice to have a matches object and do something along these lines:
...
matches.get_one_to_one()
...
matches.metrics(ground_truth)
...
It will be nice if we had 1 requirements.txt file for the entire framework
We want to measure execution time for each different algorithm
Python 3.11 is released on 24/10/2022
. A new release is needed to support it
It would be nice to be able to compare two lists of datasets, resuing intermediate data structures to speed up the processing, instead of restarting the computation for each unique pair of datasets.
Tested on MacOS 13.6 (Ventura)
This is the log generated by unittest upon running python3 -m unittest discover
:
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading omw-1.4: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
E..........
======================================================================
ERROR: test_cupid (tests.test_algorithms.TestAlgorithms.test_cupid)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 27, in normalization
tokens = nltk.word_tokenize(element)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
return find(path_, path + [""]).open()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/wisguest/nltk_data'
- '/Library/Frameworks/Python.framework/Versions/3.11/nltk_data'
- '/Library/Frameworks/Python.framework/Versions/3.11/share/nltk_data'
- '/Library/Frameworks/Python.framework/Versions/3.11/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/wisguest/Repositories/valentine/tests/test_algorithms.py", line 28, in test_cupid
matches_cu_matcher = cu_matcher.get_matches(d1, d2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 36, in get_matches
self.__add_data("DB__"+source_input.name, source_input)
File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/cupid_model.py", line 56, in __add_data
self.__schemata[schema_name].add_node(table_name=table.name, table_guid=table.unique_identifier,
File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/schema_tree.py", line 24, in add_node
self.nodes[table_name].tokens = normalization(table_name).tokens
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/wisguest/Repositories/valentine/valentine/algorithms/cupid/linguistic_matching.py", line 33, in normalization
tokens = nltk.word_tokenize(element)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
return find(path_, path + [""]).open()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Hello,
I ran Valentine with the Coma matcher
valentine_match(df1, df2, Coma(strategy="COMA_OPT"))
for a dataset which contains columns with special characters like:
1:00-2:00AM
(that is actually the name of the column)
and the code failed here:
It would be incredibly useful to give for each column in df1, give top n column matches in df2.
Given a list of directories with different experiments and a set of algorithms, create a script that generates the configuration files required for running them as a job in the framework.
Add methods that utilize column vector representations and cosine similarity among them to determine matches.
Refactor configuration generation and output filenames to remove special characters from the file name
Hello,
I ran Valentine with the DistributionBased strategy
import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match
df1 = pd.read_csv("data/Geodata/location_cities_countries/cities.csv", encoding='utf-8')
df2 = pd.read_csv("data/Geodata/location_cities_countries/countries.csv", encoding='utf-8')
matches = valentine_match(df1, df2, DistributionBased())
And I get this error:
(UnicodeEncodeError: 'charmap' codec can't encode character '\u0103' in position 4: character maps to )
The csv files come from here: https://www.kaggle.com/datasets/liewyousheng/geolocation
I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.
I saw "we implement and integrate six schema matching algorithms [14]–[19] and our own baseline method, and adapt them to the needs of dataset discovery" in your paper. At present, Valentine does not seem to support these two algorithms.
Hi folks: nice package. I tried below and curious if you had ideas on this?
I tried on two DFs with 200k rows and 10 columns. It didn't converge. I had to use df.sample(4000) instead to cut down the processing to 10mins on a MacMini. with 32GB RAM and 3GHz 6-core i5. How long should I expect such a run to take? Two files of 13MB and 2MB in https://drive.google.com/drive/folders/1BIX240k6GEouT5SrjY9pWDaT7X6_QkY4?usp=sharing.
I'd interpreted your comment in JaccardLevel as this spawns 10 processes for speedup? It raised error.
matcher = valentine.algorithms.JaccardLevenMatcher(0.2, 10)
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Hi Valentine authors!
I am having trouble with a bug that seems to be coming from Valentine, but I am unsure:
in similarity_flooding.py
, is it expected that long_name
may sometimes be None
? (this is causing my experiments to crash)
dumbish question: is it possible that column_name
should be =e[0].long_name
?
def __get_attribute_tuple(self, node):
column_name = None
if node in self.__graph1.nodes():
for e in self.__graph1.out_edges(node):
links = self.__graph1.get_edge_data(e[0], e[1])
if links.get('label') == "name":
column_name = e[1].long_name ##### LONG_NAME is None
else:
for e in self.__graph2.out_edges(node):
links = self.__graph2.get_edge_data(e[0], e[1])
if links.get('label') == "name":
column_name = e[1].long_name
return column_name
Dear valentine devs,
I was wondering about losing the structural information found in json / dictionaries when normalizing them into pandas data frames. To my understanding, coma3 would usually use this information in a matching process to improve the results. What do you think about supporting a nested (JSON) data source? I guess one would need to transform it to xml to be able to use it with coma ect.
All the best, and thanks for the great work!
The Coma algorithm could either use only schema information or both schema and instance information.
Currently, we ask users to specify the strategy parameter to either COMA_OPT
(schema) or COMA_OPT_INST
(schema + instances) which can be difficult to understand.
It would be much easier if we replaced the strategy
param with a use_instances
boolean flag.
Hello, and thank you for the library! 🍾
When using DistributionBased
matching, I have the following use case:
source_1
and source_2
), and one target table target
.matcher.get_matches(source_1, target)
. Pickle files for columns of source_1
and target
tables are written to e.g., /tmp/tmpkpakbdjz
, and the same files are read back with clustering_utils.get_column_from_store
. Matches are generated.matcher.get_matches(source_2, target)
. Pickle files for columns of source_2
and target
tables are written to e.g., /tmp/tmp41gf90n2
. HOWEVER, clustering_utils.get_column_from_store
attempts to read pkl files created for columns of source_1
from directory /tmp/tmp41gf90n2
Hello,
I ran Valentine with the DistributionBased strategy
import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match
df1 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Geographie_00164.csv", encoding='utf-8')
df2 = pd.read_csv("data/Climate data/recent/wetter_tageswerte_00164_akt/Geographie_Stationsname/Metadaten_Stationsname_00164.csv", encoding='utf-8')
matches = valentine_match(df1, df2, DistributionBased())
And I get this error:
(RuntimeError: len(ranks) < 2)
The csv files come from here: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/weather_phenomena/recent/. I converted the data to csvs. I'll attach my csvs:
Metadaten_Geographie_00164.csv
Metadaten_Stationsname_00164.csv
(Overview and explanation of data: https://www.dwd.de/EN/ourservices/cdc/cdc_ueberblick-klimadaten_en.html)
I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.
Add seeping semantics to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.
There are some duplicate code and naming issues in the similarity flooding implementation. If there is time it would be nice to fix those
I would like to be able to load a dataframe, and then add noise to specific columns of that dataset.
Tested on MacOS 13.6 (Ventura).
This also affects other parts of the test suite that may use COMA, causing assertions to fail.
Probably the right course of action would be to throw an exception if COMA cannot be ran properly on the system, which can then be caught in the tests to make the user aware of this.
C:\Users\akatsifodimos>pip install valentine
Collecting valentine
Using cached valentine-0.1.1.tar.gz (38.2 MB)
Collecting numpy<2.0,>=1.21
Using cached numpy-1.21.2.zip (10.3 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Collecting valentine
Using cached valentine-0.1.0.tar.gz (38.2 MB)
ERROR: Cannot install valentine==0.1.0 and valentine==0.1.1 because these package versions have conflicting dependencies.
The conflict is caused by:
valentine 0.1.1 depends on scipy<1.8 and >=1.7
valentine 0.1.0 depends on scipy<1.8 and >=1.7
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
Hello
I am trying to execute the following 2 lines of code but I get FileNotFoundError :
import valentine
matcher =valentine.algorithms.Coma(strategy="COMA_OPT")
matches = valentine_match(df1, df2, 'matcher'==matcher)
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine_init_.py", line 20, in valentine_match
matches = dict(sorted(matcher.get_matches(table_1, table_2).items(),
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 32, in get_matches
self.__run_coma_jar(s_f_name, t_f_name, coma_output_file, tmp_folder_path)
File "C:\Users\grbus\prog\venv\lib\site-packages\valentine\algorithms\coma\coma.py", line 49, in __run_coma_jar
subprocess.call(['java', f'-Xmx{self.__java_XmX}',
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "c:\users\grbus\appdata\local\programs\python\python38\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
How could this be resolved? Is it a windows issue or an issue with the valentine library itself?
Each run's output should be written to a JSON file with the following structure:
{
"name": "a unique identifier for the specific run",
"matches": "a dictionary that contains the output of the algorithm",
"metrics": "a dictionary with the metrics and their values"
}
Hello,
I ran Valentine with the DistributionBased strategy
import pandas as pd
from valentine.algorithms import DistributionBased
from valentine import valentine_match
df1 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/book_titles.csv", encoding='utf-8')
df2 = pd.read_csv("data/Books recommender/book_titles-collaborative_books_df/collaborative_books_df.csv", encoding='utf-8')
matches = valentine_match(df1, df2, DistributionBased())
And I get this error:
([WinError 2] The system cannot find the file specified: 'C:\Users\xxx\AppData\Local\Temp\tmptv_b_0v6\table1title.pkl')
The pickled rank files can not be found again. I can't find the files manually in my AppData folder either.
The csv files come from here: https://www.kaggle.com/datasets/thedevastator/book-recommender-system-itembased
I'm on Windows 10, I've forked valentine and am running it locally. It is up to date with valentine:master today. I haven't made any changes to the DistributionBased code.
Enrich the JaccardLevenMatcher with more string similarity measures apart from Levenshtein distance. Rename the method.
If I want to run the SemProp algorithm, what do I need to do for preparation? And run which script for experiment?
Is it required to use the docker provided by the SemProp team themselves?
Thanks for your patience!
Add the similarity flooding implementation to the framework. Look into the wiki for instructions on how to integrate an algorithm to the framework.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.