Light

crdv7 / cmdl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from qcri/cmdl

0.0 0.0 0.0 923.06 MB

Cross-Modal Data Discovery over Structured and Unstructured Data Lakes

Python 41.78% Jupyter Notebook 58.22%

cmdl's Introduction

CMDL

Cross-Modal Data Discovery over Structured and Unstructured Data Lakes

Set up:

environment.yml will set up a conda environment

Entry points:

trainer/pretrain-text.ipynb: Fine tuning a language model on text corpus to learn text embeddings
trainer/pretrain-tables.ipynb: Fine tuning a language model on table collection to learn tuple embeddings
trainer/column_text_joint_training.ipynb: training a baseline connecting text to table columns
compare_gt.py: accuracy measurement of search based baselines and similarity sketches on text->table relation discovery using the ground truth provided

Data Sets & Ground Truths:

All files and directories are inside the inputs directory

Phamra
- drugbank-tables: drugbank tables as csv files
- pubmed-targets: pubmed article abstracts as txt files
- DrugBank_Synthetic_dataset: synthetic drugbank tables as csv files
ChEBI
- ChEBI_tables_dataset: ChEMBL tables as csv files
  Note: chebi-reference.csv.zip & chebi-structures.csv.zip are compressed due to GitHub limits
ChEMBL
- ChEMBL_tables_dataset: ChEMBL tables as csv files
  Note: chembl_27-activity_supp.csv.zip , chembl_27-chembl_id_lookup.csv.zip , chembl_27-compound_records.csv.zip , chembl_27-molecule_dictionary.csv.zip are compressed due to GitHub limits
MLOpen
- MLOpen Data Source
- For our experiments we use certain subsets of the data which can be found in the subdirectories:
  - mlopen_t2t_SS_dataset
  - mlopen_t2t_MS_dataset
  - mlopen_t2t_LS_dataset
UKOpen
- UKOpen Data Source

The ground truth files for each dataset are present in the inputs directory

Resources:

Paper manuscripts provided under the folder 'docs'

Prior baselines:

snorkel labeler.ipynb needs to be run in its separate environment by following instructions at: https://github.com/snorkel-team/snorkel
build_label_files.py: profiles data, indexes tables, creates labels by probing indexes using each text
build_features.py: featurizes input data, saves features to disk to be read during training

cmdl's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.