A tool for manual classification of dwtc tables. The result can then be used as a training data set.
- Download as much DWTC datasets from https://wwwdb.inf.tu-dresden.de/research-projects/dresden-web-table-corpus/ as you want
- Let pip install all needed requirements via
pip install -r requirements.txt
export FLASK_APP=dwtc-table-manual-classificator
pip install --editable .
flask initDb pathToDwtcFiles/
to extract randomly 20 tables from each file, but saving a maximum of 100 tables per domain in the SQLite database- Run the program with
./start.sh
- Go to http://127.0.0.1:5000/
- Have fun classifiying :)
If you're interested in the underlying dataset, take a look into the WEKA compatible arff files for the features we used for classification. The most recent version of our features is the 2017 one. The corresponding raw table (html code as well as json and some information about where to find the whole webpage in the common crawl) can be found in the sqlite database data.db.
The features can be generated for a given SQLite database containing raw database tables using the code in featureClassificator
.
Unless explicitly noted otherwise, the content of this package is released under the GNU Affero General Public License version 3 (AGPLv3)
Why the GNU Affero GPL (short answer: why not?)
Copyright © 2017 Julius Gonsior