Giter Club home page Giter Club logo

dwtc-table-manual-classificator's Introduction

dwtc-table-manual-classificator

License: GPL v3

A tool for manual classification of dwtc tables. The result can then be used as a training data set.

Instructions

  1. Download as much DWTC datasets from https://wwwdb.inf.tu-dresden.de/research-projects/dresden-web-table-corpus/ as you want
  2. Let pip install all needed requirements via pip install -r requirements.txt
  3. export FLASK_APP=dwtc-table-manual-classificator
  4. pip install --editable .
  5. flask initDb pathToDwtcFiles/ to extract randomly 20 tables from each file, but saving a maximum of 100 tables per domain in the SQLite database
  6. Run the program with ./start.sh
  7. Go to http://127.0.0.1:5000/
  8. Have fun classifiying :)

Dataset

If you're interested in the underlying dataset, take a look into the WEKA compatible arff files for the features we used for classification. The most recent version of our features is the 2017 one. The corresponding raw table (html code as well as json and some information about where to find the whole webpage in the common crawl) can be found in the sqlite database data.db.

The features can be generated for a given SQLite database containing raw database tables using the code in featureClassificator.

License

Unless explicitly noted otherwise, the content of this package is released under the GNU Affero General Public License version 3 (AGPLv3)

Why the GNU Affero GPL (short answer: why not?)

Copyright © 2017 Julius Gonsior

dwtc-table-manual-classificator's People

Contributors

dependabot[bot] avatar jgonsior avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

ahmedahmedov

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.