Giter Club home page Giter Club logo

syntaxsql's Introduction

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

Source code of our EMNLP 2018 paper: SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task .

Citation

@InProceedings{Yu&al.18.emnlp.syntax,
  author =  {Tao Yu and Michihiro Yasunaga and Kai Yang and Rui Zhang and Dongxu Wang and Zifan Li and Dragomir Radev},
  title =   {SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task},
  year =    {2018},  
  booktitle =   {Proceedings of EMNLP},  
  publisher =   {Association for Computational Linguistics},
}

Environment Setup

  1. The code uses Python 2.7 and Pytorch 0.2.0 GPU.
  2. Install Python dependency: pip install -r requirements.txt

Download Data, Embeddings, Scripts, and Pretrained Models

  1. Download the dataset from the Spider task website to be updated, and put tables.json, train.json, and dev.json under data/ directory.
  2. Download the pretrained Glove, and put it as glove/glove.%dB.%dd.txt
  3. Download evaluation.py and process_sql.py from the Spider github page
  4. Download preprocessed train/dev datasets and pretrained models from here. It contains: -generated_datasets/
    • generated_data for original Spider training datasets, pretrained models can be found at generated_data/saved_models
    • generated_data_augment for original Spider + augmented training datasets, pretrained models can be found at generated_data_augment/saved_models

Generating Train/dev Data for Modules

You could find preprocessed train/dev data in generated_datasets/.

To generate them by yourself, update dirs under TODO in preprocess_train_dev_data.py, and run the following command to generate training files for each module:

python preprocess_train_dev_data.py train|dev

Folder/File Description

  • data/ contains raw train/dev/test data and table file
  • generated_datasets/ described as above
  • models/ contains the code for each module.
  • evaluation.py is for evaluation. It uses process_sql.py.
  • train.py is the main file for training. Use train_all.sh to train all the modules (see below).
  • test.py is the main file for testing. It uses supermodel.sh to call the trained modules and generate SQL queries. In practice, and use test_gen.sh to generate SQL queries.
  • generate_wikisql_augment.py for cross-domain data augmentation

Training

Run train_all.sh to train all the modules. It looks like:

python train.py \
    --data_root       path/to/generated_data \
    --save_dir        path/to/save/trained/module \
    --history_type    full|no \
    --table_type      std|no \
    --train_component <module_name> \
    --epoch           <num_of_epochs>

Testing

Run test_gen.sh to generate SQL queries. test_gen.sh looks like:

SAVE_PATH=generated_datasets/generated_data/saved_models_hs=full_tbl=std
python test.py \
    --test_data_path  path/to/raw/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --history_type    full|no \
    --table_type      std|no \

Evaluation

Follow the general evaluation process in the Spider github page.

Cross-Domain Data Augmentation

You could find preprocessed augmented data at generated_datasets/generated_data_augment.

If you would like to run data augmentation by yourself, first download wikisql_tables.json and train_patterns.json from here, and then run python generate_wikisql_augment.py to generate more training data. Second, run get_data_wikisql.py to generate WikiSQL augment json file. Finally, use merge_jsons.py to generate the final spider + wikisql + wikisql augment dataset.

Acknowledgement

The implementation is based on SQLNet. Please cite it too if you use this code.

syntaxsql's People

Contributors

chrisjbaik avatar taoyds avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.