Giter Club home page Giter Club logo

seq2sql--natural-language-sentences-to-sql-queries's Introduction

Natural Language to SQL

This project is an implementation of the Seq2SQL model described in https://arxiv.org/pdf/1709.00103.pdf

Here we have also implemented the baseline sequence to sequence model

Setup Instructions

  • The dataset must be downloaded from https://github.com/salesforce/WikiSQL and then unzipped and placed in the data directory
  • Install sqlite using the links here https://www.sqlite.org/download.html
  • Next, install the project requirements using pip install -r requirements.txt
  • Download the glove embeddings from http://nlp.stanford.edu/data/glove.6B.zip
  • Extract the archive into the glove folder
  • Run the pre-processing script python preprocess.py . This will create the tokenized versions of the dataset
  • Run python main.py . This will run the baseline model followed by the target model.
  • Running main.py will take approximately 10 hours. Please make sure to use a system with a good GPU.
  • It is highly recommended that this project is run in an anaconda environment. This will give the interpreter access to common libraries that may have been missed in requirements.txt

Folder Structure

  • The data and glove directory are for the dataset and embeddings
  • The library folder contains code provided by WikiSQL to perform basic data conversions and query running
  • The util directory contains files related to common functionality such as plotting graphs, loading datasets, preparing parallel datasets in-memory for fast access, creating batch sequences for models, and checking model accuracy.
  • The baseline directory contains all code necessary for the baseline to run
  • The seq2sql directory contains all code pertaining to the target model
  • The saved_model directory is where the target model will save the best model after training

Important Files

The entry point to the project is the main.py file. From here it is possible to control which model(s) we want to run. The preprocess.py is another essential file as it results in the generation of the tokenized dataset. Altering the tokenizing logic could significantly impact the results. constants.py contains multiple parameters used by the target model like batch size, learning rate, number of epochs, etc.

Upon completion of the run, the code will generate loss graphs and store the results of the target model into a text file in the root directory of the project

seq2sql--natural-language-sentences-to-sql-queries's People

Contributors

kaartikeynatrajan avatar tiwarikajal avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.