(持续更新中...)
The Resources for "Natural Language to Logical Form" Research;
"自然语言转逻辑形式"研究资料收集: 本阶段主要以NL2SQL的研究为主, 主要包括评测公开数据集、相关论文和部分代码实现、相关博客或公众号文章。
NL2SQL
一、主要评测数据集 dataset
二、主要论文方法及代码实现 papers&code
1. WikiSQL
2. Spider
三、相关资源扩展 extend-resources
1. RelatedWorks
2. SQL2Seq
3. 图神经网络 GNN
- Academic, Advising, ATIS, Geography, Restaurants, Scholar, IMDB, Yelp, etc.
Blog
http://jkk.name/text2sql-data/GitHub
https://github.com/jkkummerfeld/text2sql-dataPaper
Improving Text-to-SQL Evaluation Methodology, Finegan-Dollak C, Kummerfeld J K, Zhang L, et al., ACL 2018
- WikiTableQuestions
- WikiSQL
WikiSQL数据集特点:
- 单表单列查询;
- 聚合操作('MAX', 'MIN', 'COUNT', 'SUM', 'AVG');
- 条件连接('AND');
- 条件比较('=', '>', '<')
GitHub
https://github.com/salesforce/WikiSQLPaper
Seq2sql: Generating structured queries from natural language using reinforcement learning, Zhong V, Xiong C, Socher R. , 2017.
- Spider
Spider数据集特点:
- Complex, Cross-domain and Zero-shot
- 多表多列查询, 复杂子查询;
- 聚合操作('MAX', 'MIN', 'COUNT', 'SUM', 'AVG','GROUP', 'HAVING', 'LIMIT');
- join连接:('join', 'on', 'as')
- where连接:('AND','OR');
- where操作:('not', 'between', '=', '>', '<', '>=', '<=', '!=', 'in', 'like', 'is', 'exists')
- 排序操作:('order by', 'desc', 'asc')
- sql连接:('Intersect', 'Union', 'Except')
Home
https://yale-lily.github.io/spiderGitHub
https://github.com/taoyds/spiderPaper
Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task Yu T, Zhang R, Yang K, et al. , EMNLP 2018.PPT
spider/wikisql/tableQA数据集统计对比_by gibbsxiong
- SParC
SParC数据集特点:
- Context-dependent and Multi-turn version of the Spider task.
继承Spider特点的上下文多轮任务。
Home
https://yale-lily.github.io/sparcPaper
SParC: Cross-Domain Semantic Parsing in Context, Yu T, Zhang R, Yasunaga M, et al., ACL 2019.
- Context-dependent and Multi-turn version of the Spider task.
- CoSQL
CoSQL数据集特点:
- Cross-domain Conversational, the Dilaogue version of the Spider and SParC tasks.
继承Spider特点的多轮对话任务,涉及意图澄清。
Home
https://yale-lily.github.io/cosqlPaper
CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases, Yu T, Zhang R, Er H Y, et al., EMNLP-IJCNLP 2019.
- Cross-domain Conversational, the Dilaogue version of the Spider and SParC tasks.
- Chinese Spider
中文版Spider
Home
https://taolusi.github.io/CSpider-explorer/GitHub
https://github.com/taolusi/chispPaper
A Pilot Study for Chinese SQL Semantic Parsing, Qingkai Min, Yuefeng Shi and Yue Zhang, EMNLP-IJCNLP 2019.
- TableQA
首届中文NL2SQL挑战赛 数据特点:
- 中文加强版WikiSql,金融等泛领域数据
- 单表多列(两列)查询
- 聚合操作('MAX', 'MIN', 'COUNT', 'SUM', 'AVG');
- 条件连接('AND', 'OR');
- 条件比较('=', '>', '<', '!=')
论文主要以WikiSQL和Spider为评测数据,相应排行榜详见任务主页。
下面主要整理具有代表性的方法,持续更新补充...
注: score表示 | model | Dev accuracy | Test accuracy |,WikiSQL表示执行准确率,Spider表示逻辑准确率,且不包括值预测。
Weakly Supervised
采用弱监督方法,即不使用sql的逻辑形式作为监督信号。
Paper
- Min S, Chen D, Hajishirzi H, et al. A discrete hard em approach for weakly supervised question answering[C]. EMNLP-IJCNLP 2019.
- Agarwal R, Liang C, Schuurmans D, et al. Learning to Generalize from Sparse and Underspecified Rewards. 2019.
- Liang C, Norouzi M, Berant J, et al. Memory augmented policy optimization for program synthesis and semantic parsing[C].NeurIPS, 2018: 9994-10006.
Code
- https://github.com/shmsw25/qa-hard-em
- https://github.com/google-research/google-research/tree/master/meta_reward_learning
Score
Hard-EM | 84.4 | 83.9 |
---|---|---|
MeRL | 74.9 | 74.8 |
MAPO | 72.2 | 72.1 |
ExecutionGuided
Execution Guided 可以在解码阶段通过执行错误对生成sql的项进行修正,从而过滤了一些不符合实际的sql语句。主要分为三类执行错误:1)句法解析错误,即生成的sql语法错误。2)执行失败。常见的run-time error, 例如SUM( ) 和比较string类型的数据;3)假设执行结果不为空,则空查询的条件错误。例如条件值实际并不存在于预测的列中, 因此会去 Beam Search 实际包含条件值的列。
Paper
- Wang C, Huang P S, Polozov A, et al. Robust Text-to-SQL Generation with Execution-Guided Decoding[J]. 2018.
- Wang C, Brockschmidt M, Singh R. Pointing out SQL queries from text[J]. 2018.
- Dong L, Lapata M. Coarse-to-fine decoding for neural semantic parsing[J]. 2018.
- Huang P S, Wang C, Singh R, et al. Natural language to structured query generation via meta-learning[J]. 2018.
Code
Score
Coarse2Fine + EG | 84.0 | 83.8 |
---|---|---|
Coarse2Fine | 79.0 | 78.5 |
Pointer-SQL + EG | 78.4 | 78.3 |
Pointer-SQL | 72.5 | 71.9 |
SQLNet Framework
设计了一种满足SQL语法的框架, 在这样的语法框架内,只需要预测并填充相应的槽位。 语法框架为:
SELECT $AGG $COLUMN WHERE $COLUMN $OP $VALUE (AND $COLUMN $OP $VALUE)*
在这基础上去完成不同的联合任务的分类预测:
- select-column, 选择的列
- select-aggregation, 聚合操作类型
- where-number, where条件语句的数量
- where-column, where条件中的列
- where-operator, where条件操作类型('<','=','>')
- where-value, where条件值
Paper
- Xu X, Liu C, Song D. SQLNet: Generating structured queries from natural language without reinforcement learning[J]. 2018.
- Hwang W, Yim J, Park S, et al. A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization[J]. 2019.
- He P, Mao Y, Chakrabarti K, et al. X-SQL: reinforce schema representation with context[J]. 2019.
Code
Score
BERT-XSQL-Attention + EG | 92.3 | 91.8 |
---|---|---|
BERT-XSQL-Attention | 89.5 | 88.7 |
BERT-SQLova-LSTM | 87.2 | 86.2 |
BERT-SQLova-LSTM + EG | 90.2 | 89.6 |
GloVe-SQLNet-BiLSTM | 69.8 | 68.0 |
Model Interactive
基于用户交互的语义解析,更偏向于落地实践。在生成sql后,通过自然语句生成来进一步要求用户进行意图澄清,从而对sql进行修正。
Blog
Paper
- Yao Z, Su Y, Sun H, et al. Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to-SQL Case Study[C]. EMNLP-IJCNLP 2019.
Code
GNN Encoding Seq2Seq
利用多表关联信息来建立一个表名、列名为节点,表内、表间关系为边的图。 通过GNN方法计算每一个节点(table item)的隐藏状态。 在seq2seq模型的encoding阶段,每个query word 向量对每个 table item隐藏向量进行attention计算, 并将attention权重作为每个query word的图表示。 在decoding阶段,结合语法规则,如果输出应为table item,则将输出向量与所有table item隐藏向量进行全连接打分,计算其关联程度。
Paper
- Krishnamurthy J, Dasigi P, Gardner M. Neural semantic parsing with type constraints for semi-structured tables[C]. EMNLP 2017.
- Lin K, Bogin B, Neumann M, et al. Grammar-based Neural Text-to-SQL Generation. 2019.
- Bogin B, Gardner M, Berant J. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing[C]. ACL 2019.
- Bogin B, Gardner M, Berant J. Global Reasoning over Database Structures for Text-to-SQL Parsing[C]. EMNLP-IJCNLP 2019.
- Shaw P, Massey P, Chen A, et al. Generating Logical Forms from Graph Representations of Text and Entities[C]. ACL 2019.
Code
Score
BERTRAND + GNN | 57.9 | 54.6 |
---|---|---|
Global-GNN | 52.7 | 47.4 |
GNN | 40.7 | 39.4 |
GNN w/edge vectors | 32.1 | - |
RATSQL
Paper
- Wang B, Shin R, Liu X, et al.RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers [C].
ICLR 2020. review on ACL 2020.
Score
RATSQL v2 + BERT (DB content used) | 65.8 | 61.9 |
---|---|---|
RASQL + BERT | 60.8 | 55.7 |
RAT-SQL | 60.6 | 53.7 |
IRNet
MSRA work
Blog
Paper
- Guo J, Zhan Z, Gao Y, et al. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation[C]. ACL 2019.
- Dong Z, Sun S, Liu H, et al. Data-Anonymous Encoding for Text-to-SQL Generation[C] EMNLP-IJCNLP 2019.
- Liu H, Fang L, Liu Q, et al. Leveraging Adjective-Noun Phrasing Knowledge for Comparison Relation Prediction in Text-to-SQL[C]. EMNLP-IJCNLP 2019.
- Liu Q, Chen B, Lou J G, et al. FANDA: A Novel Approach to Perform Follow-up Query Analysis[C]. AAAI 2019.
- Liu Q, Chen B, Liu H, et al. A Split-and-Recombine Approach for Follow-up Query Analysis[C]. EMNLP-IJCNLP 2019.
Code
Score
IRNet-v2 + BERT | 63.9 | 55.0 |
---|---|---|
IRNet + BERT-Base | 61.9 | 54.7 |
IRNet-v2 | 55.4 | 48.5 |
IRNet | 53.2 | 46.7 |
EditSQL
Paper
- Zhang R, Yu T, Er H Y, et al. Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions[C]. EMNLP-IJCNLP 2019.
Code
Score
EditSQL + BERT | 57.6 | 53.4 |
---|---|---|
EditSQL | 36.4 | 32.9 |
SQLNet Framework
Paper
- Leveraging Adjective-Noun Phrasing Knowledge for Comparison Relation Prediction in Text-to-SQL[C]. EMNLP 2019
- Yu T, Yasunaga M, Yang K, et al. Syntaxsqlnet: Syntax tree networks for complex and cross-domaintext-to-sql task[C]. EMNLP 2018.
- Dongjun Lee. Clause-Wise and Recursive Decoding for Complex and Cross-Domain Text-to-SQL Generation[C]. EMNLP 2019.
- Lin K, Bogin B, Neumann M, et al. Grammar-based Neural Text-to-SQL Generation. 2019.
Code
Score
GrammarSQL | 34.8 | 33.8 |
---|---|---|
SyntaxSQLNet + augment | 24.8 | 27.2 |
RCSQL | 28.5 | 24.3 |
SyntaxSQLNet | 18.9 | 19.7 |
SQLNet | 10.9 | 12.4 |
Blog
Paper
- Dhamdhere K, McCurley K S, Nahmias R, et al. Analyza: Exploring data with conversation[C]//Proceedings of the 22nd International Conference on Intelligent User Interfaces. ACM, 2017.
Paper
- Xu K, Wu L, Wang Z, et al. Graph2seq: Graph to sequence learning with attention-based neural networks.2018.
- Xu K, Wu L, Wang Z, et al. SQL-to-text generation with graph-to-sequence model[C]. EMNLP 2018.
Code
- https://github.com/IBM/SQL-to-Text
- https://github.com/IBM/Graph2Seq
- https://github.com/RandolphVI/Graph2Seq
Blog
Paper
- https://github.com/thunlp/GNNPapers
- https://github.com/IndexFziQ/GNN4NLP-Papers
- https://github.com/nnzhan/Awesome-Graph-Neural-Networks
- https://github.com/naganandy/graph-based-deep-learning-literature
- https://github.com/svjan5/GNNs-for-NLP
Code