ruckbreasoning / resdsql Goto Github PK

The Pytorch implementation of RESDSQL (AAAI 2023).

Home Page: https://arxiv.org/abs/2302.05965

License: MIT License

Shell 6.33% Python 93.67%

resdsql's Issues

checkpoint下载失败

您好，请问一下 T5 checkpoints 怎么下载？表格里的google drive的两个link下载下来是两个文件夹：
text2sql_schema_item_classifier
text2natsql_schema_item_classifier

没有看到text2natsql-t5-3b，text2natsql-t5-base等目录，请问一下这些是怎么出来的？是我下载的不对吗？

Running evaluate_robustness returns nothing

Hello, I've attached a screenshot below to better highlight this issue.

For some reason, running the following command sh scripts/evaluate_robustness/evaluate_on_spider_realistic.sh generates nothing on the eval_results directory. I could see the folders and the .txt file being generated but for some reason, nothing is being appended to the said document. It is worth noting that I have ran the pre-processing scripts in advance and every command and the pre-process command already as well sh scripts/evaluate_robustness/preprocess_spider_realistic.sh

Error in Running Inference script

 raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/text2sql-t5-base/checkpoint-39312'. Use `repo_type` argument if needed.

I am also attaching the entire output log:

RESDSQL.txt

How can I finetune on CSpider

请问可以适应其他数据集嘛？

请问这套程序可以适应其他的NL2SQL数据集嘛？比如DuSQL？

如何在测试集上添加一个新的表，还需要额外添加信息表信息

Obtaining query_toks_no_value from query

I'm attempting to try training on another dataset by appending the dataset's train.json, dev.json, and tables.json to Spider's (and adding the database into Spider's too) with RESDSQL. I'm a bit stumped on how to generate the query_toks_no_value from a query. Is there a script for this, or do you have any advice on how to make one?

SQL句中的value是如何确定的？

以spider、Cspider为例，一条生成的SQL语句，包含sql语法（骨架）、字段信息、值信息（value），比如：在”计算年龄大于50岁的男性人口”中，50就是值。
我的问题是，值是如何确定的？具体而言，从自然语言问句中分辨出哪些是值（比如，从“年龄大于50岁的男性人口”中，分辨出50是值），是模型通过训练获得的能力，还是spder、Cspider已经预先界定了哪些是值？如果是通过训练获得了分辨值的能力，能简单介绍一下思路吗？

请问在text2sql和schema_item_classifier中，精调的思路是全参精调？还是部分参数精调？

我学习了一下代码，似乎没有看到哪里写了freeze部分参数的规模。

inference cspider in 3b t5

OSError: Error no file named pytorch_model.bin found in directory ./models/text2natsql-mt5-xl-cspider/checkpoint-167433 but there is a file for Flax weights. Use from_flax=True to load this model from those weights.

貌似是因为transformers==4.17.0不支持分片的模型（https://discuss.huggingface.co/t/flan-t5-xl-model-does-not-appear-to-have-a-file-named-pytorch-model-bin/30395）

nltk_downlader里面下载的内容有用到吗？

一直下载不成功，但是直接跑了cspider的实验，也能跑通，不下载会有影响吗？

No matching distribution found for spacy==2.2.3

conda create -n your_env_name python=3.8.5
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt

when install requirements.txt
will print those error message

Collecting spacy==2.2.3
  Using cached spacy-2.2.3.tar.gz (5.9 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /home/studio-lab-user/.conda/envs/studiolab/bin/python3.9 /home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpnbsyo1tb
       cwd: /tmp/pip-install-5hq5ozgn/spacy_c569f2d7ab7a48d689e1bd1e3adaedf5
  Complete output (49 lines):
  Traceback (most recent call last):
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/requirements.py", line 35, in __init__
      parsed = _parse_requirement(requirement_string)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 64, in parse_requirement
      return _parse_requirement(Tokenizer(source, rules=DEFAULT_RULES))
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 82, in _parse_requirement
      url, specifier, marker = _parse_requirement_details(tokenizer)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 126, in _parse_requirement_details
      marker = _parse_requirement_marker(
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_parser.py", line 147, in _parse_requirement_marker
      tokenizer.raise_syntax_error(
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/_tokenizer.py", line 165, in raise_syntax_error
      raise ParserSyntaxError(
  setuptools.extern.packaging._tokenizer.ParserSyntaxError: Expected end or semicolon (after version specifier)
      spacy_lookups_data>=0.0.5<0.2.0
                        ~~~~~~~^
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 349, in <module>
      main()
    File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 331, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 117, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 338, in run_setup
      exec(code, locals())
    File "<string>", line 200, in <module>
    File "<string>", line 190, in setup_package
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 106, in setup
      _install_setup_requires(attrs)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 77, in _install_setup_requires
      dist.parse_config_files(ignore_option_errors=True)
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 900, in parse_config_files
      self._finalize_requires()
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 596, in _finalize_requires
      self._convert_extras_requirements()
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/dist.py", line 611, in _convert_extras_requirements
      for r in _reqs.parse(v):
    File "/tmp/pip-build-env-oijh_314/overlay/lib/python3.9/site-packages/setuptools/_vendor/packaging/requirements.py", line 37, in __init__
      raise InvalidRequirement(str(e)) from e
  setuptools.extern.packaging.requirements.InvalidRequirement: Expected end or semicolon (after version specifier)
      spacy_lookups_data>=0.0.5<0.2.0
                        ~~~~~~~^
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/b7/f2/052bfe5861761599b5421916aba3eb0064d83145ff3072390ecdc5a836de/spacy-2.2.3.tar.gz#sha256=1d14c9e7d65b2cecd56c566d9ffac8adbcb9ce2cff2274cbfdcf5468cd940e6a (from https://pypi.org/simple/spacy/) (requires-python:!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7). Command errored out with exit status 1: /home/studio-lab-user/.conda/envs/studiolab/bin/python3.9 /home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpnbsyo1tb Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement spacy==2.2.3 (from versions: 0.31, 0.32, 0.33, 0.40, 0.51, 0.52, 0.60, 0.61, 0.62, 0.63, 0.64, 0.65, 0.67, 0.68, 0.70, 0.80, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.97, 0.98, 0.99, 0.100.0, 0.100.1, 0.100.2, 0.100.3, 0.100.4, 0.100.5, 0.100.6, 0.100.7, 0.101.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.7.2, 1.7.3, 1.7.5, 1.8.0, 1.8.1, 1.8.2, 1.9.0, 1.10.0, 1.10.1, 2.0.0, 2.0.1.dev0, 2.0.1, 2.0.2.dev0, 2.0.2, 2.0.3.dev0, 2.0.3, 2.0.4.dev0, 2.0.4, 2.0.5.dev0, 2.0.5, 2.0.6.dev0, 2.0.6, 2.0.7, 2.0.8, 2.0.9, 2.0.10.dev0, 2.0.10, 2.0.11.dev0, 2.0.11, 2.0.12.dev0, 2.0.12.dev1, 2.0.12, 2.0.13.dev0, 2.0.13.dev1, 2.0.13.dev2, 2.0.13.dev4, 2.0.13, 2.0.14.dev0, 2.0.14.dev1, 2.0.15, 2.0.16.dev0, 2.0.16, 2.0.17.dev0, 2.0.17.dev1, 2.0.17, 2.0.18.dev0, 2.0.18.dev1, 2.0.18, 2.1.0, 2.1.1.dev0, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.1.5, 2.1.6, 2.1.7.dev0, 2.1.7, 2.1.8, 2.1.9, 2.2.0.dev10, 2.2.0.dev11, 2.2.0.dev13, 2.2.0.dev15, 2.2.0.dev17, 2.2.0.dev18, 2.2.0.dev19, 2.2.0, 2.2.1, 2.2.2.dev0, 2.2.2.dev4, 2.2.2, 2.2.3.dev0, 2.2.3, 2.2.4, 2.3.0.dev1, 2.3.0, 2.3.1, 2.3.2, 2.3.3.dev0, 2.3.3, 2.3.4, 2.3.5, 2.3.6, 2.3.7, 2.3.8, 2.3.9, 3.0.0, 3.0.1.dev0, 3.0.1, 3.0.2, 3.0.3, 3.0.4, 3.0.5, 3.0.6, 3.0.7, 3.0.8, 3.0.9, 3.1.0, 3.1.1, 3.1.2, 3.1.3, 3.1.4, 3.1.5, 3.1.6, 3.1.7, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4, 3.2.5, 3.2.6, 3.3.0.dev0, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.4.4, 3.5.0, 3.5.1, 3.5.2, 3.5.3, 3.5.4, 3.6.0.dev0, 3.6.0.dev1, 3.6.0, 3.7.0.dev0, 4.0.0.dev0, 4.0.0.dev1)
ERROR: No matching distribution found for spacy==2.2.3

First time I install in windows 10
then try to implement in AWS SageMaker Studio Lab which is like Google Colab
and also has same problem

I try to install spacy 3.0.0
Can install success
But execute shell script infer_text2natsql.sh have another problem

(studiolab) studio-lab-user@default:~/sagemaker-studiolab-notebooks/RESDSQL$ sh scripts/inference/infer_text2natsql.sh 3b spider
/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py:715: UserWarning: [W094] Model 'en_core_web_sm' (2.2.0) specifies an under-constrained spaCy version requirement: >=2.2.0. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.0.0,<3.1.0
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "/home/studio-lab-user/sagemaker-studiolab-notebooks/RESDSQL/NatSQL/table_transform.py", line 885, in <module>
    _tokenizer = get_spacy_tokenizer()
  File "/home/studio-lab-user/sagemaker-studiolab-notebooks/RESDSQL/NatSQL/natsql2sql/preprocess/TokenString.py", line 249, in get_spacy_tokenizer
    nlp = spacy.load("en_core_web_sm")
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/__init__.py", line 47, in load
    return util.load_model(name, disable=disable, exclude=exclude, config=config)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 322, in load_model
    return load_model_from_package(name, **kwargs)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 355, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/en_core_web_sm/__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 514, in load_model_from_init_py
    return load_model_from_path(
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 388, in load_model_from_path
    config = load_config(config_path, overrides=dict_to_dot(config))
  File "/home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/spacy/util.py", line 545, in load_config
    raise IOError(Errors.E053.format(path=config_path, name="config.cfg"))
OSError: [E053] Could not read config.cfg from /home/studio-lab-user/.conda/envs/studiolab/lib/python3.9/site-packages/en_core_web_sm/en_core_web_sm-2.2.0/config.cfg

Is there anyone meet same problem?
Or any problem in my environment

How can I expose it as a API Service ?

Both the steps should be part of the service.

Can‘t find the file nltk_downloader.py

Thanks for your nice work! I can't find the file nltk_downloader.py which you mentioned in the file readme.md . Could you please offer it for me？Thank you.

请问如果要自己准备dataset做训练或者测试，有什么格式要求吗？

请问如果要自己准备dataset，有什么格式要求吗？

The SQL skeleton is a too easy objective

I have trained 10 epochs in Spider with seq-to-seq framework model. If the target objective just as original SQL, the results is about 60%+. But when switching to skeleton+SQL, the performance is so bad.

After manual check, I found that the model inference result only contains the skeleton, and there is no SQL at all. Have you ever encountered this problem?

CSpider 训练bash好像有错误，同时不完整

./scripts/train/cspider_text2natsql/generate_text2natsql_dataset.sh 里面存在如下两个问题（相同情况在 cspider_text2sql也有）：

line 4, text2sql_data_generator.py 的 input_dataset_path 应为带有列、表概率的 train_cspider_with_probs_natsql.json；
缺少对训练数据运行schema_item_classifier.py，写在line 4的 preprocessed_train_cspider_natsql.json 是该模型的输入才对。

Is there a distributed version of the code? I'd like to reproduce the effect of t5

If there is a distributed version available, please kindly inform us on how to use it, as not all research centers have the same resources as Renmin University. Thank you.

How can we optimize the Model Inference time. Single NLQ taking more than a 45seconds.

Hello everyone,

I hope you're doing well. I encountered an issue while using Fine-Tuned RESDSQL on my dataset(spider-like) for predicting SQL .The inference time goes around one minute for it. While profiling the steps I found that schema_item_calssifier.py and text2sql.py are taking majority of the time. I would greatly appreciate any suggestions or insights on optimizing/minimizing the prediction time.
Thank you in advance for your assistance!

natsql是如何转换的？如何取骨架？

我注意到，作者在另一个问题里提到过，似乎是sql-to-natsql的代码没有开源，所以作者是直接使用了已经转换后的数据集吗？（Cspider也有现成的数据集？）
同时，skeleton-aware的decoder，前半部分是sql骨架，后半部分是sql，那对于natsql来说，是否可以理解为：前半部分是natsql骨架，后半部分是natsql？那么，natsql的骨架是怎么处理的呢？

对中文的支持

目前看示例代码中使用的模型和数据集均是来自于英文，自测了一下也确实对中文的支持还不好。想请问一下，如果想移植到中文环境使用，是需要把训练使用的RoBERTa模型、T5模型、训练数据集都换成中文的是吧？大概在网上找了一下，也找了几个对应的模型和数据集，请问下研发团队之前做过类似的尝试吗，有没有遇到什么困难或者障碍？

我找到的几个中文模型及数据集资源：
https://github.com/brightmart/roberta_zh
https://github.com/SunnyGJing/t5-pegasus-chinese
https://taolusi.github.io/CSpider-explorer/

NatSQL-Parser

Hey,
I'm very interested in your work. I want to train RESDSQL+NatSQL on my own dataset. I had no problems to train only RESDSQL but I don't know how to create the NatSQL-JSON-file. Do you know if there is any script available for parsing SQL-queries into NatSQL?
Thanks in advance!

Timestamp Functionality

Hey, I would like to know what to do, to train the model with data so that it supports timestamps functionality as like in Druid SQL interface for example. Is there a way to do add timestamps functionality as well.

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

请问一下 label标签为什么每次只取前面四个（在prepare_batch_inputs_and_labels方法中），比如我tables.json总共设置了十张表，但是前面四个算法后的label都是0，只有后面6个label为1，经过循环之后每次只在列表中追加返回的都是前面四个label就全为0，就报了该错误，后面我调整整了tables.json中的表的顺序是可以了，后面评估schema_item_classifier又报同样的错误了？是样本不均衡的问题吗？还是说train.json中的question对应的sql查询使用的表只能在tables.json中，没有用到的就不用写进去吗？有人遇到过这样的问题吗？我是使用自己的数据集，训练集结构已经跟要求的一致。

请问有尝试过在Cspider数据集上进行测试吗？

您好，请问有尝试过在Cspider数据集上进行测试吗？
可否实现多表查询呢？

Inference script

Hi!

I would like to try your repo with my own queries. I can see that there are inference scripts for the datasets you are supporting. I can't see a script that accepts queries from users. Is there sth like this or should I write it myself?

Thanks

ModuleNotFoundError: No module named 'third_party.spider'

Traceback (most recent call last):
File "/Users/piranavs/hay_test/resdsql/text2sql.py", line 16, in
from utils.spider_metric.evaluator import EvaluateTool
File "/Users/piranavs/hay_test/resdsql/utils/spider_metric/evaluator.py", line 4, in
from third_party.spider.preprocess.get_tables import dump_db_json_schema
ModuleNotFoundError: No module named 'third_party.spider'

I am getting this error during inference. Am I doing something wrong?

sh scripts/inference/infer_text2natsql_cspider.sh 3b出错

The repository does not have a license

Thank you so much for your wonderful work.

Currently, the repository does not have a license. According to the github documentation

You're under no obligation to choose a license. However, without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work. If you're creating an open source project, we strongly encourage you to include an open source license.

Do you think you could add an open source license to the repository, so that other people are legally allowed to reproduce, distribute, or create derivative works from it?

More discussion on this matter

Dataset used for finetuning mt5 model

Hi
First of all, thank you for your great work on this project. You've reached among best results on Spider benchmark and your clear and complete readme file allowed me to run your code very easily.

I want to see if I can finetune a text2natsql model on mt5 like you did on CSpider. I was wondering how much data I have to create as I want to create a dataset like CSpider but in Persian languge.

Was CSpider the only dataset used for finetuning mt5 backbone or other datasets were also used?

CSpider上加不加NatSQL的性能差异有多大？

想请问一下在Cspider上，mT5-base模型不加NatSQL的性能比加了NatSQL的差几个点呀？

inference scripts error

你好，我在尝试使用模型推理的时候出现了一些问题：
我使用的模型是RESDSQL-base, 在前期工作准备完成后使用了sh scripts/inference/infer_text2sql.sh base spider 指令进行推理，出现了如下错误：

Traceback (most recent call last):
File "schema_item_classifier.py", line 463, in
total_table_pred_probs, total_column_pred_probs = _test(opt)
File "schema_item_classifier.py", line 428, in _test
batch_column_number_in_each_table
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 191, in forward
batch_column_number_in_each_table
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 134, in table_column_cls
output_t, (hidden_state_t, cell_state_t) = self.table_name_bilstm(table_name_embeddings)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 689, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 632, in check_forward_args
self.check_input(input, batch_sizes)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 203, in check_input
expected_input_dim, input.dim()))
RuntimeError: input must have 3 dimensions, got 2

于是我在./untils/classifier_model.py的line 134 加入：

print(table_name_embeddings.size(),table_name_embeddings)
table_name_embeddings = table_name_embeddings.unsqueeze(0)
print(table_name_embeddings.size(),table_name_embeddings)

后续报错：

torch.Size([1, 1024]) tensor([[-0.3795, -0.9529, 0.9007, ..., -0.6501, -2.1801, 0.9587]],
device='cuda:0')
torch.Size([1, 1, 1024]) tensor([[[-0.3795, -0.9529, 0.9007, ..., -0.6501, -2.1801, 0.9587]]],
device='cuda:0')
torch.Size([1, 1024]) tensor([[-0.7597, -0.5682, -0.4270, ..., 0.3219, 1.5417, 0.3518]],
device='cuda:0')
torch.Size([1, 1, 1024]) tensor([[[-0.7597, -0.5682, -0.4270, ..., 0.3219, 1.5417, 0.3518]]],
device='cuda:0')
torch.Size([1, 1024]) tensor([[-0.4921, -1.1286, 0.9307, ..., -0.5373, -2.0887, 0.9216]],
device='cuda:0')
torch.Size([1, 1, 1024]) tensor([[[-0.4921, -1.1286, 0.9307, ..., -0.5373, -2.0887, 0.9216]]],
device='cuda:0')
torch.Size([3, 1024]) tensor([[-0.5896, -1.3575, 1.1120, ..., -0.6104, -1.9414, 0.6679],
[-0.6831, -1.3711, 1.1447, ..., -0.5117, -2.0709, 0.8956],
[-0.6337, -1.3548, 1.2228, ..., -0.4896, -2.0505, 0.8417]],
device='cuda:0')
torch.Size([1, 3, 1024]) tensor([[[-0.5896, -1.3575, 1.1120, ..., -0.6104, -1.9414, 0.6679],
[-0.6831, -1.3711, 1.1447, ..., -0.5117, -2.0709, 0.8956],
[-0.6337, -1.3548, 1.2228, ..., -0.4896, -2.0505, 0.8417]]],
device='cuda:0')
0%| | 0/33 [00:01<?, ?it/s]
Traceback (most recent call last):
File "schema_item_classifier.py", line 462, in
total_table_pred_probs, total_column_pred_probs = _test(opt)
File "schema_item_classifier.py", line 427, in _test
batch_column_number_in_each_table
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 194, in forward
batch_column_number_in_each_table
File "/mnt/data/tt/RESDSQL/utils/classifier_model.py", line 138, in table_column_cls
table_name_embedding = hidden_state_t[-2:, :].view(1, 1024)
RuntimeError: shape '[1, 1024]' is invalid for input of size 3072

对于修改这些错误，需要一些帮助。感谢🙏

Inference Run Killed

when i run the Inference the process is killed but i don't know why.

What stops the process and what can i do to fix it?

请问是否有计划对mT5模型的训练代码进行模型并行化改造？

在复现mT5 text2sql的训练时，我们发现一张40G显存的GPU一个批次都跑不起来……

与GPT4效果的对比

但从评测指标上，看您这边的方案指标甚至是强于没有针对sql强化的gpt4的，而且参数量远小于gpt，想了解下性能和准确率对比gpt4的情况

能处理join 多表的情况吗

请问，能处理多个表的连接查询吗？

XLM-ROBERTA-LARGE做分类的模型如何多卡运行？

试着尝试了一下多卡运行，
model = nn.DataParallel(model, device_ids=devices)
model.to(device)
结果会报bug，
Traceback (most recent call last):
File "schema_item_classifier_gpus.py", line 470, in
_train(opt)
File "schema_item_classifier_gpus.py", line 287, in _train
loss = encoder_loss_func.compute_loss(
File "/workspace/RESDSQL/utils/classifier_loss.py", line 60, in compute_loss
table_loss = self.compute_batch_loss(batch_table_name_cls_logits, batch_table_labels, batch_size)
File "/workspace/RESDSQL/utils/classifier_loss.py", line 47, in compute_batch_loss
loss += self.focal_loss(logits, labels)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/RESDSQL/utils/classifier_loss.py", line 16, in forward
assert input_tensor.shape[0] == target_tensor.shape[0]
是由于这个分类模型的结构设计，无法实现多卡运行吗？

Requirements

When I run the requirements file.

I am getting this error.
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for jarowinkler
Failed to build rapidfuzz tokenizers jarowinkler
ERROR: Could not build wheels for rapidfuzz, tokenizers, jarowinkler, which is required to install pyproject.toml-based projects

can you please tell me how do i install this which are on requirement list?

Low accuracy in predicting SQL using RESDSQL on my dataset

Hello everyone,

I hope you're doing well. I encountered an issue while using RESDSQL for predicting SQL on my dataset. Despite following all the recommended steps, I'm observing an accuracy range of only 30-40%. I would greatly appreciate any suggestions or insights on increasing the predictions' accuracy.

Thank you in advance for your assistance!

schema_item_classifier.py中column_number_in_each_table定义问题

您好，在看您的代码的时候发现，在schema_item_classifier.py文件中第156-162行有关于batch_column_number_in_each_table更新的定义，但是借助了table_labels和colum_lables的信息，且看后面代码中这个batch_column_number_in_each_table会作为一个参数输入模型进行推理，那么在没有labels的情况下，这个参数需要怎么定义呢？

Error in running Inference script

I am trying to run the inference script but getting TypeError: expected str, bytes or os.PathLike object, not NoneType

I am also attaching the entire output log.
New Text Document (2).txt

请问我如何更好的理解文中提出的cross-encoder

这text2sql.py阶段所有的验证集都出现["sql placeholder"]是什么原因

sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
sql placeholder
near "sql": syntax error
100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.33s/it]
2023-03-01 14:17:33,138 INFO root 输出结果
['sql placeholder']

这是用单例测试，用的32G V100

schema_item_classifier.py如何对新增库中的表名和列名进行预测

schema_item_classifier.py 是一个选表选列的模型，如何对一个新增的库中的表和列进行选取，预先并不知道新增苦衷有多少表和列数据

TypeError: 'datetime.datetime' object is not subscriptable

File "F:\python_project\RESDSQL\NatSQL\natsql2sql\preprocess\db_match.py", line 194, in db_col_type_check
if skip_once and len(values) > 7 and not v[0][0].isdigit():
TypeError: 'datetime.datetime' object is not subscriptable

就是说query = "select distinct "+col[1]+" from " + self.table_list[table_idx] + " order by "+col[1]+" limit 500" 这条语句查询出来的[()....()]是'datetime.datetime' 数据类型的就会报错，难道只能是sqlite这种没有datetime类型时间类型的数据库吗？

Cspider不加natsql训练步骤，运行到第二步的时候报错：RuntimeError: input must have 3 dimensions, got 2

报错信息如下，我这边下载了xlm-roberta-large，放在base_models目录下的，下载地址：https://huggingface.co/xlm-roberta-large/tree/main
Namespace(add_fk_info=False, alpha=0.75, batch_size=4, dev_filepath='./data/preprocessed_data/preprocessed_dev_cspider_natsql.json', device='0', epochs=128, gamma=2.0, gradient_descent_step=2, learning_rate=1e-05, mode='train', model_name_or_path='./base_models/xlm-roberta-large', output_filepath='data/pre-processing/dataset_with_pred_probs.json', patience=4, save_path='./models/xlm_roberta_text2natsql_schema_item_classifier', seed=42, tensorboard_save_path='./tensorboard_log/xlm_roberta_text2natsql_schema_item_classifier', train_filepath='./data/preprocessed_data/preprocessed_train_cspider_natsql.json', use_contents=True)
Some weights of the model checkpoint at ./base_models/xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']

This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
This is epoch 1.
Traceback (most recent call last):
File "schema_item_classifier.py", line 463, in
_train(opt)
File "schema_item_classifier.py", line 277, in _train
batch_column_number_in_each_table
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/RESDSQL/utils/classifier_model.py", line 191, in forward
batch_column_number_in_each_table
File "/workspace/RESDSQL/utils/classifier_model.py", line 134, in table_column_cls
output_t, (hidden_state_t, cell_state_t) = self.table_name_bilstm(table_name_embeddings)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 677, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 620, in check_forward_args
self.check_input(input, batch_sizes)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 203, in check_input
expected_input_dim, input.dim()))
RuntimeError: input must have 3 dimensions, got 2

请问代码中用到的text2sql-t5-3b是直接从hugging face下载的吗？没看到你在哪里下载的

在解码的时候，做了哪些后处理

我想知道在解码的时候做了哪些后处理，有具体的步骤么
pred_natsql = fix_fatal_errors_in_natsql(pred_natsql, batch_tc_original[batch_id])
if old_pred_natsql != pred_natsql:
print("Before fix:", old_pred_natsql)
print("After fix:", pred_natsql)
print("---------------")
pred_sql = natsql_to_sql(pred_natsql, db_id, db_file_path, table_dict[db_id]).strip()
因为我发现在不进行后处理，直接解码的效果很差

How I can do inference on the model only with a question?

Hi, I want to know if is possible to use the model to get directly the SQL statment, after giving it a question in natural language. Actually if for example, I use the dev.json (modified version from spider dataset) attached, I have no result in pred.sql.
Thank you for help.

Dev.json file

Hi,
I want to train the model using my own dataset and I saw in another thread that the dev.json file is required for this. Could you elaborate on how the dev.json file should be formatted, given some query and a database schema?

Best,
Adam

ruckbreasoning / resdsql Goto Github PK

resdsql's Issues

Recommend Projects

Recommend Topics

Recommend Org