Giter Club home page Giter Club logo

Comments (20)

stprior avatar stprior commented on June 14, 2024 1

@Sandy26 yes, if you are just using the pretrained model to generate SQL, I've had good results for unseen tables and queries just doing that. If your new tables or queries that are substantially different to those in the WikiSQL data set you would get better results by providing training data and training a new model or modifying the existing one. Also note there was an error in the notebook which I've just fixed, and I intend to use the code from #4 which should improve results too.

from coarse2fine.

donglixp avatar donglixp commented on June 14, 2024

Hi @Sandy26 ,

The WikiSQL data format can be found in https://github.com/salesforce/WikiSQL . Thanks!

from coarse2fine.

stprior avatar stprior commented on June 14, 2024

@Sandy26 I've a notebook that runs through the steps of adding and annotating a new table and query in https://github.com/stprior/coarse2fine/blob/predict/wikisql/Exploration.ipynb . Note I'm just getting familiar with this so it may well have errors.

from coarse2fine.

donglixp avatar donglixp commented on June 14, 2024

@stprior Solved in #4 .

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

@stprior Thank you so very much. This is very helpful!!!. I will let you know how it goes :) cos I believe I need to add one more step and actually train a model on my sql tables. Thank you once again for providing a solid first step in that direction!
@donglixp Thank you for the piece of code mentioned above

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

@stprior : Just want to make sure I understand this correctly . In your notebook(https://github.com/stprior/coarse2fine/blob/predict/wikisql/Exploration.ipynb)-

"Set up a question. The SQL field describes the expected query when training or testing. In this notebook it is not used, but it should still make sense for the table (e.g. conds should not specify a column number which is not in the table)."

In [57]:
question = {"phase": 1,
"table_id": "1-10753917-1",
"question": "How many wins did the ferrari team have after 1950 and before 1960?",
"sql": {"sel": 1, "conds": [[2, 0, "Williams"], [8, 0, "2"]], "agg": 3}
}

So ideally in a "test" question I should not need an "sql:" part correct? But I should give some value (with any columns in the table) just so that the code doesn't break?
Thank you,
Sandy

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Stephen,
Your python notebook was very helpful.I was able to follow the steps to add my own table and question.
But the code works or crashes depending upon my question. For some questions it works , that is gives me a query .But for some questions it crashes giving this error-

Traceback (most recent call last):
File "test_mytable.py", line 99, in
result_list=translator.translate(batch)
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Translator.py", line 89, in translate
op_batch_list, self.fields['cond_op'].vocab.stoi[table.IO.PAD_WORD]).t())
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Utils.py", line 46, in add_pad
return torch.LongTensor(r_list).cuda()
RuntimeError: given sequence has an invalid size of dimension 2: 0
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/init.py", line 5, in
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in
import apport.fileutils
File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in
from apport.packaging_impl import impl as packaging
File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 20, in
import apt
File "/usr/lib/python3/dist-packages/apt/init.py", line 23, in
import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'

Original exception was:
Traceback (most recent call last):
File "test_mytable.py", line 99, in
result_list=translator.translate(batch)
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Translator.py", line 89, in translate
op_batch_list, self.fields['cond_op'].vocab.stoi[table.IO.PAD_WORD]).t())
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Utils.py", line 46, in add_pad
return torch.LongTensor(r_list).cuda()
RuntimeError: given sequence has an invalid size of dimension 2: 0

And even for the questions that get a query, it has incorrect column numbers. I think that is telling me that the current model is not working for me and I need to train a new one with my own table. Does that sound correct ? Also any thoughts why I might get the above error for some questions and not for others? as of now, I have failed to find a pattern in the failing and succeeding questions.

Thank you once again,
Sandy

from coarse2fine.

donglixp avatar donglixp commented on June 14, 2024

Hi @Sandy26 ,

The exception "RuntimeError: given sequence has an invalid size of dimension 2: 0" is caused by that all the queries in the current batch do not have WHERE clauses. So the "r_list" contains empty lists. A quick fix would be changing the https://github.com/donglixp/coarse2fine/blob/master/wikisql/table/Utils.py#L41 into:

max_len = max(1, max((len(b) for b in b_list)))

which would add padding tokens for "r_list".

Thanks!

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Li,
Thank you for that suggestion. It works, as in at least the code gets to the end. What I realised is if I were to ask a grammatically incorrect/incomplete question (As in a question that may be asked by someone for whom English is not the first language) my current model cannot find the where clause and hence the error.
For example- If the question is-
How many low priority items? (No verb phrase)=> the model is unable to find "where priority=low"
But if the question is-
How many items have low priority? => I get where col0=low

The last problem part is though col0 is not the "priority" column. But that I think should get better once I train the model on my own data, correct?

Just wanted to let you know about my findings.

Thank you,
Shruti

from coarse2fine.

donglixp avatar donglixp commented on June 14, 2024

Hi @Sandy26 ,

It would be better to train the model on your dataset if the questions are of different patterns. Thanks!

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Li,
Since I will be feeding the code a new table, I tried to find where the column names for the table are mapped to numerical values like 0,1,2... ? Or does the training data take care of it? As in when we give the "sql: " part in the train.json file?

Thank you,
Shruti

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Li,Stephen,

Any thoughts on how difficult it will be to add "DISTINCT" and "LIMIT" functionality to this model? As in say the question is -
"Tell me any three types of fruits in stock."
then ideally my query will be-
SELECT DISTINCT Fruits FROM Table WHERE In_stock=1 LIMIT 3

Just curious!
Thank you,
Shruti

from coarse2fine.

stprior avatar stprior commented on June 14, 2024

Hi Shruti,

Adding new clauses like DISTINCT and LIMIT would be possible, but difficult. It might be possible to treat it as a new aggregate category and layout, but the WikiSQL lib code would need to be changed to include this new operation, and the coarse2fine code would need to be rewritten too. You would also need plenty of training examples, and would need to train the model from scratch because the pre-trained model would not be usable.

Training to target the existing supported SQL syntax would be easier, training could start from the existing pre-trained model and less training data required. The column numbers just come from the order column descritions they appear in the header JSON entries of the table files.
Regard,
Stephem

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Stephen,
Thank you for sharing your thoughts. I don't think I really understood when you said- "Training to target the existing supported SQL syntax would be easier, training could start from the existing pre-trained model and less training data required".
Do you mean that I can use the existing pretrained model and some how train it to include some examples from my table? So should I just append my annotated questions to train.jsonl and train the model? Or is there a better way to do it?
Thank you,
Shruti

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Stephen,
Just appending to my previous question- I guess to start with pretrained model and add my own training data to it- Is it about line 177 in train.py?

Load checkpoint if we resume from a previous training. ?

But I don't know how to use it (I have the pretained model, my new annotated questions and table but not sure how I process to use it)

many thanks,
Shruti

from coarse2fine.

stprior avatar stprior commented on June 14, 2024

I haven't actually tried training the model yet, but I plan to in the next week or so - I'll put up some notes when I do. If you add data to the existing dev train and test data files and run annotate.py on them you shouldn't need to make code changes, you should be able to follow the top level run.sh script. I don't know how many training examples you would need to make a difference to the model though. Alternatively you could train using mostly or only your own examples, but the quality of the model for more general queries would probably drop then.

from coarse2fine.

donglixp avatar donglixp commented on June 14, 2024

Hi @Sandy26 ,

Supporting these queries needs some code modifications and new annotated examples. Then the new model can be trained from scratch.

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Stephen,Li,

I tried to train my model using my ~200 queries+ 300 from WikiSQl. I tried to keep them evenly matched so that my data gets better say. As expected the accuracy is very low but at least does better than pretrained model as is for my queries. While annotating queries though I realised that for training queries, there has to always be a where clause? As in cond cannot be like cond:[[, , "" ]]. So it how can one train very simple queries like-

What is the total number of transactions?
SQL: Select count(transactions) from table X

Or can I tweek annotate.py that can give cond:[[]] a harmless value?

Thank you,
Shruti

from coarse2fine.

stprior avatar stprior commented on June 14, 2024

Hi Shruti, the wikisql training data includes examples like this which have "conds": []

from coarse2fine.

Sandy26 avatar Sandy26 commented on June 14, 2024

Hi Li,Stephen,

I have created a new model for my own dataset and it does give me reasonable results. Thank you for all your help so far.
Now I really need to add group by, order by, and limit functionality. I am going around in circles about what really needs to be done. But here are my few specific questions-

I plan to create sample data like-
{"phase": 1, "table_id": "1-1-1", "question":"Find baseball tickets by cities ?","sql":{"sel":[1,2] ,"conds":[],"agg":[0,3],"group":[1],"order":[],"limit":[]}}
so query should be -
SELECT City,COUNT(Tickets) FROM table GROUP BY City
Another example can be-
{"phase": 1, "table_id": "1-1-1", "question":"Find top 5 baseball tickets by cost ?","sql":{"sel":[3,2] ,"conds":[],"agg":[0,0],"group":[],"order":[1],"limit":[5]}}

  1. How do I change sel from integer input to a list. Also for agg change integer value to list. With that change can I still use agg_classifier and sel_match (ModelConstructor.py line 100-104) or will that need to change?

  2. As per my understanding the "lay" field includes list of operands in the conditions , hence keeps a track of number of conditions. Do I need a separate lay for each of "group" , "order" and "limit"? or should conditions,group, order and limit should all be in one "lay" field?

  3. Should I use something like agg_classifier or matchscorer for modelling "group" , "order" and limit?

I understand that these quiet involved questions, but any help/ideas about the starting point would be greatly appreciated.

Thank you very much,
Shruti

from coarse2fine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.