microsoft / codemixed-text-generator Goto Github PK

This tool helps automatic generation of grammatically valid synthetic Code-mixed data by utilizing linguistic theories such as Equivalence Constant Theory and Matrix Language Theory.

License: MIT License

Python 15.75% Shell 0.15% Jupyter Notebook 74.59% JavaScript 8.62% CSS 0.16% HTML 0.73%

code-mixing natural-language-processing code-switching linguistics python3 synthetic-data-generation language-modeling data-generation

codemixed-text-generator's People

Contributors

Stargazers

Watchers

Forkers

amirhussein96 mohdsanadzakirizvi shammur test-mass-forker-org-1 jamiemagee qmpham nanakonoda ramaneswaran tta1993 alignment-lab-ai songys taekyoon

codemixed-text-generator's Issues

Cannot compare float and str

In the pre_gcm.py file, on line 49 that says
if 3 < l1len <= 100 and 3 < l2len <= 100 and pfms_score <= max_pfms:

this will produce an error because you are comparing a float (the pfms_score) and a str (max_pfms). Should convert max_pfms to float so that comparison can work

The given example is not working

Hi all,

Thank you for the effort taken to create this tool, it's really helpful.

I clone your repo and installed the dependencies and everything looks fine. I tried the Web UI method using the provided example (Hindi-English) and it's not working. And when I dug a little bit into the code, I found a lot of problems. For example,

The generate button calls a function called run_gcm() which runs a whole python script called sequence_run.py which runs another script called pre_gcm.py. In pre_gcm.py file, we can see that the main() function is getting called when we run the script with an empty argument when it's supposed to be a parsed sentence.

Is this a bug? or I missed something.
Also, is this tool stable for usage, or it is still under development?

Either way, thank you for your efforts!

Output of GCM is empty

Hello,

I am able to run through the aligner and the pregcm stages of the toolkit but when it comes to gcm stage, the output is empty.

Below is a screenshot of the config file.

Also, It seems that the problem is connected to def run_in_try(func, pipe, params): try: #print(params) ret = func(params) except Exception as e: ret = "fail" pipe.send(ret) pipe.close()

It seems to be returning fail consistently. Is there something that I am missing?

A file "out-cm-en-de.txt" is created, but nothing is in it.

Error in Pre GCM Stage

I am facing an error in the pre gcm stage while running the script for a large set of hindi-english parallel sentences. Please find the error in the log below. I am also facing an issue in the alignment stage where some sentences are not being considered due to some error.

Error in line 300000
||| <sentence of length 300>

__main__: INFO: 2022-01-04 17:24:56,088: Parsing sentences: 0, 499
Traceback (most recent call last):
File "pre_gcm.py", line 204, in <module>
main()
File "pre_gcm.py", line 174, in main
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "pre_gcm.py", line 174, in <listcomp>
output = ["(ROOT "+" ".join(str(berkeley_parser.parse(sentence)).split())+")\n" for sentence in target_s]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 115, in parse
return list(self.parse_sents([sentence]))[0]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 137, in parse_sents
for parse_raw, tags_raw, sentence in self._batched_parsed_raw(self._nltk_process_sents(sents)):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/base_parser.py", line 342, in _batched_parsed_raw
for sentence, datum in sentence_data_pairs:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/benepar/nltk_plugin.py", line 89, in _nltk_process_sents
sentence = nltk.word_tokenize(sentence, self._tokenizer_lang)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
for sentence in slices:
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
for sentence1, sentence2 in _pair_iter(slices):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
prev = next(iterator)
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
for match, context in self._match_potential_end_contexts(text):
File "/home/anshul.padhi/miniconda3/envs/gcm/lib/python3.7/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
before_words[match] = split[-1]
IndexError: list index out of range

Expected format of rcm_lang_tagged.txt

HI,
I am exploring this project.I would like to use SPF as sampling method.I found that rcm_lang_tagged.txt file is not present in your project.I understand that this is file should have real codemix dataset of language pair.I would like to know the format of how this data file should be. Please provide with a sample rcm_lang_tagged.txt file of hindi-english codemix data.
Thanks

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.