jeniyat / stackoverflowner Goto Github PK

Source Code and Data for Software Domain NER

License: MIT License

Python 97.17% Perl 2.83%

stackoverflowner's Introduction

Dataset and Model for Fine-grained Software Entity Extraction

This repository contains all the code and data proposed in the paper: Code and Named Entity Recognition in StackOverflow. (ACL 2020). [Paper PDF]

For the source code of our NER tagger, check the code/NER/ folder.

For our annotated data with software-domain named entities, check the resources/annotated_ner_data/ folder.

To cite the data or the code included in this repository, please use the following bibtex entry:

  @inproceedings{Tabassum20acl,
      title = {Code and Named Entity Recognition in StackOverflow},
      author = "Tabassum, Jeniya and Maddela, Mounica and  Xu, Wei  and Ritter, Alan",
      booktitle = {The Annual Meeting of the Association for Computational Linguistics (ACL)},
      year = {2020}
  }

stackoverflowner's People

Contributors

Stargazers

Watchers

stackoverflowner's Issues

tokenization never stops

The tokenization never stops for some specific instances, for example:
"it('should remove the elements domProps'), () => {"
It may be caused by the catastrophic backtracking of Func_Name_Recursive regular expression.

KeyError in /code/BERT_NER/utils_fine_tune/labels_seg.txt`

Hi,

I'm trying to run E2E_SoftNER.py. I think I have been able to resolve the references to the locations of a lot of the models and files that are associated with the repo, however, I'm getting an error, here's the traceback:

Exception has occurred: KeyError
8
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/softner_segmenter_preditct_from_file.py", line 298, in evaluate
    preds_list[i].append(label_map[preds[i][j]])
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/softner_segmenter_preditct_from_file.py", line 638, in predict_segments
    result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="", path=input_file)
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/E2E_SoftNER.py", line 186, in Extract_NER
    softner_segmenter_preditct_from_file.predict_segments(segmenter_input_file, segmenter_output_file)
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/E2E_SoftNER.py", line 206, in <module>
    Extract_NER(input_file)

It looks like there might be something off with what this code expects for the format of './utils_fine_tune/labels_seg.txt'. Looking at label_map here, it is just a dictionary that doesn't have a key for 8:

> label_map
{0: 'B-Name', 1: 'O', 2: 'CTC_PRED:0', 3: 'CTC_PRED:1', 4: 'md_label:O', 5: 'md_label:Name'}

whereas preds here seems to be an array with a pretty high number of values:

> preds
array([[ 0,  8, 13, ..., 10,  1,  0],
       [ 4, 13,  1, ...,  7,  3,  9],
       [ 9,  2,  0, ...,  9,  1,  9],
       ...,
       [ 0,  2, 13, ...,  0, 12,  0],
       [ 4,  2,  5, ..., 10,  5,  1],
       [ 4,  2,  6, ...,  9,  9,  9]])

Everything in the utils_fine_tune directory came from the megaupload link you provided, so it could be possible that there was some issue with either the archive, or the data.

If you find the time to take a look at this issue, thanks very much for contributing this code to the community and please let me know if there is anything else you might be interested in from me to help debug or further understand this issue. Hopefully it's just some misunderstanding on my end.

Crash from an unknown key of the BERT NER Model

We're encountering the exception in the image above. Could anyone tell me how to fix it please?

We doubt it from the error version of the two vocabulary files below. There are two lines like

parameters_ctc['train_file']="/data/jeniya/STACKOVERFLOW_DATA/CTC/data/train_updated.tsv"
parameters_ctc['test_file']="/data/jeniya/STACKOVERFLOW_DATA/CTC/data/test_updated.tsv"

in the file "code/BERT_NET/utils_ctc/config_ctc.py", however these two updates files cannot be downloaded anywhere.

If we alter these files with "test_v2.tsv" and "train.tsv" in the folder "data_ctc" separately which could be downloaded from the url of google drive in the Readme file, the above exception would be thrown.

Unable to download data

Hello! First of all, amazing work. I'm looking to play around with the BERT NER model, and following the prerequisite steps I tried downloading data_ctc.zip using Mega. However, I wasn't able to fully download it because I had exceeded my download quota. Is there an alternative way of hosting the data, like a google drive link? Or, is there a way of accessing a pretrained model that we can use directly for predictions on new data? Thank you in advance!

'bert-word-piece-softner/' not found in model shortcut name list

Hi there,

I'm trying to run the BERT_NER model but have a problem with the tokenizer being unable to find bert-word-piece-softner/ although I've already unzipped the attached file in the fine-tune folder. Added_tokens.json also seems to be missing in the file. Could you kindly help me with this?

Thanks!

Running NER on custom data

Hi, I am trying to run the NER on custom data, say any normal StackOverflow question. How can I use the model that I trained using that code for tagging the words in the body of the question into various custom classes? Any help would be really appreciated.
Thanks

How can I reproduct your elegant work?

I am wandering that weather this work has been maintain or have another improvement?

Can not download utils_fine_tune.tar.gz from Google driver link

Hi, @jeniyat , thanks for your generous sharing.
From #3 you give us a driver link to download resources, but when I open this link I found that I can not download the utils_fine_tune.tar.gz for The download file will exceed the limit, so it cannot be downloaded at this time this error.
Could you please give me another share link, thx.
My email address is [email protected]. Thank you very much.

Please remove it, sorry

BERT pretraining details

Thanks for sharing your great work.

Some quick questions about the BERT pretraining:

What is the max_seq_length (e.g. 128 tokens) during pretraining?
How many training steps / examples used during pretraining?
How did you decide 64,000, the size of WordPiece vocabulary?
Have you tried continual-pretraining from bert-base using the unlabeled data (152 million sentences from the StackOverflow)?

Thank you,
Naoto

Bug Fixes - Running Version of StackOverflowNER on Google Colab

Hi,

I have ported over the StackOverflowNER code over to Google colab. I fixed many small bugs and finally got it to work. Feel free to use the code in your own projects. You will need Colab Pro ($10/month) and Google Drive subscription to store the data.

You can find the project here: https://colab.research.google.com/drive/1_df1kTyPT_m65npDpuc1P8hmszqEa0aC?usp=sharing

Let me know if this helps!

Akshay

The shared google drive does't contain 'data_ctc.zip' or 'utils_fine_tune.tar.gz'

I'm trying to download the needed resources to run the model.

Although I have checked the existed issues and found the google drive link https://drive.google.com/drive/folders/1iEEMr2DYofulK2F5pSErOPf5ggrEqtJt?usp=sharing, it seems to have no data_ctc.zip or utils_fine_tune.tar.gz. Deeply grateful for any help.

The file download link is no longer available.

I try to download the utils_fine_tune.zip and the data_ctc.zip, but the link is not available now. Could you please check these links?

Thank you for your help!

Consult of "word_to_id.json"

Hi @jeniyat , when I followed readme files, I firstly encountered the problem: "word_to_id.json" file doesn't exist. I don't konw this file is auto generated or should put here in advance. To solve this problem, I notice in E2ESofrNER.py there is a ctc_classifier, vocab_size, word_to_id, id_to_word, word_to_vec, features= train_ctc_model(train_file, test_file) sentence and a word_to_id list is generated but is not saved as a json file. Then I tried to dump this list to json but I found there are no [CLS] [SEP] [UNK] and ***PADDING*** of word_id_pad = word_to_id["***PADDING***"] in utils_seg.py.
These are just my try, maybe I have done wrong. I read the source code and still can not find where word_to_id.json is generated.
Could you give me a help? Thank you very much.

Can't find the required files to download

First of all thanks for sharing this nice work. But while I was trying to download utils_fine_tune.zip
data_ctc.zip , I could not find . I have went this link
https://drive.google.com/drive/folders/1iEEMr2DYofulK2F5pSErOPf5ggrEqtJt , but those two zips are not there. @jeniyat It would be great if you please provide those files.

The results are random

I downloaded the pretrained model you provided and loaded it.(https : //drive.google.com/drive/folders/1iEEMr2DYofulK2F5pSErOPf5ggrEqtJt?usp=sharing) Why is the result random?

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
pretrained_model ='./utils_fine_tune/word_piece_ner/'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
model = AutoModelForTokenClassification.from_pretrained(pretrained_model)
sequence=" Is there any mechanism to enforce a singleton policy without having to make the derived class' constructors private manually?"
print(tokenizer.decode(tokenizer.encode(sequence)))
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)
outputs= outputs.logits
predictions = torch.argmax(outputs, dim=2)
for token, prediction in zip(tokens, predictions[0].numpy()):
     print((token, model.config.id2label[prediction]))

jeniyat / stackoverflowner Goto Github PK

stackoverflowner's Introduction

Dataset and Model for Fine-grained Software Entity Extraction

stackoverflowner's People

Contributors

Stargazers

Watchers

Forkers

stackoverflowner's Issues

Recommend Projects

Recommend Topics

Recommend Org