Giter Club home page Giter Club logo

stackoverflowner's Introduction

Dataset and Model for Fine-grained Software Entity Extraction

This repository contains all the code and data proposed in the paper: Code and Named Entity Recognition in StackOverflow. (ACL 2020). [Paper PDF]

For the source code of our NER tagger, check the code/NER/ folder.

For our annotated data with software-domain named entities, check the resources/annotated_ner_data/ folder.

To cite the data or the code included in this repository, please use the following bibtex entry:

  @inproceedings{Tabassum20acl,
      title = {Code and Named Entity Recognition in StackOverflow},
      author = "Tabassum, Jeniya and Maddela, Mounica and  Xu, Wei  and Ritter, Alan",
      booktitle = {The Annual Meeting of the Association for Computational Linguistics (ACL)},
      year = {2020}
  }

stackoverflowner's People

Contributors

jeniyat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stackoverflowner's Issues

tokenization never stops

The tokenization never stops for some specific instances, for example:
"it('should remove the elements domProps'), () => {"
It may be caused by the catastrophic backtracking of Func_Name_Recursive regular expression.

KeyError in /code/BERT_NER/utils_fine_tune/labels_seg.txt`

Hi,

I'm trying to run E2E_SoftNER.py. I think I have been able to resolve the references to the locations of a lot of the models and files that are associated with the repo, however, I'm getting an error, here's the traceback:

Exception has occurred: KeyError
8
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/softner_segmenter_preditct_from_file.py", line 298, in evaluate
    preds_list[i].append(label_map[preds[i][j]])
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/softner_segmenter_preditct_from_file.py", line 638, in predict_segments
    result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="", path=input_file)
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/E2E_SoftNER.py", line 186, in Extract_NER
    softner_segmenter_preditct_from_file.predict_segments(segmenter_input_file, segmenter_output_file)
  File "/Users/clemente/src/python/github/StackOverflowNER/code/BERT_NER/E2E_SoftNER.py", line 206, in <module>
    Extract_NER(input_file)

It looks like there might be something off with what this code expects for the format of './utils_fine_tune/labels_seg.txt'. Looking at label_map here, it is just a dictionary that doesn't have a key for 8:

> label_map
{0: 'B-Name', 1: 'O', 2: 'CTC_PRED:0', 3: 'CTC_PRED:1', 4: 'md_label:O', 5: 'md_label:Name'}

whereas preds here seems to be an array with a pretty high number of values:

> preds
array([[ 0,  8, 13, ..., 10,  1,  0],
       [ 4, 13,  1, ...,  7,  3,  9],
       [ 9,  2,  0, ...,  9,  1,  9],
       ...,
       [ 0,  2, 13, ...,  0, 12,  0],
       [ 4,  2,  5, ..., 10,  5,  1],
       [ 4,  2,  6, ...,  9,  9,  9]])

Everything in the utils_fine_tune directory came from the megaupload link you provided, so it could be possible that there was some issue with either the archive, or the data.

If you find the time to take a look at this issue, thanks very much for contributing this code to the community and please let me know if there is anything else you might be interested in from me to help debug or further understand this issue. Hopefully it's just some misunderstanding on my end.

Crash from an unknown key of the BERT NER Model

WeChat Image_20211111093940

We're encountering the exception in the image above. Could anyone tell me how to fix it please?

We doubt it from the error version of the two vocabulary files below. There are two lines like

parameters_ctc['train_file']="/data/jeniya/STACKOVERFLOW_DATA/CTC/data/train_updated.tsv"
parameters_ctc['test_file']="/data/jeniya/STACKOVERFLOW_DATA/CTC/data/test_updated.tsv"

in the file "code/BERT_NET/utils_ctc/config_ctc.py", however these two updates files cannot be downloaded anywhere.

If we alter these files with "test_v2.tsv" and "train.tsv" in the folder "data_ctc" separately which could be downloaded from the url of google drive in the Readme file, the above exception would be thrown.

Unable to download data

Hello! First of all, amazing work. I'm looking to play around with the BERT NER model, and following the prerequisite steps I tried downloading data_ctc.zip using Mega. However, I wasn't able to fully download it because I had exceeded my download quota. Is there an alternative way of hosting the data, like a google drive link? Or, is there a way of accessing a pretrained model that we can use directly for predictions on new data? Thank you in advance!

'bert-word-piece-softner/' not found in model shortcut name list

Hi there,

I'm trying to run the BERT_NER model but have a problem with the tokenizer being unable to find bert-word-piece-softner/ although I've already unzipped the attached file in the fine-tune folder. Added_tokens.json also seems to be missing in the file. Could you kindly help me with this?

Thanks!

1

Running NER on custom data

Hi, I am trying to run the NER on custom data, say any normal StackOverflow question. How can I use the model that I trained using that code for tagging the words in the body of the question into various custom classes? Any help would be really appreciated.
Thanks

Can not download utils_fine_tune.tar.gz from Google driver link

Hi, @jeniyat , thanks for your generous sharing.
From #3 you give us a driver link to download resources, but when I open this link I found that I can not download the utils_fine_tune.tar.gz for The download file will exceed the limit, so it cannot be downloaded at this time this error.
Could you please give me another share link, thx.
My email address is [email protected]. Thank you very much.

BERT pretraining details

Thanks for sharing your great work.

Some quick questions about the BERT pretraining:

  • What is the max_seq_length (e.g. 128 tokens) during pretraining?
  • How many training steps / examples used during pretraining?
  • How did you decide 64,000, the size of WordPiece vocabulary?
  • Have you tried continual-pretraining from bert-base using the unlabeled data (152 million sentences from the StackOverflow)?

Thank you,
Naoto

Consult of "word_to_id.json"

Hi @jeniyat , when I followed readme files, I firstly encountered the problem: "word_to_id.json" file doesn't exist. I don't konw this file is auto generated or should put here in advance. To solve this problem, I notice in E2ESofrNER.py there is a ctc_classifier, vocab_size, word_to_id, id_to_word, word_to_vec, features= train_ctc_model(train_file, test_file) sentence and a word_to_id list is generated but is not saved as a json file. Then I tried to dump this list to json but I found there are no [CLS] [SEP] [UNK] and ***PADDING*** of word_id_pad = word_to_id["***PADDING***"] in utils_seg.py.
These are just my try, maybe I have done wrong. I read the source code and still can not find where word_to_id.json is generated.
Could you give me a help? Thank you very much.

The results are random

I downloaded the pretrained model you provided and loaded it.(https : //drive.google.com/drive/folders/1iEEMr2DYofulK2F5pSErOPf5ggrEqtJt?usp=sharing) Why is the result random?

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
pretrained_model ='./utils_fine_tune/word_piece_ner/'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
model = AutoModelForTokenClassification.from_pretrained(pretrained_model)
sequence=" Is there any mechanism to enforce a singleton policy without having to make the derived class' constructors private manually?"
print(tokenizer.decode(tokenizer.encode(sequence)))
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)
outputs= outputs.logits
predictions = torch.argmax(outputs, dim=2)
for token, prediction in zip(tokens, predictions[0].numpy()):
     print((token, model.config.id2label[prediction]))

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.