Giter Club home page Giter Club logo

topicans's Introduction

TopicAns: Topic-Informed Architecture for Answer Recommendation on Technical Q&A Site

TopicAns aims at learning high-quality representations for the posts in Q&A sites with a neural topic model and a pre-trained model.

This involves three main steps: (1) generating topic-aware representations of Q&A posts with the neural topic model, (2) incorporating the corpus-level knowledge from the neural topic model to enhance the deep representations generated by the pre-trained language model, and (3) determining the most suitable answer for a given query based on the topic-aware representation and the deep representation.

The Architecture of TopicAns

Requirement

Install all required python dependencies:

pip install -r requirements.txt

Dataset Construction and Pre-process

Our datasets are released at https://drive.google.com/drive/folders/1Pf8zyJGdufi_14qeAigPI1yyOkiL881J?usp=share_link.

Following is how we get them:

  1. Download stackoverflow.com-Posts.7z from https://archive.org/details/stackexchange

  2. Run dataset_scripts/split_qa.py to get:

    1. q_temp_output.csv: ["Id", "PostTypeId", "Score", "AnswerCount", "AcceptedAnswerId", "Title", "Body", "Tags"]
    2. a_temp_output.csv: ["Id", "PostTypeId", "Score", "ParentId", "Body"]
    3. python_question_info
  3. Run dataset_scripts/extract_python_qa.ipynb to get python_qa (This file can also be used to get java_qa pairs!). This file does:

    1. Get python_question_info: (q_id, 1, accepted_a_id, other_a_ids...)
    2. Get python_q_id_content.csv, python_a_id_content.csv
    3. Then process the above two files
    4. delete some trash questions from python_question_info, and save the new one as python_qa_info.dataset. Please read this file by datasets (huggingface).
  4. Then we need to construct candidate answer pools following Gao et al. by construct_candidate_pools.py.

  5. Finally, we combine all together and get our datasets by combine_py_related_question.ipynb

Instructions

Training

python -u main.py -d so_python -n 0 \
    --model_class QATopicMemoryModel \
    --model_save_prefix top10_topic25_pooler_ \
    --pretrained_bert_path huggingface/CodeBERTa-small-v1 \
    --composition pooler \
    --train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --text_max_len 512 \
    --latent_dim 25 \
    --two_stage \
    --no_initial_test

The best model is saved at ./model/ and the lastest model is saved at ./last_model/. If training is broken, you can add --restore at the last without other changing. We have supported logging, so you can choose append | tee this_is_log.log to the command.

Inference

python -u main.py -d so_python -n 0 \
    --model_class QATopicMemoryModel  \
    --model_save_prefix top10_topic25_pooler_ \
    --pretrained_bert_path huggingface/CodeBERTa-small-v1 \
    --composition pooler \
    --train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --text_max_len 512 \
    --latent_dim 25 \
    --two_stage \
    --no_initial_test \
    --measure_time \
    --no_train

Citation

If you find this project useful in your research, please consider cite:

@article{10.1145/3607189,
    author = {Yang, Yuanhang and He, Wei and Gao, Cuiyun and Xu, Zenglin and Xia, Xin and Liu, Chuanyi},
    title = {TopicAns: Topic-Informed Architecture for Answer Recommendation on Technical Q&A Site},
    year = {2023},
    url = {https://doi.org/10.1145/3607189},
    journal = {ACM Trans. Softw. Eng. Methodol.},
}

License

This project is licensed under the MIT License.

topicans's People

Contributors

ysngki avatar hewei2001 avatar

Stargazers

 avatar Dawn avatar WHMU avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.