Giter Club home page Giter Club logo

bioreddit's Introduction

BioReddit Embeddings

This repository contains word embeddings trained on medical subreddits. We provide embeddings for GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018), and Flair (Akbik et al., 2018).

The embeddings are trained on ~800,000 Reddit posts from over 60 medical-themed communities. We describe the training and evaluation process of the embeddings in Basaldella and Collier, BioReddit: Word Embeddings for User-Generated Biomedical NLP, presented at the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), co-located with the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019).

Embeddings

You can download the embeddings in the release section of this repository or using the links in the table below:

Embedding Download Link
ELMo options, weights
Flair forward, backward
GloVe 50 txt, bin
Glove 100 txt, bin
Glove 200 txt, bin
FastText See COMETA
BERT See COMETA

Code

You can find the code used to download the subreddits here.

bioreddit's People

Contributors

basaldella avatar

Stargazers

 avatar  avatar Samantha C Pendleton avatar Boticello avatar Segun Aroyehun avatar Sopan Khosla avatar Yuanjiale avatar Sasho Savkov avatar Henghui Zhu avatar

Watchers

James Cloos avatar Nigel Collier avatar  avatar

bioreddit's Issues

Textual data included

Hi, I am reading the papers on BioReddit and had a quick question - it was mentioned:

300 million tokens and a vocabulary size of 780,000 words

But I can't seem to find anything regarding the average word count or count of Reddit comments/posts included in the model.

Any information of this would be greatly appreciated! Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.