Giter Club home page Giter Club logo

keywords2vec's Introduction

keywords2vec

A simple and fast way to generate a word2vec model, with multi-word keywords instead of single words.

Example result

Finding similar keywords for "obesity"

index term
0 overweight
1 obese
2 physical inactivity
3 excess weight
4 obese adults
5 high bmi
6 obese adults
7 obese people
8 obesity-related outcomes
9 obesity among children
10 poor sleep quality
11 ssbs
12 obese populations
13 cardiometabolic risk
14 abdominal obesity

Install

pip install keywords2vec

How to use

Lets download some example data

data_filepath = "epistemonikos_data_sample.tsv.gz"

!wget "https://s3.amazonaws.com/episte-labs/epistemonikos_data_sample.tsv.gz" -O "{data_filepath}"

Import

from keywords2vec.main import similars_tree, get_similars

We create the model.

labels, tree = similars_tree(data_filepath)

More info, take a look here

Then we can get the most similars keywords

get_similars(tree, labels, "obesity")
['obesity',
 'overweight',
 'obese',
 'physical inactivity',
 'excess weight',
 'high bmi',
 'obese adults',
 'obese people',
 'obesity-related outcomes',
 'obesity among children',
 'poor sleep quality',
 'ssbs',
 'obese populations',
 'cardiometabolic risk',
 'abdominal obesity']
get_similars(tree, labels, "heart failure")
['heart failure',
 'hf',
 'chf',
 'chronic heart failure',
 'reduced ejection fraction',
 'unstable angina',
 'peripheral vascular disease',
 'peripheral arterial disease',
 'angina',
 'congestive heart failure',
 'left ventricular systolic dysfunction',
 'acute coronary syndrome',
 'heart failure patients',
 'acute myocardial infarction',
 'left ventricular dysfunction']

Motivation

The idea started in the Epistemonikos database www.epistemonikos.org, a database of scientific articles for people making decisions concerning clinical or health-policy questions. In this context the scientific/health language used is complex. You can easily find keywords like:

  • asthma
  • heart failure
  • medial compartment knee osteoarthritis
  • preserved left ventricular systolic function
  • non-selective non-steroidal anti-inflammatory drugs

We tried some approaches to find those keywords, like ngrams, ngrams + tf-idf, identify entities, among others. But we didn't get really good results.

Our approach

We found that tokenizing using stopwords + non word characters was really useful for "finding" the keywords. An example:

  • input: "Timing of replacement therapy for acute renal failure after cardiac surgery"
  • output: [ "timing", "replacement therapy", "acute renal failure", "cardiac surgery" ]

So we basically split the text when we find:

  • a stopword
  • a non word character(/,!?. etc) (except from - and ')

That's it.

But as there were some problem with some keywords that cointain stopwords, like:

  • Vitamin A
  • Hepatitis A
  • Web of Science

So we decided to add another method (nltk with some grammar definition) to cover most of the cases. To use this, you need to add the parameter keywords_w_stopwords=True, this method is approx 20x slower.

References

Seem to be an old idea (2004):

Mihalcea, Rada, and Paul Tarau. "Textrank: Bringing order into text." Proceedings of the 2004 conference on empirical methods in natural language processing. 2004.

Reading an implementation of textrank, I realize they used stopwords to separate and create the graph. Then I though in using it as tokenizer for word2vec

As pointed by @deliprao in this twitter thread. It's also used by Rake (2010):

Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. 10.1002/9780470689646.ch1.

As noted by @astent in the Twitter thread, this concept is called chinking (chunking by exclusion) https://www.nltk.org/book/ch07.html#Chinking

Multi-lingual

We worked in an implementation, that could be used in multiple languages. Of course not all languages are sutable for using this approach. We have tried with good results in English, Spanish and Portuguese

Try it online

You can try it here (takes time to load, lowercase only, doesn't work in mobile yet) MPV :)

These embedding were created using 827,341 title/abstract from @epistemonikos database. With keywords that repeat at least 10 times. The total vocab is 349,080 keywords (really manageable number)

Vocab size

One of the main benefit of this method, is the size of the vocabulary. For example, using keywords that repeat at least 10 times, for the Epistemonikos dataset (827,341 title/abstract), we got the following vocab size:

ngrams keywords comp
1 127,824 36%
1,2 1,360,550 388%
1-3 3,204,099 914%
1-4 4,461,930 1,272%
1-5 5,133,619 1,464%
stopword tokenizer 350,529 100%

More information regarding the comparison, take a look to the folder analyze.

Credits

This project has been created using nbdev

keywords2vec's People

Contributors

dependabot[bot] avatar dperezrada avatar slel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

keywords2vec's Issues

how to run the try_vectors

python try_keywords2vec.py -v data/experiments/episte_30000/word2vec.vec -n episte_30000 -p "obesity"

There needs to be a space between.

I also had to adjust the movement of data.

I had copied the data to the top directory. This works great for the model building. Then I moved the resultant files into the nlp/keywords/data location in order to be able to run the above command successfully.

The data directory gets created in the nlp/keywords/data location.

Produces this.
python try_keywords2vec.py -v data/experiments/episte_30000/word2vec.vec -n episte_30000 -p "obesity"
term score counter
0 overweight 0.693945 297
1 obese 0.617680 152
2 obese individuals 0.614914 15
3 obese adults 0.606311 28
4 insulin resistance 0.589740 85
5 obese people 0.573590 13
6 glucose tolerance 0.570195 6
7 bariatric surgery 0.554315 83
8 obese subjects 0.552010 13
9 dyslipidemia 0.549108 28
10 weight loss 0.546550 364
11 dyslipidaemia 0.542751 18
12 metabolic syndrome 0.541280 85
13 intragastric balloon 0.536888 5
14 underweight 0.523899 14
15 body fat distribution 0.520858 3
16 childhood overweight 0.519322 10
17 dental caries 0.518739 61
18 dietary 0.518213 61
19 cardiovascular diseases 0.517509 100
20 metabolic disease 0.516839 6
21 white caucasians 0.510198 4
22 bmi 0.508312 247
23 childhood-onset type 0.508124 4
24 lifestyle management 0.501618 5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.