Giter Club home page Giter Club logo

spacycake's Introduction

spacycaKE: Keyphrase Extraction for spaCy

spaCy v2.0 extension and pipeline component for Keyphrase Extraction methods meta data to Doc objects.

Installation

spacycaKE requires spacy v2.0.0 or higher and spacybert v1.0.0 or higher.

Usage

import spacy
from spacycake import BertKeyphraseExtraction as bake
nlp = spacy.load('en')

Then use bake as part of the spacy pipeline,

cake = bake(nlp, from_pretrained='bert-base-cased', top_k=3)
nlp.add_pipe(cake, last=True)

Extract the keyphrases.

doc = nlp("This is a test but obviously you need to place a bigger document here to extract meaningful keyphrases")
print(doc._.extracted_phrases)  # <-- List of 3 keyphrases

Available attributes

The extension sets attributes on the Doc object. You can change the attribute names on initializing the extension.

Doc._.bert_repr torch.Tensor Document BERT embedding
Doc._.noun_phrases List[str] List of the candidate phrases from the document
Doc._.extracted_phrases List[str] List of the final extracted keyphrases

Settings

On initialization of bake, you can define the following:

name type default description
nlp spacy.lang.(...) - Only used to get the language vocabulary to initialize the phrase matcher
from_pretrained str None Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., bert-base-cased
attr_names Tuple[str] ('bert_repr', 'noun_phrases', 'extracted_phrases') Name of the various available attributes set to the ._ property (in order)
force_extension bool True A boolean value to create the same 'Extension Attribute' upon being executed again
top_k int 5 Max number of extracted phrases
mmr_lambda float .5 Maximum Marginal Relevance lambda parameter. Used to control diversity of extracted keyphrases. Closer to 1., the more diverse the results. Closer to 0., the more similar the extracted phrases will be to the source document.
kws kwargs - More keyword arguments to supply to spacybert.BertInference()

Roadmap

This extension is still experimental. Possible future updates include:

  • Adding other keyphrase extraction methods.

spacycake's People

Contributors

surajiyer avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

spacycake's Issues

CUDA error

Hi, I have tried to run the simple_keyphrase_extraction.ipynb notebook using a GPU. It works to some degree, but I think there is a problem with the memory handling. I get errors like:

RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at C:/cb/pytorch_1000000000000/work/aten/src/THC/THCTensorSort.cu:62

and:

RuntimeError: CUDA error: an illegal memory access was encountered

I am running Python 3.8.5 and Pytorch 1.6

Error when trying to run on text

Hi, when I try to run the example code I get:

File "/usr/local/lib/python3.8/site-packages/spacy/language.py", line 449, in __call__ doc = proc(doc, **component_cfg.get(name, {})) File "/usr/local/lib/python3.8/site-packages/spacycake/__init__.py", line 105, in __call__ second_part = torch.matmul( RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity

Not sure what this error means. I have checked PyTorch and Spacy and they are the correct versions.

error generated running the sample in your README file

I attempted to run your simple example from the project readme file, and it generated an error during the pipeline processing.

import spacy
from spacycake import BertKeyphraseExtraction as bake
nlp = spacy.load('en')
cake = bake(nlp, from_pretrained='bert-base-cased', top_k=3)
nlp.add_pipe(cake, last=True)
doc = nlp("This is a test but obviously you need to place a bigger document here to extract meaningful keyphrases")
print(doc._.extracted_phrases)  # <-- List of 3 keyphrases

Generated the error:


RuntimeError                              Traceback (most recent call last)
<ipython-input-3-a85540c14de5> in <module>
----> 1 doc = nlp("This is a test but obviously you need to place a bigger document here to extract meaningful keyphrases")
      2 print(doc._.extracted_phrases)  # <-- List of 3 keyphrases

/anaconda2/envs/hr-analysis/lib/python3.8/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    447             if not hasattr(proc, "__call__"):
    448                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 449             doc = proc(doc, **component_cfg.get(name, {}))
    450             if doc is None:
    451                 raise ValueError(Errors.E005.format(name=name))

/anaconda2/envs/hr-analysis/lib/python3.8/site-packages/spacycake/__init__.py in __call__(self, doc)
    103         while len(R) > 0:
    104             first_part = torch.matmul(doc_embedding, phrases_embeddings[R].transpose(0, 1))
--> 105             second_part = torch.matmul(
    106                 phrases_embeddings[R],
    107                 phrases_embeddings[S].transpose(0, 1)).max(dim=1).values

RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity

This is the Python environment:

Running python 3.8.5
spaCy 2.3.2
spacybert 1.0.1
spacycaKE 1.0.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.