Giter Club home page Giter Club logo

spacydotnet's Introduction

SpacyDotNet

SpacyDotNet is a .NET wrapper for the popular natural language library spaCy

Project scope and limitations

This project is not meant to be a complete and exhaustive implementation of all spaCy features and APIs. Altough it should be enough for basic tasks, I think of it as a starting point, if the user needs to build a complex project using spaCy in .NET

Most of the basic features in Spacy101 section of the docs are available. All Containers classes are present (Doc, DocBin, Token, Span and Lexeme) with their basic properties/methods running and also Vocab and StringStore in a limited form.

Furthermore, any developer should be ready to add the missing properties or classes in a very straightforward manner.

Requirements

This projects relies on Python.NET to interop with spaCy, which is written in Python/Cython.

It's been tested under Windows 10 and Ubuntu Linux 20.04, using the following environment

  • .NET Core 3.1 / .NET Standard 2.1
  • spaCy 3.0.5
  • Python 3.8
  • Python.NET: Latest official NuGet: 3.0.0-preview2021-04-03

Furthermore, it might work under different conditions:

  • .NET Core 3.0 and 2.1 should be fine. .NET 5.0 is a major release that I haven't tried so far. I haven't tried .NET Framework either
  • It should work with spaCy 2.3.5 and any other spaCy version that changes only its minor/patch version number

The current version of Python.NET has been compiled against Python 3.8 so the virtual environment must be created under this version. In general we should honor the specified Python.NET compiled CPython version

Setup

1) Create a Python virtual environment and install spaCy

It's advised to create a virtual environment to install spaCy. Depending on the host system this is done in different ways.

The spaCy official installation guide is fine, but keep in mind Python 3.8 restriction.

To run the examples, we'll also need to install the correspoding language package (es_core_news_sm) as shown in the guide.

2) Check for Python shared library

Python.NET makes use of Python as a shared library. Sadly, seems like the shared library is not copied with recent versions of virtualenv and it's not even distributed in some flavours of Linux/Python >= 3.8

While I don't understand the rationale behind those changes, we should check the following:

Windows

Check whether python38.dll in located under <venv_root>\Scripts folder. Otherwise, go to your main Python folder and copy all dlls. In my case: python3.dll, python38.dll and the vcruntime140.dll

Linux

Check whether a libpython shared object is located under <venv_root>/bin folder.

If not, we first need to check if the shared object is present on our system. find_libpython can help with this task.

If library is nowhere to be found, it's likely that installing python-dev package with the package manager of your favorite distribution will place the file in your system.

Once we locate the library, drop it to the bin folder. In my case, the file is named libpython3.8.so.1.0

Usage

SpaCyDotNet is built to be used as a library. However I provide an example project as a CLI program.

1) Compile and Build

If using the CLI to run .NET, (Linux), we should simply browse to Test/cs folder and compile the project with dotnet build. Under Visual Studio, just load Test.sln solution

2) Run the project

The program expects two parameters

  • interpreter: Name of Python shared library file. Usually python38.dll on Windows, libpython3.8.so on Linux and libpython3.8.dylib on Mac
  • venv: Location of the virtual environment create with python 3.8 and a spaCy version

Run the example with dotnet run --interpreter <name_of_intepreter> --venv <path_to_virtualenv> or if using Visual Studio, set the command line in Project => Properties => Debug => Application arguments

In my case:

Linux

dotnet run --interpreter libpython3.8.so.1.0 --venv /home/user/Dev/venvSpaCyPy38

Windows

dotnet run --interpreter python38.dll --venv C:\Users\user\Dev\venvSpaCyPy38

Code comparison

I've tried to mimic spaCy API as much as possible, considering the different nature of both C# and Python languages

C# SpacyDotNet code

var nlp = spacy.Load("en_core_web_sm");
var doc = nlp.GetDocument("Apple is looking at buying U.K. startup for $1 billion");

foreach (Token token in doc.Tokens)
    Console.WriteLine($"{token.Text} {token.Lemma} {token.PoS} {token.Tag} {token.Dep} {token.Shape} {token.IsAlpha} {token.IsStop}");

Console.WriteLine("");
foreach (Span ent in doc.Ents)
    Console.WriteLine($"{ent.Text} {ent.StartChar} {ent.EndChar} {ent.Label}");

nlp = spacy.Load("en_core_web_md");
var tokens = nlp.GetDocument("dog cat banana afskfsd");

Console.WriteLine("");
foreach (Token token in tokens.Tokens)
    Console.WriteLine($"{token.Text} {token.HasVector} {token.VectorNorm}, {token.IsOov}");

tokens = nlp.GetDocument("dog cat banana");
Console.WriteLine("");
foreach (Token token1 in tokens.Tokens)
{
    foreach (Token token2 in tokens.Tokens)
        Console.WriteLine($"{token1.Text} {token2.Text} {token1.Similarity(token2) }");
}

doc = nlp.GetDocument("I love coffee");
Console.WriteLine("");
Console.WriteLine(doc.Vocab.Strings["coffee"]);
Console.WriteLine(doc.Vocab.Strings[3197928453018144401]);

Console.WriteLine("");
foreach (Token word in doc.Tokens)
{
    var lexeme = doc.Vocab[word.Text];
    Console.WriteLine($@"{lexeme.Text} {lexeme.Orth} {lexeme.Shape} {lexeme.Prefix} {lexeme.Suffix} 
{lexeme.IsAlpha} {lexeme.IsDigit} {lexeme.IsTitle} {lexeme.Lang}");
}

Python spaCy code

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

print("")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat banana afskfsd")

print("")
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

tokens = nlp("dog cat banana")
print("")
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

doc = nlp("I love coffee")
print("")
print(doc.vocab.strings["coffee"])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

print("")
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

Output

Output

spacydotnet's People

Contributors

amarostegui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

spacydotnet's Issues

More Spacy Examples

Great initiative!

I am curious what could be the next milestones in Spacy Examples

Examples:

  • Phrase matcher
  • Extracting entity relations
  • Navigating the parse tree and subtrees

Please share which could be the next in your plan

what license this tool has?

I don’t see there is any license. Can I take the code and re-distribute without constraint? Or what constraints it has?

is this going to be updated and continue development?

I would like to use spacydotnet but i dont see there new development for 1 year. are there any other alternatives for using spacy from .net that are properly maintained and updated to last version and all features?

Python.Runtime.PythonException. TypeError : object of type 'generator' has no len()

Hola Antonio,

I'm trying tu run the example in the class 'LinguisticFeatures.cs' but I receiving this exception:
Python.Runtime.PythonException. TypeError : object of type 'generator' has no len().

I think this happen when trying to access the 'Children' list member of Token. Because if I try to do a token.Children.Count, same exception arises.

foreach (var token in doc.Tokens)
{
    var childs = new List<string>();
    token.Children.ForEach(c => childs.Add(c.Text)); //<---Exception here
    Console.WriteLine($"{token.Text} {token.Dep} {token.Head.Text} [{string.Join(", ", childs)}]");
}  

I am running the Test project from Visual Studio 2009 and passing the arguments --interpreter python38.dll --venv 'C:\Users\Jorge\Documents\venv'

Strange issue with the example.

Hola Antonio,

First of all, let you know that I have no idea of Python, but using spacyDotNet, I found really easy to use space in my .net project.

I build a .net app, (WinForms) using spacydotnet and I'm having an strange issue. If I run the app from VS (with F5), it works with no problem. But If I try to use the build .exe it fails always with 'AttributeError : 'NoneType' object has no attribute 'flush''

I tried in a W7 or W10 machine, both 64bits. I tried with python 3.7.0 and 3.7.1

In order to locate the problem, I took the test project from spacydotnet and I found that setting output of the project as Console app works fine, but If I tried changing the output type to windows Application, I get the same error loading the model ("var nlp = spacy.load(pyString);")

I believe this is related with the output of the python.net, could it be?

Any suggestion or trick?

thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.