Giter Club home page Giter Club logo

nlp's Introduction

Natural Language Processing

Introduction

Natural language processing (NLP) can be used to answer a variety of questions about unstructured text, as well as facilitating open-ended exploration. It can be applied to datasets such as emails, online articles and comments, tweets and novels. Although the source is text, transformations are applied to convert this data to vectors, dictionaries and symbols which can be handled very effectively by q. Many operations such as searching, clustering, and keyword extraction can all be done using very simple data structures, such as feature vectors.

Features

The NLP allows users to parse dataset using the spacy model from python in which it runs tokenisation, Sentence Detection, Part of speech tagging and Lemmatization. In addition to parsing, users can cluster text documents together using different clustering algorithms like MCL, K-means and radix. You can also run sentiment analysis which indicates whether a word has a positive or negative sentiment.

Requirements

  • kdb+>=? v3.5 64-bit
  • Anaconda Python 3.x
  • embedPy

Dependencies

The following python packages are required:

  1. numpy
  2. beautifulsoup4
  3. spacy
  • Tests were run using spacy version 2.2.1

To install these packages with

pip

pip install -r requirements.txt

or with conda

conda install -c conda-forge --file requirements.txt
  • Download the English model using python -m spacy download en

Other languages that spacy supports can be found at https://spacy.io/usage/models#languages

To use the languages in the alpha stage of developement in spacy the following steps can be taken:

To Download the Chinese model the jieba must be installed

pip

pip install jieba

To download the Japanese model mecab must be installed

pip

pip install mecab-python3
  • spacy_hunspell is not a requirement to run these scripts, but can be installed using the following methods

Linux

sudo apt-get install libhunspell-dev hunspell
pip install spacy_hunspell

mac

wget https://iweb.dl.sourceforge.net/project/wordlist/speller/2019.10.06/hunspell-en_US-2019.10.06.zip;
unzip hunspell-en_US-2019.10.06; sudo mv en_US.dic en_US.aff /Library/Spelling/; 
brew install hunspell;
export C_INCLUDE_PATH=/usr/local/include/hunspell;
sudo ln -sf /usr/local/lib/libhunspell-1.7.a /usr/local/lib/libhunspell.a;
sudo ln -sf /usr/local/Cellar/hunspell/1.7.0_2/lib/libhunspell-1.7.dylib /usr/local/Cellar/hunspell/1.7.0_2/lib/libhunspell.dylib;
CFLAGS=$(pkg-config --cflags hunspell) LDFLAGS=$(pkg-config --libs hunspell) pip install hunspell==0.5.0

At the moment spacy_hunspell does not support installation for windows. More information can be found at https://github.com/tokestermw/spacy_hunspell

Installation

Run tests with

q test.q

Place the library file in $QHOME and load into a q instance using

q)\l nlp/nlp.q
q).nlp.loadfile`:init.q
Loading init.q
Loading code/utils.q
Loading code/regex.q
Loading code/sent.q
Loading code/parser.q
Loading code/time.q
Loading code/date.q
Loading code/email.q
Loading code/cluster.q
Loading code/nlp_code.q
q).nlp.findTimes"I went to work at 9:00am and had a coffee at 10:20"
09:00:00.000 "9:00am" 18 24
10:20:00.000 "10:20"  45 50

Docker

If you have Docker installed you can alternatively run:

$ docker run -it --name mynlp kxsys/nlp
kdb+ on demand - Personal Edition

[snipped]

I agree to the terms of the license agreement for kdb+ on demand Personal Edition (N/y): y

If applicable please provide your company name (press enter for none): ACME Limited
Please provide your name: Bob Smith
Please provide your email (requires validation): [email protected]
KDB+ 3.5 2018.04.25 Copyright (C) 1993-2018 Kx Systems
l64/ 4()core 7905MB kx 0123456789ab 172.17.0.2 EXPIRE 2018.12.04 [email protected] KOD #0000000

Loading code/utils.q
Loading code/regex.q
Loading code/sent.q
Loading code/parser.q
Loading code/time.q
Loading code/date.q
Loading code/email.q
Loading code/cluster.q
Loading code/nlp_code.q
q).nlp.findTimes"I went to work at 9:00am and had a coffee at 10:20"
09:00:00.000 "9:00am" 18 24
10:20:00.000 "10:20"  45 50

N.B. instructions regarding headless/presets are available

N.B. build instructions for the image are available

Documentation

Documentation is available on the nlp homepage.

Status

The nlp library is still in development and is available here as a beta release.
If you have any issues, questions or suggestions, please write to [email protected].

nlp's People

Contributors

awilson-kx avatar cmccarthy1 avatar dianeod avatar fionncarr avatar jhanna-kx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp's Issues

Incompatible with spaCy 2.1

I've just followed the directions for installing on OSX. Running the following:

\l p.q
\l nlp/init.q

parser:.nlp.newParser[`en;`text`tokens`lemmas`pennPOS`isStop`sentChars`starts`sentIndices`keywords]

Gives the following error:

call: "[E108] As of spaCy v2.1, the pipe name `sbd` has been deprecated in favor of the pipe name `sentencizer`, which does the same thing. For example, use `nlp.create_pipeline('sentencizer')`"

`length error running .nlp.findDates

.nlp.findDates throws `length errors on the following inputs
"5.24.01"
"December 18-January 2"
"memo of 10-5 1 page.doc"

Running it on the Enron emails, it is throwing `length on ~8% of the emails

Incompatible with spaCy 3.5.3

This is similar to issue #17.

evaluation error:

call: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got  (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

  [4]  /home/david/miniconda3/envs/nlp/q/p.q:38: .p.embedPy:
        '`NYI];
      wrap pyfunc[f]. x];
                    ^
    ":"~first a0:string x0;                                                / attr lookup and possible call

  [3]  /home/david/miniconda3/envs/nlp/q/p.q:46: .p.i.wf:{[f;x]embedPy[f;x]}
                                                               ^

  [2]  /home/david/miniconda3/envs/nlp/q/nlp/code/parser.q:86: .nlp.parser.i.newSubParser:
    pipe:$[`~checkLang;model[`:create_pipe;`sentencizer];.p.pyget`x_sbd];
    model[`:add_pipe]pipe;
    ^
    ];

  [1]  /home/david/miniconda3/envs/nlp/q/nlp/code/nlpCode.q:65: .nlp.newParser:
  disabled:`ner`tagger`parser except options;
  model:parser.i.newSubParser[spacyModel;options;disabled];
        ^
  tokenAttrs:parser.i.q2spacy key[parser.i.q2spacy]inter options;

  [0]  myparser:.nlp.newParser[`en_core_web_sm;`text`tokens`lemmas`pennPOS`isStop`sentChars`starts`sentIndices`keywords] 
                ^

image

Changed the following line in parser.q for parser.i.newSubParser to resolve on my end:
From:

if[`sbd in options;
    pipe:$[`~checkLang;model[`:create_pipe;`sentencizer];.p.pyget`x_sbd];
    model[`:add_pipe]pipe;
    ];

To:

if[`sbd in options;$[`~checkLang;model[`:add_pipe;`sentencizer];model[`:add_pipe].p.pyget`x_sbd]];

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.