dpalmasan / trunajod2.0 Goto Github PK

An easy-to-use library to extract indices from texts.

Home Page: https://trunajod20.readthedocs.io/en/latest/

License: MIT License

Python 99.17% TeX 0.83%

semantic-measurements coherence text-analysis natural-language-processing readability-metrics text-processing ttr spacy text-mining spacy-extensions

trunajod2.0's Introduction

TRUNAJOD: A text complexity library for text analysis built on spaCy

TRUNAJOD is a Python library for text complexity analysis build on the high-performance spaCy library. With all the basic NLP capabilities provided by spaCy (dependency parsing, POS tagging, tokenizing), TRUNAJOD focuses on extracting measurements from texts that might be interesting for different applications and use cases. While most of the indices could be computed for different languages, currently we mostly support Spanish. We are happy if you contribute with indices implemented for your language!

Features

Utilities for text processing such as lemmatization, POS checkings.
Semantic measurements from text such as average coherence between sentences and average synonym overlap.
Giveness measurements such as pronoun density and pronoun noun ratio.
Built-in emotion lexicon to compute emotion calculations based on words in the text.
Lexico-semantic norm dataset to compute lexico-semantic variables from text.
Type token ratio (TTR) based metrics, and tunnable TTR metrics.
A built-in syllabizer (currently only for spanish).
Discourse markers based measurements to obtain measures of connectivity inside the text.
Plenty of surface proxies of text readability that can be computed directly from text.
Measurements of parse tree similarity as an approximation to syntactic complexity.
Parse tree correction to add periphrasis and heuristics for clause count, all based on linguistics experience.
Entity Grid and entity graphs model implementation as a measure of coherence.
An easy to use and user-friendly API.

Installation

TRUNAJOD can be installed by running pip install trunajod. It requires Python 3.6.2+ to run.

Getting Started

Using this package has some other pre-requisites. It assumes that you already have your model set up on spacy. If not, please first install or download a model (for Spanish users, a spanish model). Then you can get started with the following code snippet.

You can download pre-build TRUNAJOD models from the repo, under the models directory.

Below is a small snippet of code that can help you in getting started with this lib. Don´t forget to take a look at the documentation.

The example below assumes you have the es_core_news_sm spaCy Spanish model installed. You can install the model running: python -m spacy download es_core_news_sm. For other models, please check spaCy docs.

from TRUNAJOD import surface_proxies
from TRUNAJOD.entity_grid import EntityGrid
from TRUNAJOD.lexico_semantic_norms import LexicoSemanticNorm
import pickle
import spacy
import tarfile


class ModelLoader(object):
    """Class to load model."""
    def __init__(self, model_file):
        tar = tarfile.open(model_file, "r:gz")
        self.crea_frequency = {}
        self.infinitive_map = {}
        self.lemmatizer = {}
        self.spanish_lexicosemantic_norms = {}
        self.stopwords = {}
        self.wordnet_noun_synsets = {}
        self.wordnet_verb_synsets = {}

        for member in tar.getmembers():
            f = tar.extractfile(member)
            if "crea_frequency" in member.name:
                self.crea_frequency = pickle.loads(f.read())
            if "infinitive_map" in member.name:
                self.infinitive_map = pickle.loads(f.read())
            if "lemmatizer" in member.name:
                self.lemmatizer = pickle.loads(f.read())
            if "spanish_lexicosemantic_norms" in member.name:
                self.spanish_lexicosemantic_norms = pickle.loads(f.read())
            if "stopwords" in member.name:
                self.stopwords = pickle.loads(f.read())
            if "wordnet_noun_synsets" in member.name:
                self.wordnet_noun_synsets = pickle.loads(f.read())
            if "wordnet_verb_synsets" in member.name:
                self.wordnet_verb_synsets = pickle.loads(f.read())


# Load TRUNAJOD models
model = ModelLoader("trunajod_models_v0.1.tar.gz")

# Load spaCy model
nlp = spacy.load("es_core_news_sm", disable=["ner", "textcat"])

example_text = (
    "El espectáculo del cielo nocturno cautiva la mirada y suscita preguntas"
    "sobre el universo, su origen y su funcionamiento. No es sorprendente que "
    "todas las civilizaciones y culturas hayan formado sus propias "
    "cosmologías. Unas relatan, por ejemplo, que el universo ha"
    "sido siempre tal como es, con ciclos que inmutablemente se repiten; "
    "otras explican que este universo ha tenido un principio, "
    "que ha aparecido por obra creadora de una divinidad."
)

doc = nlp(example_text)

# Lexico-semantic norms
lexico_semantic_norms = LexicoSemanticNorm(
    doc,
    model.spanish_lexicosemantic_norms,
    model.lemmatizer
)

# Frequency index
freq_index = surface_proxies.frequency_index(doc, model.crea_frequency)

# Clause count (heurístically)
clause_count = surface_proxies.clause_count(doc, model.infinitive_map)

# Compute Entity Grid
egrid = EntityGrid(doc)

print("Concreteness: {}".format(lexico_semantic_norms.get_concreteness()))
print("Frequency Index: {}".format(freq_index))
print("Clause count: {}".format(clause_count))
print("Entity grid:")
print(egrid.get_egrid())

This should output:

Concreteness: 1.95
Frequency Index: -0.7684649336888104
Clause count: 10
Entity grid:
{'ESPECTÁCULO': ['S', '-', '-'], 'CIELO': ['X', '-', '-'], 'MIRADA': ['O', '-', '-'], 'UNIVERSO': ['O', '-', 'S'], 'ORIGEN': ['X', '-', '-'], 'FUNCIONAMIENTO': ['X', '-', '-'], 'CIVILIZACIONES': ['-', 'S', '-'], 'CULTURAS': ['-', 'X', '-'], 'COSMOLOGÍAS': ['-', 'O', '-'], 'EJEMPLO': ['-', '-', 'X'], 'TAL': ['-', '-', 'X'], 'CICLOS': ['-', '-', 'X'], 'QUE': ['-', '-', 'S'], 'SE': ['-', '-', 'O'], 'OTRAS': ['-', '-', 'S'], 'PRINCIPIO': ['-', '-', 'O'], 'OBRA': ['-', '-', 'X'], 'DIVINIDAD': ['-', '-', 'X']}

A real world example

TRUNAJOD lib was used to make TRUNAJOD web app, which is an application to assess text complexity and to check the adquacy of a text to a particular school level. To achieve this, several TRUNAJOD indices were analyzed for multiple Chilean school system texts (from textbooks), and latent features were created. Here is a snippet:

"""Example of TRUNAJOD usage."""
import glob

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import spacy
import textract  # To read .docx files
import TRUNAJOD.givenness
import TRUNAJOD.ttr
from TRUNAJOD import surface_proxies
from TRUNAJOD.syllabizer import Syllabizer

plt.rcParams["figure.figsize"] = (11, 4)
plt.rcParams["figure.dpi"] = 200


nlp = spacy.load("es_core_news_sm", disable=["ner", "textcat"])

features = {
    "lexical_diversity_mltd": [],
    "lexical_density": [],
    "pos_dissimilarity": [],
    "connection_words_ratio": [],
    "grade": [],
}
for filename in glob.glob("corpus/*/*.docx"):
    text = textract.process(filename).decode("utf8")
    doc = nlp(text)
    features["lexical_diversity_mltd"].append(
        TRUNAJOD.ttr.lexical_diversity_mtld(doc)
    )
    features["lexical_density"].append(surface_proxies.lexical_density(doc))
    features["pos_dissimilarity"].append(
        surface_proxies.pos_dissimilarity(doc)
    )
    features["connection_words_ratio"].append(
        surface_proxies.connection_words_ratio(doc)
    )

    # In our case corpus was organized as:
    # corpus/5B/5_2_55.docx where the folder that
    # contained the doc, contained the school level, in
    # this example 5th grade
    features["grade"].append(filename.split("/")[1][0])

df = pd.DataFrame(features)


fig, axes = plt.subplots(2, 2)

sns.boxplot(x="grade", y="lexical_diversity_mltd", data=df, ax=axes[0, 0])
sns.boxplot(x="grade", y="lexical_density", data=df, ax=axes[0, 1])
sns.boxplot(x="grade", y="pos_dissimilarity", data=df, ax=axes[1, 0])
sns.boxplot(x="grade", y="connection_words_ratio", data=df, ax=axes[1, 1])

Which yields:

TRUNAJOD web app example

TRUNAJOD web app backend was built using TRUNAJOD lib. A demo video is shown below (it is in Spanish):

Contributing to TRUNAJOD

Bug reports and fixes are always welcome! Feel free to file issues, or ask for a feature request. We use Github issue tracker for this. If you'd like to contribute, feel free to submit a pull request. For more questions you can contact me at dipalma (at) udec (dot) cl.

More details can be found in CONTRIBUTING.

References

If you find anything of this useful, feel free to cite the following papers, from which a lot of this python library was made for (I am also in the process of submitting this lib to an open software journal):

@article{Palma2021,
  doi = {10.21105/joss.03153},
  url = {https://doi.org/10.21105/joss.03153},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {60},
  pages = {3153},
  author = {Diego A. Palma and Christian Soto and Mónica Veliz and Bruno Karelovic and Bernardo Riffo},
  title = {TRUNAJOD: A text complexity library to enhance natural language processing},
  journal = {Journal of Open Source Software}
}

@article{palma2018coherence,
  title={Coherence-based automatic essay assessment},
  author={Palma, Diego and Atkinson, John},
  journal={IEEE Intelligent Systems},
  volume={33},
  number={5},
  pages={26--36},
  year={2018},
  publisher={IEEE}
}

@inproceedings{palma2019data,
  title={A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements},
  author={Palma, Diego and Soto, Christian and Veliz, M{\'o}nica and Riffo, Bernardo and Guti{\'e}rrez, Antonio},
  booktitle={International Conference on Human Interaction and Emerging Technologies},
  pages={509--515},
  year={2019},
  organization={Springer}
}

trunajod2.0's People

Contributors

Stargazers

Watchers

Forkers

brucewlee iairubank sourvad supersonic1999 arevalirio tianfrank

trunajod2.0's Issues

Investigation: Add support for word vectors

Implementation of this should be generic enough to allow used any other model, for example:

We'd like to use vectors obtained from a Vector Space model (Raw Counts, Tf-idf, LSA)
We also need to support using other types of word embeddings, such as GloVe or Word2Vec or embeddings obtained from BERT (for sentences), for example: https://huggingface.co/

Our representation should allow us to implement metrics such as:

Average sentence similarity (e.g. cosine distance, euclidean distance)
Other metrics based on sentence similarity (e.g. max distance between two sentences, average distance to the cluster center)
Givenness using semantic spaces
etc.

One design idea is having a callable that takes the text and returns de vectors as a numpy array. From the spacy dependency we should already have numpy in our dependencies, so no worries about that.

The, we will need to file issues to implement different metrics.

Add Type Hints to TTR module

TTR module can be found here:

https://github.com/dpalmasan/TRUNAJOD2.0/blob/master/src/TRUNAJOD/ttr.py

For example, using type hints we could define:

def type_token_ratio(word_list):

def type_token_ratio(word_list: List[str]) -> float:

Set Superset as a visualization tool

Connect Database with superset to do make visualizations and analyses, that could help in getting ideas/insights about how well is performing the complexity model and the app.

Update pre-commit from yapf to black

Since we are using python 3.6+ and not supporting 3.5 (deprecated), pre-commit hooks should use back instead of yapf and all the files should be reformatted. Hooks configuration should also be updated.

Fix documentation on readthedocs

For some reason, when building locally, documentation is shown correctly. However, in readthedocs TTR documentation is missing.

Add type hints to discourse markers

Add type hints to discourse markers features. This can be found in the following source code:

https://github.com/dpalmasan/TRUNAJOD2.0/blob/master/src/TRUNAJOD/discourse_markers.py

For example, the function find_matches (which already has a docstring specifying types):

def find_matches(text, list):
    """Return matches of words in list in a target text.
    Given a text and a list of possible matches (in this module, discourse
    markers list), returns the number of matches found in text. This ignores
    case.
    .. hint:: For non-Spanish users
       You could use this function with your custom list of discourse markers
       in case you need to compute this metric. In that case, the way to call
       the funcion would be: ``find_matches(YOUR_TEXT, ["dm1", "dm2", etc])``
    :param text: Text to be processed
    :type text: string
    :param list: list of discourse markers
    :type list: Python list of strings
    :return: Number of ocurrences
    :rtype: int
    """
    counter = 0
    for w in list:
        results = re.findall(r"\b%s\b" % w, text, re.IGNORECASE)
        counter += len(results)
    return counter

Could be updated to:

def find_matches(text: str, list: List[str]) -> int:

Add type hints to surface_proxies

Add type hints to src/surface_proxies.py

Migrate models built in Spacy to use Stanford models

With the new release of stanza:

https://stanfordnlp.github.io/stanza/

Maybe it is a good opportunity to improve accuracy. The issue is about investigating if this could improve our accuracy and cost estimates of migration.

Paper grammar review

There are some grammar issues on the paper. For instance, using "have" instead of "has", missing articles (such as "the" text), missing hyphen (such as grid-based), among others. I suggest using Grammarly (https://app.grammarly.com/) to assist the authors on this task.

cc openjournals/joss-reviews#3153

Implement D estimate

This estimate, estimates lexical diversity using a non-linear model. it is computed by the following procedure:

Take a random sample of $N$ words from the text.
Calculate the TTR (type token ratio)
Find the $D$ value that bests fit the following equation:

$TTR=\frac{D}{N}\left[\sqrt{(1 + 2\frac{N}{D})}-1\right]$

It is allowed using numpy as spaCy is already set as a dependency for TRUNAJOD

Include additional examples on the documentation

In the documentation, there is only one example of usage. Since there is a lot of features, the authors should include at least two or three more examples, using other features to improve the understanding of the proposed package. Currently, the remaining functionality is described only at an API level, which seems insufficient.

cc openjournals/joss-reviews#3153

Make spaCy as a dependency and fix buggy setup

TRUNAJOD use spaCy but the setup.py file does not add it as a dependency. On the other hand, long_description uses an extra unnecessary rst file that does not render properly on pypi site. This issue aims to fix this.

Add type hints to lexico semantic norms

Add type hints to src/lexico_semantic_norms.py

Complete / expand contribution guidelines

Hey 👋 ! This issue is part of the JOSS review (openjournals/joss-reviews#3153) for this project.

First of all, kudos on the project! As a Spanish-speaker NLP researcher myself, I honestly cannot commend you enough regarding the quality and breadth of this work. We need this kind of research in our community and I thank you on behalf of all my colleague researchers as well.

Checking over the CONTRIBUTING guidelines, it seems there are some TBD sections that could be completed, not necessarily in great detail, but at least to a point where they're useful. If you feel there is a section that is unnecessary you can just remove it (e.g., how to create a pull-request is a general Github-level issue that I don't necessarily think your guidelines should spend too much effort in explaining).

My suggestion is to review the CONTRIBUTING guidelines and either add some content in those TBD sections that you consider relevant or reformat the document to remove them, thus making the guidelines complete and welcoming to new contributors. Please let me know if you feel this is a big issue or if you want to further discuss it.

Add type hints to utils

Add type hints to src/utils.py

Add type hints to entity grid

The entity grid module can be found in the following source:

https://github.com/dpalmasan/TRUNAJOD2.0/blob/master/src/TRUNAJOD/entity_grid.py

For example, the function:

def get_local_coherence(egrid):

Could be defined as:

def get_local_coherence(egrid: EntityGrid) -> Tuple[float, float, float, float, float]:

Add coherence measurements based on syntactic patterns

Implement and evaluate the following coherence model:

Louis, A., & Nenkova, A. (2012, July). A coherence model based on syntactic patterns. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 1157-1168).

State of the field

In terms of state of the field, the paper include a lot of discussions around text measurements. But how about competitor libraries? TRUNAJOD is the only library/package (including other programming languages) that serves this purpose? If it isn't the only one, how does it works compared to those implementations?

cc openjournals/joss-reviews#3153

Create a mechanism for model installation

Currently TRUNAJOD models live in a folder called models, and they are loaded using a snippet as follows:

from TRUNAJOD import surface_proxies
from TRUNAJOD.entity_grid import EntityGrid
from TRUNAJOD.lexico_semantic_norms import LexicoSemanticNorm
import pickle
import spacy
import tarfile


class ModelLoader(object):
    """Class to load model."""
    def __init__(self, model_file):
        tar = tarfile.open(model_file, "r:gz")
        self.crea_frequency = {}
        self.infinitive_map = {}
        self.lemmatizer = {}
        self.spanish_lexicosemantic_norms = {}
        self.stopwords = {}
        self.wordnet_noun_synsets = {}
        self.wordnet_verb_synsets = {}

        for member in tar.getmembers():
            f = tar.extractfile(member)
            if "crea_frequency" in member.name:
                self.crea_frequency = pickle.loads(f.read())
            if "infinitive_map" in member.name:
                self.infinitive_map = pickle.loads(f.read())
            if "lemmatizer" in member.name:
                self.lemmatizer = pickle.loads(f.read())
            if "spanish_lexicosemantic_norms" in member.name:
                self.spanish_lexicosemantic_norms = pickle.loads(f.read())
            if "stopwords" in member.name:
                self.stopwords = pickle.loads(f.read())
            if "wordnet_noun_synsets" in member.name:
                self.wordnet_noun_synsets = pickle.loads(f.read())
            if "wordnet_verb_synsets" in member.name:
                self.wordnet_verb_synsets = pickle.loads(f.read())


# Load TRUNAJOD models
model = ModelLoader("trunajod_models_v0.1.tar.gz")

However, other similar projects (e.g. spacy), have models in a different repo, and they are installed via python tooling (e.g. cmd line options). This ticket is to track the effort of implementing such mechanism.

Implement Guiraud’s Index

This is a lexical diversity measurement that penalizes number of words. It is computed as:

$GI=\displaystyle\frac{v}{\sqrt{N}}$

Where $v$ is the number of distinct words in the text, and $N$ is the total number of words in the text.

Unit tests should be added as well.

Docs should be updated as well, adding the following reference:

@misc{herdan1961problemes,
  title={Probl{\`e}mes et m{\'e}thodes de la statistique linguistique},
  author={Herdan, Gustav},
  year={1961},
  publisher={JSTOR}
}

Documentation is inconsistent with current released version

Documentation on readthedocs does contain the documentation generated from master branch, but not from stable version. Documentation should be set up properly to not mislead users.

Add word variation index

Word variation index can be thought as "density of ideas" in the text. It can be computed as:

$WVI = \displaystyle\frac{log\left(n(w)\right)}{log\left(2 - \frac{log(n(vw))}{log(n(w))}\right)}$

Implement Salience model for Entity Grid

Currently, EntityGrid features only support a uniform model for entities, however, salient features might improve the performance of applications using the entity grid model. This issue is to extend the model to consider salience features with the approach stated in:

Barzilay, R., & Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1), 1-34.

Error while running example

I've successfully installed TRUNAJOD and its dependencies. However, when I tried to run the example presented in the README, the following error occurred:

Traceback (most recent call last):
  File "main.py", line 44, in <module>
    nlp = spacy.load("es", disable=["ner", "textcat"])
  File "/home/matheus/.local/lib/python3.8/site-packages/spacy/__init__.py", line 47, in load
    return util.load_model(name, disable=disable, exclude=exclude, config=config)
  File "/home/matheus/.local/lib/python3.8/site-packages/spacy/util.py", line 328, in load_model
    raise IOError(Errors.E941.format(name=name, full=OLD_MODEL_SHORTCUTS[name]))
OSError: [E941] Can't find model 'es'. It looks like you're trying to load a model from a shortcut, which is deprecated as of spaCy v3.0. To load the model, use its full name instead:

nlp = spacy.load("es_core_news_sm")

For more details on the available models, see the models directory: https://spacy.io/models. If you want to create a blank model, use spacy.blank: nlp = spacy.blank("es")

Implement Yule's K lexical diversity index

The Yule's K is defined as:

With $r=1,2,\ldots$ , and $V_r$ is the number of words occurring $r$ times in a text containing $N$ words.
Unit tests should be added as well.

Improve `is_word` function

Currently is_word function receives as an argument a pos_tag:

https://github.com/dpalmasan/TRUNAJOD2.0/blob/master/src/TRUNAJOD/utils.py#L241

Would be better for usability purposes that it receives a spacy Token instead.

Acceptance criteria

Signature for is_word is updated and input is a spacy Token
Type hints updated properly
No regressions in test cases

Add type hints to emotions.py

Add type hints to src/emotions.py

Add givenness approximation based on semantic measurements

Currently we support givenness metrics, based on pronoun patterns. An estimate that might help for some complexity assessment tasks is using givenness based on a vector space model, as depicted in the following paper:

https://escholarship.org/content/qt87n81224/qt87n81224_noSplash_9e003a08316eb6817c3aeda212d9ae27.pdf

Implement this feature.

Add type hints to semantic_measures.py

Add type hints to src/semantic_measures.py

Move from travis-ci to github actions

Travis CI is kind of slow, and now we have available github actions. This ticket aims to migrate from travis-ci to github actions.

Missing community guidelines

Both the documentation page and the README file should contain clear guidelines for third parties who wish to contribute to the software, report issues, or seek support. Here is an example:

If you wish to contribute to the software, report issues or problems, or seek support, feel free to use the issue report of this repository or to contact us.

Author's 1 name - [email protected]
Author's 2 name - [email protected]

Specify Python version

The supported python version should be explicitly specified in the README.

cc openjournals/joss-reviews#3153

Fix typo in Yule's K

Constant was set as 10^-4, but it is 10^4. Therefore, we need to change this and also update docstring.

Implement universal POS tags ratio

SpaCy tokens can contain universal POS tags and detailed POS tags, both properties in a spacy Token are pos_ and tag_ respectively. Currently, the function pos_ratio is kind of misleading as in the docstring it is describing pos_ tags, but computing ratio using tag_. This ticket is to fix this and implement a similar function using universal POS tags.

https://github.com/dpalmasan/TRUNAJOD2.0/blob/master/src/TRUNAJOD/surface_proxies.py#L515

Acceptance Criteria

Docstring is updated properly
Functionality to compute pos_ratio based on universal POS tag is added.

Create factor analysis from model 5B discussed

There was an statistical analisys on what the factors are, and what are they factor loadings, in particular we consider the following factors:

Lexical Similarity
Connectivity
Referential Cohesion
Concreteness
Narrativity

Since we already can compute these factors, we need to implement the computation bits. This issue is to track that.

Implement a Hapax Legomena Index

This is defined as the number of words occurring once in the text. Might be a good estimate when comparing texts of similar length.

Figure out a way to store model without relying on external files

Several functions of our complexity text assessor rely on external databases (such as giveness with the MCR (multilingual central repository), stopwords, lemmatizer, etc. We would like to provide an easy API for testing TRUNAJOD lib, such as the one spacy provdes with the nlp method. This issue is about investigating an alternative to achieve this.

dpalmasan / trunajod2.0 Goto Github PK

trunajod2.0's Introduction

TRUNAJOD: A text complexity library for text analysis built on spaCy

Features

Installation

Getting Started

A real world example

TRUNAJOD web app example

Contributing to TRUNAJOD

References

trunajod2.0's People

Contributors

Stargazers

Watchers

Forkers

trunajod2.0's Issues

Recommend Projects

Recommend Topics

Recommend Org