Giter Club home page Giter Club logo

trunajod2.0's Introduction

TRUNAJOD: A text complexity library for text analysis built on spaCy

Actions Status Documentation Status PyPI - Python Version License: MIT PyPI Downloads Code style: black Built with spaCy JOSS paper DOI

TRUNAJOD is a Python library for text complexity analysis build on the high-performance spaCy library. With all the basic NLP capabilities provided by spaCy (dependency parsing, POS tagging, tokenizing), TRUNAJOD focuses on extracting measurements from texts that might be interesting for different applications and use cases. While most of the indices could be computed for different languages, currently we mostly support Spanish. We are happy if you contribute with indices implemented for your language!

Features

  • Utilities for text processing such as lemmatization, POS checkings.
  • Semantic measurements from text such as average coherence between sentences and average synonym overlap.
  • Giveness measurements such as pronoun density and pronoun noun ratio.
  • Built-in emotion lexicon to compute emotion calculations based on words in the text.
  • Lexico-semantic norm dataset to compute lexico-semantic variables from text.
  • Type token ratio (TTR) based metrics, and tunnable TTR metrics.
  • A built-in syllabizer (currently only for spanish).
  • Discourse markers based measurements to obtain measures of connectivity inside the text.
  • Plenty of surface proxies of text readability that can be computed directly from text.
  • Measurements of parse tree similarity as an approximation to syntactic complexity.
  • Parse tree correction to add periphrasis and heuristics for clause count, all based on linguistics experience.
  • Entity Grid and entity graphs model implementation as a measure of coherence.
  • An easy to use and user-friendly API.

Installation

TRUNAJOD can be installed by running pip install trunajod. It requires Python 3.6.2+ to run.

Getting Started

Using this package has some other pre-requisites. It assumes that you already have your model set up on spacy. If not, please first install or download a model (for Spanish users, a spanish model). Then you can get started with the following code snippet.

You can download pre-build TRUNAJOD models from the repo, under the models directory.

Below is a small snippet of code that can help you in getting started with this lib. Don´t forget to take a look at the documentation.

The example below assumes you have the es_core_news_sm spaCy Spanish model installed. You can install the model running: python -m spacy download es_core_news_sm. For other models, please check spaCy docs.

from TRUNAJOD import surface_proxies
from TRUNAJOD.entity_grid import EntityGrid
from TRUNAJOD.lexico_semantic_norms import LexicoSemanticNorm
import pickle
import spacy
import tarfile


class ModelLoader(object):
    """Class to load model."""
    def __init__(self, model_file):
        tar = tarfile.open(model_file, "r:gz")
        self.crea_frequency = {}
        self.infinitive_map = {}
        self.lemmatizer = {}
        self.spanish_lexicosemantic_norms = {}
        self.stopwords = {}
        self.wordnet_noun_synsets = {}
        self.wordnet_verb_synsets = {}

        for member in tar.getmembers():
            f = tar.extractfile(member)
            if "crea_frequency" in member.name:
                self.crea_frequency = pickle.loads(f.read())
            if "infinitive_map" in member.name:
                self.infinitive_map = pickle.loads(f.read())
            if "lemmatizer" in member.name:
                self.lemmatizer = pickle.loads(f.read())
            if "spanish_lexicosemantic_norms" in member.name:
                self.spanish_lexicosemantic_norms = pickle.loads(f.read())
            if "stopwords" in member.name:
                self.stopwords = pickle.loads(f.read())
            if "wordnet_noun_synsets" in member.name:
                self.wordnet_noun_synsets = pickle.loads(f.read())
            if "wordnet_verb_synsets" in member.name:
                self.wordnet_verb_synsets = pickle.loads(f.read())


# Load TRUNAJOD models
model = ModelLoader("trunajod_models_v0.1.tar.gz")

# Load spaCy model
nlp = spacy.load("es_core_news_sm", disable=["ner", "textcat"])

example_text = (
    "El espectáculo del cielo nocturno cautiva la mirada y suscita preguntas"
    "sobre el universo, su origen y su funcionamiento. No es sorprendente que "
    "todas las civilizaciones y culturas hayan formado sus propias "
    "cosmologías. Unas relatan, por ejemplo, que el universo ha"
    "sido siempre tal como es, con ciclos que inmutablemente se repiten; "
    "otras explican que este universo ha tenido un principio, "
    "que ha aparecido por obra creadora de una divinidad."
)

doc = nlp(example_text)

# Lexico-semantic norms
lexico_semantic_norms = LexicoSemanticNorm(
    doc,
    model.spanish_lexicosemantic_norms,
    model.lemmatizer
)

# Frequency index
freq_index = surface_proxies.frequency_index(doc, model.crea_frequency)

# Clause count (heurístically)
clause_count = surface_proxies.clause_count(doc, model.infinitive_map)

# Compute Entity Grid
egrid = EntityGrid(doc)

print("Concreteness: {}".format(lexico_semantic_norms.get_concreteness()))
print("Frequency Index: {}".format(freq_index))
print("Clause count: {}".format(clause_count))
print("Entity grid:")
print(egrid.get_egrid())

This should output:

Concreteness: 1.95
Frequency Index: -0.7684649336888104
Clause count: 10
Entity grid:
{'ESPECTÁCULO': ['S', '-', '-'], 'CIELO': ['X', '-', '-'], 'MIRADA': ['O', '-', '-'], 'UNIVERSO': ['O', '-', 'S'], 'ORIGEN': ['X', '-', '-'], 'FUNCIONAMIENTO': ['X', '-', '-'], 'CIVILIZACIONES': ['-', 'S', '-'], 'CULTURAS': ['-', 'X', '-'], 'COSMOLOGÍAS': ['-', 'O', '-'], 'EJEMPLO': ['-', '-', 'X'], 'TAL': ['-', '-', 'X'], 'CICLOS': ['-', '-', 'X'], 'QUE': ['-', '-', 'S'], 'SE': ['-', '-', 'O'], 'OTRAS': ['-', '-', 'S'], 'PRINCIPIO': ['-', '-', 'O'], 'OBRA': ['-', '-', 'X'], 'DIVINIDAD': ['-', '-', 'X']}

A real world example

TRUNAJOD lib was used to make TRUNAJOD web app, which is an application to assess text complexity and to check the adquacy of a text to a particular school level. To achieve this, several TRUNAJOD indices were analyzed for multiple Chilean school system texts (from textbooks), and latent features were created. Here is a snippet:

"""Example of TRUNAJOD usage."""
import glob

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import spacy
import textract  # To read .docx files
import TRUNAJOD.givenness
import TRUNAJOD.ttr
from TRUNAJOD import surface_proxies
from TRUNAJOD.syllabizer import Syllabizer

plt.rcParams["figure.figsize"] = (11, 4)
plt.rcParams["figure.dpi"] = 200


nlp = spacy.load("es_core_news_sm", disable=["ner", "textcat"])

features = {
    "lexical_diversity_mltd": [],
    "lexical_density": [],
    "pos_dissimilarity": [],
    "connection_words_ratio": [],
    "grade": [],
}
for filename in glob.glob("corpus/*/*.docx"):
    text = textract.process(filename).decode("utf8")
    doc = nlp(text)
    features["lexical_diversity_mltd"].append(
        TRUNAJOD.ttr.lexical_diversity_mtld(doc)
    )
    features["lexical_density"].append(surface_proxies.lexical_density(doc))
    features["pos_dissimilarity"].append(
        surface_proxies.pos_dissimilarity(doc)
    )
    features["connection_words_ratio"].append(
        surface_proxies.connection_words_ratio(doc)
    )

    # In our case corpus was organized as:
    # corpus/5B/5_2_55.docx where the folder that
    # contained the doc, contained the school level, in
    # this example 5th grade
    features["grade"].append(filename.split("/")[1][0])

df = pd.DataFrame(features)


fig, axes = plt.subplots(2, 2)

sns.boxplot(x="grade", y="lexical_diversity_mltd", data=df, ax=axes[0, 0])
sns.boxplot(x="grade", y="lexical_density", data=df, ax=axes[0, 1])
sns.boxplot(x="grade", y="pos_dissimilarity", data=df, ax=axes[1, 0])
sns.boxplot(x="grade", y="connection_words_ratio", data=df, ax=axes[1, 1])

Which yields:

TRUNAJOD web app example

TRUNAJOD web app backend was built using TRUNAJOD lib. A demo video is shown below (it is in Spanish):

TRUNAJOD demo

Contributing to TRUNAJOD

Bug reports and fixes are always welcome! Feel free to file issues, or ask for a feature request. We use Github issue tracker for this. If you'd like to contribute, feel free to submit a pull request. For more questions you can contact me at dipalma (at) udec (dot) cl.

More details can be found in CONTRIBUTING.

References

If you find anything of this useful, feel free to cite the following papers, from which a lot of this python library was made for (I am also in the process of submitting this lib to an open software journal):

  1. Palma, D., & Atkinson, J. (2018). Coherence-based automatic essay assessment. IEEE Intelligent Systems, 33(5), 26-36.
  2. Palma, D., Soto, C., Veliz, M., Riffo, B., & Gutiérrez, A. (2019, August). A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements. In International Conference on Human Interaction and Emerging Technologies (pp. 509-515). Springer, Cham.
@article{Palma2021,
  doi = {10.21105/joss.03153},
  url = {https://doi.org/10.21105/joss.03153},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {60},
  pages = {3153},
  author = {Diego A. Palma and Christian Soto and Mónica Veliz and Bruno Karelovic and Bernardo Riffo},
  title = {TRUNAJOD: A text complexity library to enhance natural language processing},
  journal = {Journal of Open Source Software}
}

@article{palma2018coherence,
  title={Coherence-based automatic essay assessment},
  author={Palma, Diego and Atkinson, John},
  journal={IEEE Intelligent Systems},
  volume={33},
  number={5},
  pages={26--36},
  year={2018},
  publisher={IEEE}
}

@inproceedings{palma2019data,
  title={A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements},
  author={Palma, Diego and Soto, Christian and Veliz, M{\'o}nica and Riffo, Bernardo and Guti{\'e}rrez, Antonio},
  booktitle={International Conference on Human Interaction and Emerging Technologies},
  pages={509--515},
  year={2019},
  organization={Springer}
}

trunajod2.0's People

Contributors

apiad avatar brandongoding avatar brucewlee avatar danielskatz avatar dpalmasan avatar sourvad avatar supersonic1999 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

trunajod2.0's Issues

Investigation: Add support for word vectors

Implementation of this should be generic enough to allow used any other model, for example:

  • We'd like to use vectors obtained from a Vector Space model (Raw Counts, Tf-idf, LSA)
  • We also need to support using other types of word embeddings, such as GloVe or Word2Vec or embeddings obtained from BERT (for sentences), for example: https://huggingface.co/

Our representation should allow us to implement metrics such as:

  • Average sentence similarity (e.g. cosine distance, euclidean distance)
  • Other metrics based on sentence similarity (e.g. max distance between two sentences, average distance to the cluster center)
  • Givenness using semantic spaces
  • etc.

One design idea is having a callable that takes the text and returns de vectors as a numpy array. From the spacy dependency we should already have numpy in our dependencies, so no worries about that.

The, we will need to file issues to implement different metrics.

Set Superset as a visualization tool

Connect Database with superset to do make visualizations and analyses, that could help in getting ideas/insights about how well is performing the complexity model and the app.

Update pre-commit from yapf to black

Since we are using python 3.6+ and not supporting 3.5 (deprecated), pre-commit hooks should use back instead of yapf and all the files should be reformatted. Hooks configuration should also be updated.

Fix documentation on readthedocs

For some reason, when building locally, documentation is shown correctly. However, in readthedocs TTR documentation is missing.

Add type hints to discourse markers

Add type hints to discourse markers features. This can be found in the following source code:

https://github.com/dpalmasan/TRUNAJOD2.0/blob/master/src/TRUNAJOD/discourse_markers.py

For example, the function find_matches (which already has a docstring specifying types):

def find_matches(text, list):
    """Return matches of words in list in a target text.
    Given a text and a list of possible matches (in this module, discourse
    markers list), returns the number of matches found in text. This ignores
    case.
    .. hint:: For non-Spanish users
       You could use this function with your custom list of discourse markers
       in case you need to compute this metric. In that case, the way to call
       the funcion would be: ``find_matches(YOUR_TEXT, ["dm1", "dm2", etc])``
    :param text: Text to be processed
    :type text: string
    :param list: list of discourse markers
    :type list: Python list of strings
    :return: Number of ocurrences
    :rtype: int
    """
    counter = 0
    for w in list:
        results = re.findall(r"\b%s\b" % w, text, re.IGNORECASE)
        counter += len(results)
    return counter

Could be updated to:

def find_matches(text: str, list: List[str]) -> int:

Implement D estimate

This estimate, estimates lexical diversity using a non-linear model. it is computed by the following procedure:

  1. Take a random sample of words from the text.
  2. Calculate the TTR (type token ratio)
  3. Find the value that bests fit the following equation:

It is allowed using numpy as spaCy is already set as a dependency for TRUNAJOD

Include additional examples on the documentation

In the documentation, there is only one example of usage. Since there is a lot of features, the authors should include at least two or three more examples, using other features to improve the understanding of the proposed package. Currently, the remaining functionality is described only at an API level, which seems insufficient.

cc openjournals/joss-reviews#3153

Make spaCy as a dependency and fix buggy setup

TRUNAJOD use spaCy but the setup.py file does not add it as a dependency. On the other hand, long_description uses an extra unnecessary rst file that does not render properly on pypi site. This issue aims to fix this.

Complete / expand contribution guidelines

Hey 👋 ! This issue is part of the JOSS review (openjournals/joss-reviews#3153) for this project.

First of all, kudos on the project! As a Spanish-speaker NLP researcher myself, I honestly cannot commend you enough regarding the quality and breadth of this work. We need this kind of research in our community and I thank you on behalf of all my colleague researchers as well.

Checking over the CONTRIBUTING guidelines, it seems there are some TBD sections that could be completed, not necessarily in great detail, but at least to a point where they're useful. If you feel there is a section that is unnecessary you can just remove it (e.g., how to create a pull-request is a general Github-level issue that I don't necessarily think your guidelines should spend too much effort in explaining).

My suggestion is to review the CONTRIBUTING guidelines and either add some content in those TBD sections that you consider relevant or reformat the document to remove them, thus making the guidelines complete and welcoming to new contributors. Please let me know if you feel this is a big issue or if you want to further discuss it.

Add coherence measurements based on syntactic patterns

Implement and evaluate the following coherence model:

Louis, A., & Nenkova, A. (2012, July). A coherence model based on syntactic patterns. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 1157-1168).

State of the field

In terms of state of the field, the paper include a lot of discussions around text measurements. But how about competitor libraries? TRUNAJOD is the only library/package (including other programming languages) that serves this purpose? If it isn't the only one, how does it works compared to those implementations?

cc openjournals/joss-reviews#3153

Create a mechanism for model installation

Currently TRUNAJOD models live in a folder called models, and they are loaded using a snippet as follows:

from TRUNAJOD import surface_proxies
from TRUNAJOD.entity_grid import EntityGrid
from TRUNAJOD.lexico_semantic_norms import LexicoSemanticNorm
import pickle
import spacy
import tarfile


class ModelLoader(object):
    """Class to load model."""
    def __init__(self, model_file):
        tar = tarfile.open(model_file, "r:gz")
        self.crea_frequency = {}
        self.infinitive_map = {}
        self.lemmatizer = {}
        self.spanish_lexicosemantic_norms = {}
        self.stopwords = {}
        self.wordnet_noun_synsets = {}
        self.wordnet_verb_synsets = {}

        for member in tar.getmembers():
            f = tar.extractfile(member)
            if "crea_frequency" in member.name:
                self.crea_frequency = pickle.loads(f.read())
            if "infinitive_map" in member.name:
                self.infinitive_map = pickle.loads(f.read())
            if "lemmatizer" in member.name:
                self.lemmatizer = pickle.loads(f.read())
            if "spanish_lexicosemantic_norms" in member.name:
                self.spanish_lexicosemantic_norms = pickle.loads(f.read())
            if "stopwords" in member.name:
                self.stopwords = pickle.loads(f.read())
            if "wordnet_noun_synsets" in member.name:
                self.wordnet_noun_synsets = pickle.loads(f.read())
            if "wordnet_verb_synsets" in member.name:
                self.wordnet_verb_synsets = pickle.loads(f.read())


# Load TRUNAJOD models
model = ModelLoader("trunajod_models_v0.1.tar.gz")

However, other similar projects (e.g. spacy), have models in a different repo, and they are installed via python tooling (e.g. cmd line options). This ticket is to track the effort of implementing such mechanism.

Implement Guiraud’s Index

This is a lexical diversity measurement that penalizes number of words. It is computed as:

Where is the number of distinct words in the text, and is the total number of words in the text.

Unit tests should be added as well.

Docs should be updated as well, adding the following reference:

@misc{herdan1961problemes,
  title={Probl{\`e}mes et m{\'e}thodes de la statistique linguistique},
  author={Herdan, Gustav},
  year={1961},
  publisher={JSTOR}
}

Add word variation index

Word variation index can be thought as "density of ideas" in the text. It can be computed as:

Implement Salience model for Entity Grid

Currently, EntityGrid features only support a uniform model for entities, however, salient features might improve the performance of applications using the entity grid model. This issue is to extend the model to consider salience features with the approach stated in:

Barzilay, R., & Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1), 1-34.

Error while running example

I've successfully installed TRUNAJOD and its dependencies. However, when I tried to run the example presented in the README, the following error occurred:

Traceback (most recent call last):
  File "main.py", line 44, in <module>
    nlp = spacy.load("es", disable=["ner", "textcat"])
  File "/home/matheus/.local/lib/python3.8/site-packages/spacy/__init__.py", line 47, in load
    return util.load_model(name, disable=disable, exclude=exclude, config=config)
  File "/home/matheus/.local/lib/python3.8/site-packages/spacy/util.py", line 328, in load_model
    raise IOError(Errors.E941.format(name=name, full=OLD_MODEL_SHORTCUTS[name]))
OSError: [E941] Can't find model 'es'. It looks like you're trying to load a model from a shortcut, which is deprecated as of spaCy v3.0. To load the model, use its full name instead:

nlp = spacy.load("es_core_news_sm")

For more details on the available models, see the models directory: https://spacy.io/models. If you want to create a blank model, use spacy.blank: nlp = spacy.blank("es")

Missing community guidelines

Both the documentation page and the README file should contain clear guidelines for third parties who wish to contribute to the software, report issues, or seek support. Here is an example:

If you wish to contribute to the software, report issues or problems, or seek support, feel free to use the issue report of this repository or to contact us.

Author's 1 name - [email protected]
Author's 2 name - [email protected]

Fix typo in Yule's K

Constant was set as 10^-4, but it is 10^4. Therefore, we need to change this and also update docstring.

Implement universal POS tags ratio

SpaCy tokens can contain universal POS tags and detailed POS tags, both properties in a spacy Token are pos_ and tag_ respectively. Currently, the function pos_ratio is kind of misleading as in the docstring it is describing pos_ tags, but computing ratio using tag_. This ticket is to fix this and implement a similar function using universal POS tags.

https://github.com/dpalmasan/TRUNAJOD2.0/blob/master/src/TRUNAJOD/surface_proxies.py#L515

Acceptance Criteria

  • Docstring is updated properly
  • Functionality to compute pos_ratio based on universal POS tag is added.

Create factor analysis from model 5B discussed

There was an statistical analisys on what the factors are, and what are they factor loadings, in particular we consider the following factors:

  • Lexical Similarity
  • Connectivity
  • Referential Cohesion
  • Concreteness
  • Narrativity

Since we already can compute these factors, we need to implement the computation bits. This issue is to track that.

Implement a Hapax Legomena Index

This is defined as the number of words occurring once in the text. Might be a good estimate when comparing texts of similar length.

Figure out a way to store model without relying on external files

Several functions of our complexity text assessor rely on external databases (such as giveness with the MCR (multilingual central repository), stopwords, lemmatizer, etc. We would like to provide an easy API for testing TRUNAJOD lib, such as the one spacy provdes with the nlp method. This issue is about investigating an alternative to achieve this.

Add unit tests and `pre-commit-hooks`

Currently the codebase is a mess, with no unit tests and no documentation. This issue is to track efforts of adding unit tests to the current codebase, and do some refactors to improve maintainability.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.