Giter Club home page Giter Club logo

brucewlee / lftk Goto Github PK

View Code? Open in Web Editor NEW
98.0 2.0 20.0 7.35 MB

[BEA @ ACL 2023] General-purpose tool for linguistic features extraction; Tested on readability assessment, essay scoring, fake news detection, hate speech detection, etc.

Home Page: https://lftk.rtfd.io/

License: Other

Python 100.00%
feature-extraction linguistic-features readability-scores spacy word-difficulty natural-language-processing text-analysis reading-time handcrafted-features python

lftk's Introduction

Student at the University of Pennsylvania.

I'm interested in language, intelligence, and measuring things.

See my open-source projects below and papers here Scholar.

lftk's People

Contributors

brucewlee avatar dangne avatar strickvl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lftk's Issues

Is this function correct for calculating NERF?

Hello!

I am seeking to estimate text readability, and to accomplish this, I employed the NERF formula for computation. However, I encountered issues with the LingFeat library due to a dependency problem, as outlined in this GitHub issue.
Despite this setback, I attempted to use the LFTK library and found success.

Nevertheless, there is a disparity in the names of features between LeafFeat and LFTK. Consequently, I am uncertain about the correctness of the correspondence. Additionally, I am unable to locate the variable 'Constituency Parse Tree Height.' In an effort to address this concern, I turned to using the LingFeat library and it worked.

However, due to the differing names of features between LeafFeat and LFTK, I am unsure if the correspondence is correct. Could you please confirm whether this correspondence is accurate, especially regarding the variable 'Constituency Parse Tree Height'?

Thank you in advance.

import spacy
import lftk
import math
import nltk
from supar import Parser


# load models
nlp = spacy.load('en_core_web_sm')
SuPar = Parser.load('crf-con-en')

def preprocess(doc, short=False, see_token=False, see_sent_token=True):
    n_token = 1
    n_sent = 1
    token_list = []
    #raw_token_list = []
    sent_token_list = []

    # sent_list is for raw string sentences
    sent_list = []

    # count tokens, sentence + make lists
    #for sent in self.NLP_doc.sents:
    for sent in doc.sents:
        n_sent += 1
        sent_list.append(sent.text)
        temp_list = []
        for token in sent:
            if token.text.isalpha():
                temp_list.append(token.text)
                if short == True:
                    n_token += 1
                    token_list.append(token.lemma_.lower())
                if short == False:
                    if len(token.text) >= 3:
                        n_token += 1
                        token_list.append(token.lemma_.lower())
        if len(temp_list) > 3:
            sent_token_list.append(temp_list)

    #self.n_token = n_token 
    #self.n_sent = n_sent
    #self.token_list = token_list
    #self.sent_token_list = sent_token_list
    
    result = {"n_token": n_token, 
                "n_sent":n_sent
                }

    if see_token == True:
        result["token"] = token_list
    if see_sent_token == True:
        result["sent_token"] = sent_token_list

    return result

def calculate_nerf(extracted_features):
    return (0.04876 * extracted_features['t_kup'] - 0.1145 * extracted_features['t_subtlex_us_zipf']) / extracted_features['t_sent'] \
        + (0.3091 * (extracted_features['n_noun'] + extracted_features['n_verb'] + extracted_features['n_num'] + extracted_features['n_adj'] + extracted_features['n_adv']) + 0.1866 * extracted_features['n_noun'] + 0.2645 * extracted_features['to_TreeH_C']) / extracted_features['t_sent'] \
        + (1.1017 * extracted_features['t_uword']) / math.sqrt(extracted_features['t_word']) - 4.125


text = 'This is simple example sentence. This is another example sentence.'
doc = nlp(text)

# initiate LFTK extractor by passing in doc
LFTK = lftk.Extractor(docs = doc)
LFTK.customize(stop_words=True, punctuations=False, round_decimal=3)

preprocessed_features = preprocess(doc, short=False, see_token=False, see_sent_token=True)
TrSF = retrieve(SuPar, preprocessed_features['sent_token'])
feature_keys = ['t_kup', 't_subtlex_us_zipf', 't_sent', 'n_noun', 'n_verb', 'n_adj', 'n_adv', 'n_num', 't_uword', 't_word']

extracted_features = LFTK.extract(features = feature_keys)
extracted_features.update(TrSF)

# convert to float
extracted_features = {k: float(v) for k, v in extracted_features.items()}

print(calculate_nerf(extracted_features))

bug in the `total_number_of_unique_words_no_lemma` function

Hi, dear author, thank you for your awesome work! I want to bring your attention to a possible bug in the implementation of total_number_of_unique_words_no_lemma function

In lines 194-195 of file lftk/foundation/wordsent.py, the total_number_of_unique_words_no_lemma function still does the lemma operation, even though no lemma is specified. This makes the "total_number_of_unique_words_no_lemma" function almost identical to total_number_of_unique_words.

This will lead to an error that corr_ttr is same as corr_ttr_no_lem

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.