Student at the University of Pennsylvania.
I'm interested in language, intelligence, and measuring things.
See my open-source projects below and papers here Scholar.
[BEA @ ACL 2023] General-purpose tool for linguistic features extraction; Tested on readability assessment, essay scoring, fake news detection, hate speech detection, etc.
Home Page: https://lftk.rtfd.io/
License: Other
Student at the University of Pennsylvania.
I'm interested in language, intelligence, and measuring things.
See my open-source projects below and papers here Scholar.
Hello!
I am seeking to estimate text readability, and to accomplish this, I employed the NERF formula for computation. However, I encountered issues with the LingFeat library due to a dependency problem, as outlined in this GitHub issue.
Despite this setback, I attempted to use the LFTK library and found success.
Nevertheless, there is a disparity in the names of features between LeafFeat and LFTK. Consequently, I am uncertain about the correctness of the correspondence. Additionally, I am unable to locate the variable 'Constituency Parse Tree Height.' In an effort to address this concern, I turned to using the LingFeat library and it worked.
However, due to the differing names of features between LeafFeat and LFTK, I am unsure if the correspondence is correct. Could you please confirm whether this correspondence is accurate, especially regarding the variable 'Constituency Parse Tree Height'?
Thank you in advance.
import spacy
import lftk
import math
import nltk
from supar import Parser
# load models
nlp = spacy.load('en_core_web_sm')
SuPar = Parser.load('crf-con-en')
def preprocess(doc, short=False, see_token=False, see_sent_token=True):
n_token = 1
n_sent = 1
token_list = []
#raw_token_list = []
sent_token_list = []
# sent_list is for raw string sentences
sent_list = []
# count tokens, sentence + make lists
#for sent in self.NLP_doc.sents:
for sent in doc.sents:
n_sent += 1
sent_list.append(sent.text)
temp_list = []
for token in sent:
if token.text.isalpha():
temp_list.append(token.text)
if short == True:
n_token += 1
token_list.append(token.lemma_.lower())
if short == False:
if len(token.text) >= 3:
n_token += 1
token_list.append(token.lemma_.lower())
if len(temp_list) > 3:
sent_token_list.append(temp_list)
#self.n_token = n_token
#self.n_sent = n_sent
#self.token_list = token_list
#self.sent_token_list = sent_token_list
result = {"n_token": n_token,
"n_sent":n_sent
}
if see_token == True:
result["token"] = token_list
if see_sent_token == True:
result["sent_token"] = sent_token_list
return result
def calculate_nerf(extracted_features):
return (0.04876 * extracted_features['t_kup'] - 0.1145 * extracted_features['t_subtlex_us_zipf']) / extracted_features['t_sent'] \
+ (0.3091 * (extracted_features['n_noun'] + extracted_features['n_verb'] + extracted_features['n_num'] + extracted_features['n_adj'] + extracted_features['n_adv']) + 0.1866 * extracted_features['n_noun'] + 0.2645 * extracted_features['to_TreeH_C']) / extracted_features['t_sent'] \
+ (1.1017 * extracted_features['t_uword']) / math.sqrt(extracted_features['t_word']) - 4.125
text = 'This is simple example sentence. This is another example sentence.'
doc = nlp(text)
# initiate LFTK extractor by passing in doc
LFTK = lftk.Extractor(docs = doc)
LFTK.customize(stop_words=True, punctuations=False, round_decimal=3)
preprocessed_features = preprocess(doc, short=False, see_token=False, see_sent_token=True)
TrSF = retrieve(SuPar, preprocessed_features['sent_token'])
feature_keys = ['t_kup', 't_subtlex_us_zipf', 't_sent', 'n_noun', 'n_verb', 'n_adj', 'n_adv', 'n_num', 't_uword', 't_word']
extracted_features = LFTK.extract(features = feature_keys)
extracted_features.update(TrSF)
# convert to float
extracted_features = {k: float(v) for k, v in extracted_features.items()}
print(calculate_nerf(extracted_features))
Hallo,
I could build the package and create the .whl file, but I could not import it.
could you please check out why? or provide a correct file?
The .whl file is here:
https://drive.google.com/drive/folders/1SnDOXHBCpyBH1nSO72lQgqB1QDjiOxz2?usp=sharing
I need to install it without neither using internet connection nor .tar.gz file, could you please suggest how !
Thank you
Hi, dear author, thank you for your awesome work! I want to bring your attention to a possible bug in the implementation of total_number_of_unique_words_no_lemma
function
In lines 194-195 of file lftk/foundation/wordsent.py
, the total_number_of_unique_words_no_lemma
function still does the lemma operation, even though no lemma is specified. This makes the "total_number_of_unique_words_no_lemma" function almost identical to total_number_of_unique_words
.
This will lead to an error that corr_ttr
is same as corr_ttr_no_lem
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.