trinker / sentimentr Goto Github PK
View Code? Open in Web Editor NEWDictionary based sentiment analysis that considers valence shifters
License: Other
Dictionary based sentiment analysis that considers valence shifters
License: Other
See: trinker/textshape@9761b54
Handling of quotes and No.
export update_polarity_table
andupdate_valence_shifter_table
and add examples to as_key
see stansent implementation
hi
As part of preparation for teaching sentimentr, I prepared examples showing the calculation process. However, I can't reconcile the difference between my interpretation and the sentiment computed value.
sample = c("You're crazy, but I love you.")
sentiment(sample, n.before = 2, n.after=2, amplifier.weight=.8, but.weight=.9)
element_id sentence_id word_count sentiment
1: 1 1 6 0.6205374
(-1*(1 -.9) + 1*(1.9))/sqrt(6)
[1] 0.7348469
Thanks
Rick
A parallel option that runs sentiment
and sentiment_by
on multiple cores
Mean Directional Accuracy looks at the signs of prediction against acutal which is appropriate for sentiment: https://en.wikipedia.org/wiki/Mean_Directional_Accuracy_(MDA)
Reference this in the README
Hi,
first of all thanks a lot for writing this R package! I'm currently experimenting with political texts and really like it so far.
Is it also possible to use sentimentr
with languages other than english, maybe by enabling the use of custom dictionaries?
I am running some polarity computation through the function sentiment()
. What I am experiencing is, even for small piece of text, a huge amount of allocated RAM. Sometimes I get also the following error:
Error in
`
[.data.table
`
(word_dat, , .(non_pol = unlist(non_pol)), by = c("id", : negative length vectors are not allowed Calls: assign -> compute_tone -> sentiment -> [ -> [.data.table Execution halted
A character vector of 669 kB (computed through object_size()
in the package pryr leads to a peak allocation of 3.590 GB in RAM which is impressive. This is causing some problems, as you can imagine, when texts get longer.
I know you have developed everything using the data.table package (I did the same for my own package), so this sounds strange to me.
Do you have any hints or are you aware of this issue?
I am not including any minimal since this analysis can be easily performed through the profiling tool in RStudio.
Thanks
A better linguistic term is intensifier and should be referenced as such: https://en.wikipedia.org/wiki/Intensifier
sentimentr::get_sentences2('I went to AU. Awesome school.')
sentimentr::get_sentences2('I went to AU. Awesome school.')
## > sentimentr::get_sentences2('I went to AU. Awesome school.')
## [[1]]
## [1] "I went to AU." "Awesome school."
## > sentimentr::get_sentences2('I went to AU. Awesome school.')
## [[1]]
## [1] "I went to AU. Awesome school."
http://www.davidbamman.com/?p=52
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr, readr, rvest)
pacman::p_load_current_gh(file.path('trinker', c('sentimentr', 'textclean', 'textshape')))
url <- 'http://www.davidbamman.com/wp-content/uploads/2015/03/rj_annotations.txt'
raj_url <- 'http://www.davidbamman.com/wp-content/uploads/2015/03/turk_emotion1.html'
text_dat <- raj_url %>%
read_html() %>%
html_nodes(xpath='//td/a') %>%
html_attr("href") %>%
lapply(function(x){
x %>%
read_html() %>%
html_nodes(xpath='//blockquote') %>%
html_text() %>%
replace_white() %>%
trimws() %>%
textshape::combine()
}) %>%
bind_list('scene_index', 'text') %>%
tbl_df() %>%
mutate(scene_index = as.integer(scene_index) - 1)
rating_dat <- read_delim(url, "\t") %>%
tbl_df() %>%
filter(!is.na(rationale)) %>%
group_by(scene_index) %>%
summarize(rating = mean(rating))
dat <- left_join(rating_dat, text_dat)
dat[['sentimentr']] <- dat %>%
with(sentiment_by(text)[[4]] * 4)
dat[['sentimentr_jockers']] <- dat %>%
with(sentiment_by(text, polarity_dt = lexicon::hash_sentiment_jockers)[[4]])
dat[['syuzhet']] <- syuzhet::get_sentiment(dat[['text']], method="syuzhet")
dat %>%
mutate(
diff_sentimentr = abs(sign(rating) - sign(sentimentr)),
diff_syuzhet = abs(sign(rating) - sign(syuzhet))
) %>% select(-text) %>%print(n=Inf)
dat %>%
mutate(rating=sign(rating), sentimentr=sign(sentimentr)) %>%
with(table(rating, sentimentr))
dat %>%
mutate(rating=sign(rating), sentimentr_jockers=sign(sentimentr_jockers)) %>%
with(table(rating, sentimentr_jockers))
dat %>%
mutate(rating=sign(rating), syuzhet=sign(syuzhet)) %>%
with(table(rating, syuzhet))
Difficult task for 100% accuracy but there may be key features that are highly correlated with a sarcastic comment that would improve sentiment detection. The idea isn't super accuracy of sarcasm detection...just adding accuracy to the sentiment detection by identifying very likely sarcastic phrases.
Possible leading ngrams to consider:
these guys need to lower case terms and warn if needed
Use a termco
style dictionary to replace an emoticon with an accompanying emotion using https://en.wikipedia.org/wiki/List_of_emoticons
Check if a multiple regex replacemnt is faster or a separate string for each emoticon with fixed = TRUE is faster.
list(
` happy ` = '\\b(:-\\)|:\\)|:D|:o\\)|8\\)|=\\)|:\\}| :^\\))\\b',
` laugh ` = '\\b(:-D|8-D|8D|x-D|xD|X-D|XD|=-D|=D|B^D)\\b',
` very happy ` = '\\b:-))\\b'
)
Hi, I am using Sentimentr and Qdap package for polarity scoring on text data. what i observed is, when i use bi gram or more in polarity table, which i created separately for polarity scoring, sentimentr is not giving any polarity score but QDAP is able to score it.
could you please help me how to modify for sentimentr package. i want to use sentimentr its faster then QDAp. i am also converting Polarity table to as.key for usage.
when using extract_sentiment_terms in the context of a data.table, extract_sentiment_terms is not returning positive and neutral columns which is causing extract_sentiment_terms function fail on records below it.
datatable[, Positive.Terms :=
extract_sentiment_terms(Comment)[,positive], by = "Comment"]
Error in [.data.table
(extract_sentiment_terms("Attach files by dropping, Choose Files selecting them from the clipboard."), :
Variable 'positive' is not found in calling scope. Looking in calling scope because either you used the .. prefix or set with=FALSE
This above issue is causing the records below the failed records also fail.
Is there any better way to use this function in context of data.table to over come this problem.
A function that takes a dictionary and text and extracts the positive, negative terms as a list. Extraction at the sentence level. Plot method with all words and colored red/green for sentiment words with other in grey. May also do bar chart top n sentiment terms.
Possibly store as termco object and have termco do the heavy lifting.
sentence splitting will be moved outside of the sentiment
function and handled by an reimport from textshape::split_sentence
. The sentiment
and sentiment_by
functions will give a warning if the object is not a sentence_split object (class may need to be added in the reimport).
This will:
Be careful to do this to all functions that use sentence splitting such us hilighting and word extraction. This is a justifiable breaking change that will bump the major version.
http://www.rblog.uni-freiburg.de/2017/02/21/sentiment-analysis-in-r/
Package: SentimentAnalysis
Halliday in cohesion in english calls these adversitive conjunctions. Add to documentation.
The documentation is missing any mention of conjunctions (4) in the description of the type of valence shifter words.
I'm getting an error when using the original.text
parameter of the higlight()
method.
highlight(temp_sentiment, file = file.path("highlighted", paste(prod, "_highlighted-text.html", sep = "")), original.text = filtered[['Description']], open = FALSE)
Error in `[[<-.data.frame`(`*tmp*`, "txt", value = c("Feedback: \nAR - Goal History/Diagnostic Report....", :
replacement has 3649 rows, data has 3658
I've tried debugging and I think the problem may be around line 50 of Highlight.R
if (!is.null(original.text)){
txt <- get_sentences2(original.text)
} else {
txt <- get_sentences(x)
}
y[["txt"]] <- unlist(txt)
y[, txt := ifelse(polarity == "", txt, sprintf("<mark class = \"%s\">%s</mark>", polarity, txt))]
If I set original.text
to NULL
(as is the default), it works fine. I'm not quite sure why there's a size difference.
I also tried setting missing_value=NULL
for the sentiment_by
call but that didn't change anything.
Do you have any suggestions? Thanks.
seems like
of like
bit like
sorta like
kinda like
is like
was like
it's like
he's like
she's like
we're like
they're like
vs like as positive
Does it change accuracy?
I am wondering if someone has encountered the following error in the past:
Error in sprintf(paste("%-", Q, "s", sep = ""), col) :
required resulting string length 31315 is greater than maximal 8192
A workaround is appreciated.
a <- "I really enjoyed how the time (e.g. his use of stuff, mini \"lessons\" he gives). I think the pop quizzes were good."
sentiment(a)
get_sentences2(a)
> sentiment(a)
element_id sentence_id word_count sentiment
1: 1 1 7 0.6803361
2: 1 2 9 0.0000000
3: 1 3 7 0.3779645
> get_sentences2(a)
[[1]]
[1] "I really enjoyed how the time (e.g. his use of stuff, mini \"lessons\" he gives)." "I think the pop quizzes were good."
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'sentiment_by'
‘averaging.function’
Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking contents of ‘data’ directory ... OK
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking examples ... OK
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
Running ‘testthat.R’
OK
* checking PDF version of manual ... OK
* DONE
Status: 1 WARNING
See
‘/home/travis/build/trinker/sentimentr/sentimentr.Rcheck/00check.log’
for details.
update_key(valence_shifters_table, x = data.frame(x = c("exceedingly", "remarkably", "especially"), y = c(2, 2, 2)))
update_key(valence_shifters_table, x = data.frame(x = c("excessively", 'overly', 'unduly', 'too much', 'too many', 'too often', 'I wish', 'too good', 'too high', 'to tough'), y = c(rep(-2, 8), rep(-1, 2))))
Maybe:
average_absolute_rescaled_residual <- function (predicted, actual) {
stopifnot(length(actual) == length(predicted))
mean(abs(actual - general_rescale(predicted)) )/2
}
average_absolute_rescaled_residual(predicted = c(1, -1), actual = c(1, -1))
average_absolute_rescaled_residual(predicted = c(1, -1), actual = c(-1, 1))
average_absolute_rescaled_residual(predicted = rep(0, 10), actual = c(rep(-1, 5), rep(1, 5)))
average_absolute_rescaled_residual(predicted = c(-1, -1, .2, -.5, 1, 1.4, .7, .9, -1, .2), actual = c(rep(-1, 5), rep(1, 5)))
swafford <- c(
"I haven't been sad in a long time.",
"I am extremely happy today.",
"It's a good day.",
"But suddenly I'm only a little bit happy.",
"Then I'm not happy at all.",
"In fact, I am now the least happy person on the planet.",
"There is no happiness left in me.",
"Wait, it's returned!",
"I don't feel so bad after all!"
)
library(dplyr)
out <- sentiment(swafford) %>%
mutate(
actual = c(.8, 1, .8, -.1, -.5, -1, -1, .5, .6),
swafford = swafford
)
out %>%
with(average_absolute_rescaled_residual(sentiment, actual))
Hi @trinker ,
I want to make a sentiment analysis tool for Facebook comments, and your package seems very nice.
But, I can analyse text in portuguese, if I have a dictionary with positive/negative words?
Thanks!!
sentiment_by
by = NULL
by
uncombine
) level info back from an environment in the classgeom_smooth
get_sentences
from sentiment
& sentiment_by
to extract the sentences back
get_sents
under the hoodsentiment
sentiment_by
uncombine
from sentiment_by
to extract the sentences level polarity backOften people will include kind sentences when speaking unkind sentences as a polite convention. Does the kind ones truly negate the negative ones? No. It may indicate a less hostile tone but overall the tone is still hostile. Maybe a general weighting function is in order that up and down weights the group by averaging according to this convention.
This is not the default but recommended for short opinion texts like reviews or evaluations
weighted_sentiment_average <- function (x, mixed.less.than.zero.weight = 4, na.rm = TRUE, ...) {
if (any(x > 0) && any(x < 0)) {
numerator <- sum(x[x < 0 & !is.na(x)]) * mixed.less.than.zero.weight + sum(x[x > 0 & !is.na(x)])
} else {
numerator <- sum(x, na.rm = na.rm)
}
numerator/{sum(x != 0, na.rm = na.rm) + sqrt(log(1 + sum(x == 0, na.rm = na.rm)))}
}
weighted_sentiment_average(c(-1))
weighted_sentiment_average(c(-1, 1))
weighted_sentiment_average(c(-1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1, 1, 1, 0))
weighted_sentiment_average(c(-1, -1, -1, -1, 1, 1, 1))
funny
is typically positive and it's currently negative.least
should be a deamplifierunderstand
(ing|s)* should probably be a positive wordsentiment("Not very effective, hard to understand, just read from power point, couldn't understand.")
sentimentr::polarity_table["understand",]
sentimentr::polarity_table["understands",]
sentimentr::polarity_table["understanding",]
sentimentr::polarity_table["funny",]
sentimentr::polarity_table["hilarious",]
sentimentr::polarity_table["least",]
sentimentr::valence_shifters_table["least",] #deamplifier
> sentimentr::valence_shifters_table["least",]
x y
1: least NA
> sentimentr::polarity_table["understand",]
x y
1: understand NA
> sentimentr::polarity_table["understands",]
x y
1: understands NA
> sentimentr::polarity_table["understanding",]
x y
1: understanding NA
> sentimentr::polarity_table["funny",]
x y
1: funny -1
> sentimentr::polarity_table["hilarious",]
x y
1: hilarious 1
> sentimentr::polarity_table["least",]
x y
1: least NA
> sentimentr::valence_shifters_table["least",] #negator
x y
1: least NA
What is being polarized? Find Subject.
sentiment("sucked. most of the stuff does not work with my phone.")
# element_id sentence_id word_count sentiment
#1: 1 1 1 -1.0000000
#2: 1 2 10 -0.3162278
sentiment("sucked, most of the stuff does not work with my phone.")
# element_id sentence_id word_count sentiment
#1: 1 1 11 0
sentiment("sucked most of the stuff does not work with my phone.")
# element_id sentence_id word_count sentiment
#1: 1 1 11 -0.6030227
Hi,
The R version for this to work says Depends: | R (≥ 3.1.0). However when i try to load the package after installing it. This is the error i get.
library(sentimentr)
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called ‘textclean’
In addition: Warning message:
package ‘sentimentr’ was built under R version 3.4.1
Error: package or namespace load failed for ‘sentimentr’
Is this an error or in the code or the page has not been updated to say only it works for R >3.4
Regards,
Ren.
sentiment('surprised and shocked.')
> sentiment('surprised and shocked.')
element_id sentence_id word_count sentiment
1: 1 1 3 0
Warning message:
In `[.data.table`(sent_dat, , `:=`("words", make_words(space_fill(sentences, :
Supplied 3 items to be assigned to 1 items of column 'words' (2 unused)
Finding it strange. Trying the sentence "Crashing tv isn't showing" yields a sentiment score of 0.5
Sentiment for "Crashing TV" yields -0.70
Sentiment for "isn't showing" yields 0
Sentiment for "isn't " yields 0 - This is surprising coz I have "isn't" as negator in my valence table
There were only a couple of additions to the valence table and the polarity table - and none of it should have any impact in the context of this sentence.
Any idea what is wrong ?
sentiment_by("Crashing tv isn't showing", by = NULL, polarity_dt = pk_table,
valence_shifters_dt = vs_table)
sentiment_by("Crashing tv", by = NULL, polarity_dt = pk_table,
valence_shifters_dt = vs_table)
sentiment_by("isn't showing", by = NULL, polarity_dt = pk_table,
valence_shifters_dt = vs_table)
valence_shifters_dt = vs_table)
A negative score
One used to be able to use ::
as a way to add a package to yours if data/fun was used as an argument. This appears to no longer be the case: https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/sentimentr-00check.html
checking dependencies in R code ... NOTE
Namespaces in Imports field not imported from:
‘lexicon’ ‘syuzhet’
All declared Imports should be used.
Maybe as simple as adding @importFrom
in the roxygen or maybe a newer version of roxygen does this or maybe this needs to be brought up w/ ryxygen...
Dummy example for posting if need be:
library(janeaustenr)
library(stringi)
#' Compare Number of Words
#'
#' A silly little function of no onsequence.
#'
#' @param text A text string(s)
#' @param comparison The text you want to compare word counts against
#' @param \ldots ignored
#' @export
#' @examples
#' library(janeaustenr)
#' more_words_than(janeaustenr::northangerabbey)
more_words_than <- function(text, comparison = janeaustenr::emma,...){
sum(stringi::stri_count_words(text)) > sum(stringi::stri_count_words(comparison))
}
I believe that zero sentiment sentences may have too much influence on averaging sentiments (over smoothing) because zero likely doesn't carry the same weight semantically/affectively as a non-zero sentiment.
Here as some possibilities for dealing with the issue (down weighting the influence of zero in the denominator).
x <- c(1, 2, 0, 0, 0, -1)
sum(x)/length(x)
sum(x)/{sum(x != 0) + (sum(x == 0)^(1/3))}
sum(x)/{sum(x != 0) + sqrt(log(1 + sum(x == 0)))}
stringi is likely to do OS indepen. conversion of (escape sequence) bytes to unicode (or something like that) that can then be used to look up the emoji name. Use the reqex to pull them out.
then use Unicode package to go the rest of the way u_char_name(as.u_char('U+2702'))
sentiment_by.character <- function(text.var, by = NULL,
averaging.function = average_downweighted_zero
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.