trinker / sentimentr Goto Github PK

View Code? Open in Web Editor NEW

426.0 22.0 84.0 23.33 MB

Dictionary based sentiment analysis that considers valence shifters

License: Other

R 90.41% TeX 9.25% CSS 0.33%

r sentiment-analysis polarity sentiment valence-shifter amplifier

sentimentr's Issues

include sentence splitting updates from textshape in sentimentr

See: trinker/textshape@9761b54

Handling of quotes and No.

export `update_polarity_table` and`update_valence_shifter_table`

export update_polarity_table andupdate_valence_shifter_table and add examples to as_key

use textshape::split_sentence

see stansent implementation

Valence calculation for conjunction.

As part of preparation for teaching sentimentr, I prepared examples showing the calculation process. However, I can't reconcile the difference between my interpretation and the sentiment computed value.

sample = c("You're crazy, but I love you.")
sentiment(sample, n.before = 2, n.after=2, amplifier.weight=.8, but.weight=.9)
element_id sentence_id word_count sentiment
1: 1 1 6 0.6205374
(-1*(1 -.9) + 1*(1.9))/sqrt(6)
[1] 0.7348469

Thanks

Rick

check that sentiment_by methods works with an unsplit data set

Add a parallel option

A parallel option that runs sentiment and sentiment_by on multiple cores

Reference direction accuracy in README

Mean Directional Accuracy looks at the signs of prediction against acutal which is appropriate for sentiment: https://en.wikipedia.org/wiki/Mean_Directional_Accuracy_(MDA)

Reference this in the README

Other languages

Hi,

first of all thanks a lot for writing this R package! I'm currently experimenting with political texts and really like it so far.

Is it also possible to use sentimentr with languages other than english, maybe by enabling the use of custom dictionaries?

High memory consumption when using sentiment() function

I am running some polarity computation through the function sentiment(). What I am experiencing is, even for small piece of text, a huge amount of allocated RAM. Sometimes I get also the following error:

Error in ` [.data.table ` (word_dat, , .(non_pol = unlist(non_pol)), by = c("id", : negative length vectors are not allowed Calls: assign -> compute_tone -> sentiment -> [ -> [.data.table Execution halted

A character vector of 669 kB (computed through object_size() in the package pryr leads to a peak allocation of 3.590 GB in RAM which is impressive. This is causing some problems, as you can imagine, when texts get longer.

I know you have developed everything using the data.table package (I did the same for my own package), so this sounds strange to me.

Do you have any hints or are you aware of this issue?
I am not including any minimal since this analysis can be easily performed through the profiling tool in RStudio.

Thanks

Check for by vars in data and throw error right away if missing

amplifiers

A better linguistic term is intensifier and should be referenced as such: https://en.wikipedia.org/wiki/Intensifier

Sentence splitting wonky for multicaps followed by a double space and new sentence

sentimentr::get_sentences2('I went to AU. Awesome school.')
sentimentr::get_sentences2('I went to AU.  Awesome school.')

## > sentimentr::get_sentences2('I went to AU. Awesome school.')
## [[1]]
## [1] "I went to AU."   "Awesome school."

## > sentimentr::get_sentences2('I went to AU.  Awesome school.')
## [[1]]
## [1] "I went to AU.  Awesome school."

add discussion of and testing from David Bamman

http://www.davidbamman.com/?p=52

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr, readr, rvest)
pacman::p_load_current_gh(file.path('trinker', c('sentimentr', 'textclean', 'textshape')))

url <- 'http://www.davidbamman.com/wp-content/uploads/2015/03/rj_annotations.txt'
raj_url <- 'http://www.davidbamman.com/wp-content/uploads/2015/03/turk_emotion1.html'

text_dat <- raj_url %>%
    read_html() %>%
    html_nodes(xpath='//td/a') %>% 
    html_attr("href") %>%
    lapply(function(x){
        x %>%
            read_html() %>%
            html_nodes(xpath='//blockquote') %>% 
            html_text() %>%
            replace_white() %>%
            trimws() %>%
            textshape::combine()
    }) %>%
    bind_list('scene_index', 'text') %>%
    tbl_df() %>%
    mutate(scene_index = as.integer(scene_index) - 1)



rating_dat <- read_delim(url, "\t") %>%
    tbl_df() %>%
    filter(!is.na(rationale)) %>%
    group_by(scene_index) %>%
    summarize(rating = mean(rating)) 

dat <- left_join(rating_dat, text_dat)




dat[['sentimentr']] <- dat %>%
    with(sentiment_by(text)[[4]] * 4)

dat[['sentimentr_jockers']] <- dat %>%
    with(sentiment_by(text, polarity_dt = lexicon::hash_sentiment_jockers)[[4]])


dat[['syuzhet']] <- syuzhet::get_sentiment(dat[['text']], method="syuzhet")


dat %>%
   mutate(
       diff_sentimentr =  abs(sign(rating) - sign(sentimentr)),
       diff_syuzhet =  abs(sign(rating) - sign(syuzhet))
   ) %>% select(-text) %>%print(n=Inf)


dat %>%
   mutate(rating=sign(rating), sentimentr=sign(sentimentr)) %>%
   with(table(rating, sentimentr))

dat %>%
   mutate(rating=sign(rating), sentimentr_jockers=sign(sentimentr_jockers)) %>%
   with(table(rating, sentimentr_jockers))


dat %>%
   mutate(rating=sign(rating), syuzhet=sign(syuzhet)) %>%
   with(table(rating, syuzhet))

Include sarcasm in the algorithm

Difficult task for 100% accuracy but there may be key features that are highly correlated with a sarcastic comment that would improve sentiment detection. The idea isn't super accuracy of sarcasm detection...just adding accuracy to the sentiment detection by identifying very likely sarcastic phrases.

Possible leading ngrams to consider:

I (love|like) how (negative situation)
As if (pronoun/proper noun)
It's (really|so)* (awesome|great) how (negative context)

Incorrect number of word calculations

sentiment_attributes("Assisted Local yy on resolving mm ll jj ll upgrade issues. He created a ww qq process for V. xx and tt gg.")[1]

str_count("Assisted Local yy on resolving mm ll jj ll upgrade issues. He created a ww qq process for V. xx and tt gg.","\S+")

as_key update_key convert to lower case and warn

these guys need to lower case terms and warn if needed

Add emoticon replacement

Use a termco style dictionary to replace an emoticon with an accompanying emotion using https://en.wikipedia.org/wiki/List_of_emoticons

Check if a multiple regex replacemnt is faster or a separate string for each emoticon with fixed = TRUE is faster.

list(
    ` happy ` = '\\b(:-\\)|:\\)|:D|:o\\)|8\\)|=\\)|:\\}| :^\\))\\b', 
    ` laugh ` = '\\b(:-D|8-D|8D|x-D|xD|X-D|XD|=-D|=D|B^D)\\b',
    ` very happy `  = '\\b:-))\\b'
)

Issue with bi- gram or more in Polarity table

Hi, I am using Sentimentr and Qdap package for polarity scoring on text data. what i observed is, when i use bi gram or more in polarity table, which i created separately for polarity scoring, sentimentr is not giving any polarity score but QDAP is able to score it.

could you please help me how to modify for sentimentr package. i want to use sentimentr its faster then QDAp. i am also converting Polarity table to as.key for usage.

positive and negative columns names are not returned when there are no positive and negative when using extract_sentiment_terms functions

when using extract_sentiment_terms in the context of a data.table, extract_sentiment_terms is not returning positive and neutral columns which is causing extract_sentiment_terms function fail on records below it.

datatable[, Positive.Terms :=
extract_sentiment_terms(Comment)[,positive], by = "Comment"]

Error in [.data.table(extract_sentiment_terms("Attach files by dropping, Choose Files selecting them from the clipboard."), :
Variable 'positive' is not found in calling scope. Looking in calling scope because either you used the .. prefix or set with=FALSE

This above issue is causing the records below the failed records also fail.
Is there any better way to use this function in context of data.table to over come this problem.

add extract_sentiment_terms

A function that takes a dictionary and text and extracts the positive, negative terms as a list. Extraction at the sentence level. Plot method with all words and colored red/green for sentiment words with other in grey. May also do bar chart top n sentiment terms.

Possibly store as termco object and have termco do the heavy lifting.

Move sentence splitting out of the function

sentence splitting will be moved outside of the sentiment function and handled by an reimport from textshape::split_sentence. The sentiment and sentiment_by functions will give a warning if the object is not a sentence_split object (class may need to be added in the reimport).

This will:

Make it easier to maintain sentence splitting
Speed up sentiment classifying
Reduce redundant splitting for different functions (or same function call twice) using the same text

Be careful to do this to all functions that use sentence splitting such us hilighting and word extraction. This is a justifiable breaking change that will bump the major version.

Test new package

http://www.rblog.uni-freiburg.de/2017/02/21/sentiment-analysis-in-r/
Package: SentimentAnalysis

But conjunctions are adversitive

Halliday in cohesion in english calls these adversitive conjunctions. Add to documentation.

Valence category 4 is missing

The documentation is missing any mention of conjunctions (4) in the description of the type of valence shifter words.

More testing data

https://github.com/monkeylearn/sentiment-analysis-benchmark/blob/master/data

Error using highlight original.text parameter

I'm getting an error when using the original.text parameter of the higlight() method.

    highlight(temp_sentiment, file = file.path("highlighted", paste(prod, "_highlighted-text.html", sep = "")), original.text = filtered[['Description']], open = FALSE)

Error in `[[<-.data.frame`(`*tmp*`, "txt", value = c("Feedback: \nAR - Goal History/Diagnostic Report....",  : 
  replacement has 3649 rows, data has 3658

I've tried debugging and I think the problem may be around line 50 of Highlight.R

if (!is.null(original.text)){
        txt <- get_sentences2(original.text)
    } else {
        txt <- get_sentences(x)
    }

    y[["txt"]] <- unlist(txt)

    y[, txt := ifelse(polarity == "", txt, sprintf("<mark class = \"%s\">%s</mark>", polarity, txt))]

If I set original.text to NULL (as is the default), it works fine. I'm not quite sure why there's a size difference.

I also tried setting missing_value=NULL for the sentiment_by call but that didn't change anything.

Do you have any suggestions? Thanks.

Explore as zero polarity

seems like
of like
bit like
sorta like
kinda like
is like
was like
it's like
he's like
she's like
we're like
they're like

vs like as positive

Does it change accuracy?

String Length Limit

I am wondering if someone has encountered the following error in the past:

Error in sprintf(paste("%-", Q, "s", sep = ""), col) : 
  required resulting string length 31315 is greater than maximal 8192

A workaround is appreciated.

Issue with slitting sentences

a <- "I really enjoyed how the time (e.g. his use of stuff, mini \"lessons\" he gives). I think the pop quizzes were good."
sentiment(a)
get_sentences2(a)

> sentiment(a)
   element_id sentence_id word_count sentiment
1:          1           1          7 0.6803361
2:          1           2          9 0.0000000
3:          1           3          7 0.3779645
> get_sentences2(a)
[[1]]
[1] "I really enjoyed how the time (e.g. his use of stuff, mini \"lessons\" he gives)." "I think the pop quizzes were good."

error travis

* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'sentiment_by'
  â€˜averaging.functionâ€™

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter â€˜Writing R documentation filesâ€™ in the â€˜Writing R
Extensionsâ€™ manual.
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking contents of â€˜dataâ€™ directory ... OK
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking examples ... OK
* checking for unstated dependencies in â€˜testsâ€™ ... OK
* checking tests ...
  Running â€˜testthat.Râ€™
 OK
* checking PDF version of manual ... OK
* DONE

Status: 1 WARNING
See
  â€˜/home/travis/build/trinker/sentimentr/sentimentr.Rcheck/00check.logâ€™
for details.

test out new sentiment dictionary from Jockers

http://www.matthewjockers.net/2017/01/12/resurrecting/

Add terms

Amplifiers

update_key(valence_shifters_table, x = data.frame(x = c("exceedingly", "remarkably", "especially"), y = c(2, 2, 2)))

Negatives

update_key(valence_shifters_table, x = data.frame(x = c("excessively", 'overly', 'unduly', 'too much', 'too many', 'too often', 'I wish', 'too good', 'too high', 'to tough'), y = c(rep(-2, 8), rep(-1, 2))))

Better plotting

Maybe more like:

Add a measure of residual similar to log loss

Maybe:

average_absolute_rescaled_residual <- function (predicted, actual) {

    stopifnot(length(actual) == length(predicted))
    mean(abs(actual - general_rescale(predicted)) )/2

}

average_absolute_rescaled_residual(predicted = c(1, -1), actual = c(1, -1))
average_absolute_rescaled_residual(predicted = c(1, -1), actual = c(-1, 1))
average_absolute_rescaled_residual(predicted = rep(0, 10), actual = c(rep(-1, 5), rep(1, 5)))
average_absolute_rescaled_residual(predicted = c(-1, -1, .2, -.5, 1, 1.4, .7, .9, -1, .2), actual = c(rep(-1, 5), rep(1, 5)))


swafford <- c(
    "I haven't been sad in a long time.",
    "I am extremely happy today.",
    "It's a good day.",
    "But suddenly I'm only a little bit happy.",
    "Then I'm not happy at all.",
    "In fact, I am now the least happy person on the planet.",
    "There is no happiness left in me.",
    "Wait, it's returned!",
    "I don't feel so bad after all!"
)

library(dplyr)
out <- sentiment(swafford) %>%
    mutate( 
        actual = c(.8, 1, .8, -.1, -.5, -1, -1, .5, .6), 
        swafford = swafford
    )

out %>%
    with(average_absolute_rescaled_residual(sentiment, actual))

Other languages

Hi @trinker ,

I want to make a sentiment analysis tool for Facebook comments, and your package seems very nice.
But, I can analyse text in portuguese, if I have a dictionary with positive/negative words?

Thanks!!

To Do

sentiment_by...Is a mixed really a mixed

Often people will include kind sentences when speaking unkind sentences as a polite convention. Does the kind ones truly negate the negative ones? No. It may indicate a less hostile tone but overall the tone is still hostile. Maybe a general weighting function is in order that up and down weights the group by averaging according to this convention.

This is not the default but recommended for short opinion texts like reviews or evaluations

weighted_sentiment_average <- function (x, mixed.less.than.zero.weight = 4, na.rm = TRUE, ...) {

    if (any(x > 0) && any(x < 0)) {

        numerator <- sum(x[x < 0 & !is.na(x)]) * mixed.less.than.zero.weight + sum(x[x > 0 & !is.na(x)])

    } else {

        numerator <- sum(x, na.rm = na.rm)
 
    }

    numerator/{sum(x != 0, na.rm = na.rm) + sqrt(log(1 + sum(x == 0, na.rm = na.rm)))}
} 




weighted_sentiment_average(c(-1))
weighted_sentiment_average(c(-1, 1))
weighted_sentiment_average(c(-1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1, 1, 1, 0))
weighted_sentiment_average(c(-1, -1, -1, -1, 1, 1, 1))

Changed to dictionaries

funny is typically positive and it's currently negative.
least should be a deamplifier
understand(ing|s)* should probably be a positive word

sentiment("Not very effective, hard to understand, just read from power point, couldn't understand.")

sentimentr::polarity_table["understand",]
sentimentr::polarity_table["understands",]
sentimentr::polarity_table["understanding",]
sentimentr::polarity_table["funny",]
sentimentr::polarity_table["hilarious",]
sentimentr::polarity_table["least",]
sentimentr::valence_shifters_table["least",] #deamplifier


> sentimentr::valence_shifters_table["least",]
       x  y
1: least NA
> sentimentr::polarity_table["understand",]
            x  y
1: understand NA
> sentimentr::polarity_table["understands",]
             x  y
1: understands NA
> sentimentr::polarity_table["understanding",]
               x  y
1: understanding NA
> sentimentr::polarity_table["funny",]
       x  y
1: funny -1
> sentimentr::polarity_table["hilarious",]
           x y
1: hilarious 1
> sentimentr::polarity_table["least",]
       x  y
1: least NA
> sentimentr::valence_shifters_table["least",] #negator
       x  y
1: least NA

get a sense of negators and adversitive word usage in sentiments

add a subject detection

What is being polarized? Find Subject.

huh?? figure out why two negatives makes a neutral

sentiment("sucked. most of the stuff does not work with my phone.")

#   element_id sentence_id word_count  sentiment
#1:          1           1          1 -1.0000000
#2:          1           2         10 -0.3162278

sentiment("sucked, most of the stuff does not work with my phone.")

#   element_id sentence_id word_count sentiment
#1:          1           1         11         0

sentiment("sucked most of the stuff does not work with my phone.")

#   element_id sentence_id word_count  sentiment
#1:          1           1         11 -0.6030227

Depends: R (≥ 3.1.0) Not working on 3.2.3

Hi,

The R version for this to work says Depends: | R (≥ 3.1.0). However when i try to load the package after installing it. This is the error i get.

library(sentimentr)
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called ‘textclean’
In addition: Warning message:
package ‘sentimentr’ was built under R version 3.4.1
Error: package or namespace load failed for ‘sentimentr’

Is this an error or in the code or the page has not been updated to say only it works for R >3.4

Regards,
Ren.

Error for single sentence

sentiment('surprised and shocked.')

> sentiment('surprised and shocked.')
   element_id sentence_id word_count sentiment
1:          1           1          3         0
Warning message:
In `[.data.table`(sent_dat, , `:=`("words", make_words(space_fill(sentences,  :
  Supplied 3 items to be assigned to 1 items of column 'words' (2 unused)

Incorrect polarity calculation

Finding it strange. Trying the sentence "Crashing tv isn't showing" yields a sentiment score of 0.5

Sentiment for "Crashing TV" yields -0.70
Sentiment for "isn't showing" yields 0
Sentiment for "isn't " yields 0 - This is surprising coz I have "isn't" as negator in my valence table

There were only a couple of additions to the valence table and the polarity table - and none of it should have any impact in the context of this sentence.

Any idea what is wrong ?

sentiment_by("Crashing tv isn't showing", by = NULL, polarity_dt = pk_table,

```
         valence_shifters_dt = vs_table)
```
element_id word_count sd ave_sentiment
1: 1 4 NA 0.5

sentiment_by("Crashing tv", by = NULL, polarity_dt = pk_table,
```
         valence_shifters_dt = vs_table)
```
element_id word_count sd ave_sentiment
1: 1 2 NA -0.7071068

sentiment_by("isn't showing", by = NULL, polarity_dt = pk_table,
```
         valence_shifters_dt = vs_table)
```
element_id word_count sd ave_sentiment
1: 1 2 NA 0
sentiment_by("isn't", by = NULL, polarity_dt = pk_table,

         valence_shifters_dt = vs_table)

element_id word_count sd ave_sentiment
1: 1 1 NA 0

Consider adding 'unprofessional' to main dictionary

A negative score

Investigat use of data/funs from another package as arguments

One used to be able to use :: as a way to add a package to yours if data/fun was used as an argument. This appears to no longer be the case: https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/sentimentr-00check.html

checking dependencies in R code ... NOTE
Namespaces in Imports field not imported from:
  ‘lexicon’ ‘syuzhet’
  All declared Imports should be used.

Maybe as simple as adding @importFrom in the roxygen or maybe a newer version of roxygen does this or maybe this needs to be brought up w/ ryxygen...

Dummy example for posting if need be:

library(janeaustenr)
library(stringi)

#' Compare Number of Words
#' 
#' A silly little function of no onsequence.
#' 
#' @param text A text string(s)
#' @param comparison The text you want to compare word counts against
#' @param \ldots ignored
#' @export
#' @examples 
#' library(janeaustenr)
#' more_words_than(janeaustenr::northangerabbey)
more_words_than <- function(text, comparison = janeaustenr::emma,...){

    sum(stringi::stri_count_words(text)) > sum(stringi::stri_count_words(comparison))
}

Zero averaging problem

I believe that zero sentiment sentences may have too much influence on averaging sentiments (over smoothing) because zero likely doesn't carry the same weight semantically/affectively as a non-zero sentiment.

Here as some possibilities for dealing with the issue (down weighting the influence of zero in the denominator).

x <- c(1, 2, 0, 0, 0, -1)

sum(x)/length(x)
sum(x)/{sum(x != 0) + (sum(x == 0)^(1/3))}
sum(x)/{sum(x != 0) + sqrt(log(1 + sum(x == 0)))}

sentiment_by.character <- function(text.var, by = NULL, 
    averaging.function = average_downweighted_zero

trinker / sentimentr Goto Github PK

sentimentr's Issues

Amplifiers

Negatives

Recommend Projects

Recommend Topics

Recommend Org