Giter Club home page Giter Club logo

sentimentr's Issues

Valence calculation for conjunction.

hi

As part of preparation for teaching sentimentr, I prepared examples showing the calculation process. However, I can't reconcile the difference between my interpretation and the sentiment computed value.

sample = c("You're crazy, but I love you.")
sentiment(sample, n.before = 2, n.after=2, amplifier.weight=.8, but.weight=.9)
element_id sentence_id word_count sentiment
1: 1 1 6 0.6205374
(-1*(1 -.9) + 1*(1.9))/sqrt(6)
[1] 0.7348469

Thanks

Rick

Other languages

Hi,

first of all thanks a lot for writing this R package! I'm currently experimenting with political texts and really like it so far.

Is it also possible to use sentimentr with languages other than english, maybe by enabling the use of custom dictionaries?

High memory consumption when using sentiment() function

I am running some polarity computation through the function sentiment(). What I am experiencing is, even for small piece of text, a huge amount of allocated RAM. Sometimes I get also the following error:

Error in ` [.data.table ` (word_dat, , .(non_pol = unlist(non_pol)), by = c("id", : negative length vectors are not allowed Calls: assign -> compute_tone -> sentiment -> [ -> [.data.table Execution halted

A character vector of 669 kB (computed through object_size() in the package pryr leads to a peak allocation of 3.590 GB in RAM which is impressive. This is causing some problems, as you can imagine, when texts get longer.

I know you have developed everything using the data.table package (I did the same for my own package), so this sounds strange to me.

Do you have any hints or are you aware of this issue?
I am not including any minimal since this analysis can be easily performed through the profiling tool in RStudio.

Thanks

Sentence splitting wonky for multicaps followed by a double space and new sentence

sentimentr::get_sentences2('I went to AU. Awesome school.')
sentimentr::get_sentences2('I went to AU.  Awesome school.')

## > sentimentr::get_sentences2('I went to AU. Awesome school.')
## [[1]]
## [1] "I went to AU."   "Awesome school."

## > sentimentr::get_sentences2('I went to AU.  Awesome school.')
## [[1]]
## [1] "I went to AU.  Awesome school."

add discussion of and testing from David Bamman

http://www.davidbamman.com/?p=52

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyr, readr, rvest)
pacman::p_load_current_gh(file.path('trinker', c('sentimentr', 'textclean', 'textshape')))

url <- 'http://www.davidbamman.com/wp-content/uploads/2015/03/rj_annotations.txt'
raj_url <- 'http://www.davidbamman.com/wp-content/uploads/2015/03/turk_emotion1.html'

text_dat <- raj_url %>%
    read_html() %>%
    html_nodes(xpath='//td/a') %>% 
    html_attr("href") %>%
    lapply(function(x){
        x %>%
            read_html() %>%
            html_nodes(xpath='//blockquote') %>% 
            html_text() %>%
            replace_white() %>%
            trimws() %>%
            textshape::combine()
    }) %>%
    bind_list('scene_index', 'text') %>%
    tbl_df() %>%
    mutate(scene_index = as.integer(scene_index) - 1)



rating_dat <- read_delim(url, "\t") %>%
    tbl_df() %>%
    filter(!is.na(rationale)) %>%
    group_by(scene_index) %>%
    summarize(rating = mean(rating)) 

dat <- left_join(rating_dat, text_dat)




dat[['sentimentr']] <- dat %>%
    with(sentiment_by(text)[[4]] * 4)

dat[['sentimentr_jockers']] <- dat %>%
    with(sentiment_by(text, polarity_dt = lexicon::hash_sentiment_jockers)[[4]])


dat[['syuzhet']] <- syuzhet::get_sentiment(dat[['text']], method="syuzhet")


dat %>%
   mutate(
       diff_sentimentr =  abs(sign(rating) - sign(sentimentr)),
       diff_syuzhet =  abs(sign(rating) - sign(syuzhet))
   ) %>% select(-text) %>%print(n=Inf)


dat %>%
   mutate(rating=sign(rating), sentimentr=sign(sentimentr)) %>%
   with(table(rating, sentimentr))

dat %>%
   mutate(rating=sign(rating), sentimentr_jockers=sign(sentimentr_jockers)) %>%
   with(table(rating, sentimentr_jockers))


dat %>%
   mutate(rating=sign(rating), syuzhet=sign(syuzhet)) %>%
   with(table(rating, syuzhet))


Include sarcasm in the algorithm

Difficult task for 100% accuracy but there may be key features that are highly correlated with a sarcastic comment that would improve sentiment detection. The idea isn't super accuracy of sarcasm detection...just adding accuracy to the sentiment detection by identifying very likely sarcastic phrases.

Possible leading ngrams to consider:

  1. I (love|like) how (negative situation)
  2. As if (pronoun/proper noun)
  3. It's (really|so)* (awesome|great) how (negative context)

Incorrect number of word calculations

sentiment_attributes("Assisted Local yy on resolving mm ll jj ll upgrade issues. He created a ww qq process for V. xx and tt gg.")[1]
image

str_count("Assisted Local yy on resolving mm ll jj ll upgrade issues. He created a ww qq process for V. xx and tt gg.","\S+")
image

Add emoticon replacement

Use a termco style dictionary to replace an emoticon with an accompanying emotion using https://en.wikipedia.org/wiki/List_of_emoticons

Check if a multiple regex replacemnt is faster or a separate string for each emoticon with fixed = TRUE is faster.

list(
    ` happy ` = '\\b(:-\\)|:\\)|:D|:o\\)|8\\)|=\\)|:\\}| :^\\))\\b', 
    ` laugh ` = '\\b(:-D|8-D|8D|x-D|xD|X-D|XD|=-D|=D|B^D)\\b',
    ` very happy `  = '\\b:-))\\b'
)

Issue with bi- gram or more in Polarity table

Hi, I am using Sentimentr and Qdap package for polarity scoring on text data. what i observed is, when i use bi gram or more in polarity table, which i created separately for polarity scoring, sentimentr is not giving any polarity score but QDAP is able to score it.

could you please help me how to modify for sentimentr package. i want to use sentimentr its faster then QDAp. i am also converting Polarity table to as.key for usage.

positive and negative columns names are not returned when there are no positive and negative when using extract_sentiment_terms functions

when using extract_sentiment_terms in the context of a data.table, extract_sentiment_terms is not returning positive and neutral columns which is causing extract_sentiment_terms function fail on records below it.

datatable[, Positive.Terms :=
extract_sentiment_terms(Comment)[,positive], by = "Comment"]

Error in [.data.table(extract_sentiment_terms("Attach files by dropping, Choose Files selecting them from the clipboard."), :
Variable 'positive' is not found in calling scope. Looking in calling scope because either you used the .. prefix or set with=FALSE

This above issue is causing the records below the failed records also fail.
Is there any better way to use this function in context of data.table to over come this problem.

add extract_sentiment_terms

A function that takes a dictionary and text and extracts the positive, negative terms as a list. Extraction at the sentence level. Plot method with all words and colored red/green for sentiment words with other in grey. May also do bar chart top n sentiment terms.

Possibly store as termco object and have termco do the heavy lifting.

Move sentence splitting out of the function

sentence splitting will be moved outside of the sentiment function and handled by an reimport from textshape::split_sentence. The sentiment and sentiment_by functions will give a warning if the object is not a sentence_split object (class may need to be added in the reimport).

This will:

  1. Make it easier to maintain sentence splitting
  2. Speed up sentiment classifying
  3. Reduce redundant splitting for different functions (or same function call twice) using the same text

Be careful to do this to all functions that use sentence splitting such us hilighting and word extraction. This is a justifiable breaking change that will bump the major version.

Valence category 4 is missing

The documentation is missing any mention of conjunctions (4) in the description of the type of valence shifter words.

Error using highlight original.text parameter

I'm getting an error when using the original.text parameter of the higlight() method.

    highlight(temp_sentiment, file = file.path("highlighted", paste(prod, "_highlighted-text.html", sep = "")), original.text = filtered[['Description']], open = FALSE)
Error in `[[<-.data.frame`(`*tmp*`, "txt", value = c("Feedback: \nAR - Goal History/Diagnostic Report....",  : 
  replacement has 3649 rows, data has 3658

I've tried debugging and I think the problem may be around line 50 of Highlight.R

if (!is.null(original.text)){
        txt <- get_sentences2(original.text)
    } else {
        txt <- get_sentences(x)
    }

    y[["txt"]] <- unlist(txt)

    y[, txt := ifelse(polarity == "", txt, sprintf("<mark class = \"%s\">%s</mark>", polarity, txt))]

If I set original.text to NULL (as is the default), it works fine. I'm not quite sure why there's a size difference.

I also tried setting missing_value=NULL for the sentiment_by call but that didn't change anything.

Do you have any suggestions? Thanks.

Explore as zero polarity

seems like
of like
bit like
sorta like
kinda like
is like
was like
it's like
he's like
she's like
we're like
they're like

vs like as positive

Does it change accuracy?

String Length Limit

I am wondering if someone has encountered the following error in the past:

Error in sprintf(paste("%-", Q, "s", sep = ""), col) : 
  required resulting string length 31315 is greater than maximal 8192 

A workaround is appreciated.

Issue with slitting sentences

a <- "I really enjoyed how the time (e.g. his use of stuff, mini \"lessons\" he gives). I think the pop quizzes were good."
sentiment(a)
get_sentences2(a)
> sentiment(a)
   element_id sentence_id word_count sentiment
1:          1           1          7 0.6803361
2:          1           2          9 0.0000000
3:          1           3          7 0.3779645
> get_sentences2(a)
[[1]]
[1] "I really enjoyed how the time (e.g. his use of stuff, mini \"lessons\" he gives)." "I think the pop quizzes were good."   

error travis

* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Undocumented arguments in documentation object 'sentiment_by'
  ‘averaging.function’

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking contents of ‘data’ directory ... OK
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking examples ... OK
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
  Running ‘testthat.R’
 OK
* checking PDF version of manual ... OK
* DONE

Status: 1 WARNING
See
  ‘/home/travis/build/trinker/sentimentr/sentimentr.Rcheck/00check.log’
for details.

Add terms

Amplifiers

update_key(valence_shifters_table, x = data.frame(x = c("exceedingly", "remarkably", "especially"), y = c(2, 2, 2)))

Negatives

update_key(valence_shifters_table, x = data.frame(x = c("excessively", 'overly', 'unduly', 'too much', 'too many', 'too often', 'I wish', 'too good', 'too high', 'to tough'), y = c(rep(-2, 8), rep(-1, 2))))

Add a measure of residual similar to log loss

Maybe:

average_absolute_rescaled_residual <- function (predicted, actual) {

    stopifnot(length(actual) == length(predicted))
    mean(abs(actual - general_rescale(predicted)) )/2

}

average_absolute_rescaled_residual(predicted = c(1, -1), actual = c(1, -1))
average_absolute_rescaled_residual(predicted = c(1, -1), actual = c(-1, 1))
average_absolute_rescaled_residual(predicted = rep(0, 10), actual = c(rep(-1, 5), rep(1, 5)))
average_absolute_rescaled_residual(predicted = c(-1, -1, .2, -.5, 1, 1.4, .7, .9, -1, .2), actual = c(rep(-1, 5), rep(1, 5)))


swafford <- c(
    "I haven't been sad in a long time.",
    "I am extremely happy today.",
    "It's a good day.",
    "But suddenly I'm only a little bit happy.",
    "Then I'm not happy at all.",
    "In fact, I am now the least happy person on the planet.",
    "There is no happiness left in me.",
    "Wait, it's returned!",
    "I don't feel so bad after all!"
)

library(dplyr)
out <- sentiment(swafford) %>%
    mutate( 
        actual = c(.8, 1, .8, -.1, -.5, -1, -1, .5, .6), 
        swafford = swafford
    )

out %>%
    with(average_absolute_rescaled_residual(sentiment, actual))

Other languages

Hi @trinker ,

I want to make a sentiment analysis tool for Facebook comments, and your package seems very nice.
But, I can analyse text in portuguese, if I have a dictionary with positive/negative words?

Thanks!!

To Do

  • Fix the polarity formula
  • Add formula to README
  • Add sentiment_by
    • element level aggregation when by = NULL
    • Will take grouping variables for by
    • Retain ability to get sentence (uncombine) level info back from an environment in the class
  • Add class
    • sentiment
    • sentiment_by
  • Add plotting method with ggplot2 and geom_smooth
  • Add a get_sentences from sentiment & sentiment_by to extract the sentences back
    • A class that can also operate on character strings using get_sents under the hood
    • Add to sentiment
    • Add to sentiment_by
  • Add a uncombine from sentiment_by to extract the sentences level polarity back

sentiment_by...Is a mixed really a mixed

Often people will include kind sentences when speaking unkind sentences as a polite convention. Does the kind ones truly negate the negative ones? No. It may indicate a less hostile tone but overall the tone is still hostile. Maybe a general weighting function is in order that up and down weights the group by averaging according to this convention.

This is not the default but recommended for short opinion texts like reviews or evaluations

weighted_sentiment_average <- function (x, mixed.less.than.zero.weight = 4, na.rm = TRUE, ...) {

    if (any(x > 0) && any(x < 0)) {

        numerator <- sum(x[x < 0 & !is.na(x)]) * mixed.less.than.zero.weight + sum(x[x > 0 & !is.na(x)])

    } else {

        numerator <- sum(x, na.rm = na.rm)
 
    }

    numerator/{sum(x != 0, na.rm = na.rm) + sqrt(log(1 + sum(x == 0, na.rm = na.rm)))}
} 




weighted_sentiment_average(c(-1))
weighted_sentiment_average(c(-1, 1))
weighted_sentiment_average(c(-1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1, 1))
weighted_sentiment_average(c(-1, 1, 1, 1, 1, 1, 0))
weighted_sentiment_average(c(-1, -1, -1, -1, 1, 1, 1))

Changed to dictionaries

  • funny is typically positive and it's currently negative.
  • least should be a deamplifier
  • understand(ing|s)* should probably be a positive word
sentiment("Not very effective, hard to understand, just read from power point, couldn't understand.")

sentimentr::polarity_table["understand",]
sentimentr::polarity_table["understands",]
sentimentr::polarity_table["understanding",]
sentimentr::polarity_table["funny",]
sentimentr::polarity_table["hilarious",]
sentimentr::polarity_table["least",]
sentimentr::valence_shifters_table["least",] #deamplifier


> sentimentr::valence_shifters_table["least",]
       x  y
1: least NA
> sentimentr::polarity_table["understand",]
            x  y
1: understand NA
> sentimentr::polarity_table["understands",]
             x  y
1: understands NA
> sentimentr::polarity_table["understanding",]
               x  y
1: understanding NA
> sentimentr::polarity_table["funny",]
       x  y
1: funny -1
> sentimentr::polarity_table["hilarious",]
           x y
1: hilarious 1
> sentimentr::polarity_table["least",]
       x  y
1: least NA
> sentimentr::valence_shifters_table["least",] #negator
       x  y
1: least NA

huh?? figure out why two negatives makes a neutral

sentiment("sucked. most of the stuff does not work with my phone.")

#   element_id sentence_id word_count  sentiment
#1:          1           1          1 -1.0000000
#2:          1           2         10 -0.3162278

sentiment("sucked, most of the stuff does not work with my phone.")

#   element_id sentence_id word_count sentiment
#1:          1           1         11         0

sentiment("sucked most of the stuff does not work with my phone.")

#   element_id sentence_id word_count  sentiment
#1:          1           1         11 -0.6030227

Depends: R (≥ 3.1.0) Not working on 3.2.3

Hi,

The R version for this to work says Depends: | R (≥ 3.1.0). However when i try to load the package after installing it. This is the error i get.

library(sentimentr)
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called ‘textclean’
In addition: Warning message:
package ‘sentimentr’ was built under R version 3.4.1
Error: package or namespace load failed for ‘sentimentr’

Is this an error or in the code or the page has not been updated to say only it works for R >3.4

Regards,
Ren.

Error for single sentence

sentiment('surprised and shocked.')
> sentiment('surprised and shocked.')
   element_id sentence_id word_count sentiment
1:          1           1          3         0
Warning message:
In `[.data.table`(sent_dat, , `:=`("words", make_words(space_fill(sentences,  :
  Supplied 3 items to be assigned to 1 items of column 'words' (2 unused)

Incorrect polarity calculation

Finding it strange. Trying the sentence "Crashing tv isn't showing" yields a sentiment score of 0.5

Sentiment for "Crashing TV" yields -0.70
Sentiment for "isn't showing" yields 0
Sentiment for "isn't " yields 0 - This is surprising coz I have "isn't" as negator in my valence table

There were only a couple of additions to the valence table and the polarity table - and none of it should have any impact in the context of this sentence.

Any idea what is wrong ?

sentiment_by("Crashing tv isn't showing", by = NULL, polarity_dt = pk_table,

  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 4 NA 0.5

    sentiment_by("Crashing tv", by = NULL, polarity_dt = pk_table,

  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 2 NA -0.7071068

    sentiment_by("isn't showing", by = NULL, polarity_dt = pk_table,

  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 2 NA 0
    sentiment_by("isn't", by = NULL, polarity_dt = pk_table,
  •          valence_shifters_dt = vs_table)
    
    element_id word_count sd ave_sentiment
    1: 1 1 NA 0

Investigat use of data/funs from another package as arguments

One used to be able to use :: as a way to add a package to yours if data/fun was used as an argument. This appears to no longer be the case: https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/sentimentr-00check.html

checking dependencies in R code ... NOTE
Namespaces in Imports field not imported from:
  ‘lexicon’ ‘syuzhet’
  All declared Imports should be used.

Maybe as simple as adding @importFrom in the roxygen or maybe a newer version of roxygen does this or maybe this needs to be brought up w/ ryxygen...

Dummy example for posting if need be:

library(janeaustenr)
library(stringi)

#' Compare Number of Words
#' 
#' A silly little function of no onsequence.
#' 
#' @param text A text string(s)
#' @param comparison The text you want to compare word counts against
#' @param \ldots ignored
#' @export
#' @examples 
#' library(janeaustenr)
#' more_words_than(janeaustenr::northangerabbey)
more_words_than <- function(text, comparison = janeaustenr::emma,...){

    sum(stringi::stri_count_words(text)) > sum(stringi::stri_count_words(comparison))
}

Zero averaging problem

I believe that zero sentiment sentences may have too much influence on averaging sentiments (over smoothing) because zero likely doesn't carry the same weight semantically/affectively as a non-zero sentiment.

Here as some possibilities for dealing with the issue (down weighting the influence of zero in the denominator).

x <- c(1, 2, 0, 0, 0, -1)

sum(x)/length(x)
sum(x)/{sum(x != 0) + (sum(x == 0)^(1/3))}
sum(x)/{sum(x != 0) + sqrt(log(1 + sum(x == 0)))}

Emoji support

stringi is likely to do OS indepen. conversion of (escape sequence) bytes to unicode (or something like that) that can then be used to look up the emoji name. Use the reqex to pull them out.

then use Unicode package to go the rest of the way u_char_name(as.u_char('U+2702'))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.