Giter Club home page Giter Club logo

Comments (5)

trinker avatar trinker commented on August 18, 2024 1

In some cases alpha may work. In others you may need to explicitly pass in \\p{L} instead as I show in the demo below:

## Use With Non-ASCII
## Warning: sentimentr has not been tested with languages other than English.
## The example below is how one might use sentimentr if you believe the
## language you are working with are similar enough in grammar to for
## sentimentr to be viable (likely Germanic languages)
## english_sents <- c(
##     "I hate bad people.",
##     "I like yummy cookie.",
##     "I don't love you anymore; sorry."
## )

## Roughly equivalent to the above English
danish_sents <- stringi::stri_unescape_unicode(c(
    "Jeg hader d\\u00e5rlige mennesker.",
    "Jeg kan godt lide l\\u00e6kker is.",
    "Jeg elsker dig ikke mere; undskyld."
))

danish_sents
## > danish_sents
## [1] "Jeg hader dårlige mennesker."        "Jeg kan godt lide lækker is."       
## [3] "Jeg elsker dig ikke mere; undskyld."

## Polarity terms
polterms <- stringi::stri_unescape_unicode(
    c('hader', 'd\\u00e5rlige', 'undskyld', 'l\\u00e6kker', 'kan godt', 'elsker')
)

## Make polarity_dt
danish_polarity <- as_key(data.frame(
    x = stringi::stri_unescape_unicode(polterms),
    y = c(-1, -1, -1, 1, 1, 1)
))

## Make valence_shifters_dt
danish_valence_shifters <- as_key(
    data.frame(x='ikke', y="1"),
    sentiment = FALSE,
    comparison = NULL
)

sentiment(
    danish_sents,
    polarity_dt = danish_polarity,
    valence_shifters_dt = danish_valence_shifters,
    retention_regex = "\\d:\\d|\\d\\s|[^\\p{L}',;: ]"
)

## A way to test if you need [:alpha:] vs \\p{L}
## Does it wreck some of the non-ascii characters by default?
sentimentr:::make_sentence_df2(danish_sents)

## > sentimentr:::make_sentence_df2(danish_sents)
##    id                            sentences wc
## 1:  1         jeg hader d rlige mennesker   5
## 2:  2         jeg kan godt lide l kker is   7
## 3:  3 jeg elsker dig ikke mere ; undskyld   6

## Does this?
sentimentr:::make_sentence_df2(danish_sents, "\\d:\\d|\\d\\s|[^\\p{L}',;: ]")

## > sentimentr:::make_sentence_df2(danish_sents, "\\d:\\d|\\d\\s|[^\\p{L}',;: ]")
##    id                            sentences wc
## 1:  1         jeg hader dårlige mennesker   4
## 2:  2         jeg kan godt lide lækker is   6
## 3:  3 jeg elsker dig ikke mere ; undskyld   6

## If you answer yes to #1 but no to #2 you likely want \\p{L}

from sentimentr.

trinker avatar trinker commented on August 18, 2024

Hello. As I discuss here (#74 (comment)) sentimentr is English based. I don't have the expertise in other languages to understand the ramifications of of extending it beyond it's current state. Your solution may work. Let me look into this more.

from sentimentr.

trinker avatar trinker commented on August 18, 2024
x <- c(
    "danish characteøs  sentåment æcores words correctly 456",
    "It works with probleme but not with problème 234"
)
gsub("[^[:alpha:]',;: ]|\\d:\\d|\\d ",  '', x)

##  "danish characteøs  sentåment æcores words correctly " "It works with probleme but not with problème "  

This may work. I need to look at [:alpha:]

from sentimentr.

dominiqueemmanuel avatar dominiqueemmanuel commented on August 18, 2024

Thanks for working on this !

On my side the both [:alpha:] and \\p{L} seem OK :

txt <- c("première","dårlige","lækker")
stringi::stri_replace_all_regex(txt ,'[^a-zA-Z;:,\']', " ")==txt
#c(FALSE,FALSE,FALSE)
## >> KO  :(
stringi::stri_replace_all_regex(txt ,"[^[:alpha:];:,\']", " ")==txt
# c(TRUE,TRUE,TRUE)
## >> OK  :)
stringi::stri_replace_all_regex(txt ,"[^\\p{L};:,\']", " ")==txt
# c(TRUE,TRUE,TRUE)

However I would like to draw your attention on the fact that "[^\\p{L};:,\']", is compatible with stringi::stri_replace_all_regex but not with gsub, so I think you should modifiy this line :

text.var <- gsub(

Kind redards,
Dom

from sentimentr.

trinker avatar trinker commented on August 18, 2024

Thanks I have changed this out!

from sentimentr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.