Giter Club home page Giter Club logo

Comments (8)

AugustT avatar AugustT commented on August 18, 2024 1

Great work. Not that it matters but I was on linux and windows. I think your benchmarking negates the need for me to test. Great stuff! FYI @GitTFJ

from sentimentr.

trinker avatar trinker commented on August 18, 2024

@AugustT What OS are you using?

Related: trinker/textclean#51

from sentimentr.

trinker avatar trinker commented on August 18, 2024

I have made changes to address this @AugustT. Could you try it out and give feedback?

from sentimentr.

trinker avatar trinker commented on August 18, 2024

On windows I get the following using this code but I am wondering what mac users will get:

gsub_reg <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x)
gsub_perl <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x, perl = TRUE)

library(microbenchmark)
library(sentimentr)

y <- hotel_reviews$text

r <- microbenchmark::microbenchmark(
    gsub_reg = gsub_reg(y),
    gsub_perl = gsub_perl(y),
    times = 100
)

plot(r)

image

from sentimentr.

trinker avatar trinker commented on August 18, 2024

Similar results from Mac:

MicrosoftTeams-image

from sentimentr.

trinker avatar trinker commented on August 18, 2024

I get similar on Windows with \\p{L} swapped for [:alpha:]:

image

from sentimentr.

trinker avatar trinker commented on August 18, 2024

And with \\p{L} swapped for [:alpha:] on Mac:
MicrosoftTeams-image (1)

from sentimentr.

trinker avatar trinker commented on August 18, 2024

One final related:

gsub_reg_p <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x)
gsub_reg_a <- function(x) gsub("\\d:\\d|\\d\\s|[^[:alpha:]',;: ]", '<<>>', x)
gsub_perl_p <- function(x) gsub("\\d:\\d|\\d\\s|[^\\p{L}',;: ]", '<<>>', x, perl = TRUE)

library(microbenchmark)
library(sentimentr)

y <- hotel_reviews$text

r <- microbenchmark::microbenchmark(
    gsub_reg_p = gsub_reg_p(y),
    gsub_reg_alpha = gsub_reg_a(y),
    gsub_perl_p = gsub_perl_p(y),
    times = 100
)

plot(r)

x<-c("Jeg hader dårlige mennesker.", "Jeg kan godt lide lækker is.", 
"Jeg elsker dig ikke mere; undskyld.")

image

Inside of make_sentence_df2 I do not use perl = TRUE (see d24e6af) because in this case the alpha is faster than \\p{L}and setting it perl = TRUE for [:alpha:] does not retain what we want in the alphabetic nonascii chars:

gsub("\\d:\\d|\\d\\s|[^['alpha:]',;: ]", '<<>>', x)
## [1] "Jeg hader dårlige mennesker."        "Jeg kan godt lide lækker is."       
## [3] "Jeg elsker dig ikke mere; undskyld."


gsub("\\d:\\d|\\d\\s|[^['alpha:]',;: ]", '<<>>', x, perl = TRUE)
## [1] "Jeg hader dårlige mennesker."        "Jeg kan godt lide lækker is."       
## [3] "Jeg elsker dig ikke mere; undskyld."

from sentimentr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.