Giter Club home page Giter Club logo

Comments (5)

mlampros avatar mlampros commented on August 25, 2024

I'm sorry for the late reply,

as you already know the fuzzywuzzyR package ports the fuzzywuzzy python library, so I did a search on the issues page of the python library and I found the following three, which might be related to chinese characters (too),

seatgeek/fuzzywuzzy#20
seatgeek/fuzzywuzzy#104
seatgeek/fuzzywuzzy#82

The solution to your issue would be to use the force_ascii parameter (note that it doesn't apply to all functions). For instance,

library(fuzzywuzzyR)
word = "安广"
word1 = "安徽"

init_scor = FuzzMatcher$new()    # initialization of the scorer class

SCOR = init_scor$QRATIO(string1 = word, string2 = word1, force_ascii = FALSE)

which returns an error,

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position ....

It seems to me that it's a decoding issue (ascii encoding).

What it worked for me after trial and error was the following code chunk using directly the reticulate package (note that the reticulate package is a dependency of the fuzzywuzzyR package), which requires rudimentary python knowledge,

# import the python builtin functions in R

BUILTINS = reticulate::import_builtins(convert = FALSE)

# first convert the chinese characters to a 'python string' and use 'utf-8' decoding

first_word = BUILTINS$str("安广")$decode('utf-8')

second_word= BUILTINS$str("安徽")$decode('utf-8')

third_word = BUILTINS$str("广徽")$decode('utf-8')

fourth_word = BUILTINS$str("")$decode('utf-8')


# import directly the python fuzzywuzzy library in R

fzr = reticulate::import("fuzzywuzzy")


# 'force_ascii' is set to FALSE as character strings are already decoded

fzr$fuzz$QRatio(first_word, second_word, force_ascii = FALSE)     

[1] 50


fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)     

[1] 67


fzr$fuzz$QRatio(fourth_word, fourth_word, force_ascii = FALSE)     

[1] 100

Please let me know if it works (I'm not familiar with the chinese language)

from fuzzywuzzyr.

ctfysh avatar ctfysh commented on August 25, 2024

Thank you very much! This works for me in ubuntu 16.04.

If fact, each Chinese character canbe treated as a letter in English. So if I set "a" = "安" and "b" = "徽", then we can say "ab" = "安徽". To test this hypothesis, I write the following code appending your code:

fzr$fuzz$QRatio("ab", "a", force_ascii = FALSE)
[1] 67

and the result is the same with

fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)     
[1] 67

So this is very helpfull for me. Thank you. Are you planning to put these codes into the package fuzzywuzzyR to make it more powerful?

from fuzzywuzzyr.

mlampros avatar mlampros commented on August 25, 2024

I did a slight change to my previous comment concerning the force_ascii = FALSE parameter, which returns a decoding error (ascii encoding).

I'll give it a try to add this functionality to the package, so I leave this issue open until I have some results.

from fuzzywuzzyr.

mlampros avatar mlampros commented on August 25, 2024

I added the decoding parameter to the following classes : FuzzExtract, FuzzMatcher and FuzzUtils. The decoding parameter does not apply to the GetCloseMatches and SequenceMatcher classes, because there isn't any force_ascii parameter in the difflib python library.

Using the initial example,

word = "安广"
choices = c("安徽","安广")


init_proc = fuzzywuzzyR::FuzzUtils$new()

# add some special characters
remove_special_chars = paste0(word, "%&$#!")                                      

print(remove_special_chars)

[1] "安广%&$#!"

# 'utf-8' decoding applies only to 'Full_process' method in the 'FuzzUtils' class
PROC = init_proc$Full_process(string = remove_special_chars, decoding = 'utf-8') 

print(PROC)      # special characters removed

[1] "安广"

# 'utf-8' decoding applies to all methods of the 'FuzzMatcher' class
init_scor = fuzzywuzzyR::FuzzMatcher$new(decoding = 'utf-8')                      

# normally the 'WRATIO' method is with 'force_ascii = TRUE' initiallized, however here is overwritten by decoding 'utf-8'
SCOR = init_scor$WRATIO                                                          

# 'utf-8' decoding applies to all methods of the 'FuzzExtract' class
init <- fuzzywuzzyR::FuzzExtract$new(decoding = 'utf-8')                          

fzextr = init$Extract(string = word, sequence_strings = choices, scorer = SCOR)

print(fzextr)

[[1]]
[[1]][[1]]
[1] "安广"

[[1]][[2]]
[1] 100


[[2]]
[[2]][[1]]
[1] "安徽"

[[2]][[2]]
[1] 50

I uploaded the updated version of the package to Github, so to install it use

devtools::install_github(repo = 'mlampros/fuzzywuzzyR')

Would you mind taking a look at any relevant for your case tests that I added (beginning from line 1748), before I submit the newer version (1.0.2) to CRAN? So that I'm sure I didn't miss something.

from fuzzywuzzyr.

mlampros avatar mlampros commented on August 25, 2024

I close this issue for now, feel free to reopen it in case of any errors / bugs.

from fuzzywuzzyr.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.