Giter Club home page Giter Club logo

statastringutilities's Introduction

Project Status: Active - The project has reached a stable, usable state and is being actively developed.

Stata String Utilities

This package contains two Stata programs that are wrappers for Java plugins: phoneticenc and strdist.

The phoneticenc command provides users with alternatives to the soundex and soundex_nara functions native to Stata 14.
These include the Beider-Morse, Caverphone 1, Caverphone 2, Daitch-Mokotoff, Double Metaphone, Kölner Phonetik, Match Rating Approach, Metaphone, and Nysiis phonetic encoding algorithms.

The strdist command provides users with several different string similarity and distance metrics including: Cosine similarity/distance, Damerau distance, Jaccard similarity/distance, Jaro-Winkler similarity/distance, Jaro similarity/distance, Levenshtein edit distance, Longest Common Subsequence distance, Bakkelund Longest Common Subsequence distance, N-Gram distance, Normalized Levenshtein similarity/distance, Q-Gram distance, and the Sorensen Dice similarity/distance metrics.

Examples

Phonetic String Encoding

The example below shows how the strutil command can be used to generate several different phonetic encodings of a given string.

. sysuse auto.dta, clear
. phoneticenc make, caverphone1(cav1) caverphone2(cav2) col(kolner) dms(daitch) dblm(dblmeta) metap(metaphone) nys(nysiis) beiderm(bmencode) matchrating(mrating)
. li make cav1 cav2 kolner daitch in 1/5

     +---------------------------------------------------------------------------------+
     | make              cav1         cav2                             kolner   daitch |
     |---------------------------------------------------------------------------------|
  1. | AMC Concord     AMKNKT   AMKNKTNNNN   06846472656565656565656565656565   064649 |
  2. | AMC Pacer       AMKPSN   AMKPSNNNNN     068187656565656565656565656565   064749 |
  3. | AMC Spirit      AMKSPR   AMKSPRTNNN    0688172656565656565656565656565   064793 |
  4. | Buick Century   PKSNTR   PKSNTRNNNN     148627656565656565656565656565   754639 |
  5. | Buick Electra   PKLKTR   PKLKTRNNNN     145827656565656565656565656565   758439 |
     +---------------------------------------------------------------------------------+

. li make dblmeta metaphone nysiis mrating in 1/5

     +-------------------------------------------------------+
     | make            dblmeta   metaph~e   nysiis   mrating |
     |-------------------------------------------------------|
  1. | AMC Concord        AMKN       AMKK   ANCANC    AMCLNL |
  2. | AMC Pacer          AMKP       AMKP   ANCPAC    AMCLNL |
  3. | AMC Spirit         AMKS       AMKS   ANCSPA    AMCLNL |
  4. | Buick Century      PKSN       BKSN   BACANT    BCKLNL |
  5. | Buick Electra      PKLK       BKLK   BACALA    BCKLNL |
     +-------------------------------------------------------+

. li make bmencode in 1/5

     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  1. |                                                                                                 make                                                                                                           |
     |                                                                                                 AMC Concord                                                                                                    |
     |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
     | bmencode                                                                                                                                                                                                       |
     | amgzonkordnulnulnulnulnulnulnulnulnulnulnulnul|amgzonzordnulnulnulnulnulnulnulnulnulnulnulnul|amkonkordnulnulnulnulnulnulnulnulnulnulnulnul|amkonkurdnulnulnulnulnulnulnulnulnulnulnulnul|amkontsordnulnulnu.. |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  2. |                                                                                                 make                                                                                                           |
     |                                                                                                 AMC Pacer                                                                                                      |
     |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
     | bmencode                                                                                                                                                                                                       |
     | amkpakirnulnulnulnulnulnulnulnulnulnulnulnul|amkpasirnulnulnulnulnulnulnulnulnulnulnulnul|amkpatsirnulnulnulnulnulnulnulnulnulnulnulnul|amkpazirnulnulnulnulnulnulnulnulnulnulnulnul|amkpokirnulnulnulnulnul.. |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  3. |                                                                                                 make                                                                                                           |
     |                                                                                                 AMC Spirit                                                                                                     |
     |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
     | bmencode                                                                                                                                                                                                       |
     | amkspirinulnulnulnulnulnulnulnulnulnulnulnul|amkspiritnulnulnulnulnulnulnulnulnulnulnulnul|amtspiritnulnulnulnulnulnulnulnulnulnulnulnul|amzspiritnulnulnulnulnulnulnulnulnulnulnulnul|ankspirinulnulnulnuln.. |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  4. |                                                                                                 make                                                                                                           |
     |                                                                                                 Buick Century                                                                                                  |
     |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
     | bmencode                                                                                                                                                                                                       |
     | bDknturinulnulnulnulnulnulnulnulnulnulnulnul|bDksnturinulnulnulnulnulnulnulnulnulnulnulnul|bDktsnturinulnulnulnulnulnulnulnulnulnulnulnul|bDtsksnturinulnulnulnulnulnulnulnulnulnulnulnul|bDtsktsnturinulnul.. |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  5. |                                                                                                 make                                                                                                           |
     |                                                                                                 Buick Electra                                                                                                  |
     |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
     | bmencode                                                                                                                                                                                                       |
     | bDkiliktranulnulnulnulnulnulnulnulnulnulnulnul|bDkiliktronulnulnulnulnulnulnulnulnulnulnulnul|bDkilitstranulnulnulnulnulnulnulnulnulnulnulnul|bDkilitstronulnulnulnulnulnulnulnulnulnulnulnul|bDkliktranulnu.. |
     +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

String Distance

These examples are based on similar examples in the help file for the jarowinkler program developed by James Feigenbaum and available from the SSC archives.

. sysuse census, clear
(1980 Census data by state)

. keep state state2

. // Get all of the different distance and similarity metrics
. strdist state state2, coss(cosine_sim) cosd(cosine_dist) damerau(dam)            ///
> jaccards(jaccard_sim) jaccardd(jaccard_dist) lev(levenshtein)                    ///
> longsubstr(longsubstring) met(metriclcs) ngramd(ngram_distance) ngramc(4)        ///
> normlevs(normlev_similarity) normlevd(normlev_distance) qgramd(qgram_dist)       ///
> qgramc(4) dices(sorensen_similarity) diced(sorensen_distance)                    ///
> jarowinklers(jw_sim) jarowinklerd(jw_dist)

. // Get the Jaro only metrics
. strdist state state2, jarowinklers(jaro_sim) jarowinklerd(jaro_dist) jarowinklerc("-1")

. // Describe the data set
. desc

Contains data from C:\Program Files (x86)\Stata14\ado\base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:            20                          6 Apr 2014 15:43
 size:         8,000
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
cosine_sim      double  %10.0g                Cosine String Similarity
cosine_dist     double  %10.0g                Cosine String Distance
dam             double  %10.0g                Damerau String Distance
jaccard_sim     double  %10.0g                Jaccard String Similarity
jaccard_dist    double  %10.0g                Jaccard String Distance
jw_sim          double  %10.0g                Jaro Winkler String Similarity
jw_dist         double  %10.0g                Jaro Winkler String Distance
levenshtein     double  %10.0g                Levenshtein String Distance
longsubstring   double  %10.0g                Longest Common Substring Distance
metriclcs       double  %10.0g                Bakkelund String Distance
ngram_distance  double  %10.0g                N-Gram String Distance
normlev_simil~y double  %10.0g                Normalized Levenshtein String Similarity
normlev_dista~e double  %10.0g                Normalized Levenshtein String Distance
qgram_dist      double  %10.0g                Q-Gram String Distance
sorensen_simi~y double  %10.0g                Sorensen Dice String Similarity
sorensen_dist~e double  %10.0g                Sorensen Dice String Distance
jaro_sim        double  %10.0g                Jaro String Similarity
jaro_dist       double  %10.0g                Jaro String Distance
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

. // Display some of the metrics along side their respective strings
. li state state2 jw_dist jaro_dist jw_sim jaro_sim in 1/5, ab(40)

     +---------------------------------------------------------------------+
     | state        state2     jw_dist   jaro_dist      jw_sim    jaro_sim |
     |---------------------------------------------------------------------|
  1. | Alabama      AL       .19047624   .19047624   .80952376   .80952376 |
  2. | Alaska       AK       .44444442   .39999998   .55555558   .60000002 |
  3. | Arizona      AZ       .21428573   .21428573   .78571427   .78571427 |
  4. | Arkansas     AR       .19999999   .19999999   .80000001   .80000001 |
  5. | California   CA       .21333331   .21333331   .78666669   .78666669 |
     +---------------------------------------------------------------------+

. li state state2 dam jaccard* levenshtein in 1/5, ab(40)

     +----------------------------------------------------------------------+
     | state        state2   dam   jaccard_sim   jaccard_dist   levenshtein |
     |----------------------------------------------------------------------|
  1. | Alabama      AL         5             0              1             5 |
  2. | Alaska       AK         4             0              1             4 |
  3. | Arizona      AZ         5             0              1             5 |
  4. | Arkansas     AR         6             0              1             6 |
  5. | California   CA         8             0              1             8 |
     +----------------------------------------------------------------------+

. li state state2 longsubstring metriclcs norm*  in 1/5, ab(40)

     +-----------------------------------------------------------------------------------------+
     | state        state2   longsubstring   metriclcs   normlev_similarity   normlev_distance |
     |-----------------------------------------------------------------------------------------|
  1. | Alabama      AL                   5   .71428571            .28571429          .71428571 |
  2. | Alaska       AK                   4   .66666667            .33333333          .66666667 |
  3. | Arizona      AZ                   5   .71428571            .28571429          .71428571 |
  4. | Arkansas     AR                   6         .75                  .25                .75 |
  5. | California   CA                   8          .8                   .2                 .8 |
     +-----------------------------------------------------------------------------------------+

. li state state2 ngram* qgram* sorensen* in 1/5, ab(40)

     +---------------------------------------------------------------------------------------------+
     | state        state2   ngram_distance   qgram_dist   sorensen_similarity   sorensen_distance |
     |---------------------------------------------------------------------------------------------|
  1. | Alabama      AL             .2857143            4                     0                   1 |
  2. | Alaska       AK            .16666667            3                     0                   1 |
  3. | Arizona      AZ            .14285715            4                     0                   1 |
  4. | Arkansas     AR                  .25            5                     0                   1 |
  5. | California   CA                   .2            7                     0                   1 |
     +---------------------------------------------------------------------------------------------+

Additional Information

Requires JRE 1.8 or later

statastringutilities's People

Contributors

wbuchanan avatar bbuchananmpls avatar

Stargazers

zxs avatar Gang Li avatar  avatar Luís Fonseca avatar  avatar Matthieu Gomez avatar

Watchers

James Cloos avatar Eric Melse avatar  avatar

statastringutilities's Issues

strdist error message - java.lang.NoSuchMethodError

Hi. I am getting a java.lang.NoSuchMethodError when using the strdist command from the STRUTIL package.

Thought it was something I was doing at first. I did a complete reinstall and the phonetic encoding commands continue to work fine. However the strdist command returns this error. I've recreated it below with your example.

` sysuse census
(1980 Census data by state)

. keep state state2

. strdist state state2, jarowinklers(jaro_sim)
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.stata.Javacall.load(Javacall.java:130)
at com.stata.Javacall.load(Javacall.java:90)
Caused by: java.lang.NoSuchMethodError: org.paces.Stata.MetaData.Meta: method ()V not found
at org.paces.Stata.StringUtils.Similarity.DistanceMetrics.(DistanceMetrics.java:169)
at org.paces.Stata.StringUtils.StringUtilities.distance(StringUtilities.java:53)
... 6 more
`

This error is occurring on a PC - recent upgrade on my work machine.
I've not had the error on my mac - STRUTIL was installed last year when I first discovered your program and all is working ok on that installation.

I see from this exchange wbuchanan/StataJSON#22 that a similar error occured perhaps due to a dependency issue with StataJavaUtilities. Perhaps this is another in the same vein.

I hope this is something you can easily address.

Stephen

phoneticenc - issues

Hi. I've been working with the strutil package on a dataset of nearly 3500 student records. I'm matching records to my master list of students and comparing names to be sure student results are encoded to the correct students.

Anyway, I'm using the phoneticenc function. The metaphone, double metaphone and nysiis methods are returning only a single value for all 3500 records. (metaphone & double metapnone returns "NLNL"; nysiis returns "NALNAL"). The beidermorse method hangs up and returns the screenshot below:
phoneticenc_beidermorse_lname_error

Everything else seems to be working as expected.

Stephen

help strutil

Hi. Thanks for pulling this package together.

Typing "help phoneticenc" returns the help for strutil. Typing "help strutil" returns nothing to the help viewer. ADO package installed easily enough and I was able to figure out how to get the information on using the necessary commands. Might be confusing for some otherwise.

Also description for STRUTIL returned by ADO describe starts with "plugins to make it easier to make better looking graphs in Stata." The next paragraph appears to be a better match for the description I think.

help_strutil
ado_describe_strutil

I work in the Accountability Office for the San Bernardino City Unified School District. I am cleaning some data files and was excited to see that you had provided a wrapper for these. You can reach me at [email protected] if you have more questions.

Stephen

strdistance installation and StataJavaUtilities binary

I've installed the net inst strutil code and tried the demonstration program. However, I got the below result. I'm using Stata 15 over a remote server. W Buchanan has stated: "Looks like you need the StataJavaUtilities binary as well". More details at some point about what I need to install would be helpful.

strdist state state2, coss(cosine_sim) cosd(cosine_dist) damerau(dam) jaccards(jaccard_sim) jaccardd(jaccard_dist)

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Native MethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(De legatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.stata.Javacall.load(Javacall.java:132)
at com.stata.Javacall.load(Javacall.java:92)
Caused by: java.lang.NoClassDefFoundError: org/paces/Stata/MetaData/Meta
at org.paces.Stata.StringUtils.Similarity.DistanceMet rics.(DistanceMetrics.java:169)
at org.paces.Stata.StringUtils.StringUtilities.distan ce(StringUtilities.java:53)
... 6 more
r(5100);

unrecognized command: phoneticenc not defined by phoneticenc.ado

phoneticenc make, caverphone1(cav1) caverphone2(cav2) col(kolner) dms(daitch) dblm(dblmeta) metap(metaphone) nys(nysiis) beiderm(bmencode) matchrating(mrati
> ng)
unrecognized command:  phoneticenc not defined by phoneticenc.ado
r(199);

I'm following to Phonetic String Encoding example. When I tried to run the above command, I am returned an error. Thanks.

cossim output

Hi William,

I have a question regarding the output of cossim. I use two string variables as the input: x = "111111111", and y = "011010001". The output from the "strdist x y, cossim(cossim)" is 0.75592895.
If I treat each of the string variables as a vector of nine digits, and use the cosine similarity formula (x . y) / (||x|| . ||y||), the value is 2 / 4.2426 = 0.4714
I am wondering what causes the discrepancies here?

Best,
Henry

Problems with installation and solution

Hi,

Thanks for the program! I tried to install and run the package in Stata/SE 16.1 for Windows (64-bitx86-64). I have installed the package using:
net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")

It has problems initially and I have tried to add adopath, but until I did the following the program start to run:

copy and paste "StataJavaUtilities.jar" to C:\ado\plus/StataJavaUtilities.jar

I hope this information will be helpful to others!

phoneticenc ... dblm() with missingness fails in Stata 16

phoneticenc works in Stata 15 but fails in Stata 16. I am using Stata/MP for Windows (64-bit x86-64):

clear
input str15 last_name
"shadbolt"
""        
"ayres"   
"campbell"
"parmakli"
end

phoneticenc last_name, dblmetaphone(dblml)

The error in Stata 16 is

java.lang.reflect.InvocationTargetException
        at jdk.internal.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknow
> n Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at com.stata.Javacall.load(Javacall.java:130)
        at com.stata.Javacall.load(Javacall.java:90)
Caused by: java.lang.NullPointerException
r(5100);

The error is caused by the missingness. For example, this code works in both Stata 15 and 16:

clear
input str15 last_name
"shadbolt"
"hello"        
"ayres"   
"campbell"
"parmakli"
end

phoneticenc last_name, dblmetaphone(dblml)

I am unable to fix the problem under version control. Basically, I want to use Stata 16 but if I do then I can no longer use your program because of the error. Please advice. I can cross-post on Statalist if you would like.

strdist return values

Sometimes strdist returns numbers very close to 1, when it should return 1. For example:
set obs 1
gen a = "abcde"
gen b = a
strdist a b, coss(x)
format x %20.18f
list , clean noobs
gives:
abcde abcde 1.000000000000000200

strdist returns its scores as doubles - Maybe it should really be returning singles (floats)? But the underlying cause of the problem is probably mixing singles and doubles somewhere in the depths of the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.