Giter Club home page Giter Club logo

php-name-parser's Introduction

PHP-Name-Parser

Build Status

PHP library to split names into their respective components. Besides detecting first and last names, this library attempts to handle prefixes, suffixes, initials and compound last names like "Von Fange". It also normalizes prefixes (Mister -> Mr.) and fixes capitalization (JOHN SMITH -> John Smith).

Usage:

include("parser.php");

$parser = new FullNameParser();
$parser->parse_name("Mr Anthony R Von Fange III");

Results:

Array (
    [nickname] =>
    [salutation] => Mr.
    [fname] => Anthony
    [initials] => R
    [lname] => Von Fange
    [suffix] => III
)

The algorithm:

We start by splitting the full name into separate words. We then do a dictionary lookup on the first and last words to see if they are a common prefix or suffix. Next, we take the middle portion of the string (everything minus the prefix & suffix) and look at everything except the last word of that string. We then loop through each of those words concatenating them together to make up the first name. While we’re doing that, we watch for any indication of a compound last name. It turns out that almost every compound last name starts with 1 of 16 prefixes (Von, Van, Vere, etc). If we see one of those prefixes, we break out of the first name loop and move on to concatenating the last name. We handle the capitalization issue by checking for camel-case before uppercasing the first letter of each word and lowercasing everything else. I wrote special cases for periods and dashes. We also have a couple other special cases, like ignoring words in parentheses all-together.

Check examples.php for the test suite and examples of how various name formats are parsed.

Possible improvements

  • Handle the "Lname, Fname" format
  • Separate the parsing of the name from the normalization & capitalization & make those optional
  • Seperate the dictionaries from the code to make it easier to do localization
  • Add common name libraries to allow for things like gender detection

Same logic, different languages

Credits & license:

php-name-parser's People

Contributors

anchepiece avatar arnidan avatar atla5 avatar gh-o-st avatar jenky avatar jhoughtelin avatar joshfraser avatar krlnwll avatar luiz-brandao avatar squatto avatar toxaris-nl avatar waskosky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-name-parser's Issues

Failed test. Name part starting with a matched common profession

Test case shows that any last name starting with a matched common prof suffix gets incorrectly classified as a whole suffix.

Expected
OLD MACDONALD
'fname' => 'Old'
'lname' => 'Macdonald'

Got

  1. FullNameParserTest::testName with data set # 13 ('OLD MACDONALD', array('', 'Old', '', 'Macdonald', ''))
    Failed asserting that Array &0 (
    'salutation' => ''
    'fname' => 'Old'
    'initials' => ''
    'lname' => 'Macdonald'
    'suffix' => ''
    ) is identical to Array &0 (
    'salutation' => ''
    'fname' => 'Old'
    'initials' => ''
    'lname' => ''
    'suffix' => 'MACDONALD'
    ).

Notes
In this case the MA.. is matching an entry 'MA'. Applies to other cases as well, and it seems like the regex is not respecting the word boundary \b metacharacter.

parse_name to accept two parameters?

Hi,
Great Job! Have you considered to modify the main function parse_name to accept two parameters like $firstname and $lastname ? I have my list of first name and last names separate but when I feed it directly the last name is added to the first name:)
May this be a future request?

Test cases failing

I was excited to find this open-source library of yours. Thanks for making it.

I'm trying to replace my own custom name parser with yours because I bet you've put more thought into yours.

However, these 4 test cases of mine failed when trying to use your parser:

  1. fname for 'Prof. Hilma Mraz, Ph.D.' should be 'Hilma'
  2. lname for 'Prof. Hilma Mraz, Ph.D.' should be 'Mraz'
  3. fname for 'Ashley Jones, Ph.D.' should be 'Ashley'
  4. lname for 'Ashley Jones, Ph.D.' should be 'Jones'

Right?

Strange inconsistency with latest update

I'm not sure what's going on here... Somehow there seems to be 2 instances of the name Anthony Von Fange III, PhD and while one of them shows a "success" message, the other shows failure.

I'm not even sure how this is possible?

here's an image highlighting the issue...
wtf

Last Name Comma First Name comes out wrong

Isn't LastName <comma><space> FirstName a very common write-out of a persons full name?

print_r(FullNameParser::parse('LASTNAME, FIRSTNAME'));

Array
(
    [salutation] => 
    [fname] => Lastname
    [initials] => 
    [lname] => Firstname
    [lname_base] => Firstname
    [lname_compound] => 
    [suffix] => 
)

Middle Name

Are there any plans to add support for parsing of middle names, for example in a string like:

"Jonathan Randolf Jefferson"?

:)

split "lastname" into "surname prefix" and "surname"

hi Josh, i see you have a nice name parser. one of your fields coming out is "lastname"; is it also possible to add to additional fields "surname prefix" and "surname", so in essence splitting intelligent the "lastname" into 2 parts. some applications require the separation, and then it is difficult how to split a lastname into those 2 parts. Do you have a solution?

Build release

Hi. May be time to build release? We want use composer with your project.

Is this repo still maintained?

Great work! Thanks. You t package looks exactly what i need. Is this repo still maintained and does is work with PHP 8.0 and/or 8.1

Extending FullNameParser

While trying to extend The Base FullNameParse class. I realized that /** * Parse Static entry point. * * @param string $name the full name you wish to parse * @return array returns associative array of name parts */ public static function parse($name) { $parser = new static(); return $parser->parse_name($name); }
this static entry function does not uses the new self(); keyword rather than the new static(); keyword making that function useless when extending. Thanks in advance!

Altered parsed name containing unicode character ł

Hi there,

While trying to parse a name with an unicode character (ex: M. Test-ł), I saw that the result is altered (M. Test-▒).

After checking the code, I manage to find that the alteration come from this line.
As the strtolower function does not manage multi-bytes characters, this explain the alteration. Replacing the strtolower by mb_strtolower solve the case.

So I'm wondering if there is an interest to edit the code to replace the simple string functions by the multi bytes version of them to make the parser suitable for international names ?

Thanks.

Issue with is_compound on index 0

array_search will return 0 if the match is the first word, should be changed to in_array

Before:
protected function is_compound($word) {
return array_search(mb_strtolower($word), $this->dict['compound']);
}

After:
protected function is_compound($word) {
return in_array(mb_strtolower($word), $this->dict['compound']);
}

Parse wrong if last char is '.'

Try this:

$parser = new FullNameParser();
$split_name = $parser->parse_name('Fname Lname, Ph.D.');
var_dump($split_name);
/*
array(7) {
    ["salutation"]=>
    string(0) ""
    ["fname"]=>
    string(11) "Fname Lname"
    ["initials"]=>
    string(0) ""
    ["lname"]=>
    string(5) "Ph.D."
    ["lname_base"]=>
    string(5) "Ph.D."
    ["lname_compound"]=>
    string(0) ""
    ["suffix"]=>
    string(0) ""
  }
*/

Dutch Surname prefix

Any plans to implement Dutch notation for surnames
For Instance
Peter de Vries, would be listd under V, not under d
So I would very much like a Name Parser which lists the surname like:
surname: Vries
surname prefix: de
Other examples:
Stijn van der Brekel
surname: Brekel
Surname prefix: van der

Initialed first name assumed to be middle initial

I'm not sure if this is an issue or not, it could be interpreted either way. Opening up a discussion.

For example: "J. Edgar Hoover" or "M. Night Shyamalan" are currently parsed as:

Array
(
    [salutation] => 
    [fname] => Edgar
    [initials] => J.
    [lname] => Hoover
    [lname_base] => Hoover
    [lname_compound] => 
    [suffix] => 
)

If this name is re-assembled in another system it would be assumed to be "Edgar J. Hoover" which would be incorrect.

An alternative would be to make fname "J. Edgar" in this situation, with no initials.

I pulled a random sampling of 1000 people from a large database and parsed their names, this script was 96.8% accurate. If this one issue were fixed, 13 additional splits would work, upping the accuracy to 98.1%.

Comma preceding suffix is retained and saved to the last name

Given a name like the following:
Jonathan Smith, MD

The comma on the last name is retained and saved to "lname". This is contrary to the documented example in examples.php.

I believe the expected behavior is that the comma is removed from the input except in the case of the comma being within the suffix, as in this example:
Jonathan Smith IV, PhD

Both names are borrowed from examples.php but they don't evaluate as shown.

Non-US support?

The rules seem to be focussed on US-style names. Is there any support or possibility for optional (pluggable/configurable) non-US name support?

As an example, my native tongue (Dutch) has the untranslatable concept of "Tussenvoegsel" (https://en.wikipedia.org/wiki/Tussenvoegsel), which my own name happens to use. My full name is "Martijn van der Lee", my first name being "Martijn" van last name being "Lee" with the "van der" being the third ("Tussenvoegsel") part. Having my last name as "van der Lee" (or worse "Van Der Lee") would be wrong and would cause sorting to be incorrect when used in the Netherlands.

I understand many languages (at least Irish, French, German) have similar rules for names, and it would be nice if there were a single name parser which could be configured for multiple cultures/languages or possibly even auto-detect them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.