Giter Club home page Giter Club logo

Comments (7)

slneufeld avatar slneufeld commented on May 18, 2024 1

Yes, ampersand is a good one, as is 'n or 'N. Dashes are also a challenge.

Regarding silentrob's comments, agreed, I'm not looking to bastardize a general purpose tool for my own selfish needs.

My use case is very specific to noun discovery, which in the case of brand names or trademarks, and to a large degree retailer names, can be very challenging. Our database alone has tens of thousands of such entities, many of them combining different parts of speech -- Big Red, Cap N Crunch -- or just difficult to logically group together, like Shop Rite, Stop & Shop, Wal*Mart and so on. And while all I really need are IN's, CC's and NN's, I need to know them really, really well for my solution to be successful, and my preference is to keep this on the client-side and use logic versus a brute force dictionary-style approach.

On top of nlp_compromise, I've implemented two hacks -- one for the double-quote capability, and the more important one is a hacky converter that looks at token tags, looks to the previous and or next tag, and based on the three, converts the one I'm operating on into an IN, a CC, an NN, or basically an ignore. The sentences used in my solution are essentially grouped into "show me x1, x2 and x3 for y1, y2 and y3 during z1, z2 and z3", with x's being one group of like objects, y being another, z another, and so on. The complexity is in simply allowing the user more flexibility to be descriptive in their sentence, i.e. "show me both x1 and x2 in y1, y2 and y3 during the periods z1 and z2", and ensuring that I can add flexibility in the future for more complex statements. So I'm not inclined to create a "dumb parser" that looks for break words, commas and the like, because I think ultimately identifying parts of speech will afford me the flexibility and future-proofing I need.

from compromise.

spencermountain avatar spencermountain commented on May 18, 2024

hi! ofcouse. That's a great idea. I agree that a quote with only a few words is a strong Noun signal.
In that case too, the ampersand also seems like a good signal for a noun. Do you have others?
I'm in the latter stages of a big v2 rewrite, and i'd be happy to include these rules there.

from compromise.

silentrob avatar silentrob commented on May 18, 2024

That seems trickey, I could see "Stop & Shop" be translated to "Stop and Shop" which should get tagged as "Stop/NN and/CC Shop/NN". I suspect you want the phase to be parsed as a NP, but this is done by a parser and not a tagger.

from compromise.

silentrob avatar silentrob commented on May 18, 2024

@spencermountain I have explored pulling out common bigrams and trigrams using a lookup table to aid in entity recognition. For example: "fish and chips", "french fries" also "united states" they all mean something as a group more than the sum of their parts.

from compromise.

spencermountain avatar spencermountain commented on May 18, 2024

oh. very neat stuff.
can you help contribute to v2?

this is how i've been looking at that problem here.
You can see there's lots of work to do.

V1 has a tonne of lumper-splitter rules, like you mentioned. Ideally, we can think of a better way to articulate them. SilentRob's done loads of this already.
It's a great time, to be rethinking this, and I welcome any ideas.

from compromise.

slneufeld avatar slneufeld commented on May 18, 2024

Happy to help contribute, if I can get away from the day job :)

Yes, that's an interesting approach to applying a "coarse filter" to POS tags. Better than what I was considering, I think.

from compromise.

spencermountain avatar spencermountain commented on May 18, 2024

hey, I think the problems you've mentioned have been addressed in the much-smarter lumping scheme. Let me know if you find any other doozies.
cheers

from compromise.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.