Comments (7)
Yes, ampersand is a good one, as is 'n or 'N. Dashes are also a challenge.
Regarding silentrob's comments, agreed, I'm not looking to bastardize a general purpose tool for my own selfish needs.
My use case is very specific to noun discovery, which in the case of brand names or trademarks, and to a large degree retailer names, can be very challenging. Our database alone has tens of thousands of such entities, many of them combining different parts of speech -- Big Red, Cap N Crunch -- or just difficult to logically group together, like Shop Rite, Stop & Shop, Wal*Mart and so on. And while all I really need are IN's, CC's and NN's, I need to know them really, really well for my solution to be successful, and my preference is to keep this on the client-side and use logic versus a brute force dictionary-style approach.
On top of nlp_compromise, I've implemented two hacks -- one for the double-quote capability, and the more important one is a hacky converter that looks at token tags, looks to the previous and or next tag, and based on the three, converts the one I'm operating on into an IN, a CC, an NN, or basically an ignore. The sentences used in my solution are essentially grouped into "show me x1, x2 and x3 for y1, y2 and y3 during z1, z2 and z3", with x's being one group of like objects, y being another, z another, and so on. The complexity is in simply allowing the user more flexibility to be descriptive in their sentence, i.e. "show me both x1 and x2 in y1, y2 and y3 during the periods z1 and z2", and ensuring that I can add flexibility in the future for more complex statements. So I'm not inclined to create a "dumb parser" that looks for break words, commas and the like, because I think ultimately identifying parts of speech will afford me the flexibility and future-proofing I need.
from compromise.
hi! ofcouse. That's a great idea. I agree that a quote with only a few words is a strong Noun signal.
In that case too, the ampersand also seems like a good signal for a noun. Do you have others?
I'm in the latter stages of a big v2 rewrite, and i'd be happy to include these rules there.
from compromise.
That seems trickey, I could see "Stop & Shop" be translated to "Stop and Shop" which should get tagged as "Stop/NN and/CC Shop/NN". I suspect you want the phase to be parsed as a NP, but this is done by a parser and not a tagger.
from compromise.
@spencermountain I have explored pulling out common bigrams and trigrams using a lookup table to aid in entity recognition. For example: "fish and chips", "french fries" also "united states" they all mean something as a group more than the sum of their parts.
from compromise.
oh. very neat stuff.
can you help contribute to v2?
this is how i've been looking at that problem here.
You can see there's lots of work to do.
V1 has a tonne of lumper-splitter rules, like you mentioned. Ideally, we can think of a better way to articulate them. SilentRob's done loads of this already.
It's a great time, to be rethinking this, and I welcome any ideas.
from compromise.
Happy to help contribute, if I can get away from the day job :)
Yes, that's an interesting approach to applying a "coarse filter" to POS tags. Better than what I was considering, I think.
from compromise.
hey, I think the problems you've mentioned have been addressed in the much-smarter lumping scheme. Let me know if you find any other doozies.
cheers
from compromise.
Related Issues (20)
- [Improvement]: Bank of #Place - Rule. HOT 1
- [Improvement]: Government of #Country - Rule HOT 2
- [Improvement]: School Board - Rule HOT 4
- [Improvement]: Better compression algorithm HOT 3
- [Improvement]: Museum - Rule
- [Improvement]: Location Disambiguation \ Human Name Matching - Rule HOT 4
- [Improvement]: Corporation Rule HOT 7
- Unexpected behavior if "no space after period" HOT 3
- compromise-dates: types unavailable HOT 1
- Possible Issue with Root Matching HOT 1
- Uncaught TypeError: Cannot read properties of undefined (reading 'map') HOT 3
- Matched sentence text duplicated if match value occurs more than once in string HOT 2
- Property 'normalize' does not exist on type 'People'. HOT 4
- Feature Request: Proper Tagging of Names with Possessive Apostrophes HOT 6
- [Bug]: Syntax / Matching Parsing Issue. HOT 10
- Punctuation following abbreviations causes sentences to merge HOT 3
- `.not()` is destructive to punctuation HOT 4
- Feature Request: Add data to a term HOT 3
- Add TypeScript Support for Compromise-Dates HOT 2
- Website misspelling of "definitely" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compromise.