Comments (2)
I've also encountered this issue.
The issue mainly stems from the _find_most_common_entity()
method where email addresses in test_structured.csv
are being incorrectly identified as URLs, albeit with low confidence. It prioritises the entity with the highest count.
Observed behavior:
- Entity Count:
{'URL': 6, 'EMAIL_ADDRESS': 3}
- Confidence Scores:
{'EMAIL_ADDRESS': [1.0, 1.0, 1.0], 'URL': [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]}
The emails are accurately recognised but are outnumbered by the URL identifications due to their higher frequency, despite lower confidence levels.
I would like to suggest two potential improvements:
- Adapting
_find_most_common_entity()
to Consider Confidence Scores: It might be beneficial to adjust the method to account for the actual confidence scores provided by the recognizer results. - Enhancing the URL Recognizer: Improving the recognizer's ability to differentiate between URLs and email addresses could help reduce this type of misidentification
I'm keen to contribute to making these improvements and would love to work on refining the logic. Any thoughts or feedback on these suggestions would be greatly appreciated!
from presidio.
Thanks for the feedback! the URL recognizer detects parts of emails as well (e.g. microsoft.com is a url inside [email protected]), which makes it detect more URLs than emails.
I think that a good way forward here would be to allow the user to decide on a strategy for the entity selected. In some cases, we would want the entity with the majority of cases, in others we'd like the one that has the highest confidence, and in others we might want a mix of the two (e.g. most common entity, if confidence > 0.5)
A quick fix could be to update the structured analysis once finalized, in case the column's name is "email" but the detection is actually "URL".
If you're interested in creating a PR, I'd be happy to review it and discuss.
from presidio.
Related Issues (20)
- Anonymizer does not work if not separated by spaces HOT 3
- Why does Presidio spin up so many threads? HOT 1
- Add Support for 'M' Prefix in SG_NRIC_FIN Recognizer for New Foreigner IDs
- Custom Entity Detection Issue in Presidio Version 2 HOT 7
- Add Support for 'bc1' Prefix in Crypto Recognizer for Bech32 Bitcoin Address Format
- Not understanding why DICOM redaction does not detect Patient Name on example data HOT 7
- Add Singapore UEN Recognizer
- Same Analyzer detects entity in text but not in image HOT 4
- BatchAnonymizer in Presidio takes longer to anonymize entities HOT 2
- What does `annotate_spans_key` do in the `TransformersNlpEngine`? HOT 2
- Is it possible to redact entities not supported by presidio? HOT 2
- Filter recognizers based on locale/country HOT 22
- Transformers backend, device and dtype HOT 2
- PhoneRecognizer returns incorrect recognizer name in the analysis_explanation HOT 2
- How to call AWS Comprehend service for PII detection
- Configure AnalyzerEngine from file HOT 2
- Analyzer identifies Portuguese phone number as US bank account HOT 1
- Custom Pattern Recognizer Not Working Properly with German Language in Analyzer Engine HOT 6
- feat: Add new recognizer for IN_VOTER id
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from presidio.