Giter Club home page Giter Club logo

presidio's Introduction

Presidio - Data Protection and De-identification SDK

Context aware, pluggable and customizable PII de-identification service for text and images.


Build Status MIT license Release CII Best Practices PyPI pyversions

  • Presidio Analyzer Pypi Downloads
  • Presidio Anonymizer Pypi Downloads
  • Presidio Image-Redactor Pypi Downloads

What is Presidio

Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.

Presidio demo gif


💭 Demo


Are you using Presidio? We'd love to know how

Please help us improve by taking this short anonymous survey.


Goals

  • Allow organizations to preserve privacy in a simpler way by democratizing de-identification technologies and introducing transparency in decisions.
  • Embrace extensibility and customizability to a specific business need.
  • Facilitate both fully automated and semi-automated PII de-identification flows on multiple platforms.

Main features

  1. Predefined or custom PII recognizers leveraging Named Entity Recognition, regular expressions, rule based logic and checksum with relevant context in multiple languages.
  2. Options for connecting to external PII detection models.
  3. Multiple usage options, from Python or PySpark workloads through Docker to Kubernetes.
  4. Customizability in PII identification and de-identification.
  5. Module for redacting PII text in images (standard image types and DICOM medical images).

⚠️ Presidio can help identify sensitive/PII data in un/structured text. However, because it is using automated detection mechanisms, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.

Installing Presidio

  1. Using pip
  2. Using Docker
  3. From source
  4. Migrating from V1 to V2

Running Presidio

  1. Getting started
  2. Setting up a development environment
  3. PII de-identification in text
  4. PII de-identification in images
  5. Usage samples and example deployments

Support

Contributing

For details on contributing to this repository, see the contributing guide.

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Contributors

presidio's People

Contributors

ammills01 avatar aroffe99 avatar balteravishay avatar biancafbmd avatar bwagner avatar dependabot[bot] avatar devopam avatar diwu1989 avatar ebotiab avatar eladiw avatar erann1987 avatar guybartal avatar hkarakose avatar idomingog avatar ilanak avatar itye-msft avatar melmatlis avatar miltonsim avatar navalev avatar niwilso avatar omri374 avatar paulo-raca avatar rabee333 avatar rakan41 avatar sharonhart avatar shiranr avatar stevehaigh avatar tamirkamara avatar torosent avatar vmd7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

presidio's Issues

How to create a project in presidio?

Describe the bug
How to create a project in presidio? Trying hard to follow the tutorial
https://github.com/Microsoft/presidio/blob/master/docs/tutorial_service.md

To Reproduce
Steps to reproduce the behavior:

$ curl -X POST localhost:3000/api/v1/projects/1
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

$ curl -X POST localhost:3000/api/v1/projects/my-projects
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

Expected behavior
Before I can analyze a template.. I need to create a default project?
$ echo -n '{"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC333991", "analyzeTemplate":{"fields":[]} }' | http /api/v1/projects//analyze

Screenshots
N/A

Additional context
Just try to run the example from https://github.com/Microsoft/presidio/blob/master/docs/tutorial_service.md

When request has both template and a template ID, the template is ignored

Set template:

echo -n '{"fields":[{"name": "PERSON"}]}' | http http://localhost:8080/api/v1/templates/123/analyze/my-template

Call analyzer:

echo -n '{"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC111921", "analyzeTemplateId":"my-template", "analyzeTemplate":{"fields": [{ "name": "PHONE" } ]}  }' | http localhost:8080/api/v1/projects/123/analyze

response is:

{
    "field": {
        "name": "PERSON"
    },
    "location": {
        "end": 10,
        "length": 10
    },
    "score": 0.85
}

]

should output an error message or be documented

Build takes a lot of time and eventually crashes

Describe the bug
Following the instructions to build docker images. My build process is stuck at pipenv sync and eventually crashes

To Reproduce
Steps to reproduce the behavior:

  1. RUN make DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} docker-build-deps

Fails while pipenv sync. Below is the log:


Step 10/12 : RUN pipenv sync
 ---> Running in a0dbab69a607
Creating a virtualenv for this project…
Pipfile: /usr/bin/presidio-analyzer/Pipfile
Using /usr/local/bin/python (3.7.1) to create virtualenv…
⠼ Creating virtual environment...Already using interpreter /usr/local/bin/python
Using base prefix '/usr/local'
New python executable in /root/.local/share/virtualenvs/presidio-analyzer-p_bm8iFt/bin/python
Installing setuptools, pip, wheel...
done.

✔ Successfully created virtual environment!
Virtualenv location: /root/.local/share/virtualenvs/presidio-analyzer-p_bm8iFt
Installing dependencies from Pipfile.lock (6725d2)…
The command '/bin/sh -c pipenv sync' returned a non-zero code: 137
make: *** [docker-build-deps] Error 137

This is after about 2 hours.

Expected behavior
Build successfully completes.

Additional context
I have a pretty fast internet connection, and I know pipenv can be slow. But I didn't expect it to fail.
Looking at: durations
I see the builds are actually very slow ~80 mins

What can I do to fasten this or even the build to succeed.

Analyzer - fails to handle entities that are part of a toke. Reproduce: analyze 'my phone:(425) 803 8080' where no space between ':' and the phone number

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Using protocol buffers to add new parameters

Hi! I'm using and editing the Presidio analyzer software as a python module to fit my requirements. One of the things I'd like to do is to add a parameter to the process, for example, passing the country to the AnalyzerEngine().analyze method (in addition to the existing parameters like text, language, entities, etc).
I managed to change the code for this purpose, adding a new parameter to the analyze method of entity_recognizer, local_recognizer, pattern_recognizer and a few other scripts, I did this very carefully but it doesn't work because some of the _pb2_grpc automatically generated scripts (using Google's protocol buffer) control the supported variables.
I was wondering if there's a way of regenerating this scripts using the existing (edited) code to add supported parameters.
Thank you so much for the attention!

Analyzer - detects part of a guid as phone number

Analyzer detected part of this guid as phone number:
110bcd25-a55d-453a-8046-1297901ea002

{
    "text": "046-1297901",
    "field": {
        "name": "PHONE_NUMBER"
    },
    "probability": 0.5,
    "location": {
        "start": 52,
        "end": 63,
        "length": 11
    }
},

Json file attached
test.txt

Improve analyzer performance.

Our current implementation uses Python regex library and spacy word similarity.
We need to check performance and integration with re2 and possibly remove the word similarity functionality.
A good response time for our demo text file will be 200 ms per analyze request with all the field types enabled.

fieldTypes API should returns all fields, including custom ones

Describe the bug
Currently a call to the fieldTypes API returns the list of fields defined in the proto file. Since Presidio allows to extend this list by adding custom recognizers and field types, this API should return all field types (entity types) supported by a Presidio deployment

To Reproduce
Steps to reproduce the behavior:

  1. Create a new entity recognizer using the API
  2. Call the fieldTypes API (e.g. https://presidio-api.westeurope.cloudapp.azure.com/api/v1/fieldTypes)
  3. The newly added field type supported by the new recognizer is not in this list.

Expected behavior
New types supported by custom recognizers should be added to the list

Wrong results due to scoring

This bug talks about an issue where a string is considered by two different recognizers to be a PII entity.
Its not bad as compleetly missing it however there are scenario (like the one described below) that the wrong entity type 'wins' due to incorrect scoring.

consider the following request to analyze:

{
  "text": "my social number is 078051120 and my name is Jon doe",
  "analyzeTemplate": {
    "fields": [
      {
        "name": "PHONE_NUMBER"
      },
      {
        "name": "US_SSN"
      },
      {
        "name": "CREDIT_CARD"
      },
      {
        "name": "DATE_TIME"
      }
    ]
  }
}

The result we get for 078051120 is DATE_TIME with score 0.85
if we remove DATE_TIME from the template we get the correct result of US_SSN

Basically its ok because it will be matched anyway, altough the anonymize methods might be different.
in this specific case this ssn is given a WEAK score of 0.3 and due to the context it is improved to 0.6, which is still less than 0.85.
if the SSN would have been in the correct SSN format XXX-XX-XXXX it would have got a better score with context of 0.85 which is the same as date time and might still be the one removed.

This bug should be discussed.
a) do we want to give regex results priority over spacy?
b) maybe we want to give context more affect on the score
c) maybe we want to allow the user to configure the right behaviour.

Usage Documentation

Is it possible to ask for a kind of quick start documentation to help the use of presidio from a User perspective?
What I mean is that I cannot find any API specification as well as an example on how to input a database source data to presidio analyzer/anonymizer.
May be it is a limit of me, but I believe it could help many people.
Thank you in advance

Analyzer - NER context is limited

Describe the bug
NER (spacy) ability to increase score based on context is limited

To Reproduce

  1. Go to 'analyzer/tests/tes_person.py
  2. Run test with "Mr. Tailor" text
  3. "Mr. Tailor" is not identified as a PERSON entity

Expected behavior
Should identify as person

Standalone Analyzer On Spark

Hi,

Have you guys tried to run the analyzer as a spark job ? how would suggest handle loading the spacy model for each worker and also how to handle serialization?

would appreciate some suggestions around this, is there any plan to support this use case ?

Thanks

Make error during installation using docker

During the installation with Docker, it is giving make error

make DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} docker-build

make: *** [Makefile:181: go-test-style] Error 1
The command '/bin/sh -c make go-test' returned a non-zero code: 2
make: *** [docker-build-base] Error 2

It would be very helpful, if someone canhelp with the solution

Thank You


$ make DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} docker-build
docker build --build-arg REGISTRY=presidio --build-arg PRESIDIO_DEPS_LABEL='latest' -t presidio/presidio-golang-base -f Dockerfile.golang.base .
Sending build context to Docker daemon 1.261MB
Step 1/7 : ARG REGISTRY=presidio.azurecr.io
Step 2/7 : ARG PRESIDIO_DEPS_LABEL=latest
Step 3/7 : FROM ${REGISTRY}/presidio-golang-deps:${PRESIDIO_DEPS_LABEL}
---> b2929920eb09
Step 4/7 : WORKDIR $GOPATH/src/github.com/Microsoft/presidio
---> Using cache
---> 04324948e601
Step 5/7 : ADD . $GOPATH/src/github.com/Microsoft/presidio
---> Using cache
---> ecf194963a5e
Step 6/7 : RUN dep ensure
---> Using cache
---> 4d1681ebd2c6
Step 7/7 : RUN make go-test
---> Running in aa4cea2b1b88
gometalinter --config ./gometalinter.json ./...
pkg/cache/cache.go:1::warning: file is not gofmted with -s (gofmt)
pkg/cache/mock/cache.go:1::warning: file is not gofmted with -s (gofmt)
pkg/cache/redis/redis.go:1::warning: file is not gofmted with -s (gofmt)
pkg/logger/logger.go:1::warning: file is not gofmted with -s (gofmt)
pkg/logger/logger_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/client.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/cronjob.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/cronjob_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/job.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/job_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/kube.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/secret.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/kube/secret_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/local/job.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/local/local.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/local/secret.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/local/secret_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/platform/platform.go:1::warning: file is not gofmted with -s (gofmt)
pkg/presidio/presidio.go:1::warning: file is not gofmted with -s (gofmt)
pkg/presidio/services/services.go:1::warning: file is not gofmted with -s (gofmt)
pkg/presidio/templates/templates.go:1::warning: file is not gofmted with -s (gofmt)
pkg/presidio/templates/templates_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/queue/queue.go:1::warning: file is not gofmted with -s (gofmt)
pkg/queue/rabbitmq/rabbitmq.go:1::warning: file is not gofmted with -s (gofmt)
pkg/queue/rabbitmq/rabbitmq_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/rpc/client.go:1::warning: file is not gofmted with -s (gofmt)
pkg/rpc/server.go:1::warning: file is not gofmted with -s (gofmt)
pkg/server/server.go:1::warning: file is not gofmted with -s (gofmt)
pkg/storage/storage.go:1::warning: file is not gofmted with -s (gofmt)
pkg/stream/eventhubs/eventhubs.go:1::warning: file is not gofmted with -s (gofmt)
pkg/stream/kafka/kafka.go:1::warning: file is not gofmted with -s (gofmt)
pkg/stream/mock/mock.go:1::warning: file is not gofmted with -s (gofmt)
pkg/stream/mock/mock_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/stream/stream.go:1::warning: file is not gofmted with -s (gofmt)
pkg/version/version.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer-image/cmd/presidio-anonymizer-image/anonymizer/anonymizer.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer-image/cmd/presidio-anonymizer-image/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/anonymizer.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/anonymizer_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/common.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/fpe_config.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/fpe_config_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/hash_config.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/hash_config_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/mask_config.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/mask_config_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/redact_config.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/replace_config.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/replace_config_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-anonymizer/cmd/presidio-anonymizer/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/analyze/analyze.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/analyze/analyze_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/anonymize-image/anonymize-image.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/anonymize-image/anonymize-image_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/anonymize/anonymize.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/anonymize/anonymize_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/api.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/mocks/mocks.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/recognizers/recognizers.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/recognizers/recognizers_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/scanner-cron-job/scanner-cron-job.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/stream-job/stream-job.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/api/templates/templates.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-api/cmd/presidio-api/methods.go:1::warning: file is not gofmted with -s (gofmt)
presidio-collector/cmd/presidio-collector/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-collector/cmd/presidio-collector/processor/processor.go:1::warning: file is not gofmted with -s (gofmt)
presidio-collector/cmd/presidio-collector/scanner/factory.go:1::warning: file is not gofmted with -s (gofmt)
presidio-collector/cmd/presidio-collector/scanner/scanner.go:1::warning: file is not gofmted with -s (gofmt)
presidio-collector/cmd/presidio-collector/scanner/storage-item.go:1::warning: file is not gofmted with -s (gofmt)
presidio-collector/cmd/presidio-collector/scanner/storage-scanner.go:1::warning: file is not gofmted with -s (gofmt)
presidio-collector/cmd/presidio-collector/streams/streams.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/cloudstorage/cloudstorage.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/cloudstorage/cloudstorage_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/database/database.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/database/database_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/datasink-factory.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/datasink/datasink.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/main_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/stream/stream.go:1::warning: file is not gofmted with -s (gofmt)
presidio-datasink/cmd/presidio-datasink/stream/stream_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-ocr/cmd/presidio-ocr/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-ocr/cmd/presidio-ocr/ocr/ocr.go:1::warning: file is not gofmted with -s (gofmt)
presidio-recognizers-store/cmd/presidio-recognizers-store/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-recognizers-store/cmd/presidio-recognizers-store/recognizers_store_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-scheduler/cmd/presidio-scheduler/main.go:1::warning: file is not gofmted with -s (gofmt)
presidio-scheduler/cmd/presidio-scheduler/scheduler_test.go:1::warning: file is not gofmted with -s (gofmt)
presidio-tester/cmd/presidio-tester/tester.go:1::warning: file is not gofmted with -s (gofmt)
tests/common/common.go:1::warning: file is not gofmted with -s (gofmt)
tests/functional_http_test.go:1::warning: file is not gofmted with -s (gofmt)
tests/functional_recognizers_test.go:1::warning: file is not gofmted with -s (gofmt)
tests/functional_storage_scanner_test.go:1::warning: file is not gofmted with -s (gofmt)
tests/integration_anonymize_image_test.go:1::warning: file is not gofmted with -s (gofmt)
tests/integration_eventhub_test.go:1::warning: file is not gofmted with -s (gofmt)
tests/integration_kafka_test.go:1::warning: file is not gofmted with -s (gofmt)
tests/integration_redis_test.go:1::warning: file is not gofmted with -s (gofmt)
tests/integration_tesseract_test.go:1::warning: file is not gofmted with -s (gofmt)
pkg/cache/cache.go:1::warning: file is not goimported (goimports)
pkg/cache/mock/cache.go:1::warning: file is not goimported (goimports)
pkg/cache/redis/redis.go:1::warning: file is not goimported (goimports)
pkg/logger/logger.go:1::warning: file is not goimported (goimports)
pkg/logger/logger_test.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/client.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/cronjob.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/cronjob_test.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/job.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/job_test.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/kube.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/secret.go:1::warning: file is not goimported (goimports)
pkg/platform/kube/secret_test.go:1::warning: file is not goimported (goimports)
pkg/platform/local/job.go:1::warning: file is not goimported (goimports)
pkg/platform/local/local.go:1::warning: file is not goimported (goimports)
pkg/platform/local/secret.go:1::warning: file is not goimported (goimports)
pkg/platform/local/secret_test.go:1::warning: file is not goimported (goimports)
pkg/platform/platform.go:1::warning: file is not goimported (goimports)
pkg/presidio/presidio.go:1::warning: file is not goimported (goimports)
pkg/presidio/services/services.go:1::warning: file is not goimported (goimports)
pkg/presidio/templates/templates.go:1::warning: file is not goimported (goimports)
pkg/presidio/templates/templates_test.go:1::warning: file is not goimported (goimports)
pkg/queue/queue.go:1::warning: file is not goimported (goimports)
pkg/queue/rabbitmq/rabbitmq.go:1::warning: file is not goimported (goimports)
pkg/queue/rabbitmq/rabbitmq_test.go:1::warning: file is not goimported (goimports)
pkg/rpc/client.go:1::warning: file is not goimported (goimports)
pkg/rpc/server.go:1::warning: file is not goimported (goimports)
pkg/server/server.go:1::warning: file is not goimported (goimports)
pkg/storage/storage.go:1::warning: file is not goimported (goimports)
pkg/stream/eventhubs/eventhubs.go:1::warning: file is not goimported (goimports)
pkg/stream/kafka/kafka.go:1::warning: file is not goimported (goimports)
pkg/stream/mock/mock.go:1::warning: file is not goimported (goimports)
pkg/stream/mock/mock_test.go:1::warning: file is not goimported (goimports)
pkg/stream/stream.go:1::warning: file is not goimported (goimports)
pkg/version/version.go:1::warning: file is not goimported (goimports)
presidio-anonymizer-image/cmd/presidio-anonymizer-image/anonymizer/anonymizer.go:1::warning: file is not goimported (goimports)
presidio-anonymizer-image/cmd/presidio-anonymizer-image/main.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/anonymizer.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/anonymizer_test.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/common.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/fpe_config.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/fpe_config_test.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/hash_config.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/hash_config_test.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/mask_config.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/mask_config_test.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/redact_config.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/replace_config.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/anonymizer/transformations/replace_config_test.go:1::warning: file is not goimported (goimports)
presidio-anonymizer/cmd/presidio-anonymizer/main.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/analyze/analyze.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/analyze/analyze_test.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/anonymize-image/anonymize-image.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/anonymize-image/anonymize-image_test.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/anonymize/anonymize.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/anonymize/anonymize_test.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/api.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/mocks/mocks.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/recognizers/recognizers.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/recognizers/recognizers_test.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/scanner-cron-job/scanner-cron-job.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/stream-job/stream-job.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/api/templates/templates.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/main.go:1::warning: file is not goimported (goimports)
presidio-api/cmd/presidio-api/methods.go:1::warning: file is not goimported (goimports)
presidio-collector/cmd/presidio-collector/main.go:1::warning: file is not goimported (goimports)
presidio-collector/cmd/presidio-collector/processor/processor.go:1::warning: file is not goimported (goimports)
presidio-collector/cmd/presidio-collector/scanner/factory.go:1::warning: file is not goimported (goimports)
presidio-collector/cmd/presidio-collector/scanner/scanner.go:1::warning: file is not goimported (goimports)
presidio-collector/cmd/presidio-collector/scanner/storage-item.go:1::warning: file is not goimported (goimports)
presidio-collector/cmd/presidio-collector/scanner/storage-scanner.go:1::warning: file is not goimported (goimports)
presidio-collector/cmd/presidio-collector/streams/streams.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/cloudstorage/cloudstorage.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/cloudstorage/cloudstorage_test.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/database/database.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/database/database_test.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/datasink-factory.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/datasink/datasink.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/main.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/main_test.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/stream/stream.go:1::warning: file is not goimported (goimports)
presidio-datasink/cmd/presidio-datasink/stream/stream_test.go:1::warning: file is not goimported (goimports)
presidio-ocr/cmd/presidio-ocr/main.go:1::warning: file is not goimported (goimports)
presidio-ocr/cmd/presidio-ocr/ocr/ocr.go:1::warning: file is not goimported (goimports)
presidio-recognizers-store/cmd/presidio-recognizers-store/main.go:1::warning: file is not goimported (goimports)
presidio-recognizers-store/cmd/presidio-recognizers-store/recognizers_store_test.go:1::warning: file is not goimported (goimports)
presidio-scheduler/cmd/presidio-scheduler/main.go:1::warning: file is not goimported (goimports)
presidio-scheduler/cmd/presidio-scheduler/scheduler_test.go:1::warning: file is not goimported (goimports)
presidio-tester/cmd/presidio-tester/tester.go:1::warning: file is not goimported (goimports)
tests/common/common.go:1::warning: file is not goimported (goimports)
tests/functional_http_test.go:1::warning: file is not goimported (goimports)
tests/functional_recognizers_test.go:1::warning: file is not goimported (goimports)
tests/functional_storage_scanner_test.go:1::warning: file is not goimported (goimports)
tests/integration_anonymize_image_test.go:1::warning: file is not goimported (goimports)
tests/integration_eventhub_test.go:1::warning: file is not goimported (goimports)
tests/integration_kafka_test.go:1::warning: file is not goimported (goimports)
tests/integration_redis_test.go:1::warning: file is not goimported (goimports)
tests/integration_tesseract_test.go:1::warning: file is not goimported (goimports)
make: *** [Makefile:181: go-test-style] Error 1
The command '/bin/sh -c make go-test' returned a non-zero code: 2
make: *** [docker-build-base] Error 2

Analyzer - extract language model related code to a LanguageModel class for extensibility and testability

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Multiple false positives for credit card number

Describe the bug
Multiple false positives for credit card number. We should consider dismiss other results if checksum passes.

To Reproduce
analyze "My credit card number is 6011577631711174"
analyze "My credit card number is 6011-5776-3171-1174"

Expected behavior
credit card with score 1.0

Screenshot 1
image

Screenshot 2
image

word boundaries

I have a lot of customed PII that I want to add.
I am running into a lot of issue where numbers result in overlapping classification.

For example,
1234567890 9844412312312323 at https://presidio-demo.azurewebsites.net results in <DATE_TIME>UMBER>.

I am very curious in learning about how this scenario occurs to best resolve the issue.

Analyzer - improve performance of analyze_text due to context similarity

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Analyzer - Spacey NER recognized cities and countries as locations, but not addresses

Describe the bug
Only cities and countries are recognized as locations when using spacy NER

To Reproduce
Steps to reproduce the behavior:

  1. analyze text of '5507 Ringgold Road, Chattanooga, TN 37412'
  2. 5507 is recognized as DATE_TIME, the address is not reconized and Chattanooga is recognized . as location. TN is not recognized as a state,

Wrong HTTP status code returned on empty analyzer response

Describe the bug
If the analyze result is empty, a 400 http status code is returned. should be 200

To Reproduce
Steps to reproduce the behavior:

  1. have a running presidio instance
  2. create a simple template
  3. use this template with some super simple text like "hello"
  4. you should receive the correct response with the wrong status code

Expected behavior
status code should be 200

Presidio demo - Anonymize text as default option

Describe the bug
The demo currently starts when all detected fields are not anonymized. The user doesn't see any of presidio's functionality unless he or she changes the action from None to something else.

To Reproduce
Start the demo

Expected behavior
Processed text should have detected fields anonymized instead of ignored.

presidio-ocr fails building if tesseract libs are missing

Describe the bug
Building binaries (make build) fails if tesseract libs are not available.

To Reproduce
Steps to reproduce the behavior:

  1. Build using makefile
$ make build
go build -ldflags ' -X github.com/Microsoft/presidio/pkg/version.Version=v0.3-21-g78be69b' -o bin/presidio-anonymizer ./presidio-anonymizer/cmd/presidio-anonymizer
go build -ldflags ' -X github.com/Microsoft/presidio/pkg/version.Version=v0.3-21-g78be69b' -o bin/presidio-ocr ./presidio-ocr/cmd/presidio-ocr
# github.com/Microsoft/presidio/vendor/github.com/otiai10/gosseract
tessbridge.cpp:5:10: fatal error: tesseract/baseapi.h: No such file or directory
    5 | #include <tesseract/baseapi.h>
      |          ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:26: presidio-ocr] Error 2
  1. See error

Expected behavior

  1. State the dependency in the Development docs.
  2. Explicitly add this dependency, if possible.

Not sure if (2) is doable, though. For Fedora 30 I fixed this by installing the tesseract-devel package. For Ubuntu the libtesseract-dev package can be used. (Source: otiai10/gosseract#132 (comment)).

Tutorial on how to custom a new field_type

Describe the bug
Please create an Tutorial on how to custom a new field_type

To Reproduce
N/A

Expected behavior
A good framework shall allow public to contribute

Screenshots
N/A

Additional context
Give me a head start, I may able to contribute.

Documentation request: Sink and Scanner Service

Describe the bug
Is the Sink Service and Scanner Service available. If so, can you create documentation to start? Thanks.

To Reproduce
N/A

Expected behavior
N/A

Screenshots
N/A

Additional context
N/A

crypto_recognizer throws an exception

When calling the engine analyze API like

 response = engine.analyze(correlation_id=0,
                                      text=text_to_analyze,
                                      language='en',
                                      entities=[],
                                      all_fields=True,
                                      score_threshold=0.5)

and the value of 'text_to_analyze' is

"/boardingPass/v1/devices/34e7b5e1a0aa1d6f3d862b52a289cdb7/registrations/pass.apoc.wallet/"

The exception below is thrown

File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/analyzer_engine.py", line 204, in analyze current_results = recognizer.analyze(text, entities, nlp_artifacts) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/pattern_recognizer.py", line 61, in analyze pattern_result = self.__analyze_patterns(text) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/pattern_recognizer.py", line 144, in __analyze_patterns validation_result = self.validate_result(current_match) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/predefined_recognizers/crypto_recognizer.py", line 23, in validate_result bcbytes = CryptoRecognizer.__decode_base58(pattern_text, 25) File "/home/folder_name/presidio_testing/my_venv/lib/python3.6/site-packages/analyzer/predefined_recognizers/crypto_recognizer.py", line 33, in __decode_base58 n = n * 58 + digits58.index(char)

ValueError: substring not found

Analyzer - Add analyzer support for JSON files

Currently the analyzer is not trained to analyze json files.
Therefor when analyzing these files the fields are detected wrongly:
Examples:

  1.   {
     "text": "uuid:26c446d2-65a9-4eaa-9a70-53da2aded653\"}},{\"url\":\"http://synthetichealth.github.io",
     "field": {
         "name": "PERSON"
     },
     "probability": 0.85,
     "location": {
         "start": 1920,
         "end": 2005,
         "length": 85
     }
    

    },

  2. {
    "text": "oid:2.16.840.1.113883.4.3.25","value":"S99910353"},{"type":{"coding":[{"system":"http://hl7.org",
    "field": {
    "name": "NRP"
    },
    "probability": 0.85,
    "location": {
    "start": 2898,
    "end": 2993,
    "length": 95
    }
    }

Json files attached

before-anonymization2.txt
before-anonymization.txt

Analyzer doesn't return start values when the start is at 0

Send a request with an entity on as the first token. Response object doesn't include a start value
For example:

{"text":"Microsoft is the company name", "analyzeTemplate":{"allFields":true}  }

Response:

[
    {
        "field": {
            "name": "COMPANY_NAME"
        },
        "score": 1,
        "location": {
            "end": 9,
            "length": 9
        }
    }
]

Add readiness probe to analyzer

When a new pod is started, it could get a grpc request even though it hadn't finished loading.
Requests sent to the pod during this time would fail.

Implement readiness into pods in K8S

Can it be used in ETL flow

Hi,

We perform ETL operations to move data from blob storage into ADLS datalake and SQL Servers.
Can this software be used as a potential transformation tool in the ETL load process to identify and hash the PII data?

I followed the analyzer document and it gave some good results on when passed a text file in a standalone python executor. Do you have any documentation on how to start working with anonymizer and scanning pictures?

Thank You

How to install on simple unix box

Hi,
This is a question. As per installation instructions it can be installed with either docker or with helm (reddis etc).
Can't we simply install presido on windows or unix box like pip or git clone?

Language field of template-entity is ignored when unknown or empty

Set template:

echo -n '{"fields":[{"name": "PERSON", "languageCode":"HEB" }]}' | http http://localhost:8080/api/v1/templates/123/analyze/my-template

Call analyzer:

echo -n '{"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC111921", "analyzeTemplateId":"my-template"  }' | http http://localhost:8080/api/v1/projects/123/analyze

result is:

[
    {
        "field": {
            "name": "PERSON"
        },
        "location": {
            "end": 10,
            "length": 10
        },
        "score": 0.85
    }
]

should be empty.

Same applies for empty language code.

Custom recognizers won't work

Hello.

For some reason custom recognizers don't work on my current presidio instance.

I'm following the documentation:

I post into api/v1/analyzer/recognizers/spaceship
{ "value": { "entity": "SPACESHIP", "language": "en", "patterns": [ { "name": "spaceship regex", "regex": "\\W*(spaceship)\\W*", "score": 0.9 } ] } }

And then I try to run the analyzer posting into /api/v1/projects/testing/analyze in two ways:
{ "text":"They went inside the spaceship", "analyzeTemplate": { "allFields": true } }
Response:
"No results"

{ "text":"They went inside the spaceship", "analyzeTemplate": { "fields": [ { "name": "SPACESHIP" } ] } }
Response
"rpc error: code = Unknown desc = Exception calling application: No matching recognizers were found to serve the request."

Any thoughts ?

Regards.

Syntax Errors using the analyzer as Python package

Using the analyzer as a Python package I encountered a few errors, here I will describe how to reproduce them and how to (temporarily) fix them in order to get the next errors:
I first succesfully followed the instructions on the tutorial on how to install the presidio-analyzer as a Python package by creating a wheel file, after running into some of the issues I simply kept on manipulating the scripts (as I will describe in full detail) and running the test script (step 5 of the installation https://github.com/microsoft/presidio/blob/master/docs/install.md) in the same directory as the analyzer folder (for the script to recognize it as a module).

  1. In the directory presidio-analyzer/analyzer/recognizer_result.py the init is defined def init(self, entity_type, start, end, score, analysis_explanation: AnalysisExplanation = None):
    So I get a SyntaxError in “analysis_explanation: AnalysisExplanation = None”.
    Exact error:
    File "/.../presidio-analyzer/analyzer/recognizer_result.py", line 7
    analysis_explanation: AnalysisExplanation = None):
    ^
    SyntaxError: invalid syntax
    I fixed it temporarily by deleting “: AnalysisExplanation = None” (I don’t believe this is the correct fix because it makes de “from . import AnalysisExplanation” line superfluous), so I got the next errors.
  2. In the directory presidio-analyzer/analyzer/predefined_recognizers/iban_patterns.py I get a SyntaxError related to the patterns for the IBAN.
    Exact error:
    File "/.../presidio-analyzer/analyzer/predefined_recognizers/iban_patterns.py", line 39
    SyntaxError: Non-ASCII character '\xc2' in file /.../presidio-analyzer/analyzer/predefined_recognizers/iban_patterns.py on line 39, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
    I deleted the line 7 of the predefined_recognizers init that imports IbanRecognizer and deleted all references for IbanRecognizer in the file recognizer_registry.py (in the import and the load_predefined_recognizers method). Basically, I dropped the use of the iban_recognizer.py

Tutorial method 2 error: core dump and corrupted size vs. prev_size

Describe the bug
https://microsoft.github.io/presidio/tutorial_framework.html

Try Method 2
Use the analyzer Python code by importing matcher.py from presidio-analyzer/analyzer

match = matcher.Matcher()
results = self.match.analyze_text(text, fields)

To Reproduce

  1. Create "a.py" at sub folder "presidio-analyzer"
  2. install dependencies
  3. execute with python
chenglim@chenglim-GL503VM:/amaris/code/presidio/presidio-analyzer$ python3 a.py 
unable to cache TLDs in file /usr/local/lib/python3.6/dist-packages/tldextract/.tld_set: [Errno 13] Permission denied: '/usr/local/lib/python3.6/dist-packages/tldextract/.tld_set'
<class 'analyzer.matcher.Matcher'>
[text: "(212) 555-1234"
field {
  name: "PHONE_NUMBER"
}
score: 1.0
location {
  start: 91
  end: 105
  length: 14
}
, text: "New York"
field {
  name: "LOCATION"
}
score: 0.8500000238418579
location {
  start: 21
  end: 29
  length: 8
}
, text: "Seattle"
field {
  name: "LOCATION"
}
score: 0.8500000238418579
location {
  start: 59
  end: 66
  length: 7
}
, text: "yesterday"
field {
  name: "DATE_TIME"
}
score: 0.8500000238418579
location {
  start: 38
  end: 47
  length: 9
}
, text: "morning"
field {
  name: "DATE_TIME"
}
score: 0.8500000238418579
location {
  start: 48
  end: 55
  length: 7
}
]
corrupted size vs. prev_size
Aborted (core dumped)
chenglim@chenglim-GL503VM:/amaris/code/presidio/presidio-analyzer$ cat a.py 
from analyzer import matcher

class DictAttr(dict):
    def __getattr__(self, key):
        if key not in self:
            raise AttributeError(key)
        return self[key]

    def __setattr__(self, key, value):
        self[key] = value

    def __delattr__(self, key):
        del self[key]

text = """
John Smith lives in New York. We met yesterday morning in Seattle.
I called him before on (212) 555-1234 to verify the appointment.
He also told me that his drivers license is AC111921
"""

fields = []
field = ["PHONE_NUMBER", "LOCATION", "DATE_TIME"]

for f in field:
    a = DictAttr()
    a.name = f
    fields.append(a)


match = matcher.Matcher()
print(type(match))

results = match.analyze_text(text, fields)
print(results)

Expected behavior

  1. No error "corrupted size vs. prev_size"
  2. No "Aborted (core dumped)"
  3. unable to cache TLDs in file /usr/local/lib/python3.6/dist-packages/tldextract/.tld_set: [Errno 13] Permission denied: '/usr/local/lib/python3.6/dist-packages/tldextract/.tld_set'

Screenshots
N/A

Additional context
Suspect this has to do with the tldextract..

feature request: analyser shall return the matching string and original text

Describe the bug
from
https://github.com/Microsoft/presidio/blob/master/docs/tutorial_service.md
Sample 4: Custom anonymization

*** Currently the result return is
{
"field": {
"name": "US_DRIVER_LICENSE"
},
"score": 0.65,
"location": {
"start": 176,
"end": 184,
"length": 8
}
}

*** it would be good if you can return so that it is easier to debug

"field": {
  "name": "US_DRIVER_LICENSE"
},
"score": 0.65,
"location": {
  "start": 176,
  "end": 184,
  "length": 8
}
"match_text": "AC333991"

}

**** it will be good if you can also return the original "text"
{
.
.
.
"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC333991"
}

To Reproduce

$ echo -n '{"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC333991", "analyzeTemplate":{"allFields":true}  }' | http -F --verify=no https://192.168.1.44/api/v1/projects/1/analyze

Expected behavior
N/A

Screenshots
N/A

Additional context
it is very common that API also return original text plus matching string.

Analyser memory leak

Run presidio under persistent load for a long time.
notice memory in analyser increases over time, eventually the pod will restart.

ruled out recognizers, context and other factors by marking out all code in analyzer's main methods (Apply and analyze) except for nlp_engine.process_text() and memory still increases.

this is aligned with the following github issues on spacy memory increase over time, due to internal cache usage:

explosion/spaCy#3618
explosion/spaCy#3013
explosion/spaCy#285

How is it different from NER?

This is not a bug but a question. The solution looks promising. But wondering how is this different from an NER system (or LUIS entity recognition) ? I can see this spacy NER being used. Is it an extension to NER system, exposed as a service or there is some other intelligence added to the system.

Add SECURITY.MD

Add the new SECURITY.MD file for repos under the Microsoft organization.

Analyzer regex optimization and performance

In order to match re2 supported syntax, we need to modify the regex's to the following values :
CREDIT_CARD: \b((4\d{3})|(5[0-5]\d{2})|(6\d{3})|(1\d{3})|(3\d{3}))[- ]?(\d{3,4})[- ]?(\d{3,4})[- ]?(\d{3,5})\b (Also remove the diners since this covers it.

DOMAIN: \b(((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,86}[a-zA-Z0-9]))\.(([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,73}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25})))|((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,162}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25}))))\b

CRYPTO: \b[13][a-km-zA-HJ-NP-Z0-9]{26,33}\b

US_DRIVERS_LICENSE (WA): unknown at the moment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.