Giter Club home page Giter Club logo

open-data-anonymizer's Introduction

anonympy 🕶️




With ❤️ by ArtLabs

Overview

General Data Anonymization library for images, PDFs and tabular data. See ArtLabs/projects for more or similar projects.


Main Features

Ease of use - this package was written to be as intuitive as possible.

Tabular

  • Efficient - based on pd.DataFrame
  • Numerous anonymization methods
    • Numeric data
      • Generalization - Binning
      • Perturbation
      • PCA Masking
      • Generalization - Rounding
    • Categorical data
      • Synthetic Data
      • Resampling
      • Tokenization
      • Partial Email Masking
    • Datetime data
      • Synthetic Date
      • Perturbation

Images

  • Anonymization techniques
    • Personal Images (faces)
      • Blurring
      • Pixaled Face Blurring
      • Salt and Pepper Noise
    • General Images
      • Blurring

PDF

  • Find sensitive information and cover it with black boxes

Text, Sound

  • In Development

Installation

Dependencies

  1. Python (>= 3.7)
  2. cape-privacy
  3. faker
  4. pandas
  5. OpenCV
  6. pytesseract
  7. transformers
  8. . . . . .

Install with pip

Easiest way to install anonympy is using pip

pip install anonympy

Due to conflicting pandas/numpy versions with cape-privacy, it's recommended to install them seperately

pip install cape-privacy==0.3.0 --no-deps 

Install from source

Installing the library from source code is also possible

git clone https://github.com/ArtLabss/open-data-anonimizer.git
cd open-data-anonimizer
pip install -r requirements.txt
make bootstrap
pip install cape-privacy==0.3.0 --no-deps 

Downloading Repository

Or you could download this repository from pypi and run the following:

cd open-data-anonimizer
python setup.py install

Usage Example

Google Colab

More examples here

Tabular

>>> from anonympy.pandas import dfAnonymizer
>>> from anonympy.pandas.utils_pandas import load_dataset

>>> df = load_dataset() 
>>> print(df)
name age birthdate salary web email ssn
0 Bruce 33 1915-04-17 59234.32 http://www.alandrosenburgcpapc.co.uk [email protected] 343554334
1 Tony 48 1970-05-29 49324.53 http://www.capgeminiamerica.co.uk [email protected] 656564664
# Calling the generic function
>>> anonym = dfAnonymizer(df)
>>> anonym.anonymize(inplace = False) # changes will be returned, not applied
name age birthdate age web email ssn
0 Stephanie Patel 30 1915-05-10 60000.0 5968b7880f [email protected] 391-77-9210
1 Daniel Matthews 50 1971-01-21 50000.0 2ae31d40d4 [email protected] 872-80-9114
# Or applying a specific anonymization technique to a column
>>> from anonympy.pandas.utils_pandas import available_methods

>>> anonym.categorical_columns
... ['name', 'web', 'email', 'ssn']
>>> available_methods('categorical') 
... categorical_fake	categorical_fake_auto	categorical_resampling	categorical_tokenization	categorical_email_masking

>>> anonym.anonymize({'name': 'categorical_fake',  # {'column_name': 'method_name'}
                  'age': 'numeric_noise',
                  'birthdate': 'datetime_noise',
                  'salary': 'numeric_rounding',
                  'web': 'categorical_tokenization', 
                  'email':'categorical_email_masking', 
                  'ssn': 'column_suppression'})
>>> print(anonym.to_df())
name age birthdate salary web email
0 Paul Lang 31 1915-04-17 60000.0 8ee92fb1bd j*****[email protected]
1 Michael Gillespie 42 1970-05-29 50000.0 51b615c92e e*****[email protected]

Images

# Passing an Image
>>> import cv2
>>> from anonympy.images import imAnonymizer

>>> img = cv2.imread('salty.jpg')
>>> anonym = imAnonymizer(img)

>>> blurred = anonym.face_blur((31, 31), shape='r', box = 'r')  # blurring shape and bounding box ('r' / 'c')
>>> pixel = anonym.face_pixel(blocks=20, box=None)
>>> sap = anonym.face_SaP(shape = 'c', box=None)
blurred pixel sap
input_img1 output_img1 sap_image
# Passing a Folder 
>>> path = 'C:/Users/shakhansho.sabzaliev/Downloads/Data' # images are inside `Data` folder
>>> dst = 'D:/' # destination folder
>>> anonym = imAnonymizer(path, dst)

>>> anonym.blur(method = 'median', kernel = 11) 

This will create a folder Output in dst directory.

# The Data folder had the following structure

|   1.jpg
|   2.jpg
|   3.jpeg
|   
\---test
    |   4.png
    |   5.jpeg
    |   
    \---test2
            6.png

# The Output folder will have the same structure and file names but blurred images

PDF

In order to initialize pdfAnonymizer object we have to install pytesseract and poppler, and provide path to the binaries of both as arguments or add paths to system variables

>>> from anonympy.pdf import pdfAnonymizer

# need to specify paths, since I don't have them in system variables
>>> anonym = pdfAnonymizer(path_to_pdf = "Downloads\\test.pdf",
                       pytesseract_path = r"C:\Program Files\Tesseract-OCR\tesseract.exe",
                       poppler_path = r"C:\Users\shakhansho\Downloads\Release-22.01.0-0\poppler-22.01.0\Library\bin")

# Calling the generic function
>>> anonym.anonymize(output_path = 'output.pdf',
                     remove_metadata = True,
                     fill = 'black',
                     outline = 'black')
test.pdf output.pdf
test_img output_img

In case you only want to hide specific information, instead of anonymize use other methods

>>> anonym = pdfAnonymizer(path_to_pdf = r"Downloads\test.pdf")
>>> anonym.pdf2images() #  images are stored in anonym.images variable 
>>> anonym.images2text(anonym.images) # texts are stored in anonym.texts

#  Entities of interest 
>>> locs: dict = anonym.find_LOC(anonym.texts[0])  # index refers to page number
>>> emails: dict = anonym.find_emails(anonym.texts[0])  # {page_number: [coords]}
>>> coords: list = locs['page_1'] + emails['page_1'] 

>>> anonym.cover_box(anonym.images[0], coords)
>>> display(anonym.images[0])

Development

Contributions

The Contributing Guide has detailed information about contributing code and documentation.

Important Links

License

BSD-3

Code of Conduct

Please see Code of Conduct. All community members are expected to follow it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.