Giter Club home page Giter Club logo

project1's Introduction

Text-Analytics-Project1

VAMSI THOKALA

This Python script redacts sensitive information from text files. The script uses the spacy library to detect entities and regular expressions to redact specific keywords. It allows you to redact names, gender-related words, dates, phone numbers, and addresses. Parameters

--names: Redact names
--genders: Redact gender-related words
--dates: Redact dates
--phones: Redact phone numbers
--address: Redact addresses

Sample stats output file :

The output file will have the following format:

Sample_text.txt.stats

names: 25

genders: 1

dates: 10

phones: 7

addresses: 10

Each line in the output file represents the number of redacted items for each category.

Usage

1.Install the required libraries:

pip install spacy pip install en_core_web_md

2.Run the script with the appropriate flags and arguments:

python redactor.py --input '*.txt' \ --names --dates --phones --genders --address \ --output 'files/' \ --stats stderr

Notes

1.The script reads input files with the .txt extension in the specified folder. 2.The output folder will contain redacted text files with the .redacted extension and stats files with the .stats extension. The stats file will contain the number of redacted items for each category. 3.The --stats flag allows you to output the redaction statistics to the standard error. In this example, we use stderr.

Functions

The script contains the following functions:

redact_names(text: str) -> str: Redacts names from the given text.
redact_genders(text: str) -> str: Redacts gender-related words from the given text.
redact_dates(text: str) -> str: Redacts dates from the given text.
redact_phones(text: str) -> str: Redacts phone numbers from the given text.
redact_address(text: str) -> str: Redacts addresses from the given text.
count_redactions(text: str, redacted_text: str) -> Dict[str, int]: Counts the number of redactions made in the given text for each category.
write_stats(stats: Dict[str, int], file_path: str): Writes the redaction statistics to a file.
process_files(input_glob: str, output_folder: str, flags: List[str]): Processes the input files, applies redactions based on the flags, and saves the redacted files and statistics in the output folder.
main(): The main function that parses the command-line arguments and calls the process_files function.

Requirements

Python 3.6 or higher
spacy library
en_core_web_md language model

Limitations

The script may not cover all possible cases for redaction.
The redaction may be affected by the performance of the spacy library's entity recognition and the regular expressions used for redaction.
The script currently only supports English text.

Customization

If you want to customize the script to redact additional information, you can do the following: 1.Add a new function to perform redaction for the new category, similar to the existing redaction functions. 2.Update the count_redactions function to count the redactions for the new category. 3.Update the process_files function to call the new redaction function based on a new flag. 4.Add the new flag to the command-line argument parser in the main function.

License

This script is provided under the MIT License. You can use, modify, and distribute the script, but you must include the license notice in all copies or substantial portions of the software.

*** Output ***

Screenshot 2023-04-02 at 12 15 21 AM

Screenshot 2023-04-02 at 12 24 10 AM

Screenshot 2023-04-02 at 12 25 18 AM

Screen-Recording-2023-04-02-at-1.mp4

project1's People

Contributors

themagicalthings avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.