Light

apicrafter / metacrafter Goto Github PK

Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules

License: Apache License 2.0

Python 100.00%

data-profiling datadiscovery entity-recognition metadata pii pii-detection

metacrafter's Introduction

APICrafter

API wrapper for MongoDB databases

APICrafter creates Python Eve wrapper over MongoDB database/databases, creates Eve scheme for each collection and generates OpenAPI (Swagger) documentation.

Commands

Discover

Creates apicrafter.yml API description file from database or collection. Automatically generates data schemas from original data

Build API definition as apicrafter.yml apicrafter discover -h 127.0.0.1 -p 27017 -d rusregions

Run

Uses API definition from apicrafter.yml file and launches API server over MongoDB. You could

Run server apicrafter run

Examples

Please see /examples directory for data and usage

metacrafter's People

Contributors

Stargazers

Watchers

Forkers

alexrogalskiy tsvmks ohm1129 kuldeep12e nasingfaund aashnaa-goswami

metacrafter's Issues

Правило для URL

Можно ли добавить правило для нахождения URL-ов?
Вроде в pyparsing есть удобное правило pyparsing.pyparsing_common.url, но неясно как его добавить.

(sqlite3.OperationalError) no such module: VirtualSpatialIndex

Error processing 008564_pal_features_v3.sqlite - (sqlite3.OperationalError) no such module: VirtualSpatialIndex
[SQL: SELECT * FROM 'SpatialIndex' LIMIT 10000]
(Background on this error at: http://sqlalche.me/e/13/e3q8)

File
008564_pal_features_v3.zip

Add metacrafter registry code

Add data classes to use metacrafter registry and registry.apicrafter.io support

Consider detection of percentage fields with 'percent' and 'percentage' prefix and postfix

Add detection rules for percentage fields starting and ending with 'percent' or 'percentage', validation should include value verification.

Add XML support

Support XML files with following list of tasks:

Support XML files with XML tag name provided
Add examples to documentation
Collect examples with different data types and encodings
Write tests
Support automatic detection of XML records tag
Support huge XML files (SAX Parser)
Add support of XML files to server

Object of type bytes is not JSON serializable - Error processing some SQLite files

Error Object of type bytes is not JSON serializable caused by table fields with bytes type. Better detection of types needed and serialization of bytes type in JSON report.
Error caused not by processing, but by reporting function.

Example
000001_run-546.zip

Add rule caching

Without cache, tool reloads rules on each run. It makes it harder to process thousands of datasets from the command line.

Can I apply rules (eg pii) during scan-db

I have successfully run scan-db against my database.

I want to run scan-db with the pii rule but cannot see how this is possible from the examples. Is there an option to do this?

Many thanks

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column

Error processing SQLite database with non-unicode names for fields.
Example 000012_world.zip

`Traceback (most recent call last):
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1284, in fetchall
l = self.process_rows(self._fetchall_impl())
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1230, in _fetchall_impl
return self.cursor.fetchall()
sqlite3.OperationalError: Could not decode to UTF-8 column 'name' with text '\ufffdland Islands'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Program Files\Python310\Scripts\metacrafter-script.py", line 33, in
sys.exit(load_entry_point('metacrafter==0.0.2', 'console_scripts', 'metacrafter')())
File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter_main_.py", line 12, in main
exit_status = cli()
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1130, in call
return self.main(*args, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter\core.py", line 464, in scan_db
acmd.scan_db(
File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter\core.py", line 359, in scan_db
items = [dict(u) for u in queryres.fetchall()]
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1288, in fetchall
self.connection.handle_dbapi_exception(
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\base.py", line 1510, in handle_dbapi_exception
util.raise(
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\util\compat.py", line 182, in raise
raise exception
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1284, in fetchall
l = self.process_rows(self._fetchall_impl())
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1230, in _fetchall_impl
return self.cursor.fetchall()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column 'name' with text '\ufffdland Islands'
(Background on this error at: http://sqlalche.me/e/13/e3q8)
`

Add 'strip' option to process whitespace filled data

Sometimes exported CSV files include whitespace before or after values for clearer formatting and fitting into fixed space data fields.
White space should be removed automatically using the strip() function. To enable or disable it add an option '--strip' to the command line and parameter strip to the python library. By default don't strip.

Example
in_libs2.csv

Add empty values detection and reporting

Automate detection of empty values and exclude them from data analysis.

Possible empty values: None, 'N/A', empty string, 'NaN', 'None', '-'

The following actions are required:

Add empty values detection during analysis stage
Add detected empty values and number of values to analyzer report as empty_values, num_empty

Add support of NoSQL databases

Add support for the following NoSQL databases and search engines: MongoDB, ArangoDB, Milvus, ArcadeDB, ElasticSearch, OpenSearch, MeiliSearch, Apache Cassandra, StarGate (MongoDB-like API over NoSQL databases)

The current state of database support:

Other tasks:

Write universal class for NoSQL document based databases
Replace command-line command 'scan-mongodb' with 'scan-nosql' or update command 'scan-db' with NoSQL databases connection strings
Write documentation with connection strings examples and limitations
Write tests for each database type

Analyze GitSchemas dataset for missing rules

Data from https://github.com/tdoehmen/gitschemas
A lot of metadata from Github-hosted SQL files. ]

(sqlite3.OperationalError) no such tokenizer: PSITokenizer - Error processing SQLite files

Error processing several SQLite files (sqlite3.OperationalError) no such tokenizer: PSITokenizer
Example file
001607_psi.zip

Add schema for report JSON and improve reporting

Right now JSON file of the metadata scanning report is not structured well enough.
Improvements should include:

Add Cerberus schema (more info https://docs.python-cerberus.org)
Add scanning datetime
Add source info: source type, filename, connection string e.t.c. Make sure no secrets in connection string
Move 'table' to 'source' subtag
Add tests to validate reports with Cerberus validator

Add extensive tests for all core features for metacrafter

No one test exists right now, add it.

Consider to add named entity recognition

Named entity recognitions technology helps to identify named objects inside texts.

Strong

allows to identify objects inside text blobs
could allow to support more named entities (identifiers)

Weakness

could be very slow
need to prepare PII and identifier rules for recognition

Possible implementation - Slovnet https://github.com/natasha/slovnet

Incorrect field types identification for dictionaries and arrays

Nested documents in JSON/JSONlines/XML and e.t.c detected as str objects instead of dict or array objects.

Example: nested objects Scores and Geocode detected as strings.

implement detection of nested objects correctly
write testcases to avoid this bug

7be63843-6734-4bc7-b6f4-bc1615d1f667.zip

Add extended reporting

Right now report include only: field name, data type, tags, semantic type id and registry URL.
Sometimes additional information required and it's collected during matching process.

Consider to add to report following data (already collected):

Consider to add and to collect following info:

has alphas
has digits
has special chars

If possible, add following:

reconstucted regexp - regular expression reconstucted from data sample
named entities - named entities extracted by one of named entities detection tools like Microsoft Presidio or Slovnet or others

Consider adding boolean rules with prefixes "is_" and "has_", 'was_', postfixes "flag" and e.t.c.

From GitSchemas datasets analysis consider adding rules:

prefix based with prefixes: "is_" and "has_", "show_", 'was_'
name based with names: "deleted", "enabled", "approved", "active"
postfix based with postfixes: "_flag"

Additional verification should include that field has no more than 2 values (yes or no) or 3 values including NULL (yes, no, None).

Add identification of databases primary and foreign table identifiers

Some fields of databases are just incremental unique identifiers generated by the database engine. They can't be linked with any external identifier databases and are used only locally by databases.

There is a need to detect such fields as database-generated integer IDs.
Common names of such IDs are id or <object name/table name>_id

The way to implement:

Identify if the table field is a primary key and has an auto_increment feature
Mark this field as database generated id key

Add analysis of schema structure decomposition of field keys and subtypes

Flat table datasets (CSV) files, database tables, and sometimes objects with nested objects ofter include elements that could be grouped.

For example CSV file Zaara_D.csv
includes following fields: title, text, date, place, placeURL, placeLocation, placeType, reviewScore, avgScore

We could find that prefix 'place' is a subtype identifier. It could be decomposed as
place:

Name
Location
URL
Type

And postfix Score identifies value type, whether integer or float.

Most data tables use case change or "_" symbol as dividers. Very rarely is the '-' symbol also used.

Detection of field groups and decomposition of field names could help with:

additional rules to detect semantic data types
automatic context identification

Add group detection to the final report as field_group property.

Is there an integration for Datahub?

Hi,

I'm in the process of setting up Datahub (https://datahubproject.io) at our organisation and I wanted to know if there is a way to load the Metacrafter PII labels onto entities in Datahub?

Many thanks,
Ian

Consider detection rule by "_zip" postfix

Combination of "_zip" postfix and number/string validation for US ZIP codes.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.