Giter Club home page Giter Club logo

metacrafter's Introduction

APICrafter

API wrapper for MongoDB databases

APICrafter creates Python Eve wrapper over MongoDB database/databases, creates Eve scheme for each collection and generates OpenAPI (Swagger) documentation.

Commands

Discover

Creates apicrafter.yml API description file from database or collection. Automatically generates data schemas from original data

Build API definition as apicrafter.yml apicrafter discover -h 127.0.0.1 -p 27017 -d rusregions

Run

Uses API definition from apicrafter.yml file and launches API server over MongoDB. You could

Run server apicrafter run

Examples

Please see /examples directory for data and usage

metacrafter's People

Contributors

ivbeg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

metacrafter's Issues

Правило для URL

Можно ли добавить правило для нахождения URL-ов?
Вроде в pyparsing есть удобное правило pyparsing.pyparsing_common.url, но неясно как его добавить.

Add XML support

Support XML files with following list of tasks:

  • Support XML files with XML tag name provided
  • Add examples to documentation
  • Collect examples with different data types and encodings
  • Write tests
  • Support automatic detection of XML records tag
  • Support huge XML files (SAX Parser)
  • Add support of XML files to server

Add rule caching

Without cache, tool reloads rules on each run. It makes it harder to process thousands of datasets from the command line.

Can I apply rules (eg pii) during scan-db

I have successfully run scan-db against my database.

I want to run scan-db with the pii rule but cannot see how this is possible from the examples. Is there an option to do this?

Many thanks

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column

Error processing SQLite database with non-unicode names for fields.
Example 000012_world.zip

`Traceback (most recent call last):
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1284, in fetchall
l = self.process_rows(self._fetchall_impl())
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1230, in _fetchall_impl
return self.cursor.fetchall()
sqlite3.OperationalError: Could not decode to UTF-8 column 'name' with text '\ufffdland Islands'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Program Files\Python310\Scripts\metacrafter-script.py", line 33, in
sys.exit(load_entry_point('metacrafter==0.0.2', 'console_scripts', 'metacrafter')())
File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter_main_.py", line 12, in main
exit_status = cli()
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1130, in call
return self.main(*args, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter\core.py", line 464, in scan_db
acmd.scan_db(
File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter\core.py", line 359, in scan_db
items = [dict(u) for u in queryres.fetchall()]
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1288, in fetchall
self.connection.handle_dbapi_exception(
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\base.py", line 1510, in handle_dbapi_exception
util.raise
(
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\util\compat.py", line 182, in raise

raise exception
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1284, in fetchall
l = self.process_rows(self._fetchall_impl())
File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1230, in _fetchall_impl
return self.cursor.fetchall()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column 'name' with text '\ufffdland Islands'
(Background on this error at: http://sqlalche.me/e/13/e3q8)
`

Add 'strip' option to process whitespace filled data

Sometimes exported CSV files include whitespace before or after values for clearer formatting and fitting into fixed space data fields.
White space should be removed automatically using the strip() function. To enable or disable it add an option '--strip' to the command line and parameter strip to the python library. By default don't strip.

Example
in_libs2.csv

Add empty values detection and reporting

Automate detection of empty values and exclude them from data analysis.

Possible empty values: None, 'N/A', empty string, 'NaN', 'None', '-'

The following actions are required:

  • Add empty values detection during analysis stage
  • Add detected empty values and number of values to analyzer report as empty_values, num_empty

Add support of NoSQL databases

Add support for the following NoSQL databases and search engines: MongoDB, ArangoDB, Milvus, ArcadeDB, ElasticSearch, OpenSearch, MeiliSearch, Apache Cassandra, StarGate (MongoDB-like API over NoSQL databases)

The current state of database support:

  • MongoDB
  • ArangoDB
  • ElasticSearch
  • Meilisearch
  • Milvus
  • OpenSearch
  • ArcadeDB

Other tasks:

  • Write universal class for NoSQL document based databases
  • Replace command-line command 'scan-mongodb' with 'scan-nosql' or update command 'scan-db' with NoSQL databases connection strings
  • Write documentation with connection strings examples and limitations
  • Write tests for each database type

Add schema for report JSON and improve reporting

Right now JSON file of the metadata scanning report is not structured well enough.
Improvements should include:

  • Add Cerberus schema (more info https://docs.python-cerberus.org)
  • Add scanning datetime
  • Add source info: source type, filename, connection string e.t.c. Make sure no secrets in connection string
  • Move 'table' to 'source' subtag
  • Add tests to validate reports with Cerberus validator

Consider to add named entity recognition

Named entity recognitions technology helps to identify named objects inside texts.

Strong

  • allows to identify objects inside text blobs
  • could allow to support more named entities (identifiers)

Weakness

  • could be very slow
  • need to prepare PII and identifier rules for recognition

Possible implementation - Slovnet https://github.com/natasha/slovnet

Add extended reporting

Right now report include only: field name, data type, tags, semantic type id and registry URL.
Sometimes additional information required and it's collected during matching process.

Consider to add to report following data (already collected):

  • number of unique values
  • share of unique values
  • minimal length
  • max length
  • average length
  • minimal value
  • maximum value

Consider to add and to collect following info:

  • has alphas
  • has digits
  • has special chars

If possible, add following:

  • reconstucted regexp - regular expression reconstucted from data sample
  • named entities - named entities extracted by one of named entities detection tools like Microsoft Presidio or Slovnet or others

Add identification of databases primary and foreign table identifiers

Some fields of databases are just incremental unique identifiers generated by the database engine. They can't be linked with any external identifier databases and are used only locally by databases.

There is a need to detect such fields as database-generated integer IDs.
Common names of such IDs are id or <object name/table name>_id

The way to implement:

  1. Identify if the table field is a primary key and has an auto_increment feature
  2. Mark this field as database generated id key

Add analysis of schema structure decomposition of field keys and subtypes

Flat table datasets (CSV) files, database tables, and sometimes objects with nested objects ofter include elements that could be grouped.

For example CSV file Zaara_D.csv
includes following fields: title, text, date, place, placeURL, placeLocation, placeType, reviewScore, avgScore

We could find that prefix 'place' is a subtype identifier. It could be decomposed as
place:

  • Name
  • Location
  • URL
  • Type

And postfix Score identifies value type, whether integer or float.

Most data tables use case change or "_" symbol as dividers. Very rarely is the '-' symbol also used.

Detection of field groups and decomposition of field names could help with:

  • additional rules to detect semantic data types
  • automatic context identification

Add group detection to the final report as field_group property.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.