Giter Club home page Giter Club logo

piicatcher's People

Contributors

dependabot[bot] avatar ehilfer avatar jhecking avatar mateuszboryn avatar mbrg avatar n2taylor avatar nicolepng avatar vrajat avatar zer0pool avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

piicatcher's Issues

Facing issue while using db

I am using below command for sqlite db:

piicatcher db -s '/db/data' -t sqlite --scan-type shallow
where '/db/' is the path for data.sqlite file.

I am not sure what I am doing wrong. Getting below error.

image

If any detailed instructions are provided for db instances, it would help.

Thanks

Import piicatcher in python files

Can we import this piicatcher library & use it in any program. As till now, I have only found the command line access of it, but what if I want to use it in any python program?

Installation fails due to conflicting botocore version

Hi, users are unable to run Piicatcher due to dependency conflict with botocore package. As shown in the following full dependency graph of Piicatcher, pyathena requires botocore >=1.5.52,while boto3 ==1.12.1 requires botocore >=1.15.1,<1.16.0.

According to pip’s “first found wins” installation strategy, botocore 1.16.16 is the actually installed version. However, botocore 1.16.16 does not satisfy botocore >=1.15.1,<1.16.0.

Dependency tree-----------

piicatcher - 0.10.0
| +- boto3(install version:1.12.1 version range:==1.12.1)
| | +- botocore(install version:1.15.49 version range:>=1.15.1,<1.16.0)
| | +- jmespath(install version:0.10.0 version range:>=0.7.1,<1.0.0)
| | +- s3transfer(install version:0.3.3 version range:>=0.3.0,<0.4.0)
| | | +- botocore(install version:1.15.49 version range:>=1.12.36,<2.0a.0)
| +- click(install version:7.0 version range:==7.0)
| +- click-config-file(install version:0.5.0 version range:==0.5.0)
| | +- click(install version:7.0 version range:>=6.7)
| | +- configobj(install version:5.0.6 version range:>=5.0.6)
| | | +- six(install version:1.14.0 version range:*)
| +- commonregex(install version:1.5.4 version range:*)
| +- cryptography(install version:2.9 version range:*)
| | +- cffi(install version:1.14.0 version range:>=1.8)
| | +- six (install version:1.14.0 version range:>=1.4.1)
| +- cx-oracle(install version:7.3.0 version range:*)
| +- cx_Oracle(install version: version range:*)
| +- peewee(install version:3.13.2 version range:*)
| +- psycopg2-binary(install version:2.8.5 version range:*)
| +- pyathena(install version:1.10.7 version range:*)
| | +- boto3(install version:1.12.1 version range:>=1.4.4)
| | | +- botocore(install version:1.15.49 version range:>=1.15.1,<1.16.0)
| | | +- jmespath(install version:0.10.0 version range:>=0.7.1,<1.0.0)
| | | +- s3transfer(install version:0.3.3 version range:>=0.3.0,<0.4.0)
| | | | +- botocore(install version:1.15.49 version range:>=1.12.36,<2.0a.0)
| | +- botocore(install version:1.16.6 version range:>=1.5.52)
| | +- future(install version:0.18.2 version range:*)
| | +- futures(install version:3.3.0 version range:*)
| | +- tenacity(install version:6.1.0 version range:>=4.1.0)
| +- pymysql(install version:0.9.3 version range:*)
| +- pypi-publisher(install version:0.0.4 version range:*)
| | +- gitpython(install version:0.3.6 version range:==0.3.6)
| | | +- gitdb (install version:4.0.4 version range:>=0.6.4)
| | | | +- smmap(install version:3.0.4 version range:>=3.0.1,<4)
| +- python-magic(install version:0.4.15 version range:*)
| +- pyyaml(install version:5.3.1 version range:*)
| +- spacy(install version:2.2.4 version range:*)
| +- tableprint(install version:0.8.0 version range:*)
| | +- future(install version:0.18.2 version range:*)
| | +- six(install version:1.14.0 version range:*)
| | +- wcwidth(install version:0.1.9 version range:*) 

Thanks for your help.
Best,
Neolith

MySQL Syntax error raised when creating catalog table `dbcolumns`

Version:
0.9.5

Issue:
MySQL syntax error is raised for peewee generated SQL:

`CREATE TABLE IF NOT EXISTS `dbcolumns` (`id` INTEGER AUTO_INCREMENT NOT NULL PRIMARY KEY, `name` VARCHAR(255) NOT NULL, `pii_type` varchar NOT NULL, `table_id` INTEGER NOT NULL, FOREIGN KEY (`table_id`) REFERENCES `dbtables` (`id`))`

Tested using MySQL 8.0.19 and MySQL 5.7.29 (docker image releases).

conf file:

log_level="DEBUG"
catalog_format="db"
catalog_host="localhost"
catalog_port="3306"
catalog_user="root"
catalog_password="example"

[db]
host="***.redshift.amazonaws.com"
port="5439"
user="***"
password="***"
database="***"
connection_type="redshift"
scan_type="shallow"
list_all=True
schema=('raw_hr',)

Command line command:
piicatcher --config config.conf db

Debug/Error output:

DEBUG:root:PiiTypes are 
DEBUG:piicatcher.log_mixin.Table:employee has PiiTypes.PERSON,PiiTypes.EMAIL,PiiTypes.GENDER
DEBUG:piicatcher.log_mixin.Schema:raw_hr has PiiTypes.PERSON,PiiTypes.EMAIL,PiiTypes.GENDER
/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/cursors.py:170: Warning: (3090, "Changing sql mode 'NO_AUTO_CREATE_USER' is deprecated. It will be removed in a future release.")
  result = self._query(query)
DEBUG:peewee:('SELECT table_name FROM information_schema.tables WHERE table_schema = DATABASE() AND table_type != %s ORDER BY table_name', ('VIEW',))
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS `dbschemas` (`id` INTEGER AUTO_INCREMENT NOT NULL PRIMARY KEY, `name` VARCHAR(255) NOT NULL)', [])
DEBUG:peewee:('SELECT table_name FROM information_schema.tables WHERE table_schema = DATABASE() AND table_type != %s ORDER BY table_name', ('VIEW',))
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS `dbtables` (`id` INTEGER AUTO_INCREMENT NOT NULL PRIMARY KEY, `name` VARCHAR(255) NOT NULL, `schema_id` INTEGER NOT NULL, FOREIGN KEY (`schema_id`) REFERENCES `dbschemas` (`id`))', [])
DEBUG:peewee:('CREATE INDEX `dbtables_schema_id` ON `dbtables` (`schema_id`)', [])
DEBUG:peewee:('SELECT table_name FROM information_schema.tables WHERE table_schema = DATABASE() AND table_type != %s ORDER BY table_name', ('VIEW',))
DEBUG:peewee:('CREATE TABLE IF NOT EXISTS `dbcolumns` (`id` INTEGER AUTO_INCREMENT NOT NULL PRIMARY KEY, `name` VARCHAR(255) NOT NULL, `pii_type` varchar NOT NULL, `table_id` INTEGER NOT NULL, FOREIGN KEY (`table_id`) REFERENCES `dbtables` (`id`))', [])
Traceback (most recent call last):
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/peewee.py", line 3057, in execute_sql
    cursor.execute(sql, params or ())
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/connections.py", line 684, in _read_packet
    packet.check_error()
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/protocol.py", line 220, in check_error
    err.raise_mysql_exception(self._data)
  File "/Users/piicatcher/venv/lib/python3.7/site-packages/pymysql/err.py", line 109, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'NOT NULL, `table_id` INTEGER NOT NULL, FOREIGN KEY (`table_id`) REFERENCES `dbta' at line 1")

Documentation links do not work

The links to the documentation in the README do not work. They end up displaying a page with the following message:

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
</Error>

Best Practices for Automated Scans using PIICatcher

@n2taylor @zer0pool

I want to check in to see if you are continuing to use PIICatcher.

I also want to find out if you have tried to automate or schedule piicatcher scan on your databases. I am getting questions on best practices to deploy and automate scans. Before I set best practices I want to find out from known users if they have tried to automate scans as well as what are the gaps and requirements.

Please tell me if you have any thoughts on this.

Deep scan should also scan column names for candidates

It is not surprising that deep and shallow scan show different results. Shallow scan only looks at column names. Deep scan looks at a sample of the data. I've even noticed that two different runs of deep scan show different results as sample rows are different. This is the challenge with not scanning all of the data. Its a trade-off between performance/cost and accuracy. There is no right answer.

W.R.T the output in particular, my observations are:

  1. Shallow scan should recognize phone, credit card, person and location from column names
  2. Deep scan did not recognize PII in a few columns. I need to look at the data to figure out if thats a bug or the column did not have any relevant data.
  3. Deep scan should also scan column names for candidates
  4. Along with an array, PIICatcher should add confidence numbers.

Originally posted by @vrajat in #67 (comment)

Getting error while running piicatcher through file or DB having more records

I am trying to run piicatcher on csv file with 15K records and getting below error:

File "f:\pii\piicrawler.env\lib\site-packages\spacy\language.py", line 429, in call Errors.E088.format(length=len(text), max_length=self.max_length) ValueError: [E088] Text of length 2924092 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of te mporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not us ing the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

Is there any limitation for data ?

unable to get pii columns from mysql

Tried running the piicatcher on mysql db but it did not list any columns

TABLE_SCHEMA TABLE_NAME COLUMN_NM DATA_TYPE
testpiicatcher cstmr fl_nm varchar
testpiicatcher cstmr fname varchar
testpiicatcher cstmr lname varchar

The output is blank and has not listed, was expecting 2 columns should be listed as having pii i.e.
fname|lname as they follow the regex pattern mentioned in

using python 3.8.2 and on macosx

include schema filter is not working on redshift

I'm finding that piicatcher fails to apply include schema argument. This has been tried using the -n, --schema via cli as well as schema via config file.

I have the following database structure in redshift:

prod_database
│
├── schema_a
│   ├── table_1
│   ├── table_2
│   └── table_3
│
├── schema_b
    ├── table_x
    ├── table_y
    └── table_z

My config file looks like this:

[db]
host="***"
port="5439"
user="***"
password="***"
database="prod_database"
connection_type="redshift"
scan_type="shallow"
list_all=False
schema=("schema_a",)
exclude_schema=("schema_b",)

However the returned output shows that piicatcher scanned through both schemas.

setting development virtualenv

Command I tried.

  • pipenv sync
  • pipenv shell
  • piicatcher -h
david@Melody:~/temp/pii/piicatcher-master$ 
david@Melody:~/temp/pii/piicatcher-master$ pipenv shell
Launching subshell in virtual environment…
 . /home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/bin/activate
N/A: version "N/A -> N/A" is not yet installed.

You need to run "nvm install N/A" to install it before using it.
david@Melody:~/temp/pii/piicatcher-master$  . /home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/bin/activate
(piicatcher-master) david@Melody:~/temp/pii/piicatcher-master$ piicatcher -h
Traceback (most recent call last):
  File "/home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/bin/piicatcher", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3261, in <module>
    @_call_aside
  File "/home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3245, in _call_aside
    f(*args, **kwargs)
  File "/home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3274, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/lib/python3.6/site-packages/pkg_resources/__init__.py", line 584, in _build_master
    ws.require(__requires__)
  File "/home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/lib/python3.6/site-packages/pkg_resources/__init__.py", line 901, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/home/david/.local/share/virtualenvs/piicatcher-master-2C6jlV6z/lib/python3.6/site-packages/pkg_resources/__init__.py", line 787, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'snowflake-connector-python==2.2.7' distribution was not found and is required by piicatcher
(piicatcher-master) david@Melody:~/temp/pii/piicatcher-master$ 

instructions from the link ( https://tokern.io/docs/piicatcher/1-installation/) seems work but following the instructions from the develoment setting page ( https://tokern.io/docs/piicatcher/development/ ) give me an error.

AWS config issue

I have created AWS config file as mentioned in docs and trying command piicatcher --config awsconfig.ini aws similar to db one.

piicatcher --config dbconfig.ini db works fine.

Getting below error:

image

Below is my config file :

[aws]
access_key="XXXXXX"
secret_key="XXXXX"
staging_dir=""
region="XXXXX"
output_format="glue"
scan_type="deep"
output="XXXXXX"
list_all=True

Not able to see complete data

I followed the installation steps. Tried to run on files and db both, but getting the same output
image

Can anyone help on what I am missing ?

Support Include/Exclude Lists

Athena is working for shallow type scan. For deep scan it throws below error:
HIVE_UNKNOWN_ERROR: All access to this object has been disabled (Service: Amazon S3; Status Code: 403; Error Code: AllAccessDisabled; Request ID:

Is there any parameter in config where we can pass specific DB to scan ?

Originally posted by @jayeshagwan1 in #55 (comment)

Exception when scanning oracle database

I got everything set up but I'm running into an error.

I'm running it with this command: piicatcher db -s 192.168.168.13 -p 1521 -u USERNAME-p PASSWORD -c deep -t oracle -o piicatcher_results.txt. I also tried using localhost instead of my private IP.

The stack trace shows that line 387 of explorer.py is throwing a TypeError: 'host' is an invalid keyword argument for this function. Am I doing something wrong or do you think it's a bug?

Reddit Permalink: https://www.reddit.com/message/messages/ju1oie

CSV files are not supported

{
"files": [
{
"Mime/Type": "application/csv",
"path": "tests/samples/sample-data.csv",
"pii": [
{
"enum": "PiiTypes.UNSUPPORTED"
}
]
}
]

Can we add config file for db ?

Can we add config file for database connections and other parameters, so that from cli we just need to pass type of database.

Getting error while scanning file system

With latest piicatcher version, getting below error:
piicatcher --config fileconfig.ini files

image

[files]
path="testdata.csv"
catalog_file="output.json"
output_format="json"

`pip install piicatcher` fails on Python 3.8 "pymssql install fails"

piicatcher is failing to install on python 3.8.1. (also fails on 3.7). Fails when reaching pymssql package installation. In an attempt to get around it, I tried to install earlier versions of pymssql as listed below:

pip install pymssql==1.0.0
pip install pymssql==1.0.1
pip install pymssql==1.0.2
pip install pymssql==1.0.3
pip install pymssql==2.0.0
pip install pymssql==2.0.1
pip install pymssql==2.1.0
pip install pymssql==2.1.1
pip install pymssql==2.1.2
pip install pymssql==2.1.3
pip install pymssql==2.1.4rc1
pip install pymssql==2.1.4

But all led to the same error:

$ pip install piicatcher
...
Collecting pymssql<3.0 (from piicatcher)
  Using cached https://files.pythonhosted.org/packages/2e/81/99562b93d75f3fc5956fa65decfb35b38a4ee97cf93c1d0d3cb799fffb99/pymssql-2.1.4.tar.gz
    ERROR: Command errored out with exit status 1:
     command: /Users/drivera017/dev/testing-grounds/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/nl/gjvm7nyx4szc8dt9d5fgdk_w0000gn/T/pip-install-7lfmffe3/pymssql/setup.py'"'"'; __file__='"'"'/private/var/folders/nl/gjvm7nyx4szc8dt9d5fgdk_w0000gn/T/pip-install-7lfmffe3/pymssql/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: /private/var/folders/nl/gjvm7nyx4szc8dt9d5fgdk_w0000gn/T/pip-install-7lfmffe3/pymssql/
    Complete output (7 lines):
    /Users/drivera017/dev/testing-grounds/venv/lib/python3.8/site-packages/setuptools/dist.py:45: DistDeprecationWarning: Do not call this function
      warnings.warn("Do not call this function", DistDeprecationWarning)
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/nl/gjvm7nyx4szc8dt9d5fgdk_w0000gn/T/pip-install-7lfmffe3/pymssql/setup.py", line 88, in <module>
        from Cython.Distutils import build_ext as _build_ext
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

cx_Oracle_DatabaseError: ORA-00936: missing expression

Running this: piicatcher db -s localhost -p 1521 -d orcl -u USERNAME-p PASSWORD -t oracle -o file.txt I'm getting "cx_Oracle_DatabaseError: ORA-00936: missing expression" at line 160 of explorer.py. I normally wouldn't mind debugging it myself a little to see what I can find out but I'm not very familiar with how python projects run as a command instead of calling the file directly.

Reddit comment: https://www.reddit.com/message/messages/ju5gzz

Try to get the column name which is having PII data, but sqlite -list-all returns nothing

@vrajat, as discussed with you on live chat, I am logging this sqlite db scanning issue which returns nothing.

I have a sqlite db, which is containing sample PII information, I have scanned it through sqlite --list-all but it returns nothing. I have tried sqlite --list-all --scan-type deep but nothing actually changed. I have also tried to generate the catalog file, it also returns a blank file.

If I am just scanning the CSV file, than it only indicates that this file contains enum like Email, Person etc, but didn't tell actually which column is containing the PII information.

I want to identify which column is actually containing the which PII data, like which column is containg email data & which column is containing address information,

Fix exception in logging

TypeError: not all arguments converted during string formatting
Call stack:
File "F:\PII\pii_athena\piicatcher.env\Scripts\piicatcher-script.py", line 11, in
load_entry_point('piicatcher', 'console_scripts', 'piicatcher')()
File "F:\PII\pii_athena\piicatcher.env\lib\site-packages\click-7.0-py3.7.egg\click\core.py", line 764, in call
return self.main(*args, **kwargs)
File "F:\PII\pii_athena\piicatcher.env\lib\site-packages\click-7.0-py3.7.egg\click\core.py", line 717, in main
rv = self.invoke(ctx)
File "F:\PII\pii_athena\piicatcher.env\lib\site-packages\click-7.0-py3.7.egg\click\core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "F:\PII\pii_athena\piicatcher.env\lib\site-packages\click-7.0-py3.7.egg\click\core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "F:\PII\pii_athena\piicatcher.env\lib\site-packages\click-7.0-py3.7.egg\click\core.py", line 555, in invoke
return callback(*args, **kwargs)
File "F:\PII\pii_athena\piicatcher.env\lib\site-packages\click-7.0-py3.7.egg\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\aws.py", line 57, in cli
AthenaExplorer.dispatch(args)
File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\explorer.py", line 58, in dispatch
explorer.scan()
File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\explorer.py", line 86, in scan
schema.scan(self._generate_rows)
File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\metadata.py", line 53, in scan
child.scan(generator)
File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\metadata.py", line 105, in scan
col.scan(val, scanners)
File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\metadata.py", line 140, in scan
[self._pii.add(pii) for pii in scanner.scan(data)]
File "f:\pii\pii_athena\piicatcher\piicatcher\scanner.py", line 47, in scan
logging.debug("Processing '{}'", text)

File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\metadata.py", line 140, in scan
[self.pii.add(pii) for pii in scanner.scan(data)]
File "f:\pii\pii_athena\piicatcher\piicatcher\scanner.py", line 51, in scan
logging.debug("Found {}", ent.label
)
Message: 'Found {}'
Arguments: ('DATE',)

File "f:\pii\pii_athena\piicatcher\piicatcher\explorer\metadata.py", line 140, in scan
[self._pii.add(pii) for pii in scanner.scan(data)]
File "f:\pii\pii_athena\piicatcher\piicatcher\scanner.py", line 61, in scan
logging.debug("PiiTypes are {}", list(types))
Message: 'PiiTypes are {}'
Arguments: ([<PiiTypes.BIRTH_DATE: 9>],)
DEBUG:root:{<PiiTypes.BIRTH_DATE: 9>}

Snowflake Support

In a recent user survey, snowflake support is the top feature request. This issue will track addition of snowflake support. One of the challenges for development is access to snowflake.

Snowflake provides a 30 day trial which is sufficient for development. Maintenance & CI is an open question.

Snowflake information schema is available at: https://docs.snowflake.com/en/sql-reference/info-schema.html

Snowflake as a pure python connector: https://docs.snowflake.com/en/user-guide/python-connector.html

Snowflake SQL Reference for deep search: https://docs.snowflake.com/en/sql-reference/constructs/sample.html

cc @jayeshagwan1 @hsdqa @food-spotter

Shallow scan should recognize phone, credit card, person and location from column names

It is not surprising that deep and shallow scan show different results. Shallow scan only looks at column names. Deep scan looks at a sample of the data. I've even noticed that two different runs of deep scan show different results as sample rows are different. This is the challenge with not scanning all of the data. Its a trade-off between performance/cost and accuracy. There is no right answer.

W.R.T the output in particular, my observations are:

  1. Shallow scan should recognize phone, credit card, person and location from column names
  2. Deep scan did not recognize PII in a few columns. I need to look at the data to figure out if thats a bug or the column did not have any relevant data.
  3. Deep scan should also scan column names for candidates
  4. Along with an array, PIICatcher should add confidence numbers.

Originally posted by @vrajat in #67 (comment)

ascii_table format not working

Getting below error when trying output format as ascii_table

\Programs\Python\Python37\lib\encodings\cp 1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-42: character maps to

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.