Comments (31)
There hasnt been any progress on this feature. IIRC @jayeshagwan1 got stuck in installing a test Hive cluster. @zer0pool will you be able to help out?
from piicatcher.
from piicatcher.
I am interested in contributing.
from piicatcher.
Thanks!
Here are some guidelines on how to get started:
Install a developer version of piicatcher
- Fork the repo.
- Instructions are here: https://tokern.io/docs/piicatcher/development
Hive installation
I am not sure about your tech setup. A web search should provide a lot of websites with instrutions to setup Hive.
Load data into Hive
I use a couple of simple datasets:
- https://github.com/tokern/piicatcher/blob/master/tests/test_databases.py#L19
- https://github.com/tokern/piicatcher/blob/master/tests/samples/sample-data.csv
Add pyhive
Add pyhive as a requirement in requirements.txt
Rerun pipenv update to install pyhive.
Write a explorer
An explorer is the base class for supporting different types of technologies.
You can use AWS Explorer as an example.
You'll have to:
- Create a new python file - hive.py - for example.
- Implement a cli function.
- Implement a HiveExplorer class similar to AthenaExplorer
- Change all the code in the functions to make it work with hive. For example all the queries have to be changed. Use pyhive instead of pyathena and so on.
I can answer any questions while you develop.
from piicatcher.
Thanks @vrajat. Will follow the above steps. If any issue, will let you know.
from piicatcher.
Above steps were followed. After running the command piicatcher --config hiveconfig.ini hive
getting below error :
It seems its issue on windows system while installing pyhive.
from piicatcher.
from piicatcher.
from piicatcher.
from piicatcher.
Thanks. Now facing :
from piicatcher.
from piicatcher.
from piicatcher.
Hive2
from piicatcher.
from piicatcher.
yes
from piicatcher.
from piicatcher.
Now able to connect to hiveserver2. But getting below error:
raise ValueError("Password should be set if and only if in LDAP or CUSTOM mode; " ValueError: Password should be set if and only if in LDAP or CUSTOM mode; Remove password or use one of th ose modes
Currently I am passing auth='NOSASL' in connection. If I pass auth='Custom or none' then getting this error:
from piicatcher.
Can you confirm if these are errors when you try to connect to hive through python console ? No PIICatcher involved ?
Can you confirm if you can connect to Hive and run queries from python console ?
from piicatcher.
Sure. Will confirm. I think there similar open issues with pyhive also. Do we have other option for pyhive ?
from piicatcher.
- https://github.com/cloudera/impyla
- https://dwgeek.com/steps-to-connect-hiveserver2-from-python-using-hive-jdbc-drivers.html/
1 is probably the better option
from piicatcher.
There is some issue with pyhive. I have tried with python, but still getting same error.
from piicatcher.
Is it specific to OS ? Haven't tried with linux or ubuntu yet.
from piicatcher.
I am not sure. I've used in Centos and it worked. That was for a specific configuration of hive. OS or the configuration of python/hive can be the problem. Dont know how to help remotely with no knowledge about the setup.
from piicatcher.
Can you try impyla ?
from piicatcher.
Is this uses impala ?
from piicatcher.
I am trying on centOS, but getting this error:
[Errno 14] problem making ssl connection
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: bintray--sbt-rpm. Please verify its path and try again
So could not install anything. Tried couple of things for ssl but its not working
from piicatcher.
from piicatcher.
ftw superset uses pyhive. https://github.com/apache/incubator-superset/blob/master/superset/db_engine_specs/hive.py#L71
There are also hive related issues but in general it works. I still think there is something about your installation that pyhive does not work with.
from piicatcher.
I will start working on Hive from next week and keep you posted.
from piicatcher.
@jayeshagwan1 hello. I am wondering how this implementation go. it would be great if this feature can be added soon.
from piicatcher.
closing this as there is not much demand for Hive. There is more interest in redshift, snowflake and Trino.
from piicatcher.
Related Issues (20)
- sensitive information about a db scan should not be stored HOT 2
- parallelism in deep scans
- Progress report/ status on deep scan whilst not running in DEBUG mode
- ERROR: duplicate key value violates unique constraint "unique_column_name" HOT 1
- shallow scan followed by deep scan causes pk/ uk violations (and possibly viceversa) HOT 3
- Some name columns are identified as phone pii - unable to ascertain why... HOT 2
- Scan snowflake PII catcher HOT 1
- Error parsing info on dropped column during deep (data) detect command
- sqlalchemy.orm.exc.NoResultFound: No row was found for one() HOT 6
- Datahub ingestion function HOT 2
- No row was found for one() when trying Local File
- Support Google Cloud BigQuery HOT 1
- Support Google Cloud Spanner HOT 1
- Unable to Connect to Postgres HOT 5
- Scan can take DAYS on large database clusters HOT 2
- Support OpenMetadata integration HOT 1
- Redshift doesnt support bernoulli tablesample HOT 1
- Unclear example of export to datahub HOT 4
- Unique Constraint Failed HOT 1
- Update ReadMe to accommodate new commands and remove outdated data
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from piicatcher.