saucecat / pydqc Goto Github PK
View Code? Open in Web Editor NEWpython automatic data quality check toolkit
License: MIT License
python automatic data quality check toolkit
License: MIT License
I am researching a few options to monitor data quality inside a snowflake instance. One of those is to use python. Came across this package and looks very promising, but appears to be built with flat files in mind.
Any thoughts on upgrading it to analyze database data?
in def _compare_key(key, _df1, _df2, img_dir):
and def _compare_numeric(col, _df1, _df2, img_dir, date_flag=False):
the labels df1_name
& df2_name
are not defined and consequently it crashed during data comparision
Hi there,
Thanks for the great package.
Sorry I am new to Python and trying to pick up. Please see my issue below.
I am seeing there an error everytime I run the command
from pydqc import infer_schema, data_summary, data_compare, data_consist
ModuleNotFoundError: No module named 'dqc_utils'
If I make the anaconda3\Lib\site-packages\pydqc-0.1-py3.6.egg\pydqc as my working directory, then I am able run without error. But when I am in a different working directory I get this error.
My Path variable is set to anaconda3\Lib\site-packages.
I am able to load other packages like numpy, pandas, scipy etc.
Also, I notice that init.py file is blank for PyDQC package. Is that supposed to be the case?
Sorry again if this is a beginner's problem.
Hi There,
This looks like a great package and I was testing your package for my own automation.
So, I have created the schema of my dataframe with the infer_schema() function and it returns the .xlsx file.
When I tried to read the .xlsx schema file using pd.read_excel() function, it returns "AssertionError: "
I am not sure what is happening here.
Here is the complete error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-29-e35830cdc292> in <module>()
----> 1 test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')
~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
116 else:
117 kwargs[new_arg_name] = new_arg_value
--> 118 return func(*args, **kwargs)
119 return wrapper
120 return _deprecate_kwarg
~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
228
229 if not isinstance(io, ExcelFile):
--> 230 io = ExcelFile(io, engine=engine)
231
232 return io._parse_excel(
~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
292 self.book = xlrd.open_workbook(file_contents=data)
293 elif isinstance(self._io, compat.string_types):
--> 294 self.book = xlrd.open_workbook(self._io)
295 else:
296 raise ValueError('Must explicitly set engine if not passing in'
~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
141 formatting_info=formatting_info,
142 on_demand=on_demand,
--> 143 ragged_rows=ragged_rows,
144 )
145 return bk
~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
835 x12sheet = X12Sheet(sheet, logfile, verbosity)
836 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 837 x12sheet.process_stream(zflo, heading)
838 del zflo
839
~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
546 for event, elem in ET.iterparse(stream):
547 if elem.tag == row_tag:
--> 548 self_do_row(elem)
549 elem.clear() # destroy all child elements (cells)
550 elif elem.tag == U_SSML12 + "dimension":
~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in do_row(self, row_elem)
633 self.rowx = int(row_number) - 1
634 explicit_row_number = 1
--> 635 assert 0 <= self.rowx < X12_MAX_ROWS
636 rowx = self.rowx
637 colx = -1
AssertionError:
Here is what I have installed on my python:
airflow==1.8.0
alembic==0.8.10
annoy==1.11.1
asn1crypto==0.24.0
attrs==17.4.0
Automat==0.6.0
awscli==1.14.32
basemap==1.0.7
bcolz==1.1.2
beautifulsoup4==4.6.0
bitarray==0.8.1
bleach==1.5.0
bokeh==0.12.14
boto==2.48.0
boto3==1.5.22
botocore==1.8.36
branca==0.2.0
bs4==0.0.1
bz2file==0.98
cachetools==2.0.1
Cartopy==0.15.1
catboost==0.6
certifi==2018.1.18
cffi==1.11.4
chardet==3.0.4
click==6.7
click-plugins==1.0.3
cliff==2.8.0
cligj==0.4.0
cloudpickle==0.5.2
cmd2==0.8.0
colorama==0.3.7
colorcet==1.0.0
configparser==3.5.0
constantly==15.1.0
croniter==0.3.20
cryptography==2.1.4
cssselect==1.0.3
cycler==0.10.0
cymem==1.31.2
Cython==0.27.3
cytoolz==0.8.2
dask==0.17.0
datashader==0.6.5
datashape==0.5.4
deap==1.2.2
decorator==4.2.1
descartes==1.1.0
dill==0.2.7.1
distributed==1.21.0
Django==1.11.8
docutils==0.14
en-core-web-sm==2.0.0
entrypoints==0.2.3
et-xmlfile==1.0.1
fastai==0.6
feather-format==0.4.0
Fiona==1.7.11.post1
Flask==0.11.1
Flask-Admin==1.4.1
Flask-Cache==0.13.1
Flask-Login==0.2.11
flask-swagger==0.2.13
Flask-WTF==0.12
folium==0.5.0
ftfy==4.4.3
funcsigs==1.0.0
future==0.15.2
gensim==3.2.0
geopandas==0.3.0
geoplot==0.1.2
gitdb2==2.0.3
GitPython==2.1.8
graphviz==0.8.2
gunicorn==19.3.0
h5py==2.7.1
heapdict==1.0.0
html5lib==0.9999999
hyperlink==17.3.1
idna==2.6
ijson==2.3
impyla==0.14.0
incremental==17.5.0
ipykernel==4.8.0
ipython==6.2.1
ipython-genutils==0.2.0
ipywidgets==7.1.1
isoweek==1.3.3
itsdangerous==0.24
JayDeBeApi==1.1.1
jdcal==1.3
jedi==0.11.1
Jinja2==2.8.1
jmespath==0.9.3
JPype1==0.6.2
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.2
jupyter-console==5.2.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.4.0
jupyter-core==4.4.0
jupyter-highlight-selected-word==0.1.0
jupyter-latex-envs==1.4.1
jupyter-nbextensions-configurator==0.4.0
kaggle-cli==0.12.13
Keras==2.1.3
ktext==0.31
lightgbm==2.1.0
llvmlite==0.21.0
locket==0.2.0
lockfile==0.12.2
lxml==3.8.0
Mako==1.0.7
Markdown==2.6.9
MarkupSafe==1.0
matplotlib==2.1.2
MechanicalSoup==0.8.0
missingno==0.4.0
mistune==0.8.3
mizani==0.4.4
more-itertools==4.1.0
msgpack==0.5.1
msgpack-numpy==0.4.1
msgpack-python==0.5.1
multipledispatch==0.4.9
multiprocess==0.70.5
munch==2.2.0
murmurhash==0.28.0
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.1
nltk==3.2.5
nose==1.3.7
notebook==5.3.1
numba==0.36.2
numpy==1.14.0
olefile==0.45.1
opencv-python==3.4.0.12
openpyxl==2.5.0
ordereddict==1.1
packaging==16.8
palettable==3.1.0
pandas==0.22.0
pandas-summary==0.0.41
pandocfilters==1.4.2
param==1.5.1
parsel==1.3.1
parso==0.1.1
partd==0.3.8
pathlib==1.0.1
pathos==0.2.1
patsy==0.5.0
pbr==3.1.1
pexpect==4.3.1
pickleshare==0.7.4
Pillow==5.0.0
plac==0.9.6
plotly==2.3.0
plotnine==0.3.0
ply==3.10
pox==0.2.3
ppft==1.6.4.7.1
preshed==1.0.0
prettytable==0.7.2
progressbar2==3.34.3
prompt-toolkit==1.0.15
protobuf==3.4.1
psutil==5.4.3
ptyprocess==0.5.2
pyarrow==0.8.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycparser==2.18
PyDispatcher==2.0.5
pydot==1.2.4
pydqc==0.1
pyemd==0.5.1
Pygments==2.2.0
pyodbc==4.0.22
pyOpenSSL==17.5.0
pyparsing==2.2.0
pyperclip==1.6.0
Pyphen==0.9.4
pyproj==1.9.5.1
pyshp==1.2.12
python-daemon==2.1.2
python-dateutil==2.6.1
python-editor==1.0.3
python-Levenshtein==0.12.0
python-nvd3==0.14.2
python-slugify==1.1.4
python-utils==2.2.0
pytz==2017.3
PyYAML==3.12
pyzmq==16.0.3
qtconsole==4.3.1
queuelib==1.4.2
regex==2017.4.5
requests==2.18.4
rsa==3.4.2
s3transfer==0.1.12
scikit-learn==0.19.1
scipy==1.0.0
Scrapy==1.5.0
seaborn==0.8.1
selenium==3.8.1
Send2Trash==1.4.2
service-identity==17.0.0
setproctitle==1.1.10
Shapely==1.6.4.post1
simplegeneric==0.8.1
six==1.11.0
sklearn-pandas==1.6.0
smart-open==1.5.6
smmap2==2.0.3
sortedcontainers==1.5.9
spacy==2.0.6
SQLAlchemy==1.2.2
statsmodels==0.8.0
stevedore==1.28.0
stopit==1.1.1
tabulate==0.7.7
tblib==1.3.2
tensorflow==1.4.1
tensorflow-tensorboard==0.1.5
teradata==15.10.0.21
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
textacy==0.5.0
Theano==1.0.1
thinc==6.10.2
thrift==0.9.3
thriftpy==0.3.9
toolz==0.9.0
torchtext==0.2.1
tornado==4.5.3
TPOT==0.9.2
tqdm==4.19.5
traitlets==4.3.2
Twisted==17.9.0
ujson==1.35
Unidecode==1.0.22
update-checker==0.16
urllib3==1.22
w3lib==1.19.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.14.1
widgetsnbextension==3.1.0
wrapt==1.10.11
WTForms==2.1
xarray==0.10.0
xgboost==0.7
xlrd==1.1.0
XlsxWriter==1.0.2
zict==0.1.3
zope.deprecation==4.3.0
zope.interface==4.4.3
Let me know your thoughts and if you need more info.
Sugi
Because it would be nice to pip install pydqc
.
I'm getting this error: "TypeError: 'DataFrame' object is not callable"; I have no idea why as I can see the headers of the data and it's all fine.
Current implementation of data_compare assigns default values of 'table1' and 'table2' to the dataframes.
However, it would be good if the user can specify name of the dataframe, to be able to distinguish the dataframes better.
The signature can be as follows:
distribution_compare_pretty(_df1, _df2, _df1_name='table1',_df2_name='table2', col, figsize=None, date_flag=False)
This is a great tool! But I am facing some errors: I think the openpyxl is causing issues. How to fix this?
Details:
File "", line 1, in
data_summary.data_summary(table_schema=data_schema, table=df, fname='processed', sample_size=1.0, keep_images=False, n_jobs=2)
File "C:/Users/pydqc\pydqc\data_summary.py", line 468, in data_summary
_insert_numeric_results(numeric_results, ws, 35, img_dir)
File "C:/Users/pydqc\pydqc\dqc_utils.py", line 259, in _insert_numeric_results
_style_range(ws, 'A%d:C%d'%(head_row, head_row+result_df.shape[0]-1), border=border)
File "C:/Users/pydqc\pydqc\dqc_utils.py", line 50, in _style_range
cell.border += top
File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\styles\proxy.py", line 46, in add
return self.__target + other
File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 203, in add
vals[el] = a + b
File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 203, in add
vals[el] = a + b
File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 206, in add
return self.class(**vals)
File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\styles\colors.py", line 86, in init
self.indexed = indexed
File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\base.py", line 69, in set
value = _convert(self.expected_type, value)
File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\base.py", line 59, in _convert
raise TypeError('expected ' + str(expected_type))
TypeError: expected <class 'int'>
Selected KDE bandwidth is 0. Cannot estimate density.
runtime error is thrown when trying to plot the kde of data, with the below characteristics:
default_df['overdue_accts'].value_counts()
0 43408
1 5382
2 1241
3 361
4 120
5 48
6 22
7 7
8 7
9 4
11 4
12 2
16 1
10 1
13 1
14 1
18 1
nondefault_df['overdue_accts'].value_counts()
0 163471
1 14588
2 3061
3 841
4 284
5 118
6 74
7 31
9 21
8 20
11 8
12 6
10 5
13 4
14 4
17 2
23 1
15 1
18 1
19 1
25 1
This behaviour is seen when the package statsmodels is installed in the same environment, as seaborn tries use statsmodels.
The following is the error stack trace :
ValueError Traceback (most recent call last)
~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
450 try:
--> 451 bw = float(bw)
452 except:ValueError: could not convert string to float: 'scott'
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
in
6 for col in default_df.columns:
7 print(col)
----> 8 distribution_compare_pretty(default_df,nondefault_df,col=col,figsize=None)~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/pydqc/data_compare.py in distribution_compare_pretty(_df1, _df2, col, figsize, date_flag)
125 plt.legend(loc=1)
126 else:
--> 127 ax1 = sns.distplot(df1_draw_values, color=TABLE1_DARK, hist=False, label='table1')
128 ax2 = sns.distplot(df2_draw_values, color=TABLE2_DARK, hist=False, label='table2')
129 y_low_1, y_up_1 = ax1.get_ylim()~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax)
231 if kde:
232 kde_color = kde_kws.pop("color", color)
--> 233 kdeplot(a, vertical=vertical, ax=ax, color=kde_color, **kde_kws)
234 if kde_color != color:
235 kde_kws["color"] = kde_color~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in kdeplot(data, data2, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, **kwargs)
703 ax = _univariate_kdeplot(data, shade, vertical, kernel, bw,
704 gridsize, cut, clip, legend, ax,
--> 705 cumulative=cumulative, **kwargs)
706
707 return ax~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in _univariate_kdeplot(data, shade, vertical, kernel, bw, gridsize, cut, clip, legend, ax, cumulative, **kwargs)
293 x, y = _statsmodels_univariate_kde(data, kernel, bw,
294 gridsize, cut, clip,
--> 295 cumulative=cumulative)
296 else:
297 # Fall back to scipy if missing statsmodels~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in _statsmodels_univariate_kde(data, kernel, bw, gridsize, cut, clip, cumulative)
365 fft = kernel == "gau"
366 kde = smnp.KDEUnivariate(data)
--> 367 kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
368 if cumulative:
369 grid, y = kde.support, kde.cdf~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py in fit(self, kernel, bw, fft, weights, gridsize, adjust, cut, clip)
138 density, grid, bw = kdensityfft(endog, kernel=kernel, bw=bw,
139 adjust=adjust, weights=weights, gridsize=gridsize,
--> 140 clip=clip, cut=cut)
141 else:
142 density, grid, bw = kdensity(endog, kernel=kernel, bw=bw,~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
451 bw = float(bw)
452 except:
--> 453 bw = bandwidths.select_bandwidth(X, bw, kern) # will cross-val fit this pattern?
454 bw *= adjust
455~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/bandwidths.py in select_bandwidth(x, bw, kernel)
172 # eventually this can fall back on another selection criterion.
173 err = "Selected KDE bandwidth is 0. Cannot estimate density."
--> 174 raise RuntimeError(err)
175 else:
176 return bandwidthRuntimeError: Selected KDE bandwidth is 0. Cannot estimate density.
Try "save and checkpoint" the notebook after running it, before commit,, then it will render with output.. It will make it easier for people like me to see if they want to try it out before they have to install it..
set n_jobs = min(n_jobs(pass_in), num_task)
I was able to run and modify the output of the infer_schema function. However, when running the data summary function I keep getting the error str indices must be int. Any thoughts about what I might be doing wrong?
When pandas reads a column that contains a 'mix' of integers and strings its dtype will be object. But the entries themselves will be cast to int/string individually. So infer_schema falls over at np.unique to a type error.
In line 113 of infer_schema.py, changing to
col_stat['sample_num_uni'] = len(np.unique(sample_data).astype(str))
might work?
Thanks
In dqc_utils if you change line 275-
img = openpyxl.drawing.image.Image(os.path.join(img_dir, '%s.png' %(graph_name))) ws.add_image(img, 'C%d' %(head_row+1)) except Exception as e: print(e) continue
It ask for PIL package to print images in XLS. Solve that installing Pillow
pip install Pillow
Here is an error log trying to extract data summary - seems like the spreadsheet writer is unable to fill float values
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-b07dddd4375d> in <module>()
1 get_ipython().magic(u'time')
2 data_summary.data_summary(table_schema=df_schema, table=df, fname='csv',
----> 3 sample_size=1.0, output_root='dq/', keep_images=True, n_jobs=2)
<path>/lib/python2.7/site-packages/pydqc/data_summary.pyc in data_summary(table_schema, table, fname, sample_size, sample_rows, output_root, keep_images, n_jobs)
466 ws = wb.create_sheet(title=u'numeric')
467 # write the final result to work sheet
--> 468 _insert_numeric_results(numeric_results, ws, 35, img_dir)
469
470 # string features
<path>/lib/python2.7/site-packages/pydqc/dqc_utils.pyc in _insert_numeric_results(numeric_results, ws, row_height, img_dir, date_flag)
253 result_df = result['result_df']
254 result_df = result_df[['feature', 'value', 'graph']]
--> 255 head_row = _insert_df(result_df, ws)
256
257 # merge cells for the graph
/<path>/lib/python2.7/site-packages/pydqc/dqc_utils.pyc in _insert_df(result_df, ws, header, head_color, bold_first_column, head_style)
202 max_col = result_df.shape[1]
203 for r_idx, r in enumerate(dataframe_to_rows(result_df, index=False, header=header)):
--> 204 ws.append(r)
205 for cell_idx, cell in enumerate(ws.iter_cols(max_col=max_col, min_row=ws.max_row, max_row=ws.max_row)):
206 cell = cell[0]
<path>/lib/python2.7/site-packages/openpyxl/worksheet/worksheet.pyc in append(self, iterable)
652 cell.row = row_idx
653 else:
--> 654 cell = Cell(self, row=row_idx, column=col_idx, value=content)
655 self._cells[(row_idx, col_idx)] = cell
656
<path>/lib/python2.7/site-packages/openpyxl/cell/cell.pyc in __init__(self, worksheet, row, column, value, style_array)
118 self.data_type = 'n'
119 if value is not None:
--> 120 self.value = value
121 self._comment = None
122
<path>/lib/python2.7/site-packages/openpyxl/cell/cell.pyc in value(self, value)
250 def value(self, value):
251 """Set the value and infer type and display options."""
--> 252 self._bind_value(value)
253
254 @property
<path>/lib/python2.7/site-packages/openpyxl/cell/cell.pyc in _bind_value(self, value)
216
217 elif value is not None:
--> 218 raise ValueError("Cannot convert {0!r} to Excel".format(value))
219
220 self._value = value
ValueError: Cannot convert -0.625969839 to Excel
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.