saucecat / pydqc Goto Github PK

View Code? Open in Web Editor NEW

283.0 283.0 57.0 9.69 MB

python automatic data quality check toolkit

License: MIT License

Python 42.15% Jupyter Notebook 57.85%

pydqc's People

Contributors

Stargazers

Watchers

pydqc's Issues

Can this be used this against Database Data, vs. flat files?

I am researching a few options to monitor data quality inside a snowflake instance. One of those is to use python. Came across this package and looks very promising, but appears to be built with flat files in mind.

Any thoughts on upgrading it to analyze database data?

Non-defined labels

in def _compare_key(key, _df1, _df2, img_dir): and def _compare_numeric(col, _df1, _df2, img_dir, date_flag=False): the labels df1_name & df2_name are not defined and consequently it crashed during data comparision

ModuleNotFoundError

Hi there,

Thanks for the great package.
Sorry I am new to Python and trying to pick up. Please see my issue below.

I am seeing there an error everytime I run the command
from pydqc import infer_schema, data_summary, data_compare, data_consist

ModuleNotFoundError: No module named 'dqc_utils'

If I make the anaconda3\Lib\site-packages\pydqc-0.1-py3.6.egg\pydqc as my working directory, then I am able run without error. But when I am in a different working directory I get this error.

My Path variable is set to anaconda3\Lib\site-packages.
I am able to load other packages like numpy, pandas, scipy etc.

Also, I notice that init.py file is blank for PyDQC package. Is that supposed to be the case?

Sorry again if this is a beginner's problem.

pd.read_excel() on the output of infer_schema.xlsx returns AssertionError:

Hi There,

This looks like a great package and I was testing your package for my own automation.

So, I have created the schema of my dataframe with the infer_schema() function and it returns the .xlsx file.

When I tried to read the .xlsx schema file using pd.read_excel() function, it returns "AssertionError: "

I am not sure what is happening here.

Here is the complete error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-29-e35830cdc292> in <module>()
----> 1 test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    292             self.book = xlrd.open_workbook(file_contents=data)
    293         elif isinstance(self._io, compat.string_types):
--> 294             self.book = xlrd.open_workbook(self._io)
    295         else:
    296             raise ValueError('Must explicitly set engine if not passing in'

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    141                 formatting_info=formatting_info,
    142                 on_demand=on_demand,
--> 143                 ragged_rows=ragged_rows,
    144                 )
    145             return bk

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    835         x12sheet = X12Sheet(sheet, logfile, verbosity)
    836         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 837         x12sheet.process_stream(zflo, heading)
    838         del zflo
    839 

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in do_row(self, row_elem)
    633             self.rowx = int(row_number) - 1
    634             explicit_row_number = 1
--> 635         assert 0 <= self.rowx < X12_MAX_ROWS
    636         rowx = self.rowx
    637         colx = -1

AssertionError:

Here is what I have installed on my python:
airflow==1.8.0
alembic==0.8.10
annoy==1.11.1
asn1crypto==0.24.0
attrs==17.4.0
Automat==0.6.0
awscli==1.14.32
basemap==1.0.7
bcolz==1.1.2
beautifulsoup4==4.6.0
bitarray==0.8.1
bleach==1.5.0
bokeh==0.12.14
boto==2.48.0
boto3==1.5.22
botocore==1.8.36
branca==0.2.0
bs4==0.0.1
bz2file==0.98
cachetools==2.0.1
Cartopy==0.15.1
catboost==0.6
certifi==2018.1.18
cffi==1.11.4
chardet==3.0.4
click==6.7
click-plugins==1.0.3
cliff==2.8.0
cligj==0.4.0
cloudpickle==0.5.2
cmd2==0.8.0
colorama==0.3.7
colorcet==1.0.0
configparser==3.5.0
constantly==15.1.0
croniter==0.3.20
cryptography==2.1.4
cssselect==1.0.3
cycler==0.10.0
cymem==1.31.2
Cython==0.27.3
cytoolz==0.8.2
dask==0.17.0
datashader==0.6.5
datashape==0.5.4
deap==1.2.2
decorator==4.2.1
descartes==1.1.0
dill==0.2.7.1
distributed==1.21.0
Django==1.11.8
docutils==0.14
en-core-web-sm==2.0.0
entrypoints==0.2.3
et-xmlfile==1.0.1
fastai==0.6
feather-format==0.4.0
Fiona==1.7.11.post1
Flask==0.11.1
Flask-Admin==1.4.1
Flask-Cache==0.13.1
Flask-Login==0.2.11
flask-swagger==0.2.13
Flask-WTF==0.12
folium==0.5.0
ftfy==4.4.3
funcsigs==1.0.0
future==0.15.2
gensim==3.2.0
geopandas==0.3.0
geoplot==0.1.2
gitdb2==2.0.3
GitPython==2.1.8
graphviz==0.8.2
gunicorn==19.3.0
h5py==2.7.1
heapdict==1.0.0
html5lib==0.9999999
hyperlink==17.3.1
idna==2.6
ijson==2.3
impyla==0.14.0
incremental==17.5.0
ipykernel==4.8.0
ipython==6.2.1
ipython-genutils==0.2.0
ipywidgets==7.1.1
isoweek==1.3.3
itsdangerous==0.24
JayDeBeApi==1.1.1
jdcal==1.3
jedi==0.11.1
Jinja2==2.8.1
jmespath==0.9.3
JPype1==0.6.2
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.2
jupyter-console==5.2.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.4.0
jupyter-core==4.4.0
jupyter-highlight-selected-word==0.1.0
jupyter-latex-envs==1.4.1
jupyter-nbextensions-configurator==0.4.0
kaggle-cli==0.12.13
Keras==2.1.3
ktext==0.31
lightgbm==2.1.0
llvmlite==0.21.0
locket==0.2.0
lockfile==0.12.2
lxml==3.8.0
Mako==1.0.7
Markdown==2.6.9
MarkupSafe==1.0
matplotlib==2.1.2
MechanicalSoup==0.8.0
missingno==0.4.0
mistune==0.8.3
mizani==0.4.4
more-itertools==4.1.0
msgpack==0.5.1
msgpack-numpy==0.4.1
msgpack-python==0.5.1
multipledispatch==0.4.9
multiprocess==0.70.5
munch==2.2.0
murmurhash==0.28.0
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.1
nltk==3.2.5
nose==1.3.7
notebook==5.3.1
numba==0.36.2
numpy==1.14.0
olefile==0.45.1
opencv-python==3.4.0.12
openpyxl==2.5.0
ordereddict==1.1
packaging==16.8
palettable==3.1.0
pandas==0.22.0
pandas-summary==0.0.41
pandocfilters==1.4.2
param==1.5.1
parsel==1.3.1
parso==0.1.1
partd==0.3.8
pathlib==1.0.1
pathos==0.2.1
patsy==0.5.0
pbr==3.1.1
pexpect==4.3.1
pickleshare==0.7.4
Pillow==5.0.0
plac==0.9.6
plotly==2.3.0
plotnine==0.3.0
ply==3.10
pox==0.2.3
ppft==1.6.4.7.1
preshed==1.0.0
prettytable==0.7.2
progressbar2==3.34.3
prompt-toolkit==1.0.15
protobuf==3.4.1
psutil==5.4.3
ptyprocess==0.5.2
pyarrow==0.8.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycparser==2.18
PyDispatcher==2.0.5
pydot==1.2.4
pydqc==0.1
pyemd==0.5.1
Pygments==2.2.0
pyodbc==4.0.22
pyOpenSSL==17.5.0
pyparsing==2.2.0
pyperclip==1.6.0
Pyphen==0.9.4
pyproj==1.9.5.1
pyshp==1.2.12
python-daemon==2.1.2
python-dateutil==2.6.1
python-editor==1.0.3
python-Levenshtein==0.12.0
python-nvd3==0.14.2
python-slugify==1.1.4
python-utils==2.2.0
pytz==2017.3
PyYAML==3.12
pyzmq==16.0.3
qtconsole==4.3.1
queuelib==1.4.2
regex==2017.4.5
requests==2.18.4
rsa==3.4.2
s3transfer==0.1.12
scikit-learn==0.19.1
scipy==1.0.0
Scrapy==1.5.0
seaborn==0.8.1
selenium==3.8.1
Send2Trash==1.4.2
service-identity==17.0.0
setproctitle==1.1.10
Shapely==1.6.4.post1
simplegeneric==0.8.1
six==1.11.0
sklearn-pandas==1.6.0
smart-open==1.5.6
smmap2==2.0.3
sortedcontainers==1.5.9
spacy==2.0.6
SQLAlchemy==1.2.2
statsmodels==0.8.0
stevedore==1.28.0
stopit==1.1.1
tabulate==0.7.7
tblib==1.3.2
tensorflow==1.4.1
tensorflow-tensorboard==0.1.5
teradata==15.10.0.21
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
textacy==0.5.0
Theano==1.0.1
thinc==6.10.2
thrift==0.9.3
thriftpy==0.3.9
toolz==0.9.0
torchtext==0.2.1
tornado==4.5.3
TPOT==0.9.2
tqdm==4.19.5
traitlets==4.3.2
Twisted==17.9.0
ujson==1.35
Unidecode==1.0.22
update-checker==0.16
urllib3==1.22
w3lib==1.19.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.14.1
widgetsnbextension==3.1.0
wrapt==1.10.11
WTForms==2.1
xarray==0.10.0
xgboost==0.7
xlrd==1.1.0
XlsxWriter==1.0.2
zict==0.1.3
zope.deprecation==4.3.0
zope.interface==4.4.3

Let me know your thoughts and if you need more info.

Sugi

Make a pypi package

Because it would be nice to pip install pydqc.

'DataFrame' object is not callable

I'm getting this error: "TypeError: 'DataFrame' object is not callable"; I have no idea why as I can see the headers of the data and it's all fine.

Add parameters table1_name,table2_name in data_compare method.

Current implementation of data_compare assigns default values of 'table1' and 'table2' to the dataframes.

However, it would be good if the user can specify name of the dataframe, to be able to distinguish the dataframes better.

The signature can be as follows:
distribution_compare_pretty(_df1, _df2, _df1_name='table1',_df2_name='table2', col, figsize=None, date_flag=False)

Error on data_summary: expected <class 'int'>

This is a great tool! But I am facing some errors: I think the openpyxl is causing issues. How to fix this?

Details:

File "", line 1, in
data_summary.data_summary(table_schema=data_schema, table=df, fname='processed', sample_size=1.0, keep_images=False, n_jobs=2)

File "C:/Users/pydqc\pydqc\data_summary.py", line 468, in data_summary
_insert_numeric_results(numeric_results, ws, 35, img_dir)

File "C:/Users/pydqc\pydqc\dqc_utils.py", line 259, in _insert_numeric_results
_style_range(ws, 'A%d:C%d'%(head_row, head_row+result_df.shape[0]-1), border=border)

File "C:/Users/pydqc\pydqc\dqc_utils.py", line 50, in _style_range
cell.border += top

File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\styles\proxy.py", line 46, in add
return self.__target + other

File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 203, in add
vals[el] = a + b

File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\serialisable.py", line 206, in add
return self.class(**vals)

File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\styles\colors.py", line 86, in init
self.indexed = indexed

File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\base.py", line 69, in set
value = _convert(self.expected_type, value)

File "C:\Users\6\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\descriptors\base.py", line 59, in _convert
raise TypeError('expected ' + str(expected_type))

TypeError: expected <class 'int'>

Selected KDE bandwidth is 0. Cannot estimate density.

Selected KDE bandwidth is 0. Cannot estimate density. runtime error is thrown when trying to plot the kde of data, with the below characteristics:

default_df['overdue_accts'].value_counts()

0 43408
1 5382
2 1241
3 361
4 120
5 48
6 22
7 7
8 7
9 4
11 4
12 2
16 1
10 1
13 1
14 1
18 1

nondefault_df['overdue_accts'].value_counts()

0 163471
1 14588
2 3061
3 841
4 284
5 118
6 74
7 31
9 21
8 20
11 8
12 6
10 5
13 4
14 4
17 2
23 1
15 1
18 1
19 1
25 1

This behaviour is seen when the package statsmodels is installed in the same environment, as seaborn tries use statsmodels.

The following is the error stack trace :

ValueError Traceback (most recent call last)
~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
450 try:
--> 451 bw = float(bw)
452 except:

ValueError: could not convert string to float: 'scott'

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
in
6 for col in default_df.columns:
7 print(col)
----> 8 distribution_compare_pretty(default_df,nondefault_df,col=col,figsize=None)

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/pydqc/data_compare.py in distribution_compare_pretty(_df1, _df2, col, figsize, date_flag)
125 plt.legend(loc=1)
126 else:
--> 127 ax1 = sns.distplot(df1_draw_values, color=TABLE1_DARK, hist=False, label='table1')
128 ax2 = sns.distplot(df2_draw_values, color=TABLE2_DARK, hist=False, label='table2')
129 y_low_1, y_up_1 = ax1.get_ylim()

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax)
231 if kde:
232 kde_color = kde_kws.pop("color", color)
--> 233 kdeplot(a, vertical=vertical, ax=ax, color=kde_color, **kde_kws)
234 if kde_color != color:
235 kde_kws["color"] = kde_color

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in kdeplot(data, data2, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, **kwargs)
703 ax = _univariate_kdeplot(data, shade, vertical, kernel, bw,
704 gridsize, cut, clip, legend, ax,
--> 705 cumulative=cumulative, **kwargs)
706
707 return ax

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in _univariate_kdeplot(data, shade, vertical, kernel, bw, gridsize, cut, clip, legend, ax, cumulative, **kwargs)
293 x, y = _statsmodels_univariate_kde(data, kernel, bw,
294 gridsize, cut, clip,
--> 295 cumulative=cumulative)
296 else:
297 # Fall back to scipy if missing statsmodels

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/seaborn/distributions.py in _statsmodels_univariate_kde(data, kernel, bw, gridsize, cut, clip, cumulative)
365 fft = kernel == "gau"
366 kde = smnp.KDEUnivariate(data)
--> 367 kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
368 if cumulative:
369 grid, y = kde.support, kde.cdf

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py in fit(self, kernel, bw, fft, weights, gridsize, adjust, cut, clip)
138 density, grid, bw = kdensityfft(endog, kernel=kernel, bw=bw,
139 adjust=adjust, weights=weights, gridsize=gridsize,
--> 140 clip=clip, cut=cut)
141 else:
142 density, grid, bw = kdensity(endog, kernel=kernel, bw=bw,

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
451 bw = float(bw)
452 except:
--> 453 bw = bandwidths.select_bandwidth(X, bw, kern) # will cross-val fit this pattern?
454 bw *= adjust
455

~/Developer/anaconda3/envs/gdd3.6/lib/python3.6/site-packages/statsmodels/nonparametric/bandwidths.py in select_bandwidth(x, bw, kernel)
172 # eventually this can fall back on another selection criterion.
173 err = "Selected KDE bandwidth is 0. Cannot estimate density."
--> 174 raise RuntimeError(err)
175 else:
176 return bandwidth

RuntimeError: Selected KDE bandwidth is 0. Cannot estimate density.

Can you also put a date column in example?

The package looks quite interesting, It would help if the sample notebooks were saved with output

Try "save and checkpoint" the notebook after running it, before commit,, then it will render with output.. It will make it easier for people like me to see if they want to try it out before they have to install it..

issues and enhancements list

set n_jobs = min(n_jobs(pass_in), num_task)

Str indices must be int

I was able to run and modify the output of the infer_schema function. However, when running the data summary function I keep getting the error str indices must be int. Any thoughts about what I might be doing wrong?

infer_schema TypeError: '<' not supported between instances of 'str' and 'int'

When pandas reads a column that contains a 'mix' of integers and strings its dtype will be object. But the entries themselves will be cast to int/string individually. So infer_schema falls over at np.unique to a type error.

In line 113 of infer_schema.py, changing to
col_stat['sample_num_uni'] = len(np.unique(sample_data).astype(str))

might work?

Thanks

You need to install package Pillow for supporting Images in openpyxl

In dqc_utils if you change line 275-

img = openpyxl.drawing.image.Image(os.path.join(img_dir, '%s.png' %(graph_name))) ws.add_image(img, 'C%d' %(head_row+1)) except Exception as e: print(e) continue

It ask for PIL package to print images in XLS. Solve that installing Pillow

pip install Pillow

ValueError: Cannot convert -#.##### to Excel

Cannot convert -#.##### to Excel

Here is an error log trying to extract data summary - seems like the spreadsheet writer is unable to fill float values

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-b07dddd4375d> in <module>()
      1 get_ipython().magic(u'time')
      2 data_summary.data_summary(table_schema=df_schema, table=df, fname='csv', 
----> 3                           sample_size=1.0, output_root='dq/', keep_images=True, n_jobs=2)

<path>/lib/python2.7/site-packages/pydqc/data_summary.pyc in data_summary(table_schema, table, fname, sample_size, sample_rows, output_root, keep_images, n_jobs)
    466         ws = wb.create_sheet(title=u'numeric')
    467         # write the final result to work sheet
--> 468         _insert_numeric_results(numeric_results, ws, 35, img_dir)
    469 
    470     # string features

<path>/lib/python2.7/site-packages/pydqc/dqc_utils.pyc in _insert_numeric_results(numeric_results, ws, row_height, img_dir, date_flag)
    253         result_df = result['result_df']
    254         result_df = result_df[['feature', 'value', 'graph']]
--> 255         head_row = _insert_df(result_df, ws)
    256 
    257         # merge cells for the graph

/<path>/lib/python2.7/site-packages/pydqc/dqc_utils.pyc in _insert_df(result_df, ws, header, head_color, bold_first_column, head_style)
    202     max_col = result_df.shape[1]
    203     for r_idx, r in enumerate(dataframe_to_rows(result_df, index=False, header=header)):
--> 204         ws.append(r)
    205         for cell_idx, cell in enumerate(ws.iter_cols(max_col=max_col, min_row=ws.max_row, max_row=ws.max_row)):
    206             cell = cell[0]

<path>/lib/python2.7/site-packages/openpyxl/worksheet/worksheet.pyc in append(self, iterable)
    652                     cell.row = row_idx
    653                 else:
--> 654                     cell = Cell(self, row=row_idx, column=col_idx, value=content)
    655                 self._cells[(row_idx, col_idx)] = cell
    656 

<path>/lib/python2.7/site-packages/openpyxl/cell/cell.pyc in __init__(self, worksheet, row, column, value, style_array)
    118         self.data_type = 'n'
    119         if value is not None:
--> 120             self.value = value
    121         self._comment = None
    122 

<path>/lib/python2.7/site-packages/openpyxl/cell/cell.pyc in value(self, value)
    250     def value(self, value):
    251         """Set the value and infer type and display options."""
--> 252         self._bind_value(value)
    253 
    254     @property

<path>/lib/python2.7/site-packages/openpyxl/cell/cell.pyc in _bind_value(self, value)
    216 
    217         elif value is not None:
--> 218             raise ValueError("Cannot convert {0!r} to Excel".format(value))
    219 
    220         self._value = value

ValueError: Cannot convert -0.625969839 to Excel

saucecat / pydqc Goto Github PK

pydqc's People

Contributors

Stargazers

Watchers

Forkers

pydqc's Issues

Cannot convert -#.##### to Excel

Recommend Projects

Recommend Topics

Recommend Org