capitalone / datacompy Goto Github PK
View Code? Open in Web Editor NEWPandas, Polars, and Spark DataFrame comparison for humans and more!
Home Page: https://capitalone.github.io/datacompy/
License: Apache License 2.0
Pandas, Polars, and Spark DataFrame comparison for humans and more!
Home Page: https://capitalone.github.io/datacompy/
License: Apache License 2.0
Seems like the unit tests are failing due to a potential regression in Pandas?
pandas-dev/pandas#21119
def test_decimal_with_joins():
df1 = pd.DataFrame([{'a': Decimal('1'), 'b': 2}, {'a': Decimal('2'), 'b': 2}])
df2 = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 2, 'b': 2}])
compare = datacompy.Compare(df1, df2, 'a')
assert compare.matches()
assert compare.all_columns_match()
assert compare.all_rows_overlap()
assert compare.intersect_rows_match()
Following test raises: ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
Would it make sense to pin the Pandas version?
@ryanshiroma Can you please sign the project CLA for datacompy project? If you plan to submit a PR in the near future, CLA-Assistant bot will prompt to to sign. If you aren't planning to submit any new PRs to the project, pls use this link to sign: https://docs.google.com/forms/d/e/1FAIpQLSfwtl1s6KmpLhCY6CjiY8nFZshDwf_wrmNYx1ahpsNFXXmHKw/viewform
We appreciate your contributions to the project and your understanding on the contribution requirement for CLA.
As probably expected, the following code will return a mismatch since 'B'<>'B ':
import pandas as pd
import datacompy
df1 = pd.DataFrame([
{'id': 1234, 'column_value': 'A'},
{'id': 2345, 'column_value': 'B'}])
df2 = pd.DataFrame([
{'id': 1234, 'column_value': 'A'},
{'id': 2345, 'column_value': 'B '}])
compare = datacompy.Compare(
df1,
df2,
join_columns='id',
abs_tol=0,
rel_tol=0,
)
compare.matches(ignore_extra_columns=False)
# False
# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())
What I propose is an optional parameter to ignore differences where the only difference is leading or trailing spaces. In this example it is obvious that there is a trailing space. However, when we are dealing with extracts from different databases/source files, without real control over the ETL of these, sometimes we can't prevent these discrepancies. We may wish to ignore these types of mismatches to identify 'worse' mismatches more effectively.
Another candidate could be ignoring case sensitivity differences.
Of course these could both be easily handled with preprocessing the dataframes, but still could be some convenient enhancements!
Comparison of two empty numpy
arrays currently return False
, which results in showing diffs where there shouldn't be.
This is due to the way numpy
compares empty arrays.
Running bool(np.array([]) == np.array([]))
returns False
and throws this warning:
The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
Reproduce this bug with:
df1 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})
df2 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})
pdcompare = datacompy.Compare(df1, df2, join_columns="id")
print(pdcompare.report())
output:
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 2 10
1 df2 2 10
Column Summary
--------------
Number of columns in common: 2
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 10
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 0
Number of rows with some compared columns unequal: 10
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 10
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 some_col object object 10 0 0
Sample Rows with Unequal Values
-------------------------------
id some_col (df1) some_col (df2)
9 9 [] []
0 0 [] []
3 3 [] []
7 7 [] []
5 5 [] []
1 1 [] []
4 4 [] []
2 2 [] []
8 8 [] []
6 6 [] []
I'd love to see SparkCompare
produce "Sample Rows with Unequal Values". This is a really useful section from the pandas
version of datacompy
.
For eg.
When I run the report on my dataframes, I want only the output under the 'Columns with Unequal Values or Types' columns.
I also want the output of 'Sample Rows with Unequal Values'.
These are the only part of the compare.report I need. Is there any way to capture only this part of the report?
YAML parser and emitter for Python
Library home page: https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz
Path to dependency file: datacompy/test-requirements.txt
Path to vulnerable library: datacompy/test-requirements.txt
Dependency Hierarchy:
Found in HEAD commit: 4f516c4de1453b3db34b7f14ea561dd05f7110c1
Found in base branch: develop
A vulnerability was discovered in the PyYAML library in all versions, where it is susceptible to arbitrary code execution when it processes untrusted YAML files through the full_load method or with the FullLoader loader. .load() defaults to using FullLoader and FullLoader is still vulnerable to RCE when run on untrusted input. Applications that use the library to process untrusted input may be vulnerable to this flaw. An attacker could use this flaw to execute arbitrary code on the system by abusing the python/object/new constructor.
The fix for CVE-2020-1747 was not enough to fix this issue.
Publish Date: 2020-07-21
URL: CVE-2020-14343
Base Score Metrics:
Step up your Open Source Security Game with WhiteSource here
Maybe travis CI? Cloud Custodian uses Travis, hopefully I can poach their stuff.
I use this tool to compare two parquets and surprisingly found the comparison result is completely wrong:
The two parquet files both has 40 fields with exactly the same name, difference is one parquet is 27MB and the other is 250MB.
Here is the summary:
Number of columns in common: 7
Number of columns in first_parquet but not in second_parquet: 33
Number of columns in second_parquet but not in first_parquet: 25
I understand it would help your troubleshooting if you have the actual parquet files, but unfortunately I can't give the files, a screenshot of the schemas side-by-side is here:
https://snag.gy/rnpmBX.jpg
Not sure if it makes sense to go all the way to subclassing or ABCs, but the API calls between Compare
and SparkCompare` are quite different. I think they could be aligned somewhat before adding any new functionality.
You can get at duplicate rows like:
comp.df1_unq_rows[comp.df1_unq_rows['acct_id'].isin(comp.intersect_rows['acct_id'])]
The compare report just says Any duplicates on match values: Yes
. Could have some things like
https://github.com/malexer/pytest-spark
Could get spark tests up and running on Travis then maybe?
there is an error during import when I run
import datacompy
this error arises
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\datacompy\__init__.py", line 20, in <module> from datacompy.sparkcompare import SparkCompare, NUMERIC_SPARK_TYPES File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\datacompy\sparkcompare.py", line 26, in <module> from pyspark.sql import functions as F File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\__init__.py", line 51, in <module> from pyspark.context import SparkContext File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\context.py", line 31, in <module> from pyspark import accumulators File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\accumulators.py", line 97, in <module> from pyspark.serializers import read_int, PickleSerializer File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\serializers.py", line 72, in <module> from pyspark import cloudpickle File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\cloudpickle.py", line 145, in <module> _cell_set_template_code = _make_cell_set_template_code() File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes)
I have another version of this library running on two different pc's but this error must be new
Having a structured format output can help in below cases
row_num | key[symbol:quotedate] | close (Original) | close (New) | numdiff | text (Original) | text (New) | textdiff | |
---|---|---|---|---|---|---|---|---|
0 | 4 | [AAN:01-05-2016] | 20.0 | 23.12 | 3.12 | |||
1 | 4 | [AAN:01-05-2016] | 1030099.0 | 1030097.0 | -2 | |||
2 | 7 | [AAP:01-05-2016] | b | a | True |
'Sample Rows with Unequal Values' shows just 10 rows which are different.
I want to see what changed in all the rows in the 2 dataframes, not just 10 rows.
Is there a way I can see all the unequal rows?
Please advise.
Prior to the check for uniqueness, _validate_dataframe
does dataframe.columns = [col.lower() for col in dataframe.columns]
. If columns have unique mixed-case labels, these could collide when lower-cased as in this example:
In [30]: df
Out[30]:
OSs Oss
0.0 -0.438774 -0.853128
0.5 0.919049 -0.795371
In [31]: x = datacompy.Compare(df, df, on_index=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-69a8e3399fdd> in <module>
----> 1 x = datacompy.Compare(df, df, on_index=True)
lib/python3.6/site-packages/datacompy/core.py in __init__(self, df1, df2, join_columns, on_index, abs_tol, rel_tol, df1_name, df2_name, ignore_spaces, ignore_case)
103
104 self._any_dupes = False
--> 105 self.df1 = df1
106 self.df2 = df2
107 self.df1_name = df1_name
lib/python3.6/site-packages/datacompy/core.py in df1(self, df1)
121 """Check that it is a dataframe and has the join columns"""
122 self._df1 = df1
--> 123 self._validate_dataframe("df1")
124
125 @property
lib/python3.6/site-packages/datacompy/core.py in _validate_dataframe(self, index)
151
152 if len(set(dataframe.columns)) < len(dataframe.columns):
--> 153 raise ValueError("{} must have unique column names".format(index))
154
155 if self.on_index:
ValueError: df1 must have unique column names```
For the individual column comparisons, the join_cols are always outputted followed by the values in df1, df2.
However for the Sample Rows Only in df1/df2, the first 10 columns are outputted.
Suggestion would be to force the output to show the join columns in this as well, as sometimes (for whatever reason) the join columns are not in the first 10.
Having it present would make downstream investigations of mismatches easier, instead of having to use the pandas df index to locate the join columns (assuming your join columns are also typically the lookup columns for investigations).
Hey @SanthiSridharan are you able to give me a hand with this? I've got the built docs in the /docs
subfolder, but I think I need someone with full repo rights to tell GH where to serve pages from, and maybe some help getting the capitalone.github.io connection set up?
If I am trying to compare the intersect rows for the following two dataframes:
df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]
})
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
'value': [5, 6, 7]})
compare = datacompy.Compare(
df1,
df2,
join_columns=['key'], #You can also specify a list of columns
abs_tol=0,
rel_tol=0,
df1_name='df1',
df2_name='df2')
compare.intersect_rows only return the following rows:
key | value_df1 | value_df2 | _merge | value_match |
---|---|---|---|---|
foo | 1 | 5 | both | False |
bar | 2 | 6 | both | False |
baz | 3 | 7 | both | False |
While I am expecting an additional row:
key | value_df1 | value_df2 | _merge | value_match |
---|---|---|---|---|
foo | 5 | 5 | both | True |
In fact, when you use pandas merge on df1 and df2, you do see the last row in df1:
df1.merge(df2, how="outer", suffixes=("_df1", "_df2"), indicator=True, on='key')
key | value_df1 | value_df2 | _merge |
---|---|---|---|
foo | 1 | 5 | both |
foo | 5 | 5 | both |
bar | 2 | 6 | both |
baz | 3 | 7 | both |
Why is the last row from df1 missing from intersect_row dataframe?
If you are using camel case naming conventions and have several multi-word column names, then automatic lowercasing of names will make the final result less readable.
There is only a need to lowercase when you have case-insensitive duplicates. If you do not, there is no need to disambiguate. So why not add an option to turn this off? If there are duplicates, then simply throw - it already does this, anyhow.
Rather than make it print to stdout (or a file)?
The Compare.report returns a string by default. I think it's better to make those method behave consistently.
In setup.py, should change the open(os.path.join(CURR_DIR, "README.rst"), encoding="utf-8") as file_open
to work with Python 2.7
Seems like the latest pytest-spark
has a bug or something. I've been tinkering around with it to get Spark up and running. Have a MVP repo here: https://github.com/fdosani/travis-pytest-spark
We should probably pin pytest-spark
until this sort itself out. Also in the repo is a .travis.yml
on how to setup the environment so a standalone spark instance can be used for unit testing. Hope this helps with the upcoming changing ๐
Wondering if we should have some conversation about dropping Python 2 support as it is coming soon to EOL. Right now we are support both, but it might help simplify things if we only go with 3.6+ in the next major release?
@theianrobertson @jborchma @gforsyth @slott56 Would appreciate any feedback on this.
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/something/datacompy/setup.py", line 12, in <module>
LONG_DESCRIPTION = file_open.read()
File "/something/envs/py36/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10137: ordinal not in range(128)
Issue is that file_open should be explicitly opened as utf-8. If someone's default environment isn't utf-8 this read fails.
At present it does not seem the built-in comparison report accepts any parameters. For one, I would like to be able to set the row limit myself as far as how many differences are printed per field. By default this is 10 - I'd like to optionally change this.
Appreciate the use of smart defaults, but any hard-coded parameter - 10 or otherwise, I feel ought to be something that can be configured externally.
Not sure why this is, but I'm getting a warning message from the heart of Pandas here:
import logging
import sys
import pandas as pd
import numpy as np
import datacompy
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
expected = pd.DataFrame([{'acct': 1, 'c': 2}, {'acct': 3, 'c': 4}, {'acct': np.nan}])
actual = expected.copy()
compare = datacompy.Compare(actual, expected, 'acct')
DEBUG:datacompy.core:Checking equality
INFO:datacompy.core:df1 Pandas.DataFrame.equals df2
INFO:datacompy.core:Number of columns in common: 2
DEBUG:datacompy.core:Checking column overlap
INFO:datacompy.core:Number of columns in df1 and not in df2: 0
INFO:datacompy.core:Number of columns in df2 and not in df1: 0
DEBUG:datacompy.core:Merging dataframes
DEBUG:datacompy.core:Outer joining
DEBUG:datacompy.core:Selecting df1 unique rows
DEBUG:datacompy.core:Selecting df2 unique rows
INFO:datacompy.core:Number of rows in df1 and not in df2: 0
INFO:datacompy.core:Number of rows in df2 and not in df1: 0
DEBUG:datacompy.core:Selecting intersecting rows
INFO:datacompy.core:Number of rows in df1 and df2 (not necessarily equal): 3
DEBUG:datacompy.core:Comparing intersection
INFO:datacompy.core:acct: 3 / 3 (100.00%) match
/Users/myname/miniconda3/envs/t2/lib/python3.6/site-packages/pandas/core/generic.py:6515: RuntimeWarning: invalid value encountered in absolute
return np.abs(self)
INFO:datacompy.core:c: 3 / 3 (100.00%) match
INFO:datacompy.core:df1 matches df2
It doesn't seem to happen if a) I only have two rows (take out the second row), or b) if I step through the code myself (?), or c) if I have four rows UGH
Having trouble isolating the issue, so ๐คทโโ๏ธ
It would be useful to be able to analyze a dataset based on the comparison, for example:
Compare dataset A, B: approximately 2% do not match on my relative tolerance threshold for variable X. I would like to get a dataset that has the inner join of A, B and an indicator variable for (difference in X > tol) that I could use for anlaysis.
As it stands I have to cook this myself, and I wasn't able to get sample_mismatch() to return what I needed easily
~\Anaconda3\lib\site-packages\datacompy\sparkcompare.py in (.0)
521 def helper(c):
522 # Create a predicate for each match type, comparing column values to the match type value
--> 523 predicates = [F.col(c) == k.value for k in MatchType]
524 # Create a tuple(number of match types found for each match type in this column)
525 return F.struct([F.lit(F.sum(pred.cast("integer"))) for pred in predicates]).alias(c)
NameError: name 'F' is not defined
Please be sure to add trusted reviewers to your codeowners file and the project write team.
There's an issue with the logic that tries to handle duplicate data. This is a minimal example:
import numpy as np
import pandas as pd
import datacompy
df1 = pd.DataFrame({
'fld_1': [1,2,2,3,3,4,5,5],
'fld_2': ['A',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
df2 = pd.DataFrame({
'fld_1': [1,2,3,4,5],
'fld_2': ['A',np.nan,np.nan,np.nan,np.nan]})
comp = datacompy.Compare(df1, df2, join_columns=['fld_1', 'fld_2'])
assert comp.subset()
df2
is a subset of df1
, but it's not matching. The issue is here, the groupby.cumcount
just lumps any rows with nulls in them into one big category. Need to make that more resilient to nulls.
Although the tolerance values and overall matching seem to work fine when comparing object to float64 or int64, in the summary report, the Max Diff is always 0.000000
Probably can be fixed here with .astype(float)?:
Lines 295 to 297 in 4d7fc48
Seems like the TravisCI tests are failing. I think this is due to the version of various packages not being pinned. As the packages change and evolve so will testing behaviour.
The following test is failing:
def test_decimal_with_joins():
df1 = pd.DataFrame([{"a": Decimal("1"), "b": 2}, {"a": Decimal("2"), "b": 2}])
df2 = pd.DataFrame([{"a": 1, "b": 2}, {"a": 2, "b": 2}])
with raises(ValueError):
compare = datacompy.Compare(df1, df2, "a")
I think in the past Pandas might have has different behaviour. Testing with 0.19.0
version of Pandas and Python 3.5
it still seems to NOT throw any error.
Join columns can have whitespaces as well,so ideally below section should be enhanced to check for ignore_spaces and strip() if it is true for string columns
This is causing complete mismatch of 2 dataframes here in case 1 of join column has space:
if on_index and join_columns is not None:
raise Exception("Only provide on_index or join_columns")
elif on_index:
self.on_index = True
self.join_columns = []
elif isinstance(join_columns, str):
self.join_columns = [join_columns.lower()]
self.on_index = False
else:
self.join_columns = [col.lower() for col in join_columns]
self.on_index = False
Might be a nice to have a optional ignore case feature so that STRING
== string
.
Should be relatively simple to add. Happy to take it on.
@theianrobertson thoughts?
Since we target Python 3.5+ in our CI testing, we can make this explicit and commit to only supporting that version and higher. This will enable us to use type hinting in our code, which would have a few benefits:
This should probably best be done in concert with something like #13 as long as we're messing with the code, but since this will involve committing to supporting certain versions of Python, it should be a conscious choice. Python 3.4 reached EOL in March 2019 so I don't see a problem with explicitly targeting 3.5+.
Interested in using this to clean up some style and enforce that the style is consistent across everything...
https://github.com/pre-commit/pre-commit
https://github.com/ambv/black
Hi author, I am trying to use the SparkCompare functionality with the sample code that you have provided but it is throwing me this error:
Traceback (most recent call last):
File ".\spark.py", line 69, in <module>
comparison.report()
File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 862, in report
self._print_row_matches_by_column(file)
File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 689, in _print_row_matches_by_column
self._populate_columns_match_dict()
File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 529, in _populate_columns_match_dict
match_data = match_dataframe.agg(*[helper(col) for col in self.columns_compared]).collect()[
File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 529, in <listcomp>
match_data = match_dataframe.agg(*[helper(col) for col in self.columns_compared]).collect()[
File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 523, in helper
predicates = [F.col(c) == k.value for k in MatchType]
File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 523, in <listcomp>
predicates = [F.col(c) == k.value for k in MatchType]
NameError: name 'F' is not defined
Any idea why is this happening? Thanks! :)
Need to dig into this a bit more, but this would probably mean moving docs to the gh-pages branch, and rebuilding every time there's a commit to master.
JavaScript library for DOM operations
Library home page: https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.js
Path to vulnerable library: /datacompy/_static/jquery-3.4.1.js
Dependency Hierarchy:
Found in HEAD commit: dd34b569a2407940480bfd5f3df171741533b660
In jQuery versions greater than or equal to 1.2 and before 3.5.0, passing HTML from untrusted sources - even after sanitizing it - to one of jQuery's DOM manipulation methods (i.e. .html(), .append(), and others) may execute untrusted code. This problem is patched in jQuery 3.5.0.
Publish Date: 2020-04-29
URL: CVE-2020-11022
Base Score Metrics:
Type: Upgrade version
Origin: https://blog.jquery.com/2020/04/10/jquery-3-5-0-released/
Release Date: 2020-04-29
Fix Resolution: jQuery - 3.5.0
Step up your Open Source Security Game with WhiteSource here
Currently the csv that's being written out has all the columns written out on a failed comparison making it harder to read with a larger data-set.
This is more of a feature request but the ability to only have columns that didn't match written to the mismatch csv would really help and speed up the QA process on our end.
One section of the output report is called "Sample Rows with Unequal Values". It prints each column with differences and then 10 rows of differences. However, I've noticed that this output changes with multiple runs - so multiple calls of .report()
will give you output in a different order. As an example, if you have two dataframes with the columns A, B, and C, all of which have differences, then one run may give you results in the order of B, C, A, the second in C, A, B, etc.
This isn't a huge issue - it's just cosmetic. But I do think the output would be cleaner if it the ordering was deterministic. Moreover, it should probably match the ordering of the columns of the first dataframe, which is referred to as df1
in the code.
This section of the report is generated by looping through self.column_stats
and generated sample rows of differences. self.column_stats
is generated from the method _intersect_compare()
, which loops through self.intersect_columns()
. And finally, the issue is in self.interesect_columns()
, which uses set
- an unordered data structure - and does an intersection.
Lines 212 to 214 in 8d96429
I'm thinking the solution to this problem is use an ordered set. There are packages available on PyPi for this.
It appears that _validate_dataframe
assumes that columns have string labels (it unconditionally calls .lower()
on them), which raises AttributeError
if columns instead have other types of labels (e.g. numerical):
In [18]: df = pd.DataFrame(np.random.randn(2), index=np.arange(0.0, 1.0, 0.5),
...: columns=[0.01])
In [19]: df
Out[19]:
0.01
0.0 -0.283165
0.5 -0.388745
In [20]: compare = datacompy.Compare(df, df, on_index=True)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-20-d3440cd3839e> in <module>
----> 1 compare = datacompy.Compare(df, df, on_index=True)
lib/python3.6/site-packages/datacompy/core.py in __init__(self, df1, df2, join_columns, on_index, abs_tol, rel_tol, df1_name, df2_name, ignore_spaces, ignore_case)
103
104 self._any_dupes = False
--> 105 self.df1 = df1
106 self.df2 = df2
107 self.df1_name = df1_name
lib/python3.6/site-packages/datacompy/core.py in df1(self, df1)
121 """Check that it is a dataframe and has the join columns"""
122 self._df1 = df1
--> 123 self._validate_dataframe("df1")
124
125 @property
lib/python3.6/site-packages/datacompy/core.py in _validate_dataframe(self, index)
145 raise TypeError("{} must be a pandas DataFrame".format(index))
146
--> 147 dataframe.columns = [col.lower() for col in dataframe.columns]
148 # Check if join_columns are present in the dataframe
149 if not set(self.join_columns).issubset(set(dataframe.columns)):
lib/python3.6/site-packages/datacompy/core.py in <listcomp>(.0)
145 raise TypeError("{} must be a pandas DataFrame".format(index))
146
--> 147 dataframe.columns = [col.lower() for col in dataframe.columns]
148 # Check if join_columns are present in the dataframe
149 if not set(self.join_columns).issubset(set(dataframe.columns)):
AttributeError: 'float' object has no attribute 'lower'
Powerful data structures for data analysis, time series, and statistics
Library home page: https://files.pythonhosted.org/packages/db/83/7d4008ffc2988066ff37f6a0bb6d7b60822367dcb36ba5e39aa7801fda54/pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl
Path to dependency file: datacompy/requirements.txt
Path to vulnerable library: datacompy/requirements.txt
Dependency Hierarchy:
Found in HEAD commit: d321c1b49544384bde8893327ea768552b3f175d
** DISPUTED ** pandas through 1.0.3 can unserialize and execute commands from an untrusted file that is passed to the read_pickle() function, if reduce makes an os.system call. NOTE: third parties dispute this issue because the read_pickle() function is documented as unsafe and it is the user's responsibility to use the function in a secure manner.
Publish Date: 2020-05-15
URL: CVE-2020-13091
Base Score Metrics:
Step up your Open Source Security Game with WhiteSource here
master
/ develop
branching style.gh-pages
as a separate branch vs on master
.@jborchma: Thoughts on this?
The output for report under section : Sample Values with Unequal Values
Shows same values for df1 and df2
no site_id net_revenue (df1) net_revenue (df2)
46 TRVED01 63.234000 63.234000
49 TRVEGH01 236.445673 236.445673
in actual it is
no site_id net_revenue (Source) net_revenue (Target)
46 TRVED01 63.234 63.234000
49 TRVEGH01 236.4456732 236.445673
When two table have same schema and data were different. this will throws exception
This is in the API reference /api/core.html#datacompy.core.Compare.report
but should be more explicit in the usage docs too.
Hey @SanthiSridharan I think I need your help on this one too. Hoping to add:
Repo description: Pandas and Spark DataFrame comparison for humans
Tags: python, pandas, spark, data
Link: (Whatever the link is from #4)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.