erdogant / benfordslaw Goto Github PK

View Code? Open in Web Editor NEW

41.0 4.0 13.0 6.62 MB

benfordslaw is about the frequency distribution of leading digits.

Home Page: https://erdogant.github.io/benfordslaw

License: Other

Python 74.77% Shell 3.40% Jupyter Notebook 21.83%

benfords-law fraud-detection anomaly-detection distribution chi-square kolmogorov-smirnov

benfordslaw's Introduction

benfordslaw

benfordslaw is Python package to test if an empirical (observed) distribution differs significantly from a theoretical (expected, Benfords) distribution. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. This method can be used if you want to test whether your set of numbers may be artificial (or manipulated). If a certain set of values follows Benford's Law then model's for the corresponding predicted values should also follow Benford's Law. Normal data (Unmanipulated) does trend with Benford's Law, whereas Manipulated or fraudulent data does not.
Assumptions of the data:
1. The numbers need to be random and not assigned, with no imposed minimums or maximums.
2. The numbers should cover several orders of magnitude
3. Dataset should preferably cover at least 1000 samples. Though Benford's law has been shown to hold true for datasets containing as few as 50 numbers.

⭐️ Star this repo if you like it ⭐️

Install benfordslaw from PyPI

pip install benfordslaw

Import benfordslaw package

from benfordslaw import benfordslaw

Documentation pages

On the documentation pages you can find detailed information about the working of the benfordslaw with many examples.

Examples

Example: Analyze first digit-distribution

Example: Analyze second digit-distribution

Example: Analyze last digit-distribution

Example: Analyze second last digit-distribution

References

Citation

Please cite in your publications if this is useful for your research (see citation).

Maintainers

Erdogan Taskesen, github: erdogant

Contribute

All kinds of contributions are welcome!
If you wish to buy me a Coffee for this work, it is very appreciated :)

Licence

See LICENSE for details.

benfordslaw's People

Contributors

Stargazers

Watchers

Forkers

johnjboren andrewlane bomburr reayrtnygit jeffreyhorn jeyakumar-iopex sstm2 theskallywag seanahmad nandevers rohankumardubey gfreynoso mrrobotv22

benfordslaw's Issues

About digital numbers smaller than 1

Hi Erdogan,

My array goes like this: [-0.89, 1.18, 0.28, 0.0032, ...]. Most numbers in my array is between -1 and 1.

If I directly use your Benford method, I get many NaN. To obtain correct results, I have to multiply my array by a large number, e.g., 1000 or 1e6. How can I use your code to let the method recognize that the first digit of 0.0032 is 3 instead of 0? Or do I have to manually multiply a large number myself?

Thank you!

Feature: 2nd digit bl

Hey,
it would be cool if you could add 2nd digit benfords law since it is more useful for analysing election fraud.

AttributeError: 'FigureCanvasTkAgg' object has no attribute `'set_window_title'`

I get

File "/home/user/anaconda3/envs/environment/lib/python3.10/site-packages/benfordslaw/benfordslaw.py", line 186, in plot fig.canvas.set_window_title('Percentage First Digits') AttributeError: 'FigureCanvasTkAgg' object has no attribute 'set_window_title'``

with this code

`bl = benfordslaw(alpha=0.05)

results = bl.fit(amounts)

bl.plot(title='Amounts of transactions - 1st digit', barcolor=[0.5, 0.5, 0.5], fontsize=12, barwidth=0.4)`

Using benfordslaw v 1.2.0 and Python 3.10.4

comment says "Get only non-zero values " but code says > 1

benfordslaw/benfordslaw/benfordslaw.py

Lines 238 to 239 in ea5de2a

 # Get only non-zero values 

 data = data[data>1]

# Get only non-zero values
data = data[data>1]

should we take
data = data[data>0]

or should the comment say take only values that are not 0 or 1 ?

Failed to reproduce the examples

Hi! Thans for developing benfordslaw. I'm a python rookies, and I found the examples in benfordslaw are irreproducible although I fix the bug "AttributeError: 'numpy.ndarray' object has no attribute 'fillna'".

import benfordslaw
benfordslaw.__version__

'1.1.1'

##First digit test
from benfordslaw import benfordslaw
import pandas as pd

# Initialize
bl = benfordslaw(alpha=0.05)

# Load elections example
df = bl.import_example(data='USA')

# Extract election information.
X = df['votes'].loc[df['candidate']=='Donald Trump'].values

print(X)
# array([ 5387, 23618,  1710, ...,    16,    21,     0], dtype=int64)

#Add for "AttributeError: 'numpy.ndarray' object has no attribute 'fillna'"
X = pd.DataFrame(X)


# Make fit
results = bl.fit(X)

# Plot
bl.plot(title='Donald Trump')
----------------------------------------------------------------------------------------------------------------------------------
D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py:306: RuntimeWarning: invalid value encountered in double_scalars
  empirical_percentage = [(i / total_count) * 100 for i in empirical_counts]
D:\anaconda3\lib\site-packages\scipy\stats\_stats_py.py:6766: RuntimeWarning: invalid value encountered in divide
  terms = (f_obs_float - f_exp)**2 / f_exp
posx and posy should be finite values
[benfordslaw] >Import dataset [USA]
[ 5387 23618  1710 ...    16    21     0]
[benfordslaw] >Analyzing digit position: [1]
[benfordslaw] >[chi2] No anomaly detected. P=nan, Tstat=nan
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

##Second digit test
from benfordslaw import benfordslaw

# Initialize
bl = benfordslaw(pos=2)

# Load elections example
df = bl.import_example(data='USA')

# Extract election information.
X = df['votes'].loc[df['candidate']=='Donald Trump'].values

#AttributeError: 'numpy.ndarray' object has no attribute 'fillna'
X = pd.DataFrame(X)

# Make fit
results = bl.fit(X)

# Plot
bl.plot(title='Results of Donald Trump based on 2nd digit', barcolor=[0.5, 0.5, 0.5], fontsize=12, barwidth=0.4)
[benfordslaw] >Import dataset [USA]
[benfordslaw] >Analyzing digit position: [2]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_12096\648746591.py in <cell line: 17>()
     15 
     16 # Make fit
---> 17 results = bl.fit(X)
     18 
     19 # Plot

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in fit(self, X)
    101         if self.verbose>=3:
    102             print("[benfordslaw] >Analyzing digit position: [%s]" %(self.pos))
--> 103         counts_emp, percentage_emp, total_count, digit = _count_digit(X, self.pos, self.digit_range)
    104         # Expected counts
    105         counts_exp = self._get_expected_counts(total_count)

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in _count_digit(data, d, digit_range)
    292     Iloc = data>=np.power(10, d)
    293     Iloc = Iloc.fillna(False).astype(bool)
--> 294     digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))
    295 
    296     # Count occurences. Make sure every position is for [1-9]

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in <lambda>(x)
    292     Iloc = data>=np.power(10, d)
    293     Iloc = Iloc.fillna(False).astype(bool)
--> 294     digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))
    295 
    296     # Count occurences. Make sure every position is for [1-9]

IndexError: string index out of range

##Second last digit test
from benfordslaw import benfordslaw

# Initialize
bl = benfordslaw(pos=-2)

# Load elections example
df = bl.import_example(data='USA')

# Extract election information.
X = df['votes'].loc[df['candidate']=='Donald Trump'].values

#AttributeError: 'numpy.ndarray' object has no attribute 'fillna'
X = pd.DataFrame(X)

# Make fit
results = bl.fit(X)

# Plot
bl.plot(title='Results of Donald Trump based on 2nd digit', barcolor=[0.5, 0.5, 0.5], fontsize=12, barwidth=0.4)
[benfordslaw] >Import dataset [USA]
[benfordslaw] >Analyzing digit position: [-2]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_12096\3481008079.py in <cell line: 17>()
     15 
     16 # Make fit
---> 17 results = bl.fit(X)
     18 
     19 # Plot

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in fit(self, X)
    101         if self.verbose>=3:
    102             print("[benfordslaw] >Analyzing digit position: [%s]" %(self.pos))
--> 103         counts_emp, percentage_emp, total_count, digit = _count_digit(X, self.pos, self.digit_range)
    104         # Expected counts
    105         counts_exp = self._get_expected_counts(total_count)

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in _count_digit(data, d, digit_range)
    281     # Reverse numbers if last digits is required
    282     if d < 0:
![Uploading Percentage_First_Digits.png…]()

--> 283         data = list(map(lambda x: x[::-1], data.astype(str)))
    284         data = np.array(data).astype(int)
    285         d = d * -1

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in <lambda>(x)
    281     # Reverse numbers if last digits is required
    282     if d < 0:
--> 283         data = list(map(lambda x: x[::-1], data.astype(str)))
    284         data = np.array(data).astype(int)
    285         d = d * -1

TypeError: 'int' object is not subscriptable

benfordslaw.fit doesn't work on pandas Series with Int64Dtype (nullable)

Hi Erdogan,

I was recently working with a pandas DataFrame that had a column with a Int64Dtype, which is nullable. The column didn't actually have any null-values. This gave me the following error:

  File "/usr/local/lib/python3.8/dist-packages/benfordslaw/benfordslaw.py", line 293, in _count_digit
    digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid `indices`

I looked into it and it's because the nullable int also produces a nullable boolean series. So the variable Iloc was actually a nullable boolean, which I guess isn't supported by numpy. See below for a small reproducable example.

import pandas 
from benfordslaw import benfordslaw

bl = benfordslaw(alpha=0.05)

data = pandas.DataFrame({'value': [1,2,3,4,5]})
bl.fit(data['value'].astype(int)) # this works fine 
bl.fit(data['value'].astype(pandas.Int64Dtype())) #this throws an error

I feel like something like this would solve it (not tested):

# Get the ith digit
digits = np.zeros_like(data)
Iloc = data>=np.power(10, d)
# ignore nulls and cast to non-nullable dtype just in case
Iloc = Iloc.fillna(False).astype(bool)
digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))

I wouldn't mind making a pull request with some test cases. But I'll leave it up to you, I can also imagine this is not a high priority since I think the nullable IntDtype is still pretty experimental.

Kind regards,
Thomas

Incorrect determination nth digit distribution

The nth digit distribution is calculated wrong.

The empirical_counts is offset due to the assumption that digit_range is from 1-9. However, for digits larger than 1, 0 is a valid value. Hence, digit_range = 0-9 for digits > 1. Additionally, the nth digits are determined correctly and stored in digits[Iloc]. However, digits[Iloc] is not used for the empirical_counts, rather digits is.

To make _count_digit work for every digit, line 303 should be:
empirical_counts[i - digit_range[0]] = list(digits[Iloc]).count(i)

Pos < 0 results in incorrect probabilities

benfordslaw() returns incorrect expected probabilities when pos < 0. This shows up clearly in the Example info: https://erdogant.github.io/benfordslaw/pages/html/Examples.html#last-digit-test.

The Example where pos=-1 shows that the the last digit is expected to be '1' about 30% of the time. That's not true. It appears to be using the formula for the first significant digit, instead of using convolution to find the probability of the n-th digit.

Recommended solution would be to restrict pos > 0.

ChiSquare error being thrown: can an explanation be provided instead?

I am trying to use this library more or less as either a binary indicator of "benford or not" or a probability indicator of same. So any distribution should be possible to send into it. If the distribution is weird - then say "sorry, nope."

Instead consider:

bl = benfordslaw(alpha=0.05)
x = np.linspace(0,1000,1001)
x = np.append(x,[1,1,1,1,1,1,])
isben2 = bl.fit(x)
print(f"isben2 {isben2}")

Instead of a "nope" we get:

ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.0009950248756218905

Note that even just using x without the extra np.append() results in the same error. So .. what does this mean? Should I add my own code to catch that exception and then say "nope" ? The problem with that is we don't get any probability and also it is unclear whether that exception were due to some other unexplained data problem.

fyi the entire stacktrace is

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-8d38f0d591c2>", line 4, in <module>
    isben2 = bl.fit(x)
  File "/usr/local/lib/python3.9/site-packages/benfordslaw/benfordslaw.py", line 109, in fit
    tstats, Praw = chisquare(counts_emp, f_exp=counts_exp)
  File "/usr/local/lib/python3.9/site-packages/scipy/stats/stats.py", line 6852, in chisquare
    return power_divergence(f_obs, f_exp=f_exp, ddof=ddof, axis=axis,
  File "/usr/local/lib/python3.9/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
    raise ValueError(msg)
ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.0009950248756218905

Sample data appears to not be available from pip installation

How can I use the "bl.import_example()" feature? The sample/test code provided doesn't work if you pip install benfordslaw (I haven't tried cloning this repo to see if it works then).