Giter Club home page Giter Club logo

benfordslaw's Introduction

benfordslaw

Python PyPI Version License BuyMeCoffee Github Forks GitHub Open Issues Project Status Downloads Downloads Open In Colab Sphinx DOI

  • benfordslaw is Python package to test if an empirical (observed) distribution differs significantly from a theoretical (expected, Benfords) distribution. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. This method can be used if you want to test whether your set of numbers may be artificial (or manipulated). If a certain set of values follows Benford's Law then model's for the corresponding predicted values should also follow Benford's Law. Normal data (Unmanipulated) does trend with Benford's Law, whereas Manipulated or fraudulent data does not.

  • Assumptions of the data:

    1. The numbers need to be random and not assigned, with no imposed minimums or maximums.
    2. The numbers should cover several orders of magnitude
    3. Dataset should preferably cover at least 1000 samples. Though Benford's law has been shown to hold true for datasets containing as few as 50 numbers.

⭐️ Star this repo if you like it ⭐️

Install benfordslaw from PyPI

pip install benfordslaw

Import benfordslaw package

from benfordslaw import benfordslaw

On the documentation pages you can find detailed information about the working of the benfordslaw with many examples.


Examples

References

Citation

Please cite in your publications if this is useful for your research (see citation).

Maintainers

Contribute

  • All kinds of contributions are welcome!
  • If you wish to buy me a Coffee for this work, it is very appreciated :)

Licence

See LICENSE for details.

benfordslaw's People

Contributors

andrewlane avatar erdogant avatar gfreynoso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

benfordslaw's Issues

About digital numbers smaller than 1

Hi Erdogan,

My array goes like this: [-0.89, 1.18, 0.28, 0.0032, ...]. Most numbers in my array is between -1 and 1.

If I directly use your Benford method, I get many NaN. To obtain correct results, I have to multiply my array by a large number, e.g., 1000 or 1e6. How can I use your code to let the method recognize that the first digit of 0.0032 is 3 instead of 0? Or do I have to manually multiply a large number myself?

Thank you!

Feature: 2nd digit bl

Hey,
it would be cool if you could add 2nd digit benfords law since it is more useful for analysing election fraud.

AttributeError: 'FigureCanvasTkAgg' object has no attribute `'set_window_title'`

I get

File "/home/user/anaconda3/envs/environment/lib/python3.10/site-packages/benfordslaw/benfordslaw.py", line 186, in plot fig.canvas.set_window_title('Percentage First Digits') AttributeError: 'FigureCanvasTkAgg' object has no attribute 'set_window_title'``

with this code

`bl = benfordslaw(alpha=0.05)

results = bl.fit(amounts)

bl.plot(title='Amounts of transactions - 1st digit', barcolor=[0.5, 0.5, 0.5], fontsize=12, barwidth=0.4)`

Using benfordslaw v 1.2.0 and Python 3.10.4

Failed to reproduce the examples

Hi! Thans for developing benfordslaw. I'm a python rookies, and I found the examples in benfordslaw are irreproducible although I fix the bug "AttributeError: 'numpy.ndarray' object has no attribute 'fillna'".

import benfordslaw
benfordslaw.__version__
​
'1.1.1'

##First digit test
from benfordslaw import benfordslaw
import pandas as pd

# Initialize
bl = benfordslaw(alpha=0.05)

# Load elections example
df = bl.import_example(data='USA')

# Extract election information.
X = df['votes'].loc[df['candidate']=='Donald Trump'].values

print(X)
# array([ 5387, 23618,  1710, ...,    16,    21,     0], dtype=int64)

#Add for "AttributeError: 'numpy.ndarray' object has no attribute 'fillna'"
X = pd.DataFrame(X)


# Make fit
results = bl.fit(X)

# Plot
bl.plot(title='Donald Trump')
----------------------------------------------------------------------------------------------------------------------------------
D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py:306: RuntimeWarning: invalid value encountered in double_scalars
  empirical_percentage = [(i / total_count) * 100 for i in empirical_counts]
D:\anaconda3\lib\site-packages\scipy\stats\_stats_py.py:6766: RuntimeWarning: invalid value encountered in divide
  terms = (f_obs_float - f_exp)**2 / f_exp
posx and posy should be finite values
[benfordslaw] >Import dataset [USA]
[ 5387 23618  1710 ...    16    21     0]
[benfordslaw] >Analyzing digit position: [1]
[benfordslaw] >[chi2] No anomaly detected. P=nan, Tstat=nan
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

Percentage_First_Digits

##Second digit test
from benfordslaw import benfordslaw
​
# Initialize
bl = benfordslaw(pos=2)
​
# Load elections example
df = bl.import_example(data='USA')
​
# Extract election information.
X = df['votes'].loc[df['candidate']=='Donald Trump'].values
​
#AttributeError: 'numpy.ndarray' object has no attribute 'fillna'
X = pd.DataFrame(X)
​
# Make fit
results = bl.fit(X)
​
# Plot
bl.plot(title='Results of Donald Trump based on 2nd digit', barcolor=[0.5, 0.5, 0.5], fontsize=12, barwidth=0.4)
[benfordslaw] >Import dataset [USA]
[benfordslaw] >Analyzing digit position: [2]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_12096\648746591.py in <cell line: 17>()
     15 
     16 # Make fit
---> 17 results = bl.fit(X)
     18 
     19 # Plot

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in fit(self, X)
    101         if self.verbose>=3:
    102             print("[benfordslaw] >Analyzing digit position: [%s]" %(self.pos))
--> 103         counts_emp, percentage_emp, total_count, digit = _count_digit(X, self.pos, self.digit_range)
    104         # Expected counts
    105         counts_exp = self._get_expected_counts(total_count)

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in _count_digit(data, d, digit_range)
    292     Iloc = data>=np.power(10, d)
    293     Iloc = Iloc.fillna(False).astype(bool)
--> 294     digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))
    295 
    296     # Count occurences. Make sure every position is for [1-9]

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in <lambda>(x)
    292     Iloc = data>=np.power(10, d)
    293     Iloc = Iloc.fillna(False).astype(bool)
--> 294     digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))
    295 
    296     # Count occurences. Make sure every position is for [1-9]

IndexError: string index out of range
##Second last digit test
from benfordslaw import benfordslaw
​
# Initialize
bl = benfordslaw(pos=-2)
​
# Load elections example
df = bl.import_example(data='USA')
​
# Extract election information.
X = df['votes'].loc[df['candidate']=='Donald Trump'].values
​
#AttributeError: 'numpy.ndarray' object has no attribute 'fillna'
X = pd.DataFrame(X)
​
# Make fit
results = bl.fit(X)
​
# Plot
bl.plot(title='Results of Donald Trump based on 2nd digit', barcolor=[0.5, 0.5, 0.5], fontsize=12, barwidth=0.4)
[benfordslaw] >Import dataset [USA]
[benfordslaw] >Analyzing digit position: [-2]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_12096\3481008079.py in <cell line: 17>()
     15 
     16 # Make fit
---> 17 results = bl.fit(X)
     18 
     19 # Plot

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in fit(self, X)
    101         if self.verbose>=3:
    102             print("[benfordslaw] >Analyzing digit position: [%s]" %(self.pos))
--> 103         counts_emp, percentage_emp, total_count, digit = _count_digit(X, self.pos, self.digit_range)
    104         # Expected counts
    105         counts_exp = self._get_expected_counts(total_count)

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in _count_digit(data, d, digit_range)
    281     # Reverse numbers if last digits is required
    282     if d < 0:
![Uploading Percentage_First_Digits.png…]()

--> 283         data = list(map(lambda x: x[::-1], data.astype(str)))
    284         data = np.array(data).astype(int)
    285         d = d * -1

D:\anaconda3\lib\site-packages\benfordslaw\benfordslaw.py in <lambda>(x)
    281     # Reverse numbers if last digits is required
    282     if d < 0:
--> 283         data = list(map(lambda x: x[::-1], data.astype(str)))
    284         data = np.array(data).astype(int)
    285         d = d * -1

TypeError: 'int' object is not subscriptable

benfordslaw.fit doesn't work on pandas Series with Int64Dtype (nullable)

Hi Erdogan,

I was recently working with a pandas DataFrame that had a column with a Int64Dtype, which is nullable. The column didn't actually have any null-values. This gave me the following error:

  File "/usr/local/lib/python3.8/dist-packages/benfordslaw/benfordslaw.py", line 293, in _count_digit
    digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid `indices`

I looked into it and it's because the nullable int also produces a nullable boolean series. So the variable Iloc was actually a nullable boolean, which I guess isn't supported by numpy. See below for a small reproducable example.

import pandas 
from benfordslaw import benfordslaw

bl = benfordslaw(alpha=0.05)

data = pandas.DataFrame({'value': [1,2,3,4,5]})
bl.fit(data['value'].astype(int)) # this works fine 
bl.fit(data['value'].astype(pandas.Int64Dtype())) #this throws an error

I feel like something like this would solve it (not tested):

# Get the ith digit
digits = np.zeros_like(data)
Iloc = data>=np.power(10, d)
# ignore nulls and cast to non-nullable dtype just in case
Iloc = Iloc.fillna(False).astype(bool)
digits[Iloc] = list(map(lambda x: int(str(x)[d]), data[Iloc]))

I wouldn't mind making a pull request with some test cases. But I'll leave it up to you, I can also imagine this is not a high priority since I think the nullable IntDtype is still pretty experimental.

Kind regards,
Thomas

Incorrect determination nth digit distribution

The nth digit distribution is calculated wrong.

The empirical_counts is offset due to the assumption that digit_range is from 1-9. However, for digits larger than 1, 0 is a valid value. Hence, digit_range = 0-9 for digits > 1. Additionally, the nth digits are determined correctly and stored in digits[Iloc]. However, digits[Iloc] is not used for the empirical_counts, rather digits is.

To make _count_digit work for every digit, line 303 should be:
empirical_counts[i - digit_range[0]] = list(digits[Iloc]).count(i)

Pos < 0 results in incorrect probabilities

benfordslaw() returns incorrect expected probabilities when pos < 0. This shows up clearly in the Example info: https://erdogant.github.io/benfordslaw/pages/html/Examples.html#last-digit-test.

The Example where pos=-1 shows that the the last digit is expected to be '1' about 30% of the time. That's not true. It appears to be using the formula for the first significant digit, instead of using convolution to find the probability of the n-th digit.

Recommended solution would be to restrict pos > 0.

ChiSquare error being thrown: can an explanation be provided instead?

I am trying to use this library more or less as either a binary indicator of "benford or not" or a probability indicator of same. So any distribution should be possible to send into it. If the distribution is weird - then say "sorry, nope."

Instead consider:

bl = benfordslaw(alpha=0.05)
x = np.linspace(0,1000,1001)
x = np.append(x,[1,1,1,1,1,1,])
isben2 = bl.fit(x)
print(f"isben2 {isben2}")

Instead of a "nope" we get:

ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.0009950248756218905

Note that even just using x without the extra np.append() results in the same error. So .. what does this mean? Should I add my own code to catch that exception and then say "nope" ? The problem with that is we don't get any probability and also it is unclear whether that exception were due to some other unexplained data problem.

fyi the entire stacktrace is

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-8d38f0d591c2>", line 4, in <module>
    isben2 = bl.fit(x)
  File "/usr/local/lib/python3.9/site-packages/benfordslaw/benfordslaw.py", line 109, in fit
    tstats, Praw = chisquare(counts_emp, f_exp=counts_exp)
  File "/usr/local/lib/python3.9/site-packages/scipy/stats/stats.py", line 6852, in chisquare
    return power_divergence(f_obs, f_exp=f_exp, ddof=ddof, axis=axis,
  File "/usr/local/lib/python3.9/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
    raise ValueError(msg)
ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
0.0009950248756218905

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.