cjph8914 / 2020_benfords Goto Github PK

View Code? Open in Web Editor NEW

369.0 369.0 83.0 2.52 MB

Jupyter Notebook 100.00%

2020_benfords's People

Contributors

Stargazers

Watchers

Forkers

goodtiding5 yakzio alexanderpeters tomholford zigster64 darth62969 johnyoungdt ljocampo marclatour chaztikov mdheller jezz81 mlewis1973 szokejokepu saltinechips datagrams rabidsabbath cortmcelmury youdecidetimes jonburroughs zaithe charlesmartin14 ywang-afk rkhedger46 yanghu2018 tiu1234 lifeishard pmacg d2squared hyxxsfwy murrisg coroin robertomalatesta oelias27 jvsinclair laeeth mixja cephurs wasauce elmershaw tmsktn paweldefee jimmysong dangyi4113 henryse alwaysbcoding xun6000 clayne julosaure andyspicer deepbreath comesea mhelgeby panicfarm satya77309 brianfarrellydata rainly epictetus herrold lpvm jackhou martynchamberlin justinlboyer geremia smock520 tamasflamich hammertoe data-touille jtneill zsh jimbojuice netdisciple feaselkl sterlingrf mmike-hub pearce790 madhukant rezaprimasatya deebal85 subcritical mansanitas snp289 datalabs-apps-tools-docs-tips

2020_benfords's Issues

2016 and 2012 comparisons

It would be interesting to see these same districts compared to the 2016 results and the 2012 results (where an incumbent was running).

I need a solution. Inforgot my passcode to iCloud, and I tried to guess Many times and now is activation lock. Helllpppppppp

Starting and working on new repository

https://github.com/bd271828/2020_election

Milwaukee ward sizes are small and there is a highly preferred candidate

The disappearance of Benford's law in Milwaukee is a function of voter preference alone. If one candidate has between 60% and 80% average chance of receiving a vote, then the sizes of the wards in Milwaukee are too small to accommodate Benford's law. See further details with my simulations here https://rpubs.com/frycast/687633

Edit: Not just too small, but too concentrated. They do not span many orders of magnitude.

Edit 2: The thread below becomes distracted by an effort to look into election data anomalies that are not directly related to this issue. My intention here is not to develop a fraud detection tool, but to highlight the major flaws with the one being used, and currently being touted by various news sources as evidence of fraud. So far, this issue is still open, and should be resolved by at least adding some comments to the README clarifying that the pattern observed in Milwaukee is a pattern that can arise in election data absent of fraud. Hopefully the owner of this popular repository, and the people involved here in this thread, are all interested in acting in good faith, and will focus on resolving the issue.

Reach out to the voter integrity project

Hey can you please reach out to me on this discord server. This is very compelling and we have a way to get things in front of the campaign quickly. this is very damning.

https://discord.gg/TYmxZkFq

I'm CoderKing.

Chicago data source

The data source URL in the Chicago_Wards_Precincts_Benfords_Data.ipynb (https://chicagoelections.gov/en/election-results-specifics.asp) actually led me to a website where the only data available for download is 2018 Primary - DEM 3/20/18 which didn't seem like the one used in the notebook. Am I missing anything here?

Strongest Evidence there was election fraud.

If this is too off base from the data delete it. I came here searching the truth and I'm convinced there was voter fraud, a claim I do not levy lightly. I am not an attorney and I recommend everyone read the entire docket below.

Source:
PA Courts - I would recommend everyone reading all of this.
https://www.courtlistener.com/docket/18618673/donald-j-trump-for-president-inc-v-boockvar/

Issue:
A Trump claim was about the right to cure ballots stating that this happened illegally based on the laws by the state of PA.

Proof:
The opposition to the case has filed DEMOCRATIC VOTERS affidavits CONFIRMING Trump's claim votes were systematically interfered with at scale throughout PA. I had to read this 5 times. This is likely one of the dumbest things I have seen in the court of law.

Source:
https://www.courtlistener.com/docket/18618673/donald-j-trump-for-president-inc-v-boockvar/
Via re 30 MOTION to Intervene filed by Joseph Ayeni, Black Political Empowerment Project, Common Cause Pennsylvania, Lucia Gajda, Stephanie Higgins, Meril Lara, League of Women Voters of Pennsylvania, Ricardo Morales, NAACP Pennsylvania State Conference, Natalie Price, Tim Stevens, Taylor Stover. They have submitted affidavits against Trump's motion stating they actually broke the law and even named specific parties like the DNC.
See the exhibits on 31 Nov 10, 2020.

Allegheny, PA absentee votes, second digit

I made (and corrected) a quick analysis of second digits for absentee votes only in Allegheny, PA.

Looking at vote counts instead of first digits shows why this is not evidence of fraud

I suggest you plot a histograms of the vote counts per precinct, along with the histograms of the first digits. You will see immediately that these are not evidence of voter fraud, or even examples of data that should obey Benford's Law.

Instead, what you will see is that counties like Allegheny were chosen, where Biden almost always got more than 200 votes per precinct, and Trump did not. So, it superficially looks like Biden has a "shortage" of the digit 1. But, in fact, this normal distribution should not be expected to obey Benford's Law, even approximately.

Time series analysis?

Via Twitter: someone on 4chan, of all places, scraped time series from NYT and analyzed GOP/Dem vote share drift in vote count deltas over time. That is, with each update, what's the percentage of the update for Biden vs Trump (or at least that's how I understand it). Outside the turbulent period where in-person votes are counted, mail-in tended to drift slightly towards GOP over time in uncontested areas (both GOP and Dem), which they explain away as rural votes taking longer to arrive, and steeply towards Dem in contested ones.

Given where this comes from, take it with a massive grain of salt - I can't vouch for veracity of the data. I'm attaching the CSV if anyone wants to verify/take a look.

2020_election_time_series.zip

Not linking the Twitter thread here, as it launches into several conspiracies which we here can neither confirm nor deny.

Add a Way to Simulate an Election With Given Parameters

It would be nice to be able to run simulated elections and then see if the results of those elections (which we know are fair) conform to Benford's Law whether 1st digit or 2nd digit. I have a quick and dirty start here https://github.com/snex/election_results_benford/blob/master/sim.rb

Why is the Biden Election Day vote data nearly Gaussian ?

The weirdest thing to me about all of this is that the Election Day vote distribution for Biden is almost perfectly Normal, with a slight right skew.

Whereas the Trump data , being heavy-tailed, just looks more like real-world data to me

Any thoughts ?

Data availability

Maybe I completely missed it, but what exactly are the source links for the data?

Paper suggests that even second-digit analysis cannot be used

Please refer to chapter 2 in the following paper:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.697.5592&rep=rep1&type=pdf

The paper suggests precinct results in previous elections in a number of countries do not seem to follow the second-digit Benford distribution.

Let me try to outline why this does not hold for second digits either. If you have precincts in cities designed so that the votes for a certain candidate follows a chi squared distribution with an expected value of 5000 and a certain deviation, then the most likely result is 5000 (2nd digit: 0). The second most likely results are 4999 and 5001 (2nd digits: 9 and 0). The third most likely results are 4998 and 5002 (2nd digits: 9 and 0). Etc. (edit: i got this wrong the first time)

On the other hand, for a Benford distribution, the most likely result is 1. The second most likely result is 2. The third most likely result is 3. Etc.

Hence, using second digits does not fix the problem with planned precinct sizes. We can perhaps see from the example how Benford's Law will only work if the expected value of the distribution is 0. With rational planning of precinct sizes inside cities, that won't happen. Countryside precincts are more likely to follow the Benford pattern, as the number of votes in each precinct will be more "organically" determined and less planned.

It thus seems that the methodology cannot be applied inside cities.

For numbers summed to 100%, there can be no such thing that only one member doesn't follow Benford's rule.

Keep calm and learn math.

ps.: don't use lead digits in these situation. Misleading.

Plans for expansion?

Are there plans of expanding this project to include every county/division/voting unit in the country? I think there's value in it and would hash out some of the concerns in Issue #5.

I'd also suggest including documentation about where the data can be obtained as well.

If there's interest in carrying out a similar analysis in R I could devote some time.

1,2,3 digit Benford plots for Texas (2020 to 1992 presidential elections) using county aggregated counts

https://github.com/bd271828/2020_election/tree/main/plot/tx

Analyzing the second leading digit makes all the results look non-conforming

This is really astonishing, and I want to make sure I didn't make some kind of simple mistake. I'm new to Pandas.

I took your notebook & data, and made the changes on Kaggle: https://www.kaggle.com/dogweather/allegheny-cty-benford-s

I suspect I'm not applying Benford's Law correctly. I.e., it doesn't apply to simply the second digit being a 2, but rather e.g. the number starting with 12.

Understanding the plot

Can someone help me to understand the plots? I understand Benford law but what does the frequency mean in the plots? I know it's the frequency of some numbers/data of the vote but what exactly are these numbers? Where do they come from? Thanks!

You might want to quantify the deviation of data from the Benford's law for each graph

Breakdown by vote type

I am doing my own similar analysis on this, and I've noticed that if you break results down by vote type, by far the most non-conforming dataset is the Joe Biden ELECTION DAY results, rather than absentee or mail-in results. Please add a breakdown by vote type as well as by total. I have uploaded my datasets and code here: https://github.com/snex/election_results_benford

What is the typical variance for applications of Benford's law?

My question is does Benford's law have high variance when observing any individual dataset? This could explain any abnormalities or possibility strengthen a claim that this dataset appears irregular.

According to my research second-digit tests are more reliable for detecting election fraud

See: http://www-personal.umich.edu/~wmebane/pm06.ps, https://www.degruyter.com/view/journals/jbnst/231/5-6/article-p719.xml and https://www.researchgate.net/publication/275305550_Comment_on_Benford's_Law_and_the_Detection_of_Election_Fraud

Registered voters & number of ballots

Very Cool Project. Could you also plot the distribution of leading digits in the number of registered voters and number of ballots per voting district?

In cities where the votes are heavily skewed to one candidate or another, the distribution of leading digits of votes for that candidate should be highly correlated to the leading digit of the number of ballots. Would number of registered voters and number of ballots follow Benford's law?

Data from multiple providers?

Has there been any comparison of results for the race with data from different providers? E.g, compare NYT Edison vs Clarity Elections for Georgia or Fulton County

For what it's worth, I created a simple project to download data from NYT, perhaps it could one day be an option for this repo's analysis:

https://github.com/tomdotcash/election_data

Code to mass process the data

I combined the toolkit I've been building for the last few days with your data and some of the processing code. Not sure if this is something you'd be interested in, but I can put together a PR to add the script to the repo. You can take a look here, it's not as pretty as your code, but it is tested and works. Results have been compared with other benford py libraries. Here's the link: https://github.com/FraudAnalysis/Benford2020/blob/main/Analysis.ipynb

Some more ideas based on infamous examples from other countries

https://projecteuclid.org/euclid.aoas/1458909907
Paper describing various statistical fingerprints for fraudulent elections. Focus and proof of applicability based on several Russian elections.

Does anybody have a dataset known to be fraudulent?

I want to experiment around with changing data to base 8 or base 16, multiplying, etc, was wondering if anybody knows of a dataset that was known to be manipulated so I can tell if the fraud get's obscured or holds.

Election Fraud Data Check of Dr. Shiva

We have made requests to get the data sets to check the following video. Once, if we get the data I will post it here.

See the video below.
https://www.pscp.tv/w/1BdGYYjgkgQGX

For comparison...

Here's is someone doing this slightly more elaborately: https://probablydance.com/2020/11/08/looking-for-voter-fraud-in-old-elections-with-data-visualization/

2BL test more appropiate for appropriate for election fraud

Just wanted to throw in my 2 cents considering this github is being used by some poeple on social media to advance their agenda. I'm probably gonna run 2BL tests unless someone already has. Just wanted to throw out this article also regarding the fallacy of using first-digit BL tests on election data:

http://www-personal.umich.edu/~wmebane/inapB.pdf

The average ward in Milwaukee has 750 votes, how would Biden have 100-200 in 30% of wards?

This repo's use of Benford's Law is so misleading that it discredits other claims of fraud more generally.

If you pull the data from Milwaukee city, the average ward has 755 votes. Biden wins an average of 595 votes per ward. Obviously if this is true, his first-digit distribution is going to be skewed towards the 4, 5, 6 range.
Only 20.5% of wards had over 1,000 votes, and 2.1% of wards had between 100-200 votes. These are the only wards where Biden would even have a chance to get to 1___ votes.
It's laughably easy to produce these kinds of anomalies with political data. 65.6% of main-party candidates in the 2018 House elections had a vote total starting with 1. Massive fraud? No, it's because the average congressional district had 264 thousand votes, and in most races one or both of the candidates had 100,000-something votes.

2020 Milwaukee Data
2018 House Election Data

If the size of the district you're looking at is the same across many different races, the results will skew towards something completely different from Benford's Law, absent any fraud whatsoever. Thus, these results from Milwaukee or any other place provide no evidence of election fraud.

Research suggests Benford is unreliable in Election Fraud Detection

Nice analysis. However, I wanted to point you to a few articles that may be of interest to you. Essentially the research suggests Benford's is unreliable when applied to election data:

https://repository.library.georgetown.edu/handle/10822/557850

https://www.jstor.org/stable/23011436?seq=1

https://courses.math.tufts.edu/math19/duchin/dmo.pdf

https://www.cambridge.org/core/journals/political-analysis/article/benfords-law-and-the-detection-of-election-fraud/3B1D64E822371C461AF3C61CE91AAF6D

Florida is benford law

https://dos.myflorida.com/elections/data-statistics/elections-data/precinct-level-election-results/

[onishin@pump florida]$ grep -i Trump *.txt | awk '{print substr($NF,1,1)}' | sort | uniq -c | sort -n
257 9
280 8
301 0
308 7
359 6
429 5
614 4
754 3
1111 2
1739 1
[onishin@pump florida]$ grep -i Biden *.txt | awk '{print substr($NF,1,1)}' | sort | uniq -c | sort -n
191 9
203 8
246 7
261 6
281 5
283 0
474 4
808 3
1430 2
1975 1
[onishin@pump florida]$
Is clean for me

Benfords law regarding election data needs to use the second digit analysis

Benford's first-digit analysis is intended to be used on data with several orders of magnitude, and hundreds of votes per precinct over hundreds of counties is not sufficient. For detecting voter fraud, you need to use the second-digit analysis.

The data presented in this project do not properly apply the law and are misleading.

https://repository.library.georgetown.edu/bitstream/handle/10822/557850/Brown_georgetown_0076M_11716.pdf
https://www.cambridge.org/core/journals/political-analysis/article/benfords-law-and-the-detection-of-election-fraud/3B1D64E822371C461AF3C61CE91AAF6D
https://en.wikipedia.org/wiki/Benford%27s_law#Election_data

Analyze same data with different base values

Given that with Benford's law :

Clarity should (?) improve as the spread of values increases
The law still holds for any given base

Would be interesting to cross reference the same input data sets with using bases other than 10

something like
np.logX(1 + 1/digit) * N

where X is a range of numbers say [2..16]

So if you generated a range of graphs across the same data with different base sizes and eyeballed the result, that may lead to higher confirmation that the given data is abnormal if its also abnormal for the majority of numeric bases.

Alternatively, it may show what appears as an apparent anomaly is maybe not as bad as it looks. maybe ?

This is super interesting, you got me reading up on stats all over again :) Thanks.

Failure to take account for external factors

Covid caused the mail-in voting rate to rise.
Counties counted votes at different rates due to the surplus of mail-in ballots vs the standard rate using an electronic system.
Mail-in voting discouraged by the republican candidate. As a result, one side was more likely to cast in-person using such a system at a poll location.

Don't get me wrong; this looks well written. However, this could do for a PR with such notices. I'd be happy to contribute one to the readme.