michelenuijten / statcheck Goto Github PK
View Code? Open in Web Editor NEWA spellchecker for statistics
A spellchecker for statistics
It's reasonably common for HTML to write the degrees of freedom as a subscript without a parentheses.
When reading the text, the formatting is lost, so it appears without the parentheses:
Make the parentheses optional. Just adding a ?
after the parentheses in the regexes is simple. But the code to determine the statistical test relies on the open parenthesis.
test_that("t-tests with without parentheses are retrieved from text", {
txt1 <- "t 28 = 2.20, p = .03"
txt2 <- "t28 = 2.20, p = .03"
result <- statcheck(c(txt1, txt2), messages = FALSE)
expect_equal(nrow(result), 2)
})
I would love to be able to consider all p-values from a paper and see for which ones tests could be extracted / which ones can be flagged as problematic. Currently, I can get the tests or AllPValues, but they are hard to match. Could there be an option to augment rather than replace the standard output?
Is the statcheck logo available for reuse? If so, could you add it to the repo (.svg?) and specify the license under which you make it available (CC 0 please? :-)).
I'd like to use it for statcheck-extension
I am just starting on.
Should there be a comma separating the value and the "p"? Yes.
Do some authors use a semicolon instead? Also yes.
Look at final character here
Some languages use the comma as a decimal separator. For instance, in some journals written in Spanish, it is recommended that results should be written as... "F(1, 19) = 4,44, p = 0,048". I was not able to extract results from such papers using statcheck. It would be nice if this could be somehow considered.
Thanks for this interesting project.
I'm wondering what the release plan is for version 1.4. There are prereleases for 1.4 https://github.com/MicheleNuijten/statcheck/releases, but no release yet.
Another thing I'm wondering about is the stale (?) develop
branch. What is your plan with this branch? Merge it into master?https://github.com/MicheleNuijten/statcheck/network. There are interesting features in there regarding PDF parsing with pdftools. It might be better to abandon the develop
branch after merging it into master.
see this test:
test_that("t-values with a weird minus sign and a space do not result in errors", {
txt1 <- " t(553) = − 4.46, p < .0001" # this is an em dash or something
expect_output(statcheck(txt1, messages = FALSE), "did not find any results")
})
not sure why I decided why these cases should be ignored. seems reasonable to include them.
Hi Michele,
Statcheck doesn't seem to be able to recognise the below string. Could you please advise what I might be doing wrong? Thanks!
statcheck("F(1,210) = 0, p = 1")
statcheck("F(1,210) = 1, p = 0")
When scanning an entire folder of html & pdf articles, allow for choosing the pdf reader
This PDF file
10.1111:apps.12362.pdf
fails with
Error in if (grepl(pattern = RGX_Q, x = test_raw)) { :
the condition has length > 1
This is because the chisq tests get read as follows:
a good model fit (2 (199) = 627.73, p < .001, CFI = .94, RMSEA = .07, SRMR = .05), and [...] loading on one factor (2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and the one-factor model with all items loading on one common factor (2 (209) = 4489.05, p < .001, CFI = .40, RMSEA = .20, SRMR = .17).
This is really odd xpdf
-behaviour because I can copy-paste them from the PDF without trouble, so they seem to be embedded as characters rather than images.
So, two questions here:
could not process "(2 (199) = 627.73"
then trouble-shooting would be much easier?a good model fit (χ 2 (199) = 627.73, p < .001, CFI = .94,\nRMSEA = .07, SRMR = .05), and [...] loading on one factor (χ 2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and\nthe one-factor model with all items loading on one common factor (χ 2 (209) = 4489.05,\np < .001, CFI = .40, RMSEA = .20, SRMR = .17).
(Getting this to work requires two minor pre-processing steps:
pdftools::pdf_text(f) |> paste(collapse = "") |> gsub("\n", "", _) |> statcheck:::extract_stats("chisq")
)
I get an error that is new to me. I pasted it below. I also attached the text for which I got this. You can read it in to R with read.table
; I was using statcheck()
when this error occurred.
ERROR
Extracting statistics...
|=========================================================================| 100%
Error in if (any(DecisionErrorAlphas)) { :
missing value where TRUE/FALSE needed
Alternative to xpdf which does not require installation of separate software: https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen
StatCheck returns p = 1 for a spearman rho test:
https://pubpeer.com/publications/482004022406F33A920A732DC12DCC#fb99015
Apologies if this was fixed in the StatCheck update.
One more wish: could messages = FALSE
also suppress the "statcheck did not find any results" message? Alternatively, could this be delivered as a message()
rather than with cat()
? The cat()
output is quite difficult to suppress in a loop ...
Line breaks in a (badly?) converted PDF file result in the not reading of a test result. Maybe it is a worthwhile addition to add the removal of newlines (\n
) next to the space removal used in the statcheck function.
A reproducible example is:
statcheck("F(1, 45) = .12, p = .58 and F(2, 165)\n = .001, p = .96")
Michèle, if you agree I can do this sometime soon.
After scanning a directory with 70 files in it, I got this message:
There were 50 or more warnings (use warnings() to see the first 50)
The output of warnings() is attached.
warnings.txt
if you run statcheck(NA), you'll get an error. Return warning/message instead.
Error message:
Error in readChar(file(fileName), file.info(fileName)$size, useBytes = TRUE) :
cannot open the connection
In addition: Warning message:
In readChar(file(fileName), file.info(fileName)$size, useBytes = TRUE) :
cannot open file 'C:/Users/Nick/Desktop/html/ML1.12 Math = male, me = female, therefore math ? me - ProQuest.html': Invalid argument
Filename is "ML1.12 Math = male, me = female, therefore math ≠ me - ProQuest.html"
The character causing a problem seems to be "≠".
Original article: https://www.ncbi.nlm.nih.gov/pubmed/12088131
To have a cli program, it would be nice to have the element of the input vector from which the statistics is extracted (eg. the line number)
I implemented an example program here (please forgive my poor R)
Example output format would be:
filename.org:27:23: info: F(1,132) = 5.59, p = 0.019
filename.org:28:24: info: F(1,132) = 8.96, p = 0.003
filename.org:38:8: info: F(1,130) = 4.86, p = 0.029
filename.org:39:9: error: The expected value is 0.043 (0.0426781658095173)
filename.org:40:2: info: F(1,130) = 7.41, p = 0.007
filename.org:54:2: error: The expected value is 0.019 (0.0189133318829514)
filename.org:54:42: error: The expected value is 0.007 (0.00737627921418102)
filename.org:56:26: error: The expected value is 0.011 (0.0112664797423938)
filename.org:60:16: info: F(1,132) = 5.59, p = 0.02
This can be used inside emacs with flycheck like this:
(flycheck-define-checker statscheck
"A linter for statistics."
:command ("statscheck" source)
:error-patterns
((error line-start (file-name) ":" line ":" column ": error: "
(message) line-end)
(info line-start (file-name) ":" line ":" column ": info: "
(message) line-end))
:modes (text-mode markdown-mode org-mode))
(add-to-list 'flycheck-checkers 'statscheck)
The library tctlk is sometimes problematic on Mac. Also, there doesn't seem to be a specific reason to choose this library over the base R functions
if you scan a folder that has both a pdf version and html version of the same file, they will get the same source name in the final result. this seems undesirable.
See https://dev.w3.org/html5/html-author/charref for a list. Some tags are already included in file-to-txt.R, but not all variations.
Based on the pubpeer reports a bug was found in v1.0.1 where some correlations are incorrectly extracted as Chi2. It is based on this paper and the Pubpeer comments are available here.
It might be worthwhile to make these a use case for testing purposes. But definitely something worth looking into. Attached is a csv
of the results from statcheck
for this paper.
issue.txt
Currently, devtools::install_github("MicheleNuijten/statcheck")
fails due to a Malformed package version.
Apparently, 1.4.1-beta.1
is not acceptable there ...
I saw something in the statcheck scrape that might be a bug. I think generic tests may get scraped as t-tests.
For example, in this paper, the authors write "Friedman's test (15) = 62.92", and statcheck scrapes it as "t(15) = 62.92".
I don't know much about Friedman's test (or nonparametric tests in general), but it seems to use its own Q-statistic that is closer to a chi-square distribution.
I don't think this is a common situation, of course, but if the regexp could be tweaked to avoid mistaking "...test (df)" for "t(df)" it would improve the specificity of the statcheck program.
e.g., χ2(5) = 231.24, p = 5.81 × 10−48
When a filename is very long, it can't be opened (for some reason). This causes statcheck to throw an error. It would be better to throw an informative message instead, so that when you scan an entire folder of papers and one has a file name that's too long, you just skip the long file and still scan the rest.
Error message:
Importing HTML files...
|== | 2%Error in readChar(con, file.info(fileName)$size, useBytes = TRUE) :
cannot open the connection
In addition: Warning message:
In readChar(con, file.info(fileName)$size, useBytes = TRUE) :
cannot open file 'C:/Users/mnuijten/surfdrive/UVT/Projects/EffectivenessStatcheck/effectiveness_statcheck/articles/PS/2013/A Longitudinal Cluster-Randomized Controlled Study on the Accumulating Effects of Individualized Literacy Instruction on Students’ Reading From First Through Third Grade.htm': No such file or directory
Greetings from the SIPS conference where we are having an error checking discussion - and all very appreciative of statcheck. Given that confidence intervals are usually produced from standard errors, they can be calculated based on p-value and sample size. Could statcheck add a test of those, given that they have become highly recommended parts of the APA guidelines?
Prepare for release:
git pull
usethis::use_github_links()
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
str(statcheck("t(10) = 3, p = .009", alpha = .01))
Classes ‘statcheck’ and 'data.frame': 1 obs. of 15 variables:
$ Source : Factor w/ 1 level "1": 1
$ Statistic : Factor w/ 1 level "t": 1
$ df1 : logi NA
$ df2 : num 10
$ Test.Comparison : Factor w/ 1 level "=": 1
$ Value : num 3
$ Reported.Comparison: Factor w/ 1 level "=": 1
$ Reported.P.Value : num 0.009
$ Computed : num 0.0133
$ Raw : Factor w/ 1 level "t(10) = 3, p = .009": 1
$ Error : logi FALSE
$ DecisionError : logi FALSE
$ OneTail : logi TRUE
$ OneTailedInTxt : logi FALSE
$ APAfactor : num 1
I think I found a bug in how statcheck()
diagnoses errors in one-sided t-tests. Specifically, when the sample mean is in the "wrong" tail it is inappropriate to calculate the one-tailed p-value by halfing the two-sided p-value as is done by statcheck()
.
mean(sleep$extra[sleep$group == 1])
# [1] 0.75
mean(sleep$extra[sleep$group == 2])
# [1] 2.33
t.test(
sleep$extra[sleep$group == 1]
, sleep$extra[sleep$group == 2]
, var.equal = T
)
# Two Sample t-test
#
# data: sleep$extra[sleep$group == 1] and sleep$extra[sleep$group == 2]
# t = -1.8608, df = 18, p-value = 0.07919
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -3.363874 0.203874
# sample estimates:
# mean of x mean of y
# 0.75 2.33
t.test(
sleep$extra[sleep$group == 1]
, sleep$extra[sleep$group == 2]
, alternative = "greater"
, var.equal = T
)
# Two Sample t-test
#
# data: sleep$extra[sleep$group == 1] and sleep$extra[sleep$group == 2]
# t = -1.8608, df = 18, p-value = 0.9604
# alternative hypothesis: true difference in means is greater than 0
# 95 percent confidence interval:
# -3.052378 Inf
# sample estimates:
# mean of x mean of y
# 0.75 2.33
Hence, statcheck
incorrectly indicates erroneous reporting, whereas the, in this case, incorrect p-value is deemed correct.
statcheck:::statcheck("t(18) = -1.86, p = 0.960", OneTailedTests = TRUE)[, c("Reported.P.Value", "Computed", "Error", "OneTail")]
# Reported.P.Value Computed Error OneTail
# 1 0.96 0.03965356 TRUE FALSE
statcheck:::statcheck("t(18) = -1.86, p = 0.039", OneTailedTests = TRUE)[, c("Reported.P.Value", "Computed", "Error", "OneTail")]
# Reported.P.Value Computed Error OneTail
# 1 0.039 0.03965356 FALSE FALSE
Since statcheck()
can't know what the tested hypothesis is, it should probably always consider both possibilities and err on the side of caution?
When converting some older PDFs, I've encountered a couple of character recognition errors that I think could be addressed with some updated regex:
UI/UX could use a quick boost by displaying ASCII characters for parsed Chi-Squared tests. Currently displays something along the lines of "2 (1) = 3.3, p = 0.07" in the output table. Thanks!
Errors are caused if there is a space between two numbers after a test statistic and before a decimal in a reported statistical test result. (Scanned several thousand papers and this only occurred once so it's unlikely to pop up too often!)
Examples:
statcheck::statcheck(" z = 1 1 .25, p = .806. ")
#> Extracting statistics...
#>
|
| | 0%
|
|=================================================================| 100%
#> Error in if (lower[i] < 0) {: missing value where TRUE/FALSE needed
statcheck::statcheck("t(123) = 1 0.25, p = .806")
#> Extracting statistics...
#>
|
| | 0%
|
|=================================================================| 100%
#> Error in if (lower[i] < 0) {: missing value where TRUE/FALSE needed
Created on 2019-11-28 by the reprex package (v0.2.1)
It would be nice to have a dataframe variable saying whether the parsed "formula" is in valid APA style.
A simple way to reach it would be to match $Raw
with the valid-APA regex
We are currently doing an error-detection hackathon (related to the ERROR project) ... and were wondering whether you'd be interested in having statcheck extended to (HMTL) tables ... or whether that would work better as a separate extension package? Would be great to hear your thoughts ...
Statcheck checks if functions like checkHTML() work by scanning "test articles". These are not synced with git, because of copyright issues. That means that if you download statcheck from GitHub, you will fail a lot of tests, because there are no test articles. Skip these tests if the articles are not there (maybe with a printed message).
Sorting is messed up where source 10 appears before source 2.
it seems as if p < .000 is always counted as an error, even when pZeroError == FALSE. This doesn't happen for p = .000.
example:
checkPDF() # chose a html file
PDF error: May not be a PDF file (continuing anyway)
PDF error (2): Illegal character <21> in hex string
PDF error (4): Illegal character <4f> in hex string
PDF error (6): Illegal character <54> in hex string
PDF error (7): Illegal character <59> in hex string
PDF error (8): Illegal character <50> in hex string
PDF error (11): Illegal character <68> in hex string
PDF error (12): Illegal character <74> in hex string
(etc.)
When I run the code below, the output says that OneTail is FALSE indicating that the results is incorrect if it were a one-tailed test (which it is, in the example). That doesn't seem right, though.
statcheck("this is a one-tailed test: t(40)=1.80,p<.04")
When I change p<.04 in p<.05 OneTail becomes TRUE, yet when I change it into p<.06 it becomes FALSE again. If I say p=.04, it changes back to TRUE. I'm not sure, but the issue seems to be line 1178 of statcheck.R, or am I missing something? Thanks.
Best,
Tom
Add a flag to the output for cases where statcheck "unrounded" the test stat.
Otherwise you get cases like this:
r(97) = .17, p = .084
recalculated p = .0925
consistent
This is confusing if people don't know that statcheck also counts the p-value belonging to r = .165-.175 as correct
Should authors use parentheses for df? Yes
Do some authors use square brackets instead? Also yes
Change RGX_OPEN_BRACKET to "(.+?(?=[\\(\\[]))"
update the progress bar(s) such that when the progress bar is completed, the result is in (now you still have to wait quite some time after the progress bar "Extracting stats" is full)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.