ropensci / jstor Goto Github PK

View Code? Open in Web Editor NEW

46.0 6.0 9.0 6.28 MB

Import journal data from DfR (JSTOR)

Home Page: https://docs.ropensci.org/jstor

R 98.58% TeX 1.08% CSS 0.14% HTML 0.19%

jstor text-analysis text-mining r peer-reviewed r-package rstats

jstor's Introduction

jstor: Import and Analyse Data from Scientific Articles

Author: Thomas Klebel
License: GPL v3.0

The tool Data for Research (DfR) by JSTOR is a valuable source for citation analysis and text mining. jstor provides functions and suggests workflows for importing datasets from DfR. It was developed to deal with very large datasets which require an agreement, but can be used with smaller ones as well.

Note: As of 2021, JSTOR has moved changed the way they provide data to a new platform called Constellate. The package jstor has not been adapted to this change, and might therefore only be used for legacy data that was optained from the old DfR platform.

The most important set of functions is a group of jst_get_* functions:

jst_get_article
jst_get_authors
jst_get_references
jst_get_footnotes
jst_get_book
jst_get_chapters
jst_get_full_text
jst_get_ngram

All functions which are concerned with meta data (therefore excluding jst_get_full_text and jst_get_ngram) operate along the same lines:

The file is read with xml2::read_xml().
Content of the file is extracted via XPATH or CSS-expressions.
The resulting data is returned in a tibble.

Installation

To install the package use:

install.packages("jstor")

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("ropensci/jstor")

Usage

In order to use jstor, you first need to load it:

library(jstor)
library(magrittr)

The basic usage is simple: supply one of the jst_get_*-functions with a path and it will return a tibble with the extracted information.

jst_get_article(jst_example("article_with_references.xml")) %>% knitr::kable()

file_name	journal_doi	journal_jcode	journal_pub_id	journal_title	article_doi	article_pub_id	article_jcode	article_type	article_title	volume	issue	language	pub_day	pub_month	pub_year	first_page	last_page	page_range
article_with_references	NA	tranamermicrsoci	NA	Transactions of the American Microscopical Society	10.2307/3221896	NA	NA	research-article	On the Protozoa Parasitic in Frogs	41	2	eng	1	4	1922	59	76	59-76

jst_get_authors(jst_example("article_with_references.xml")) %>% knitr::kable()

file_name	prefix	given_name	surname	string_name	suffix	author_number
article_with_references	NA	R.	Kudo	NA	NA	1

Further explanations, especially on how to use jstor’s functions for importing many files, can be found in the vignettes.

Getting started

In order to use jstor, you need some data from DfR. From the main page you can create a dataset by searching for terms and restricting the search regarding time, subject and content type. After you created an account, you can download your selection. Alternatively, you can download sample datasets with documents from before 1923 for the US, and before 1870 for all other countries.

Supported Elements

In their technical specifications, DfR lists fields which should be reliably present in all articles and books.

The following table gives an overview, which elements are supported by jstor.

Articles

`xml`-field	reliably present	supported in `jstor`
journal-id (type=“jstor”)	x	x
journal-id (type=“publisher-id”)	x	x
journal-id (type=“doi”)		x
issn	x
journal-title	x	x
publisher-name	x
article-id (type=“doi”)	x	x
article-id (type=“jstor”)	x	x
article-id (type=“publisher-id”)		x
article-type		x
volume		x
issue		x
article-categories	x
article-title	x	x
contrib-group	x	x
pub-date	x	x
fpage	x	x
lpage		x
page-range		x
product	x
self-uri	x
kwd-group	x
custom-meta-group	x	x
fn-group (footnotes)		x
ref-list (references)		x

Books

`xml`-field	reliably present	supported in `jstor`
book-id (type=“jstor”)	x	x
discipline	x	x
call-number	x
lcsh	x
book-title	x	x
book-subtitle		x
contrib-group	x	x
pub-date	x	x
isbn	x	x
publisher-name	x	x
publisher-loc	x	x
permissions	x
self-uri	x
counts	x	x
custom-meta-group	x	x

Book Chapters

`xml`-field	reliably present	supported in `jstor`
book-id (type=“jstor”)	x	x
part_id	x	x
part_label	x	x
part-title	x	x
part-subtitle		x
contrib-group	x	x
fpage	x	x
abstract	x	x

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Citation

To cite jstor, please refer to citation(package = "jstor"):

Klebel (2018). jstor: Import and Analyse Data from Scientific Texts. Journal of 
Open Source Software, 3(28), 883, https://doi.org/10.21105/joss.00883

Acknowledgements

Work on jstor benefited from financial support for the project “Academic Super-Elites in Sociology and Economics” by the Austrian Science Fund (FWF), project number “P 29211 Einzelprojekte”.

Some internal functions regarding file paths and example files were adapted from the package readr.

jstor's People

Contributors

Stargazers

Watchers

Forkers

bklebel firefoxxy8 starship9 jimhester tdjames1 anumadav krlmlr romainfrancois ekmixon

jstor's Issues

add case study as pre-built vignette

should show how to deal with different types of input, i.e. journal articles and book chapters

articles <- list.files("...",  pattern = "article", full.names = T)

res <- articles %>% 
  purrr::map_df(find_metadata)

Think about onboarding with rOpenSci

make pre-submission https://github.com/ropensci/onboarding/blob/master/policies.md/#code-of-conduct
follow guidelines of best practice: https://github.com/ropensci/onboarding/blob/master/packaging_guide.md
think about scope: is scope broad enough for rOpenSci/CRAN?
Name: Very similar to JSTORr, but different goal

Problem with vignette: Error in gzfile(file, "rb") : cannot open the connection

FIrst, thank you for this jstor package. It seems perfectly fitted for working with the DfR files. I am having an issue though and thought I'd ask you about it. I'm very much a beginner but in following you example I get the following error when I try to access the bigrams_files:

Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'bigram_paths.rds', probable reason 'No such file or directory'

I'm not sure what's happening here. I'm working in rStudio and it appears that the value is there:

chr [1:1059] "./receipt-id-631571-part-001/ngram2/journal-article-10.2307_23271615-ngram2.txt" ...

Any help would be greatly appreciated and thank you again for putting together a very useful tool.

Billy

follow guidelines of best practice

https://github.com/ropensci/onboarding/blob/master/packaging_guide.md

General

use code of conduct
use NEWS.md file

Readme

Citation information
A bit more explanation about DfR

Documentation

add top-level documentation for jstor

add examples

add examples to function documentation

find_references fails silently

References for articles from "Gènese" are currently not being extracted. Example file: journal-article-10.2307_26197863.xml

This is the responsible function:

extract_ref_content <- function(x) {
  if (identical(xml2::xml_attr(x, "content-type"), "parsed-citations")) {
    x %>%
      xml_find_all("title|ref/mixed-citation") %>%
      map_chr(collapse_text)

  } else if (is.na(xml2::xml_attr(x, "content-type"))) {
    x %>%
      xml_find_all("title|ref/mixed-citation/node()[not(self::*)]") %>%
      xml_text() %>%
      purrr::keep(str_detect, "[a-z]") %>%
      str_replace("^\\\n", "") # remove "\n" at beginning of strings

  } else if (identical(xml2::xml_attr(x, "content-type"), "unparsed")) {
    x %>%
      xml_find_all("title|ref/mixed-citation") %>%
      xml_text()
  }
}

The content-type of the references is "unparsed-citations" and it therefore fails silently.
Solutions:

Change the last else if to include "unparsed-citations"
make the last case more general to simply apply to all other cases
add another case. This case could either be the same as the third, or it could emit a message along the lines "Type of reference not recognized. Please alert package maintainer at 'url to GitHub'"

add tests for find_chapters

better parsing for author order

use "contrib-id", if available
extract author role too (author, editor)

implement new fields

article-type
for article-id, journal id and book-id, differentiate between "doi", "jstor" and "publisher-id"

update case study about n-grams

Comments from Matthias

Flagship journals:
hier steht im Text "Journal of Sociology", du meinst aber "American Journal of Sociology"?
Ersteres ist nämlich der Name des Journals der Australischen Gesellschaft für Soziologie, nur letzteres ist wirklich "leading".

Importing bigrams
bei "6729813 bigrams". Üblicherweise sieht man so lange Zahlen mit Trennzeichen (zB 6.729.813), auch wenn ich die reine Zahl ästhetisch finde.

Am Ende, wo du “labor market”, “labor force”, and “income inequality”, usw. diskutierst, bringst du das Argument, dass diese "bigrams" gut zum Thema inequality passen. Das ist zwar richtig, aber nicht besonders interessant.

Stattdessen würde ich das Problem diskutieren, dass es sich dabei aus einer inhaltlichen Sicht gar nicht um bigrams, sondern um "single concepts" handelt, die aus zwei Wörtern bestehen. Mit ganz wenigen Ausnahmen wo begriffliche Gegensatzpaare (black-white) oder (quasi-)Synonyme (race-ethnicity) auftauchen, trifft das auf den gesamten Output zu! world polity, affirmative action, gender gap, etc.

Die häufigsten Begriffe zu identifizieren hat sicher seinen Zweck. Mir scheint aber, dass (evtl. überraschende) Begriffspaarungen das eigentlich interessante einer solchen automatisierten Analyse darstellen. Daher würde ich mir als Anwender wünschen, solche "two words-single concepts" definieren zu können, um zu sehen, mit welchen anderen Begriffen sie zusammenhängen. Mit dem, was wir hier sehen, sind wir noch nicht auf die Ebene von Zusammenhängen zwischen Begriffen gekommen.

"University of Chicago" u.ä. würde von vornherein ausschließen aus den von dir genannten Gründen, da das gar nichts zur Interpretation des Outputs beiträgt.

add capability to import book chapter data

Create new functions:

find_book
find_chapters
adapt find_authors
Change existing name from find_metadata to find_article.

footnotes can appear in <body>

can references too?

example file: journal-article-10.1086_507054.xml

update vignette about batch import

Improve documentation regarding endnotes

Thanks for this, I'm so happy to have easily extracted most of the xml data. However I noticed that some of the articles have endnotes rather than footnotes and I was wondering if you would consider adding a function to read those.

update DESCRIPTION

Title and Description need an update.

avoid warning for invalid URI

Reading files should not raise a warning. Either use SuppressWarnings() or set option to xml2::read_xml(), once this works as expected (r-lib/xml2#208).

Directly import from .zip file

TODOs:

add tests for get_basename (and fix it for zip_archives)
rename import spec and capture spec to something different
add checks for input in capture spec (possible name: jst_define_import)
fix namespaces
document new functions
add tests
implement reserach reports and pamphlets
implement ngrams
show_progress and col_names should be passed down to jstor_convert_to_file

format message output for jstor_import

Starting to import 23909 file(s). -> Starting to import 23,909 file(s).

Use format(x, big.mark = ",").

implement jstor_example

Find out, if the following steps are sufficient

copy code from https://github.com/tidyverse/readr/blob/master/R/example.R
change to GPL 2.0
Cite readr in function documentation
use new function in examples and vignette

tests fail on windows

Tests fail on windows due to a bug in testthat: r-lib/testthat#694

extract page range

Example File:

id: journal-article-10.2307_27522097
link: http://www.jstor.org/stable/27522097

This article is an example of a file with an erratum. fpage and lpage are misleading in this case. The content of page-range is: 375, 345-364. This means, that the erratum is on page 375, and the original article from 345-364. Note, this specification could be the other way round, like 345-364, 375.

It is probably best to extract the content of page-range and fix weird (negative) total pages with a helper.

export group of cleaning and helper functions

total_pages
unify_journal_id
clean_pages: use heuristic: either simply extract digits, or use some regex
wrapper like augment to apply all cleaning functions. possible name: jstor_clean

Make printing pretty

possibly by using tibble within a S3 method for printing.

add option to write to file with column_names

vignette for supported fields

create vignette with a list of fields from http://www.jstor.org/dfr/about/technical-specifications and mark, which are supported and which are not.

Possibly incorporate this into the introductory vignette and/or the README.

extract journal title too

may be helpful for identifying erroneous journal ids

rename functions

think about unified naming scheme, to avoid the need of remembering the function names.

possibly:

jst_article
jst_book
jst_authors
jst_import
etc.

I would prefer "jst_" over "js_", since the latter is short for javascript, and the default for autocomplete in RStudio is three characters.

add acknowledgement

Acknowledge FWF funding

Improve function documentation

add examples

proofread documentation for spelling and clarity

check with complete data

check whether the new fields from #13 lead to errors in the main dataset

Make tests available publicly

Isolate special cases and add tests for:

footnotes
references
article_metadata

use new_tibble instead of special class jstor for printing

update all functions
remove helpers

find_references lumps authors together

The new format does not fare well with the current implementation.

<ref id="ref6">
            <mixed-citation publication-type="book">
               <person-group person-group-type="author">
                  <string-name>Aulette, J.</string-name>
               </person-group>and<person-group person-group-type="author">
                  <string-name>Michalowski, R.</string-name>
               </person-group>(<year>1993</year>)<article-title>“Fire in Hamlet: A Case Study of a State-Corporate Crime”</article-title>, in<person-group person-group-type="editor">
                  <string-name>K. Tunnell</string-name>
               </person-group>, ed.,<source>
                  <italic>Political Crime in Contemporary America: A Critical Approach</italic>
               </source>.:<publisher-name>Garland Publishing</publisher-name>, pp.<fpage>171</fpage>–<lpage>206</lpage>.</mixed-citation>
</ref>

gets turned into:

Aulette, J.andMichalowski, R.(1993)“Fire in Hamlet: A Case Study of a State-Corporate Crime”, inK. Tunnell, ed.,Political Crime in Contemporary America: A Critical Approach.:Garland Publishing, pp.171–206.

Either we could explicitly parse the separate fields, or we still lump everything together, but somehow put spaces in between. The general problem is here, that the format is most likely not uniform for all articles from DfR.

Maybe we can distinguish the two formats from the following?

<ref-list content-type="parsed-citations">

Review by jsonbecker

change basename_id to file_name.
explain why file_name is a good identifier (seconded by elinw)
think about export option to sqlite
add import function to read csvs from jstor_import #40
clarify, whether find_authors works on books or not
check, if find_footnotes and find_references could be used for books too
fix vignette formatting
improve error messages for find_references to say, that references are not available for books

Review by elinw

add vignette with "known quirks" and how to handle them
check aspect with repeated endnotes

  out <- list(
    book_id = extract_child(xml_file, ".//book-id"),
    basename_id = extract_basename(file_path, "xml"),
    list(purrr::map_df(parts, find_part, authors))
  )
  
  out %>% 
    data.frame(stringsAsFactors = FALSE) %>% 
    tibble::new_tibble()

  base <- list(
    book_id = extract_child(xml_file, ".//book-id"),
    basename_id = extract_basename(file_path, "xml")
  )

  parts <- purrr::map(parts, find_part, authors)

  dplyr::bind_cols(base, parts)

need to showcase something about books too.
use fs to find paths?

use pkgdown

http://enpiar.com/2017/11/21/getting-down-with-pkgdown/

parse references if information is available

References of type "parsed" could be parsed themselves, to extract the information in a more useful format than simply a single string.

TODO:

Test for ref-titles
benchmark times: how long does parsing take? -> currently no way to speed up.
Is extracting references without parsing them slower than before? -> NO
Search for more possible fields within the raw data (extract references, find articles with parsed references from that, look into the files to see, if there are more fields).
Fix intro vignette.

add progress bar to jstor_import

two options:

overall progress bar
progress bars for each batch

new argument for jstor_import

allow specification of number of batches directly