Giter Club home page Giter Club logo

jstor's Introduction

jstor: Import and Analyse Data from Scientific Articles

Author: Thomas Klebel
License: GPL v3.0

R-CMD-check AppVeyorBuild status Coverage status lifecycle CRAN status CRAN_Download_Badge rOpenSci badge JOSS badge Zenodo DOI

The tool Data for Research (DfR) by JSTOR is a valuable source for citation analysis and text mining. jstor provides functions and suggests workflows for importing datasets from DfR. It was developed to deal with very large datasets which require an agreement, but can be used with smaller ones as well.

Note: As of 2021, JSTOR has moved changed the way they provide data to a new platform called Constellate. The package jstor has not been adapted to this change, and might therefore only be used for legacy data that was optained from the old DfR platform.

The most important set of functions is a group of jst_get_* functions:

  • jst_get_article
  • jst_get_authors
  • jst_get_references
  • jst_get_footnotes
  • jst_get_book
  • jst_get_chapters
  • jst_get_full_text
  • jst_get_ngram

All functions which are concerned with meta data (therefore excluding jst_get_full_text and jst_get_ngram) operate along the same lines:

  1. The file is read with xml2::read_xml().
  2. Content of the file is extracted via XPATH or CSS-expressions.
  3. The resulting data is returned in a tibble.

Installation

To install the package use:

install.packages("jstor")

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("ropensci/jstor")

Usage

In order to use jstor, you first need to load it:

library(jstor)
library(magrittr)

The basic usage is simple: supply one of the jst_get_*-functions with a path and it will return a tibble with the extracted information.

jst_get_article(jst_example("article_with_references.xml")) %>% knitr::kable()
file_name journal_doi journal_jcode journal_pub_id journal_title article_doi article_pub_id article_jcode article_type article_title volume issue language pub_day pub_month pub_year first_page last_page page_range
article_with_references NA tranamermicrsoci NA Transactions of the American Microscopical Society 10.2307/3221896 NA NA research-article On the Protozoa Parasitic in Frogs 41 2 eng 1 4 1922 59 76 59-76
jst_get_authors(jst_example("article_with_references.xml")) %>% knitr::kable()
file_name prefix given_name surname string_name suffix author_number
article_with_references NA R. Kudo NA NA 1

Further explanations, especially on how to use jstor’s functions for importing many files, can be found in the vignettes.

Getting started

In order to use jstor, you need some data from DfR. From the main page you can create a dataset by searching for terms and restricting the search regarding time, subject and content type. After you created an account, you can download your selection. Alternatively, you can download sample datasets with documents from before 1923 for the US, and before 1870 for all other countries.

Supported Elements

In their technical specifications, DfR lists fields which should be reliably present in all articles and books.

The following table gives an overview, which elements are supported by jstor.

Articles

xml-field reliably present supported in jstor
journal-id (type=“jstor”) x x
journal-id (type=“publisher-id”) x x
journal-id (type=“doi”) x
issn x
journal-title x x
publisher-name x
article-id (type=“doi”) x x
article-id (type=“jstor”) x x
article-id (type=“publisher-id”) x
article-type x
volume x
issue x
article-categories x
article-title x x
contrib-group x x
pub-date x x
fpage x x
lpage x
page-range x
product x
self-uri x
kwd-group x
custom-meta-group x x
fn-group (footnotes) x
ref-list (references) x

Books

xml-field reliably present supported in jstor
book-id (type=“jstor”) x x
discipline x x
call-number x
lcsh x
book-title x x
book-subtitle x
contrib-group x x
pub-date x x
isbn x x
publisher-name x x
publisher-loc x x
permissions x
self-uri x
counts x x
custom-meta-group x x

Book Chapters

xml-field reliably present supported in jstor
book-id (type=“jstor”) x x
part_id x x
part_label x x
part-title x x
part-subtitle x
contrib-group x x
fpage x x
abstract x x

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Citation

To cite jstor, please refer to citation(package = "jstor"):

Klebel (2018). jstor: Import and Analyse Data from Scientific Texts. Journal of 
Open Source Software, 3(28), 883, https://doi.org/10.21105/joss.00883

Acknowledgements

Work on jstor benefited from financial support for the project “Academic Super-Elites in Sociology and Economics” by the Austrian Science Fund (FWF), project number “P 29211 Einzelprojekte”.

Some internal functions regarding file paths and example files were adapted from the package readr.

ropensci_footer

jstor's People

Contributors

bklebel avatar jeroen avatar jimhester avatar starship9 avatar tklebel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

jstor's Issues

add case study as pre-built vignette

  • should show how to deal with different types of input, i.e. journal articles and book chapters
articles <- list.files("...",  pattern = "article", full.names = T)

res <- articles %>% 
  purrr::map_df(find_metadata)

Problem with vignette: Error in gzfile(file, "rb") : cannot open the connection

FIrst, thank you for this jstor package. It seems perfectly fitted for working with the DfR files. I am having an issue though and thought I'd ask you about it. I'm very much a beginner but in following you example I get the following error when I try to access the bigrams_files:

Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'bigram_paths.rds', probable reason 'No such file or directory'

I'm not sure what's happening here. I'm working in rStudio and it appears that the value is there:

chr [1:1059] "./receipt-id-631571-part-001/ngram2/journal-article-10.2307_23271615-ngram2.txt" ...

Any help would be greatly appreciated and thank you again for putting together a very useful tool.

Billy

find_references fails silently

References for articles from "Gènese" are currently not being extracted. Example file: journal-article-10.2307_26197863.xml

This is the responsible function:

extract_ref_content <- function(x) {
  if (identical(xml2::xml_attr(x, "content-type"), "parsed-citations")) {
    x %>%
      xml_find_all("title|ref/mixed-citation") %>%
      map_chr(collapse_text)

  } else if (is.na(xml2::xml_attr(x, "content-type"))) {
    x %>%
      xml_find_all("title|ref/mixed-citation/node()[not(self::*)]") %>%
      xml_text() %>%
      purrr::keep(str_detect, "[a-z]") %>%
      str_replace("^\\\n", "") # remove "\n" at beginning of strings

  } else if (identical(xml2::xml_attr(x, "content-type"), "unparsed")) {
    x %>%
      xml_find_all("title|ref/mixed-citation") %>%
      xml_text()
  }
}

The content-type of the references is "unparsed-citations" and it therefore fails silently.
Solutions:

  • Change the last else if to include "unparsed-citations"
  • make the last case more general to simply apply to all other cases
  • add another case. This case could either be the same as the third, or it could emit a message along the lines "Type of reference not recognized. Please alert package maintainer at 'url to GitHub'"

implement new fields

  • article-type
  • for article-id, journal id and book-id, differentiate between "doi", "jstor" and "publisher-id"

update case study about n-grams

Comments from Matthias

Flagship journals:
hier steht im Text "Journal of Sociology", du meinst aber "American Journal of Sociology"?
Ersteres ist nämlich der Name des Journals der Australischen Gesellschaft für Soziologie, nur letzteres ist wirklich "leading".

Importing bigrams
bei "6729813 bigrams". Üblicherweise sieht man so lange Zahlen mit Trennzeichen (zB 6.729.813), auch wenn ich die reine Zahl ästhetisch finde.

Am Ende, wo du “labor market”, “labor force”, and “income inequality”, usw. diskutierst, bringst du das Argument, dass diese "bigrams" gut zum Thema inequality passen. Das ist zwar richtig, aber nicht besonders interessant.

Stattdessen würde ich das Problem diskutieren, dass es sich dabei aus einer inhaltlichen Sicht gar nicht um bigrams, sondern um "single concepts" handelt, die aus zwei Wörtern bestehen. Mit ganz wenigen Ausnahmen wo begriffliche Gegensatzpaare (black-white) oder (quasi-)Synonyme (race-ethnicity) auftauchen, trifft das auf den gesamten Output zu! world polity, affirmative action, gender gap, etc.

Die häufigsten Begriffe zu identifizieren hat sicher seinen Zweck. Mir scheint aber, dass (evtl. überraschende) Begriffspaarungen das eigentlich interessante einer solchen automatisierten Analyse darstellen. Daher würde ich mir als Anwender wünschen, solche "two words-single concepts" definieren zu können, um zu sehen, mit welchen anderen Begriffen sie zusammenhängen. Mit dem, was wir hier sehen, sind wir noch nicht auf die Ebene von Zusammenhängen zwischen Begriffen gekommen.

"University of Chicago" u.ä. würde von vornherein ausschließen aus den von dir genannten Gründen, da das gar nichts zur Interpretation des Outputs beiträgt.

Improve documentation regarding endnotes

Thanks for this, I'm so happy to have easily extracted most of the xml data. However I noticed that some of the articles have endnotes rather than footnotes and I was wondering if you would consider adding a function to read those.

Directly import from .zip file

TODOs:

  • add tests for get_basename (and fix it for zip_archives)
  • rename import spec and capture spec to something different
  • add checks for input in capture spec (possible name: jst_define_import)
  • fix namespaces
  • document new functions
  • add tests
  • implement reserach reports and pamphlets
  • implement ngrams
  • show_progress and col_names should be passed down to jstor_convert_to_file

extract page range

Example File:

This article is an example of a file with an erratum. fpage and lpage are misleading in this case. The content of page-range is: 375, 345-364. This means, that the erratum is on page 375, and the original article from 345-364. Note, this specification could be the other way round, like 345-364, 375.

It is probably best to extract the content of page-range and fix weird (negative) total pages with a helper.

export group of cleaning and helper functions

  • total_pages
  • unify_journal_id
  • clean_pages: use heuristic: either simply extract digits, or use some regex
  • wrapper like augment to apply all cleaning functions. possible name: jstor_clean

rename functions

think about unified naming scheme, to avoid the need of remembering the function names.

possibly:

  • jst_article
  • jst_book
  • jst_authors
  • jst_import
  • etc.

I would prefer "jst_" over "js_", since the latter is short for javascript, and the default for autocomplete in RStudio is three characters.

Improve function documentation

  • add examples

proofread documentation for spelling and clarity

  • find_authors
  • jstor_example
  • find_footnotes
  • find_references
  • find_metadata
  • jstor_import
  • read_fulltext

find_references lumps authors together

The new format does not fare well with the current implementation.

<ref id="ref6">
            <mixed-citation publication-type="book">
               <person-group person-group-type="author">
                  <string-name>Aulette, J.</string-name>
               </person-group>and<person-group person-group-type="author">
                  <string-name>Michalowski, R.</string-name>
               </person-group>(<year>1993</year>)<article-title>“Fire in Hamlet: A Case Study of a State-Corporate Crime”</article-title>, in<person-group person-group-type="editor">
                  <string-name>K. Tunnell</string-name>
               </person-group>, ed.,<source>
                  <italic>Political Crime in Contemporary America: A Critical Approach</italic>
               </source>.:<publisher-name>Garland Publishing</publisher-name>, pp.<fpage>171</fpage>–<lpage>206</lpage>.</mixed-citation>
</ref>

gets turned into:

Aulette, J.andMichalowski, R.(1993)“Fire in Hamlet: A Case Study of a State-Corporate Crime”, inK. Tunnell, ed.,Political Crime in Contemporary America: A Critical Approach.:Garland Publishing, pp.171–206.

Either we could explicitly parse the separate fields, or we still lump everything together, but somehow put spaces in between. The general problem is here, that the format is most likely not uniform for all articles from DfR.

Maybe we can distinguish the two formats from the following?

<ref-list content-type="parsed-citations">

Address rOpenSci review requests

Review at: ropensci/software-review#189

Review by jsonbecker

  • change basename_id to file_name.
  • explain why file_name is a good identifier (seconded by elinw)
  • think about export option to sqlite
  • add import function to read csvs from jstor_import #40
  • clarify, whether find_authors works on books or not
  • check, if find_footnotes and find_references could be used for books too
  • fix vignette formatting
  • improve error messages for find_references to say, that references are not available for books

Review by elinw

  • add vignette with "known quirks" and how to handle them
  • check aspect with repeated endnotes

use bind_cols for different depths

Instead of

  out <- list(
    book_id = extract_child(xml_file, ".//book-id"),
    basename_id = extract_basename(file_path, "xml"),
    list(purrr::map_df(parts, find_part, authors))
  )
  
  out %>% 
    data.frame(stringsAsFactors = FALSE) %>% 
    tibble::new_tibble()

do

  base <- list(
    book_id = extract_child(xml_file, ".//book-id"),
    basename_id = extract_basename(file_path, "xml")
  )

  parts <- purrr::map(parts, find_part, authors)

  dplyr::bind_cols(base, parts)

parse references if information is available

References of type "parsed" could be parsed themselves, to extract the information in a more useful format than simply a single string.

TODO:

  • Test for ref-titles
  • benchmark times: how long does parsing take? -> currently no way to speed up.
  • Is extracting references without parsing them slower than before? -> NO
  • Search for more possible fields within the raw data (extract references, find articles with parsed references from that, look into the files to see, if there are more fields).
  • Fix intro vignette.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.