ropensci / gutenbergr Goto Github PK

View Code? Open in Web Editor NEW

99.0 15.0 23.0 24.71 MB

Search and download public domain texts from Project Gutenberg

Home Page: https://docs.ropensci.org/gutenbergr

R 100.00%

r r-package rstats peer-reviewed

gutenbergr's Introduction

gutenbergr: R package to search and download public domain texts from Project Gutenberg

Authors: David Robinson
License: GPL-2

Download and process public domain works from the Project Gutenberg collection. Includes

A function gutenberg_download() that downloads one or more works from Project Gutenberg by ID: e.g., gutenberg_download(84) downloads the text of Frankenstein.
Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
- gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc
- gutenberg_authors contains information about each author, such as aliases and birth/death year
- gutenberg_subjects contains pairings of works with Library of Congress subjects and topics

Installation

Install the package with:

install.packages("gutenbergr")

Or install the development version using devtools with:

devtools::install_github("ropensci/gutenbergr")

Examples

The gutenberg_works() function retrieves, by default, a table of metadata for all unique English-language Project Gutenberg works that have text associated with them. (The gutenberg_metadata dataset has all Gutenberg works, unfiltered).

Suppose we wanted to download Emily Bronte’s “Wuthering Heights.” We could find the book’s ID by filtering:

library(dplyr)
library(gutenbergr)

gutenberg_works() %>%
  filter(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <chr>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                 rights                    has_text
#>   <chr>                                               <chr>                     <lgl>   
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books Public domain in the USA. TRUE

# or just:
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <chr>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                 rights                    has_text
#>   <chr>                                               <chr>                     <lgl>   
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books Public domain in the USA. TRUE

Since we see that it has gutenberg_id 768, we can download it with the gutenberg_download() function:

wuthering_heights <- gutenberg_download(768)
wuthering_heights
#> # A tibble: 12,342 × 2
#>    gutenberg_id text               
#>           <int> <chr>              
#>  1          768 "Wuthering Heights"
#>  2          768 ""                 
#>  3          768 "by Emily Brontë"  
#>  4          768 ""                 
#>  5          768 ""                 
#>  6          768 ""                 
#>  7          768 ""                 
#>  8          768 "CHAPTER I"        
#>  9          768 ""                 
#> 10          768 ""                 
#> # ℹ 12,332 more rows

gutenberg_download can download multiple books when given multiple IDs. It also takes a meta_fields argument that will add variables from the metadata.

# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
#> # A tibble: 33,343 × 3
#>    gutenberg_id text                title            
#>           <int> <chr>               <chr>            
#>  1          768 "Wuthering Heights" Wuthering Heights
#>  2          768 ""                  Wuthering Heights
#>  3          768 "by Emily Brontë"   Wuthering Heights
#>  4          768 ""                  Wuthering Heights
#>  5          768 ""                  Wuthering Heights
#>  6          768 ""                  Wuthering Heights
#>  7          768 ""                  Wuthering Heights
#>  8          768 "CHAPTER I"         Wuthering Heights
#>  9          768 ""                  Wuthering Heights
#> 10          768 ""                  Wuthering Heights
#> # ℹ 33,333 more rows

books %>%
  count(title)
#> # A tibble: 2 × 2
#>   title                           n
#>   <chr>                       <int>
#> 1 Jane Eyre: An Autobiography 21001
#> 2 Wuthering Heights           12342

It can also take the output of gutenberg_works directly. For example, we could get the text of all Aristotle’s works, each annotated with both gutenberg_id and title, using:

aristotle_books <- gutenberg_works(author == "Aristotle") %>%
  gutenberg_download(meta_fields = "title")

aristotle_books
#> # A tibble: 17,147 × 3
#>    gutenberg_id text                                                                    
#>           <int> <chr>                                                                   
#>  1         1974 "THE POETICS OF ARISTOTLE"                                              
#>  2         1974 ""                                                                      
#>  3         1974 "By Aristotle"                                                          
#>  4         1974 ""                                                                      
#>  5         1974 "A Translation By S. H. Butcher"                                        
#>  6         1974 ""                                                                      
#>  7         1974 ""                                                                      
#>  8         1974 "[Transcriber's Annotations and Conventions: the translator left"       
#>  9         1974 "intact some Greek words to illustrate a specific point of the original"
#> 10         1974 "discourse. In this transcription, in order to retain the accuracy of"  
#>    title                   
#>    <chr>                   
#>  1 The Poetics of Aristotle
#>  2 The Poetics of Aristotle
#>  3 The Poetics of Aristotle
#>  4 The Poetics of Aristotle
#>  5 The Poetics of Aristotle
#>  6 The Poetics of Aristotle
#>  7 The Poetics of Aristotle
#>  8 The Poetics of Aristotle
#>  9 The Poetics of Aristotle
#> 10 The Poetics of Aristotle
#> # ℹ 17,137 more rows

FAQ

What do I do with the text once I have it?

The Natural Language Processing CRAN View suggests many R packages related to text mining, especially around the tm package.
The tidytext package is useful for tokenization and analysis, especially since gutenbergr downloads books as a data frame already.
You could match the wikipedia column in gutenberg_author to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend package.
If you’re considering an analysis based on author name, you may find the humaniformat (for extraction of first names) and gender (prediction of gender from first names) packages useful. (Note that humaniformat has a format_reverse function for reversing “Last, First” names).

How were the metadata R files generated?

See the data-raw directory for the scripts that generate these datasets. As of now, these were generated from the Project Gutenberg catalog on 19 December 2022.

Do you respect the rules regarding robot access to Project Gutenberg?

Yes! The package respects these rules and complies to the best of our ability. Namely:

Project Gutenberg allows wget to harvest Project Gutenberg using this list of links. The gutenbergr package visits that page once to find the recommended mirror for the user’s location.
We retrieve the book text directly from that mirror using links in the same format. For example, Frankenstein (book 84) is retrieved from https://www.gutenberg.lib.md.us/8/84/84.zip.
We retrieve the .zip file rather than txt to minimize bandwidth on the mirror.

Still, this package is not the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow their recommendation to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic.

Code of Conduct

Please note that the gutenbergr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

gutenbergr's People

Contributors

Stargazers

Watchers

gutenbergr's Issues

Error in filter(., books == "Wuthering Heights") : 'filter' is longer than time series

The example in readme.md

> gutenberg_works() %>% 
    filter(books == "Wuthering Heights")

returns the following error;

Error in filter(., books == "Wuthering Heights") : 
  'filter' is longer than time series
In addition: Warning messages:
1: In data.matrix(data) : NAs introduced by coercion
2: In data.matrix(data) : NAs introduced by coercion
3: In data.matrix(data) : NAs introduced by coercion
4: In data.matrix(data) : NAs introduced by coercion
5: In data.matrix(data) : NAs introduced by coercion

Removed from CRAN - New Maintainer Needed

A book club we're about to start at R4DS uses this package, which led me to find that it's archived on CRAN. Is there anything we can do to get it back up? I think this is used for examples in a lot of things, so I want to help revive it if I can!

Looking at the notes, I think this is one of the things that can be auto-fixed via the latest version of roxygen2.

misses a production note starting with "Special thanks"

Here's what happens when you try to download Sense and Sensibility (ID 161):

> library("gutenbergr")
> head(gutenberg_download(161), 4)
# A tibble: 4 x 2
  gutenberg_id                                                     text
         <int>                                                    <chr>
1          161 Special thanks are due to Sharon Partridge for extensive
2          161               proofreading and correction of this etext.
3          161                                                         
4          161

You can fix this by adding ^special thanks to start_paragraph_regex in gutenberg_strip.

Using gutenbergr in vignettes -- HTTP error 403

This package is so useful and put together well, but I do have a long-running, intermittent issue with it that I am not sure what to do about. When I use gutenbergr functions in a package vignette, building the package on Travis/AppVeyor fails sometimes because of a 403 error from Project Gutenberg, like so:

Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Quitting from lines 37-46 (topic_modeling.Rmd) 
Error: processing vignette 'topic_modeling.Rmd' failed with diagnostics:
HTTP error 403.
Execution halted

This doesn't happen every time, and I have not been able to really predict when would be more likely to succeed vs. fail. If it does fail, generally it will fail again and again for the next several hours, up to a whole day. On the other hand, sometimes I can go many hours/commits with all successes.

I know Project Gutenberg is picky about automated traffic. Is there something I can do about this as a user or vignette writer? Any thoughts?

gutenbergr sometimes returns results with the wrong declared encoding

See, for example, line 15222 of Mansfield Park, which is wrongly declared to be UTF-8:

> library("gutenbergr")
> book <- gutenberg_download(141) # Mansfield Park
Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org
> line <- book$text[15222]
> line
[1] "the command of her beauty, and her \xa320,000, any one who could satisfy the"
> Encoding(line)
[1] "UTF-8"

Note the \xa3. You can fix this by either setting the correct encoding:

> line2 <- line
> Encoding(line2) <- "latin1"
> line2
[1] "the command of her beauty, and her £20,000, any one who could satisfy the"

Or, by converting from the actual encoding ("latin1") to UTF-8:

> iconv(line, "latin1", "UTF-8")
[1] "the command of her beauty, and her £20,000, any one who could satisfy the"

Alternatively, just download the UTF-8 version of the text ( http://www.gutenberg.org/files/141/141-0.txt ) instead of the version that gutenberg.org wrongly claims to be ASCII (http://www.gutenberg.org/files/141/141.txt )

More discussion at juliasilge/janeaustenr#4

Too drastic strip function on a french book

Hello,

I noticed that on a french book, the gutenberg_strip function removes all the content. I think it's a collection of poems, the ID of that book is 4688.

Code:
gutenbergr::gutenberg_download('4688', strip = T)

the output is this :

# A tibble: 1 x 2
  gutenberg_id text 
         <int> <chr>
1         4688 ""

I know from the documentation that 'This is based on some formatting guesses so it may not be perfect'. I'm just raising the issue so that it is known.

Move actual download tests to GHA or manual only

Most of the time, we should assume that the gutenberg side isn't changing, and focus on making sure our code works. Get rid of the test-download.R version of tests, at least as far as the main test suite goes. Add a GHA that checks that side of things from time to time, and maybe an easy way to check it manually (such as a switch in the mock).

Update data-raw to R

Right now the initial steps in data-raw are in Python, and are thus opaque to the maintainers. That's a recipe for eventual disaster.

We should rewrite the full process to use R, so we can maintain it.

Make tests work without actually downloading

See recent Actions failures. Use mocks to skip the actual download (and instead load from local file).

Connection timeout

I keep getting connection timeout although my internet is active. Specifically, this error message:

Error in open.connection(con, "rb") : Timeout was reached_

I only downloaded the package yesterday.

Upgraded to 0.2.4 and gutenberg_works cannot find 140

I was having students use gutenburg_works to find the id for "the Jungle" by Upton Sinclair. My code worked fine and their code did as well. However, when we upgraded from 0.2.3 to 0.2.4 the book no longer appears in the list of works.

gutenberg_works() %>%
filter(gutenberg_id == 140) # no work found with ID 140

Empty tibble

I can download it with gutenberg_downloads fine. Is there a reason that it no longer shows in the list of works but can still be downloaded?

https://www.gutenberg.org/ebooks/140

Thanks,
Richard

Problem with use of gutenberg_download function

I'm receiving following error message when trying to download. Can someone give advice ?

library(gutenbergr)
wuthering_heights <- gutenberg_download(768)
Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://www.gutenberg.lib.md.us
Warning messages:
1: In .f(.x[[i]], ...) :
Could not download a book at http://www.gutenberg.lib.md.us/7/6/768/768.zip
2: Unknown column 'text'
3: In is.na(text) : is.na() applied to non-(list or vector) of type 'NULL'

GHA-only tests of API

Add a test or tests that run purely on a GHA basis to check that the download API still works. This is in contrast to #53, which will remove the (real) download tests from the package tests.

languages doesn't work

> library("gutenbergr")
> gutenberg_works(languages == "es")
Error in filter_impl(.data, quo) : 
  Evaluation error: object 'languages' not found.
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2     gutenbergr_0.1.3

loaded via a namespace (and not attached):
 [1] compiler_3.4.1   lazyeval_0.2.0   magrittr_1.5     assertthat_0.2.0
 [5] R6_2.2.2         glue_1.1.1       dplyr_0.7.2      tibble_1.3.3    
 [9] Rcpp_0.12.12     pkgconfig_2.0.1  rlang_0.1.2      bindr_0.1

Unable to download text with id 5001

I am trying to download Einstein's book on Relativity from the website: it has id '5001'.

Trying to download it using the command:

gutenberg_download('5001')

Warning in .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/5/0/0/5001/5001.zip
Warning: Unknown or uninitialised column: 'text'.

From looking around the site, it looks like the file has moved to 5001-h.zip, but I am not sure how to modify the URL to do this properly.

Update roxygen documentation for CRAN compliance

This primary reason for CRAN removal (#30), is that documentation is out-of-date and needs to be fixed with a re-build with current roxygen2.

Addition of function to retrieve all PG mirrors

I've been looking over this repo and others at rOpenSci and am interested in contributing. In working with {gutenbergr} I found that it would be helpful for me to get all the PG mirrors programmatically, perhaps something like the following that parses the available markdown table into a tbl_df:

gutenberg_get_all_mirrors <- function() {
  mirrors_url <- "https://www.gutenberg.org/MIRRORS.ALL"
  mirrors <- suppressWarnings(
    readr::read_delim(
      mirrors_url,
      delim = "|",
      trim_ws = TRUE
    ) %>%
      dplyr::slice(2:(n() - 1))
  )

  return(mirrors)
}

If there's interest, I can put in a PR.

Appveyor build

I see you have the same issue as I do: I currently cannot build packages on Appveyor & there's an issue in the r-appveyor repo: krlmlr/r-appveyor#64

Get rid of non-GHA in README

The README uses travis-ci and appveyor badges. Update those to GHA-based systems.

Separate catalog from package

The catalog is updated daily, but we only update data-raw when we think about it. We should separate the data into a separate data package with more frequent updates, so we can keep the "functions for using Project Gutenberg" package history cleanly separated from the "catalog data" history.

We could update the data package daily on GitHub via GitHub actions. We could then set up a schedule or rules for how often to update that other package on CRAN (perhaps once X titles have changed or something along those lines). We could also include instructions in this package for downloading the dev version of the data package.

This project will have multiple moving parts, and we'll need to talk to someone about the policies of ropensci for splitting off a dependency like that (I suspect we might have to put it through the full review process).

There's no rush on this, I just wanted to log my thoughts.

dplyr filter_() deprecated as of 0.7.0

In R/gutenberg_works.R, filter_() is used, but as of dplyr 0.7.0 it has been deprecated and should be changed to filter()

Include more metadata

We throw away some of the metadata (namely at least Category, and "Release Date", plus the list of formats; possibly other things that are only in the XML but not shared on the website). Consider adding that data in a way that won't break existing code.

`gutenberg_download()` not working: "Could not download a book at http://aleph.gutenberg.org/x/x/x/x.zip"

I haven't been able to get the gutenberg_download() function to work at all. A couple of examples, both drawn from the examples provided in the readme:

> gutenberg_download(768)
# A tibble: 0 x 2
# … with 2 variables: gutenberg_id <int>, text <chr>
Warning messages:
1: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/7/6/768/768.zip
2: Unknown or uninitialised column: `text`.

> aristotle_books <- gutenberg_works(author == "Aristotle") %>%
+     gutenberg_download(meta_fields = "title")
Error: Problem with `mutate()` column `gutenberg_id`.
ℹ `gutenberg_id = as.integer(gutenberg_id)`.
ℹ `gutenberg_id` must be size 0 or 1, not 7.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/1/9/7/1974/1974.zip
2: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/2/4/1/2412/2412.zip
3: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/6/7/6/6762/6762.zip
4: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/6/7/6/6763/6763.zip
5: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/8/4/3/8438/8438.zip
6: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/1/2/6/9/12699/12699.zip
7: In .f(.x[[i]], ...) :
  Could not download a book at http://aleph.gutenberg.org/2/6/0/9/26095/26095.zip

I assume that the issue is to do with the aleph.gutenberg.org mirror, since none of the addresses it generates give me anything when pasted into a web browser. FWIW I have been able to use gutenberg_works() fine on its own to return queried metadata, just not gutenberg_download().
I'm running gutenbergr 0.2.0, downloaded from CRAN.

Session info:
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Release gutenbergr 0.2.2

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
git push
usethis::use_github_release()

Add year field to metadata?

I'm thinking of using gutenbergr to create some sort of ad-hoc English word frequency table and would either filter or down-weight older usage, but there's no direct way to do so now. The gutenberg_authors data gives data about the authors' lifespan, which can give a crude approximation, but the year of authorship of the work itself would be more helpful.

Is this something that can be included readily?

License mismatch

Cool package, noticed a minor problem: license in README.md (MIT) doesn't match the license in DESCRIPTION (GPL-2).

Function read_zip_url is missing from my installation

Part of what is returned from
> library(help = "gutenbergr")
is this index of functions:
Index:

gutenberg_authors Metadata about Project Gutenberg authors
gutenberg_download Download one or more works using a Project Gutenberg ID
gutenberg_get_mirror Get the recommended mirror for Gutenberg files
gutenberg_metadata Gutenberg metadata about each work
gutenberg_strip Strip header and footer content from a Project Gutenberg book
gutenberg_subjects Gutenberg metadata about the subject of each work
gutenberg_works Get a filtered table of Gutenberg work metadata
read_zip_url Read a file from a .zip URL

But if I list what is in that part of the search path, one is missing.
> ls(pos = 5) [1] "gutenberg_authors" "gutenberg_download" "gutenberg_get_mirror" [4] "gutenberg_metadata" "gutenberg_strip" "gutenberg_subjects" [7] "gutenberg_works"

What could be an explanation for read_zip_url being missing? Without it, I'm very limited.

Investigate using Category for `has_text`

Project Gutenberg entries have a "Category". The ones I have seen recently are "Text" and "Audio". See if we can use that to more safely sort things for the has_text field.

Release gutenbergr 0.2.4

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)

Parse Catalog Faster

The all_metadata step of parse_rdfs.R is very, very slow. This makes debugging tedious. Some of this slowness might be unavoidable (we're parsing a lot of data), but try to optimize if possible.

The Project Gutenberg docs imply that there's a single XML/RDF file available, but I don't see it. That would presumably be much faster to parse.

Update metadata / refactor metadata code

The package metadata, which is currently derived from scraping the Gutenberg website, is out of date. We'd like this to be complete along so we can submit it as part of getting this back on CRAN with a maintainer transfer (#30)

@Myfanwy has helpfully investigated and describes the issue as follows:

The package depends on some python libraries that are broken...I was unable to run the code that would bring the metadata up to date.

Separately, I investigated the Gutenberg Projects API and they have quite a few machine-readable options that (I think) might be more straightforward instead of web-scraping the metadata, which seems to be the current approach by the package. However I couldn't immediately reproduce all of the intermediate fields that the gutenbergr creates in the script that tidies the web-scraping results in the time that I had to work on it, so I stopped.

Seems like there are a few options, here in increasing order of commitment/time:

we update the documentation to fix the CRAN issue (#31) , but do not update the metadata

someone who is more experienced with python libraries can take a stab at updating the metadata with the existing script, and then I can update the documentation for CRAN

We re-factor the package for a new major release that will be more stable going forward, but the updates to the core metadata may have breaking changes to previous code.

Transfer to ropensci

@dgrtwo We now transfer all onboarded repos to ropensci. Would it be ok if I transferred your repo soon? Do you want me to update all links after transfer (CI links, installation instructions) or do you prefer to do it yourself?

Add tidytext examples to tests and/or vignettes

It looks like a lot of issues (both open and closed) come from people working through examples from Text Mining with R. Let's bake those examples into our tests and/or vignettes, so we can spot issues before someone has to bring them to us.

CRAN version later than dev version

Hi @dgrtwo , @maelle , @noamross : I was hoping to update gutenberg_metadata and make a pull request, but noted that the CRAN version is later than the github, is there somewhere else the package is being developed?

Filing this here at Noam's suggestion (he also mentioned something called 'commitment escalation', whatever that is... probably nothing ;-) )

Add session persistent caching system

use temp files (or memoise::memoise) for caching books (less downloads)
add options for persistent cache system in local path

Example returns 404

Hi,

Amazing project. However, one of your examples is returning a 404. Perhaps Gutenberg has moved things around.

aristotle_books <- gutenberg_works(author == "Aristotle") %>%
  gutenberg_download(meta_fields = "title")


Error in utils::download.file(url, tmp, quiet = TRUE) : 
  cannot open URL 'http://www.gutenberg.lib.md.us/8/4/3/8438/8438.zip'
In addition: Warning message:
In utils::download.file(url, tmp, quiet = TRUE) :
  cannot open: HTTP status was '404 Not Found'

What are the correct meta_fields IDs?

Hi! I've been trying to download certain books by ID, and I want the following meta data: title, author, author's year of birth, and author's year of death. Title and author work fine, but whenever I add birthdate and deathdate, which is what both the handbook and other websites say the IDs are for author's year of birth and death, Rstudio returns the complaint that those columns do not exist.

This is the exact code I'm trying to use:
gutenberg_download(c(56404,
52993,
44019,
16299
), meta_fields = c("title", "author", "birthdate", "deathdate"))

Anyone know what I'm doing wrong? Thanks!

Add link to wikipedia in metadata

Having a link to the wikipedia page would be useful for different task such as character identification based on plot summaries or on character lists. There is a wikipedia template for gutenberg books:

https://en.wikipedia.org/wiki/Template:Gutenberg_book

Many common titles cannot be found on any mirror

For some reason, many books that worked fine a few months ago have stopped working with gutenberg_download(), regardless of mirror settings.

For example, here are 4 common Shakespearean tragedies:

library(gutenbergr)

tragedy_ids <- c(
  1524,  # Hamlet
  1532,  # King Lear
  1533,  # Macbeth
  1513   # Romeo and Juliet
)

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title"
)
#> Error in `dplyr::mutate()`:
#> ℹ In argument: `gutenberg_id = as.integer(gutenberg_id)`.
#> Caused by error:
#> ! `gutenberg_id` must be size 0 or 1, not 4.
#> Run `rlang::last_trace()` to see where the error occurred.
#> Warning messages:
#> 1: ! Could not download a book at http://aleph.gutenberg.org/1/5/2/1524/1524.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 2: ! Could not download a book at http://aleph.gutenberg.org/1/5/3/1532/1532.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 3: ! Could not download a book at http://aleph.gutenberg.org/1/5/3/1533/1533.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 4: ! Could not download a book at http://aleph.gutenberg.org/1/5/1/1513/1513.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options.

That's typically a sign that there are issues with the mirror (see #28), so we can specify a different mirror. Every mirror, however, leads to the same error:

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title",
  mirror = "https://mirrors.xmission.com/gutenberg"
)

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title",
  mirror = "https://gutenberg.pglaf.org"
)

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title",
  mirror = "https://gutenberg.nabasny.com"
)

#> 1: ! Could not download a book at https://mirrors.xmission.com/gutenberg/1/5/2/1524/1524.zip.
#> 1: ! Could not download a book at https://gutenberg.pglaf.org/1/5/2/1524/1524.zip.
#> 1: ! Could not download a book at https://gutenberg.nabasny.com/1/5/2/1524/1524.zip.

Visiting the mirror site in a browser and hunting through the file system shows that the corresponding .zip files don't exist there either:

That page was last edited on June 27, 2023, so I wonder if something changed on Project Gutenberg's end?

Some of these books have alternative IDs (found with gutenberg_works()) that do work, but not all. Romeo and Juliet (1513), for instance, does not, which makes it inaccessible

# Hamlet, King Lear, and Macbeth have alternative versions that work:
# 2265 - Hamlet
# 2266 - King Lear
# 2264 - Macbeth

# This works!
some_tragedies <- gutenberg_download(
  c(2265, 2266, 2264),
  meta_fields = "title"
)

# Romeo and Juliet doesn't have an alternative version, so it doesn't work, regardless of the mirror
romeo_juliet <- gutenberg_download(
  1513,
  meta_fields = "title",
  mirror = "https://gutenberg.pglaf.org"
)
#> Warning messages:
#> 1: ! Could not download a book at https://gutenberg.pglaf.org/1/5/1/1513/1513.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 2: Unknown or uninitialised column: `text`.

Rename default branch to main

Per more recent conventions, it would be good to update the default branch of this repo from master to main. Someone with admin rights can do that with usethis::git_default_branch_rename(from = "master", to = "main").

Update (and require) GHA

Something about the GHA is at least somewhat non-standard right now, and for reasons I haven't figured out yet that appears to be causing me to be unable to require those checks. Clean up the GHA so I can be a little safer.

problem with spanish encodings

I'm trying to dowload a book in Spanish:
books <- gutenberg_download(2000, meta_fields = c("title", "author")`

The text is already downloaded but there are problems to show correctly the accent marks and the ñ
Sorry if it's not the correct place for this question.
Thanks in advance

gutenberg_works gives an error when there are no results

> library("gutenbergr")
> gutenberg_works(title == "ZZZ")
Error in rep(1:nrow(data), n[[1]]) : invalid 'times' argument

querying metadata is not returning all needed metadata

When I use gutenberg_metadata or gutenberg_works, the query is not returning all of the metadata, including, most importantly, the Gutenberg ID, or quite a number of other fields listed in the docs. For example, if I do this:

library(gutenberg)
library(stringr)
some_metadata <- gutenberg_works(author == "Shakespeare, William", !str_detect(title, "Works"))

I get a data frame that is 66 x 2 that includes only columns for title and gutenberg_author_id (all 65 in this case) but nothing else, no gutenberg_id which is what I would need to go on to do gutenberg_download. Have you seen this? Any ideas?

Here is my session info:

R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stringr_1.0.0        gutenbergr_0.1.1     tidytext_0.1.0.9000  Matrix_1.2-6         janeaustenr_0.1.0    topicmodels_0.2-4   
 [7] dplyr_0.4.3.9001     tm_0.6-2             NLP_0.1-9            testthat_1.0.2       devtools_1.11.1.9000