Giter Club home page Giter Club logo

getpapers's Introduction

getpapers NPM version license MIT Downloads

Get metadata, fulltexts or fulltext URLs of papers matching a search query using any of the following APIs:

  • EuropePMC
  • IEEE
  • ArXiv
  • Crossref (metadata, no fulltext)

getpapers can fetch article metadata, fulltexts (PDF or XML), and supplementary materials. It's designed for use in content mining, but you may find it useful for quickly acquiring large numbers of papers for reading, or for bibliometrics.

Installation

Installing nodeJS

Please follow these cross-platform instructions

Installing getpapers

$ npm install --global getpapers

Usage

Use getpapers --help to see the command-line help:

    -h, --help                output usage information
    -V, --version             output the version number
    -q, --query <query>       search query (required)
    -o, --outdir <path>       output directory (required - will be created if not found)
    --api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
    -x, --xml                 download fulltext XMLs if available
    -p, --pdf                 download fulltext PDFs if available
    -s, --supp                download supplementary files if available
    -t, --minedterms          download text-mined terms if available
    -l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)
    -a, --all                 search all papers, not just open access
    -n, --noexecute           report how many results match the query, but don't actually download anything
    -f, --logfile <filename>  save log to specified file in output directory as well as printing to terminal
    -k, --limit <int>         limit the number of hits and downloads
    --filter <filter object>  filter by key value pair, passed straight to the crossref api only
    -r, --restart             restart file downloads after failure

By default, getpapers uses the EuropePMC API.

Screenshot

screenshot

Query formats

Each API has its own query format. Usage guides are provided on our wiki:

License

Copyright (c) 2014 Shuttleworth Foundation Licensed under the MIT license

Caveats

  • The remote site may timeout or hang (we have found that if EPMC gets a query with no results it will timeout).
  • Be careful not to download the whole site. use the -k option to limit downloads (this should be a default).

getpapers's People

Contributors

blahah avatar chartgerink avatar chreman avatar jkbcm avatar katrinleinweber avatar matthewgthomas avatar petermr avatar tarrow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

getpapers's Issues

eupmc: -p stumbles on BMC OA papers (?)

Success:
getpapers -q Gasteria --api eupmc -s -l verbose --outdir ./blah

Success:
getpapers -q Gasteria --api eupmc -x -l verbose --outdir ./blah

Fail:

$ getpapers -q Gasteria --api eupmc -p   -l verbose  --outdir ./blah
info: Searching using eupmc API
debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria%20OPEN_ACCESS%3Ay&resulttype=core
info: Found 13 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC3978243" had no fulltext HTML url
warn: Article with pmcid "PMC3605904" had no fulltext HTML url
warn: Article with pmcid "PMC3371435" had no fulltext HTML url
warn: Article with pmcid "PMC3364152" had no fulltext HTML url
warn: Article with pmcid "PMC3066391" had no fulltext HTML url
warn: Article with pmcid "PMC2871514" had no fulltext HTML url
warn: Article with pmcid "PMC2602195" had no fulltext HTML url
info: Fulltext HTML URL list written to fulltext_html_urls.txt
warn: Article with pmcid "PMC3371435" had no fulltext PDF url
info: Downloading fulltext PDF files
debug: Creating directory: PMC4377467/
debug: Downloading PDF: http://europepmc.org/articles/PMC4377467?pdf=render
debug: Creating directory: PMC3978243/
debug: Downloading PDF: http://europepmc.org/articles/PMC3978243?pdf=render
debug: Creating directory: PMC4152747/
debug: Downloading PDF: http://europepmc.org/articles/PMC4152747?pdf=render
debug: Creating directory: PMC3729011/
debug: Downloading PDF: http://europepmc.org/articles/PMC3729011?pdf=render
debug: Creating directory: PMC3605904/
debug: Downloading PDF: http://europepmc.org/articles/PMC3605904?pdf=render
debug: Creating directory: PMC3371435/
debug: Downloading PDF: http://europepmc.org/articles/PMC3364152?pdf=render
debug: Creating directory: PMC3364152/
debug: Downloading PDF: http://europepmc.org/articles/PMC3305877?pdf=render
debug: Creating directory: PMC3305877/
debug: Downloading PDF: http://europepmc.org/articles/PMC3066391?pdf=render
debug: Creating directory: PMC3066391/
debug: Downloading PDF: http://europepmc.org/articles/PMC2141413?pdf=render
debug: Creating directory: PMC2141413/
debug: Downloading PDF: http://www.biomedcentral.com/content/pdf/1478-5854-9-18.pdf
debug: Creating directory: PMC2871514/
debug: Downloading PDF: http://europepmc.org/articles/PMC2602195?pdf=render
debug: Creating directory: PMC2602195/
debug: Downloading PDF: http://www.biomedcentral.com/content/pdf/1471-2229-10-32.pdf
Downloading files [===---------------------------] 8% (eta 0.0s)
/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.emit (events.js:117:20)
    at finishMaybe (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:499:14)
    at endWritable (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:509:3)
    at BufferStream.Writable.end (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:474:5)
    at Unzip.onend (_stream_readable.js:502:10)
    at Unzip.g (events.js:180:16)
    at Unzip.emit (events.js:117:20)

PS Gasteria is one of my favourite genera of succulent plants :)

If using IEEE API, getpapers throws an error if 0 results found

Example Query

$ getpapers -q 'cs:"Syracuse University"' -o Output10 --api 'ieee'

Output

info: Searching using ieee API

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.EventEmitter.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:449:9
    at Parser.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.EventEmitter.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

Note: I am not sure if the query format is correct. It may have to be

 -q 'cs="Syracuse University"

The xml returned by IEEE is stupid and copied below. Why not just keep the same format as when you get results with totalcount 0.

<Error>Cannot go to record 1 since query  only returned 0 records</Error>

--api ieee -x the warning shows, but not a graceful exit

When I saw the graceful exit displayed by --api arxiv -x:

$ getpapers -q 'Gasteria' --api arxiv -x  --outdir ./car
info: Searching using arxiv API
warn: The ArXiv API does not provide fulltext XML, so the --xml flag will be ignored
info: Found 0 results

I realised that even though the warning is clearly displayed, this should still be flagged as a bug because of the non-graceful exit:

$ getpapers -q 'Gasteria' --api ieee -x  --outdir ./car
info: Searching using ieee API
warn: The IEEE API does not provide fulltext XML, so the --xml flag will be ignored

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:448:9
    at Parser.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

ieee -a fails when search hits == 0

I saw the no XML warning for using ieee with -x (XML) but there's no similar warning here:

getpapers -q 'Gasteria' --api ieee -a  --outdir ./car
info: Searching using ieee API

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:448:9
    at Parser.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

arxiv: q-bio not recognised as a distinct subject class

$getpapers -q 'dinosaurs' --api arxiv -p  --outdir ./car
# it all worked but ...
$ cd car
$ ls
0704.1912v4  1302.3267v1  arxiv_results.json  cond-mat
1209.5439v1  1302.5142v1  astro-ph            hep-ph

The subfolders by subject class is cool e.g. astro-ph, hep-ph & cond-mat
... but it looks like q-bio isn't similarly recognised!
http://arxiv.org/abs/1209.5439 1209.5439v1 should be in a q-bio folder, following that convention

friendlier saving of log info to file [enhancement request]

I know when I asked about this last time (for quickscrape) you suggested this to save the log output:

command blah blah... 2>&1 | tee log.txt 

but that doesn't seem very friendly or intuitive for shell newbies. A lot of shell knowledge required.
Compare with wget inbuilt functionality wget URL -o log.log

I for one would love to have an -o log-to-file option in both getpapers and quickscrape please :)

empty folders created

in cases where neither pdf nor xml are found, folders are created anyway. this may be irritating when interpreting results and working with e.g. norma and other tools

When downloading fullTextXML for a particular pmcid, if eupmc returns a 404, getpapers crashes

The url http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1693416/fullTextXML returns a 404 and this crashes getpapers.

Command executed

$ getpapers -q 'dinosaurs' -l 'debug' -x -s -p -a -o dinosaursOutput4 >> dinosaursOutput4.log

The output at the location it crashed

debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1693416/fullTextXML
Downloading files [------------------------------] 0% (eta 0.0s)
/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.EventEmitter.emit (events.js:117:20)
    at finishMaybe (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:502:14)
    at endWritable (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:512:3)
    at BufferStream.Writable.end (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:477:5)
    at IncomingMessage.onend (_stream_readable.js:483:10)
    at IncomingMessage.g (events.js:180:16)
    at IncomingMessage.EventEmitter.emit (events.js:117:20)

Version of getpapers: 0.3.0

PLOS ONE dinosaurs search inconsistency

Why can't getpapers metadata-only supply the user a list of dinosaur-related fulltext URLs from PLOS ONE?

(edit: same for PeerJ & eLife. Even when doing metadata only searches, I would like/expect getpapers to output a fulltext_urls.txt file)

$getpapers -q 'dinosaurs JOURNAL:"PLOS ONE"' --api eupmc -o plos_test_eupmc
info: Searching using eupmc API
info: Found 350 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 325 unique results identified
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4439161" had no fulltext HTML url
warn: Article with pmcid "PMC4373905" had no fulltext HTML url
... snip 322 similar warnings snip ...
$ ls
eupmc_results.json

The JSON file from the above metadata only query returns 325 items.

Compare this with the search with added -p, where the JSON file contains 750 records, and the url file contains 18, and it downloaded ~33 PDFs. Super inconsistent!

getpapers -q 'dinosaurs' --api eupmc -p -o pdf_test_eupmc
wc pdf_test_eupmc/fulltext_html_urls.txt 
 18  19 778 fulltext_html_urls.txt

Inconsistent type of URL returned for simple EUPMC search

getpapers -q extremophiles --outdir ./extremophiles

The returned fulltext_html_urls.txt file contains a list of 836 URLs that initially are 100% DOIs ... however down from about 67th in the list to the end, the URLs mysteriously switch from being DOIs to mostly being of the form: http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=EBI&pubmedid=24961215
(and each of these does have a DOI, it's not because they are DOI-less papers)

It doesn't appear to be associated with particular journals. There are journals e.g. PLOS ONE that appear as DOIs if they were among the first 50 results returned and as PMID based links if they were in the later bit of the list e.g. 100-836th.

Odd behaviour.

eupmc: -a AND -x fails

Either switch on their own is fine but trying to do -a AND -x crashes it.
Expected behaviour: search all (not just OA), and download full text XML of the OA ones.

$ getpapers -q 'Gasteria' --api eupmc -a -x   -l verbose  --outdir ./zar
info: Searching using eupmc API
debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria&resulttype=core
info: Found 61 results
Retrieving results [============------------------] 41% (eta 0.0s)debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria&resulttype=core&page=1
Retrieving results [=========================-----] 82% (eta 0.1s)debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria&resulttype=core&page=2
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 50 unique results identified
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC3978243" had no fulltext HTML url
warn: Article with pmcid "PMC3690993" had no fulltext HTML url
warn: Article with pmcid "PMC3605904" had no fulltext HTML url
warn: Article with pmcid "PMC3371435" had no fulltext HTML url
warn: Article with pmcid "PMC3364152" had no fulltext HTML url
warn: Article with pmcid "PMC3286288" had no fulltext HTML url
warn: Article with pmcid "PMC3066391" had no fulltext HTML url
warn: Article with pmcid "PMC3043933" had no fulltext HTML url
warn: Article with pmcid "PMC3002468" had no fulltext HTML url
warn: Article with pmcid "PMC4242389" had no fulltext HTML url
warn: Article with pmcid "PMC4233834" had no fulltext HTML url
warn: Article with pmcid "PMC2871514" had no fulltext HTML url
warn: Article with pmcid "PMC1460952" had no fulltext HTML url
warn: Article with pmcid "PMC2602195" had no fulltext HTML url
warn: Article with pmcid "PMC1693204" had no fulltext HTML url
warn: Article with pmcid "PMC1203420" had no fulltext HTML url
warn: Article with pmcid "PMC1208741" had no fulltext HTML url
warn: Article with pmcid "PMC1209182" had no fulltext HTML url
warn: Article with pmcid "PMC1208723" had no fulltext HTML url
info: Fulltext HTML URL list written to fulltext_html_urls.txt
warn: Article with title "Gasteria plant named 'WT10' did not have a PMCID (therefore no XML)
warn: Article with pmid "8835456 did not have a PMCID (therefore no XML)
warn: Article with pmid "9087376 did not have a PMCID (therefore no XML)
warn: Article with pmid "24240951 did not have a PMCID (therefore no XML)
warn: Article with pmid "18775771 did not have a PMCID (therefore no XML)
warn: Article with pmid "13033248 did not have a PMCID (therefore no XML)
warn: Article with pmid "5594485 did not have a PMCID (therefore no XML)
warn: Article with pmid "18098797 did not have a PMCID (therefore no XML)
info: Downloading fulltext XML files
debug: Creating directory: PMC4377467/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4377467/fullTextXML
debug: Creating directory: PMC4202640/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4202640/fullTextXML
debug: Creating directory: PMC4202400/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4202400/fullTextXML
debug: Creating directory: PMC3978243/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3978243/fullTextXML
debug: Creating directory: PMC4152747/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4152747/fullTextXML
debug: Creating directory: PMC3690993/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3690993/fullTextXML
debug: Creating directory: PMC3729011/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3729011/fullTextXML
debug: Creating directory: PMC3605904/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3605904/fullTextXML
debug: Creating directory: PMC3371435/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3371435/fullTextXML
debug: Creating directory: PMC3364152/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3364152/fullTextXML
debug: Creating directory: PMC3305877/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3305877/fullTextXML
debug: Creating directory: PMC3286288/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3286288/fullTextXML
debug: Creating directory: Gasteria plant named 'WT10'/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3066391/fullTextXML
debug: Creating directory: PMC3066391/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3043933/fullTextXML
debug: Creating directory: PMC3043933/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3002468/fullTextXML
debug: Creating directory: PMC3002468/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2141413/fullTextXML
debug: Creating directory: 8835456/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4242389/fullTextXML
debug: Creating directory: 9087376/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4233834/fullTextXML
debug: Creating directory: 24240951/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4257511/fullTextXML
debug: Creating directory: 18775771/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2871514/fullTextXML
debug: Creating directory: 13033248/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1460952/fullTextXML
debug: Creating directory: PMC2141413/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4243674/fullTextXML
debug: Creating directory: PMC4242389/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2602195/fullTextXML
debug: Creating directory: PMC4233834/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1462085/fullTextXML
debug: Creating directory: PMC4257511/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1693204/fullTextXML
debug: Creating directory: PMC2871514/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2707325/fullTextXML
debug: Creating directory: PMC1460952/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2844068/fullTextXML
debug: Creating directory: PMC4243674/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2707867/fullTextXML
debug: Creating directory: PMC2602195/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1203420/fullTextXML
debug: Creating directory: PMC1462085/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2803647/fullTextXML
debug: Creating directory: PMC1693204/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1078655/fullTextXML
debug: Creating directory: PMC2707325/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1076571/fullTextXML
debug: Creating directory: 5594485/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1077958/fullTextXML
debug: Creating directory: 18098797/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1209227/fullTextXML
debug: Creating directory: PMC2844068/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1076789/fullTextXML
debug: Creating directory: PMC2707867/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1208741/fullTextXML
debug: Creating directory: PMC1203420/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1076941/fullTextXML
debug: Creating directory: PMC2803647/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1209172/fullTextXML
debug: Creating directory: PMC1078655/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC438925/fullTextXML
debug: Creating directory: PMC1076571/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1209182/fullTextXML
debug: Creating directory: PMC1077958/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1200927/fullTextXML
debug: Creating directory: PMC1209227/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1208723/fullTextXML
Downloading files [=-----------------------------] 2% (eta 0.0s)
/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.emit (events.js:117:20)
    at finishMaybe (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:499:14)
    at endWritable (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:509:3)
    at BufferStream.Writable.end (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:474:5)
    at IncomingMessage.onend (_stream_readable.js:502:10)
    at IncomingMessage.g (events.js:180:16)
    at IncomingMessage.emit (events.js:117:20)

Biology Direct paper 'had no fulltext HTML url' -- really?

getpapers -q extremophiles --outdir ./extremophiles (again)

many warning lines such as:

warn: Article with pmcid "PMC1586193" had no fulltext HTML url

...so I checked to see what PMC1586193 is and it turns out it's a Biology Direct paper (Rooting the tree of life by transition analyses):
http://europepmc.org/articles/PMC1586193

and visually looking at the above EUPMC url in a web browser it looks like EUPMC does have a copy of the full text of the paper. I don't know if this is an issue with getpapers or EUPMC but it seems odd.

Incidentally I got 283 of those warnings. For a search that returns 836 results that's quite a high proportion!

Nested Boolean searches

Hi,

I am trying to do nested Boolean searches, but I immediately receive an error that an operator was unexpected. More specifically, it is the operator that combines the two Boolean searches that proves to create an error. These kind of searches do work in EuropePMC directly, btw.

The search I used is getpapers --query '(TITLE: "QRP" AND TITLE:"misconduct") OR (PUB_TYPE:"retraction of publication")' --outdir test, see also the attached image.

If I remove the Boolean part from OR onward I also receive an error, so it seems that the parentheses might also be the cause of the problem. Any help on why this creates an error and whether this is solvable?

Kind regards,
Chris Hartgerink
nested boolean

Hangs in Network Drops During Search

Currently seems to silently hang if connectivity drops during either the search or download stage. My solution of to Ctrl-C and restart repeatedly. If network stays stable for whole cycle then runs perfectly. This was using a CM-FTDM VM.

"no fulltext HTML url" could be located on NCBI site

warn: Article with pmcid "PMC4327751" had no fulltext HTML url
warn: Article with pmcid "PMC4015397" had no fulltext HTML url
warn: Article with pmcid "PMC4210678" had no fulltext HTML url
warn: Article with pmcid "PMC3260561" had no fulltext HTML url
warn: Article with pmcid "PMC3337047" had no fulltext HTML url
warn: Article with pmcid "PMC3026713" had no fulltext HTML url
warn: Article with pmcid "PMC3109237" had no fulltext HTML url
warn: Article with pmcid "PMC3023303" had no fulltext HTML url

Followed http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015397/ Manually and could read whole HTML version of paper

Search multiple APIs in one query?

I'm sure this is in the roadmap, but just to flag it up anyway

Would be nice to search EUPMC + arxiv + IEEE from just one query -- if people are really thirsty for knowledge they want it from anywhere!

No PNAS fulltext (PDF or XML) via getpapers

Very strange. It appears one can't get PNAS fulltext as either PDF or XML via getpapers!
Yet, via the EuropePMC website there's clearly a lot of freely available full text articles, with PDF (not so sure about availability of full text XML).

Absolutely zero fulltext downloads appear to be possible for PNAS or Science:

getpapers --query 'Journal:"PNAS" AND FIRST_PDATE:[2000-01-01 TO 2015-05-01]' -x --outdir pnas
info: Searching using eupmc API
info: Found 0 open access results
#include closed papers
 getpapers --query 'Journal:"PNAS" AND FIRST_PDATE:[2000-01-01 TO 2015-05-01]' --all --outdir pnas
info: Searching using eupmc API
info: Found 57575 results

Take Busch et al as the test case: http://europepmc.org/articles/PMC4321246
Clearly available as full text for free for human eyes via EPMC as html & downloadable PDF.

#finds the paper because of --all switch
getpapers --query 'Author:"Busch" AND FIRST_PDATE:[2015-01-01 TO 2015-02-01]' --all --outdir busch
#DOES NOT find the paper
getpapers --query 'Author:"Busch" AND FIRST_PDATE:[2015-01-01 TO 2015-02-01]'  --outdir openbusch

404 for single open Elsevier PDF because it doesn't exist?

404 error when trying to get PDF of this one open article from the journal Academic Radiology:

I suspect it's because there is no PDF for this article at EuropePMC: http://europepmc.org/articles/PMC4234081 . Does getpapers just assume there's PDF if there's XML available?

getpapers --query 'Journal:"Academic Radiology" AND FIRST_PDATE:[2010-01-01 TO 2015-07-01]' -p  --outdir elseacrad
info: Searching using eupmc API
info: Found 1 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4234081" had no fulltext HTML url
info: Downloading fulltext PDF files
Downloading files [==============================] 100% (eta 0.0s)

/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.emit (events.js:117:20)
    at finishMaybe (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:499:14)
    at afterWrite (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:378:3)
    at afterTick (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/node_modules/process-nextick-args/index.js:11:8)
    at process._tickCallback (node.js:448:13)

Downloading the XML for the same article is fine, no problem

getpapers --query 'Journal:"Academic Radiology" AND FIRST_PDATE:[2010-01-01 TO 2015-07-01]' -x  --outdir elseacrad
info: Searching using eupmc API
info: Found 1 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4234081" had no fulltext HTML url
info: Downloading fulltext XML files
Downloading files [==============================] 100% (eta 0.0s)
info: All XML downloads succeeded!

Display identity of timed out downloads

So I'm on wifi at the moment, trying to download a year's worth of Nature supp data. (The flaky wifi isn't the problem, it's to be expected). I get this message:

warn: 20 downloads timed out. Retrying.

I'm not confident the retrying worked. Which 20 downloads out of a possible 2000 timed out?
e.g. the PDF for PMC 123456 , the supp data file for PMC 654321 etc...

I think it would be better to print exact identity of failed downloads to screen and/or file. Related to #43 in some ways

What is the purpose of the file fulltext_html_urls.txt

What is the purpose of the file fulltext_html_urls.txt available as a part of the output?

Purpose: Search open access papers in eupmc for the query dinosaurs and download fulltext XMLs, supplementary files and fulltext PDFs if available

Query used

$ getpapers -q 'dinosaurs' -x -s -p -o dinosaursOutput2 >> dinosaursOutput2.log

This generated a fulltext_html_urls.txt file with 22 urls

Not all pmids listed in fulltext_html_urls.txt had a corresponding fulltext.xml or fulltext.html file downloaded. Of the 22 urls with pmcids listed in the file, the breakdown of what I found was as follows:

  • 20 of the pmcids had an empty dir
  • 2 of the pmcids had a dir with a fulltext.xml file but an empty fulltext.html file
  • For each of the pmcids in the fulltext_html_urls.txt file, the output produced a message similar to the following one
    warn: Article with pmcid "PMC3381548" had no fulltext PDF url

fulltext_html_urls.txt naming issue

PMR and I think that the output fulltext_html_urls.txt should instead be named

APImethod_fulltext_html_urls.txt

e.g. eupmc_fulltext_html_urls.txt , ieee_fulltext_html_urls.txt , arxiv_fulltext_html_urls.txt

This would better follow the convention set by the .json results files which are similarly named.
It's useful when you do a search for 'dinosaurs' in eupmc, then arxiv, then ieee all outputting into the same outdir

No warning that --api ieee -s won't work

ditto for ieee + -s

$ getpapers -q 'Gasteria' --api ieee -s  --outdir ./car
info: Searching using ieee API

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:448:9
    at Parser.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

need a manifest file to preserve provenance info [important]

PMR and I strongly think we need a manifest.json of some sort to document in each search.

  • the time and date and API used of the getpapers search and the search parameters
  • a full listing of all files downloaded from the getpapers query

hence the suggested name of either 'manifest' or 'metadata' for this new JSON file.

If I do a search today, just by looking at the output I will have no idea in 7 days time what search I ran to get those results. PMR also thinks it's very important for downstream tools to have a manifest of all the files in the cmdir.

no-execute (dry-run) mode, -n

My first test query using getpapers returned rather an unexpectedly large number of results (836), which it then proceeded to try and download...

I was wondering if getpapers could have a 'no-execute mode' -n similar to:

mmv -n
rsync -n

whereby the output would simply return the number of papers found for that query and NOT download anything. This is useful in the cases where you don't quite know how many matching papers you're going to get served.

PLOS ONE 'supp materials' are mostly just figures NOT SI

Single paper demonstration of issue: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0079155 Expected getpapers to download the supp info: File S1 (9 Mb PDF). EuropePMC does hold this supp file: http://europepmc.org/articles/PMC3838368/bin/pone.0079155.s001.pdf

But getpapers unexpectedly returns all 9 main paper figure images as supp info and NOT the real supp info.

getpapers --query 'JOURNAL:"PLOS ONE" TITLE:"Monogenean lost clamps"' -s  --outdir plosmono
cd plosmono/PMC3838368/
unzip supplementaryFiles.zip
ls
pone.0079155.g001.jpg  pone.0079155.g005.jpg  pone.0079155.g009.jpg
pone.0079155.g002.jpg  pone.0079155.g006.jpg  supplementaryFiles.zip
pone.0079155.g003.jpg  pone.0079155.g007.jpg
pone.0079155.g004.jpg  pone.0079155.g008.jpg

Multiple PLOS ONE example, where only 3 out of 25 hits returns real supplementary information files.
The three that do return real supp info are: PMC3665537, PMC3669350, PMC3692442
All others, as apparent from file names g001 etc are just the figures from the main paper

#Downloads 25 supplementary materials zip files
getpapers --query 'JOURNAL:"PLOS ONE" METHODS:"NHMUK"' -s --outdir plosbmnh
#unzip all
tree
.
├── eupmc_results.json
├── PMC3648582
├── PMC3665537
│   ├── pone.0065295.e001.jpg
│   ├── pone.0065295.e002.jpg
│   ├── pone.0065295.g001.jpg
│   ├── pone.0065295.g002.jpg
│   ├── pone.0065295.g003.jpg
│   ├── pone.0065295.g004.jpg
│   ├── pone.0065295.g005.jpg
│   ├── pone.0065295.g006.jpg
│   ├── pone.0065295.g007.jpg
│   ├── pone.0065295.s001.doc
│   ├── pone.0065295.s002.doc
│   ├── pone.0065295.s003.doc
│   ├── pone.0065295.s004.doc
│   ├── pone.0065295.s005.wmv
│   ├── pone.0065295.s006.wmv
│   ├── pone.0065295.s007.wmv
│   ├── pone.0065295.s008.wmv
│   └── supplementaryFiles.zip
├── PMC3669350
│   ├── pone.0064203.g001.jpg
│   ├── pone.0064203.g002.jpg
│   ├── pone.0064203.g003.jpg
│   ├── pone.0064203.g004.jpg
│   ├── pone.0064203.g005.jpg
│   ├── pone.0064203.g006.jpg
│   ├── pone.0064203.g007.jpg
│   ├── pone.0064203.g008.jpg
│   ├── pone.0064203.g009.jpg
│   ├── pone.0064203.g010.jpg
│   ├── pone.0064203.g011.jpg
│   ├── pone.0064203.s001.doc
│   ├── pone.0064203.s002.doc
│   ├── pone.0064203.s003.doc
│   ├── pone.0064203.s004.nex
│   └── supplementaryFiles.zip
├── PMC3692442
│   ├── pone.0067176.g001.jpg
│   ├── pone.0067176.g002.jpg
│   ├── pone.0067176.g003.jpg
│   ├── pone.0067176.g004.jpg
│   ├── pone.0067176.g005.jpg
│   ├── pone.0067176.s001.xls
│   ├── pone.0067176.s002.pdf
│   └── supplementaryFiles.zip
├── PMC3789696
│   ├── pone.0077457.g001.jpg
│   ├── pone.0077457.g002.jpg
│   ├── pone.0077457.g003.jpg
│   ├── pone.0077457.g004.jpg
│   └── supplementaryFiles.zip
├── PMC3838368
│   ├── pone.0079155.g001.jpg
│   ├── pone.0079155.g002.jpg
│   ├── pone.0079155.g003.jpg
│   ├── pone.0079155.g004.jpg
│   ├── pone.0079155.g005.jpg
│   ├── pone.0079155.g006.jpg
│   ├── pone.0079155.g007.jpg
│   ├── pone.0079155.g008.jpg
│   ├── pone.0079155.g009.jpg
│   └── supplementaryFiles.zip
├── PMC3847141
│   ├── pone.0080405.g001.jpg
│   ├── pone.0080405.g002.jpg
│   ├── pone.0080405.g003.jpg
│   ├── pone.0080405.g004.jpg
│   ├── pone.0080405.g005.jpg
│   ├── pone.0080405.g006.jpg
│   ├── pone.0080405.g007.jpg
│   ├── pone.0080405.g008.jpg
│   ├── pone.0080405.g009.jpg
│   ├── pone.0080405.g010.jpg
│   ├── pone.0080405.g011.jpg
│   ├── pone.0080405.g012.jpg
│   ├── pone.0080405.g013.jpg
│   ├── pone.0080405.g014.jpg
│   ├── pone.0080405.g015.jpg
│   ├── pone.0080405.g016.jpg
│   ├── pone.0080405.g017.jpg
│   ├── pone.0080405.g018.jpg
│   ├── pone.0080405.g019.jpg
│   ├── pone.0080405.g020.jpg
│   ├── pone.0080405.g021.jpg
│   ├── pone.0080405.g022.jpg
│   ├── pone.0080405.g023.jpg
│   ├── pone.0080405.g024.jpg
│   ├── pone.0080405.g025.jpg
│   ├── pone.0080405.g026.jpg
│   ├── pone.0080405.g027.jpg
│   ├── pone.0080405.g028.jpg
│   ├── pone.0080405.g029.jpg
│   ├── pone.0080405.g030.jpg
│   ├── pone.0080405.g031.jpg
│   ├── pone.0080405.g032.jpg
│   ├── pone.0080405.g033.jpg
│   ├── pone.0080405.g034.jpg
│   └── supplementaryFiles.zip
├── PMC3852158
│   ├── pone.0080974.g001.jpg
│   ├── pone.0080974.g002.jpg
│   ├── pone.0080974.g003.jpg
│   ├── pone.0080974.g004.jpg
│   ├── pone.0080974.g005.jpg
│   ├── pone.0080974.g006.jpg
│   ├── pone.0080974.g007.jpg
│   ├── pone.0080974.g008.jpg
│   ├── pone.0080974.g009.jpg
│   ├── pone.0080974.g010.jpg
│   ├── pone.0080974.g011.jpg
│   ├── pone.0080974.g012.jpg
│   ├── pone.0080974.g013.jpg
│   ├── pone.0080974.g014.jpg
│   ├── pone.0080974.g015.jpg
│   ├── pone.0080974.g016.jpg
│   ├── pone.0080974.g017.jpg
│   └── supplementaryFiles.zip
├── PMC3859474
│   ├── pone.0066075.g001.jpg
│   ├── pone.0066075.g002.jpg
│   ├── pone.0066075.g003.jpg
│   ├── pone.0066075.g004.jpg
│   ├── pone.0066075.g005.jpg
│   ├── pone.0066075.g006.jpg
│   └── supplementaryFiles.zip
├── PMC3897400
│   ├── pone.0084709.g001.jpg
│   ├── pone.0084709.g002.jpg
│   ├── pone.0084709.g003.jpg
│   ├── pone.0084709.g004.jpg
│   ├── pone.0084709.g005.jpg
│   ├── pone.0084709.g006.jpg
│   ├── pone.0084709.g007.jpg
│   ├── pone.0084709.g008.jpg
│   ├── pone.0084709.g009.jpg
│   ├── pone.0084709.g010.jpg
│   ├── pone.0084709.g011.jpg
│   ├── pone.0084709.g012.jpg
│   ├── pone.0084709.g013.jpg
│   ├── pone.0084709.g014.jpg
│   ├── pone.0084709.g015.jpg
│   ├── pone.0084709.g016.jpg
│   ├── pone.0084709.g017.jpg
│   ├── pone.0084709.g018.jpg
│   ├── pone.0084709.g019.jpg
│   ├── pone.0084709.g020.jpg
│   └── supplementaryFiles.zip
├── PMC3907582
│   ├── pone.0086864.g001.jpg
│   ├── pone.0086864.g002.jpg
│   ├── pone.0086864.g003.jpg
│   ├── pone.0086864.g004.jpg
│   ├── pone.0086864.g005.jpg
│   ├── pone.0086864.g006.jpg
│   ├── pone.0086864.g007.jpg
│   ├── pone.0086864.g008.jpg
│   ├── pone.0086864.g009.jpg
│   ├── pone.0086864.g010.jpg
│   ├── pone.0086864.g011.jpg
│   ├── pone.0086864.g012.jpg
│   ├── pone.0086864.g013.jpg
│   ├── pone.0086864.g014.jpg
│   ├── pone.0086864.g015.jpg
│   ├── pone.0086864.g016.jpg
│   ├── pone.0086864.g017.jpg
│   ├── pone.0086864.g018.jpg
│   ├── pone.0086864.g019.jpg
│   └── supplementaryFiles.zip
├── PMC3914794
│   ├── pone.0087048.g001.jpg
│   ├── pone.0087048.g002.jpg
│   ├── pone.0087048.g003.jpg
│   ├── pone.0087048.g004.jpg
│   ├── pone.0087048.g005.jpg
│   └── supplementaryFiles.zip
├── PMC3937355
│   ├── pone.0089165.g001.jpg
│   ├── pone.0089165.g002.jpg
│   ├── pone.0089165.g003.jpg
│   ├── pone.0089165.g004.jpg
│   ├── pone.0089165.g005.jpg
│   ├── pone.0089165.g006.jpg
│   ├── pone.0089165.g007.jpg
│   ├── pone.0089165.g008.jpg
│   ├── pone.0089165.g009.jpg
│   ├── pone.0089165.g010.jpg
│   ├── pone.0089165.g011.jpg
│   ├── pone.0089165.g012.jpg
│   ├── pone.0089165.g013.jpg
│   ├── pone.0089165.g014.jpg
│   ├── pone.0089165.g015.jpg
│   ├── pone.0089165.g016.jpg
│   ├── pone.0089165.g017.jpg
│   ├── pone.0089165.g018.jpg
│   ├── pone.0089165.g019.jpg
│   ├── pone.0089165.g020.jpg
│   ├── pone.0089165.g021.jpg
│   └── supplementaryFiles.zip
├── PMC3991637
│   ├── pone.0095296.g001.jpg
│   ├── pone.0095296.g002.jpg
│   ├── pone.0095296.g003.jpg
│   ├── pone.0095296.g004.jpg
│   ├── pone.0095296.g005.jpg
│   ├── pone.0095296.g006.jpg
│   ├── pone.0095296.g007.jpg
│   ├── pone.0095296.g008.jpg
│   ├── pone.0095296.g009.jpg
│   ├── pone.0095296.g010.jpg
│   ├── pone.0095296.g011.jpg
│   ├── pone.0095296.g012.jpg
│   ├── pone.0095296.g013.jpg
│   ├── pone.0095296.g014.jpg
│   ├── pone.0095296.g015.jpg
│   └── supplementaryFiles.zip
├── PMC4118863
│   ├── pone.0103152.g001.jpg
│   ├── pone.0103152.g002.jpg
│   ├── pone.0103152.g003.jpg
│   ├── pone.0103152.g004.jpg
│   ├── pone.0103152.g005.jpg
│   ├── pone.0103152.g006.jpg
│   ├── pone.0103152.g007.jpg
│   ├── pone.0103152.g008.jpg
│   ├── pone.0103152.g009.jpg
│   ├── pone.0103152.g010.jpg
│   ├── pone.0103152.g011.jpg
│   ├── pone.0103152.g012.jpg
│   ├── pone.0103152.g013.jpg
│   ├── pone.0103152.g014.jpg
│   ├── pone.0103152.g015.jpg
│   └── supplementaryFiles.zip
├── PMC4131922
│   ├── pone.0104551.g001.jpg
│   ├── pone.0104551.g002.jpg
│   ├── pone.0104551.g003.jpg
│   ├── pone.0104551.g004.jpg
│   ├── pone.0104551.g005.jpg
│   ├── pone.0104551.g006.jpg
│   ├── pone.0104551.g007.jpg
│   ├── pone.0104551.g008.jpg
│   ├── pone.0104551.g009.jpg
│   ├── pone.0104551.g010.jpg
│   ├── pone.0104551.g011.jpg
│   └── supplementaryFiles.zip
├── PMC4192354
│   ├── pone.0109785.g001.jpg
│   ├── pone.0109785.g002.jpg
│   ├── pone.0109785.g003.jpg
│   ├── pone.0109785.g004.jpg
│   ├── pone.0109785.g005.jpg
│   └── supplementaryFiles.zip
├── PMC4206445
│   ├── pone.0110646.e001.jpg
│   ├── pone.0110646.e002.jpg
│   ├── pone.0110646.g001.jpg
│   ├── pone.0110646.g002.jpg
│   ├── pone.0110646.g003.jpg
│   ├── pone.0110646.g004.jpg
│   ├── pone.0110646.g005.jpg
│   ├── pone.0110646.g006.jpg
│   ├── pone.0110646.g007.jpg
│   ├── pone.0110646.g008.jpg
│   ├── pone.0110646.g009.jpg
│   ├── pone.0110646.g010.jpg
│   ├── pone.0110646.g011.jpg
│   ├── pone.0110646.g012.jpg
│   ├── pone.0110646.g013.jpg
│   ├── pone.0110646.g014.jpg
│   ├── pone.0110646.g015.jpg
│   └── supplementaryFiles.zip
├── PMC4269487
│   ├── pone.0113911.g001.jpg
│   ├── pone.0113911.g002.jpg
│   ├── pone.0113911.g003.jpg
│   ├── pone.0113911.g004.jpg
│   ├── pone.0113911.g005.jpg
│   ├── pone.0113911.g006.jpg
│   ├── pone.0113911.g007.jpg
│   ├── pone.0113911.g008.jpg
│   ├── pone.0113911.g009.jpg
│   ├── pone.0113911.g010.jpg
│   ├── pone.0113911.g011.jpg
│   ├── pone.0113911.g012.jpg
│   ├── pone.0113911.g013.jpg
│   ├── pone.0113911.g014.jpg
│   ├── pone.0113911.g015.jpg
│   ├── pone.0113911.g016.jpg
│   ├── pone.0113911.g017.jpg
│   ├── pone.0113911.g018.jpg
│   ├── pone.0113911.g019.jpg
│   ├── pone.0113911.g020.jpg
│   ├── pone.0113911.g021.jpg
│   ├── pone.0113911.g022.jpg
│   ├── pone.0113911.g023.jpg
│   ├── pone.0113911.g024.jpg
│   ├── pone.0113911.g025.jpg
│   └── supplementaryFiles.zip
├── PMC4382297
│   ├── pone.0120924.g001.jpg
│   ├── pone.0120924.g002.jpg
│   ├── pone.0120924.g003.jpg
│   ├── pone.0120924.g004.jpg
│   ├── pone.0120924.g005.jpg
│   ├── pone.0120924.g006.jpg
│   ├── pone.0120924.g007.jpg
│   ├── pone.0120924.g008.jpg
│   ├── pone.0120924.g009.jpg
│   ├── pone.0120924.g010.jpg
│   ├── pone.0120924.g011.jpg
│   ├── pone.0120924.g012.jpg
│   ├── pone.0120924.g013.jpg
│   ├── pone.0120924.g014.jpg
│   ├── pone.0120924.g015.jpg
│   ├── pone.0120924.g016.jpg
│   ├── pone.0120924.g017.jpg
│   ├── pone.0120924.g018.jpg
│   ├── pone.0120924.g019.jpg
│   ├── pone.0120924.g020.jpg
│   ├── pone.0120924.g021.jpg
│   ├── pone.0120924.g022.jpg
│   ├── pone.0120924.g023.jpg
│   ├── pone.0120924.g024.jpg
│   ├── pone.0120924.g025.jpg
│   ├── pone.0120924.g026.jpg
│   ├── pone.0120924.g027.jpg
│   ├── pone.0120924.g028.jpg
│   ├── pone.0120924.g029.jpg
│   ├── pone.0120924.g030.jpg
│   ├── pone.0120924.g031.jpg
│   └── supplementaryFiles.zip
├── PMC4406738
│   ├── pone.0123503.g001.jpg
│   ├── pone.0123503.g002.jpg
│   ├── pone.0123503.g003.jpg
│   ├── pone.0123503.g004.jpg
│   ├── pone.0123503.g005.jpg
│   └── supplementaryFiles.zip
├── PMC4454574
│   ├── pone.0125819.g001.jpg
│   ├── pone.0125819.g002.jpg
│   ├── pone.0125819.g003.jpg
│   ├── pone.0125819.g004.jpg
│   ├── pone.0125819.g005.jpg
│   ├── pone.0125819.g006.jpg
│   ├── pone.0125819.g007.jpg
│   ├── pone.0125819.g008.jpg
│   ├── pone.0125819.g009.jpg
│   ├── pone.0125819.g010.jpg
│   ├── pone.0125819.g011.jpg
│   ├── pone.0125819.g012.jpg
│   ├── pone.0125819.g013.jpg
│   ├── pone.0125819.g014.jpg
│   ├── pone.0125819.g015.jpg
│   ├── pone.0125819.g016.jpg
│   ├── pone.0125819.g017.jpg
│   ├── pone.0125819.g018.jpg
│   └── supplementaryFiles.zip
├── PMC4465186
│   ├── pone.0127727.g001.jpg
│   ├── pone.0127727.g002.jpg
│   └── supplementaryFiles.zip
├── PMC4480851
│   ├── pone.0129193.g001.jpg
│   ├── pone.0129193.g002.jpg
│   ├── pone.0129193.g003.jpg
│   ├── pone.0129193.g004.jpg
│   ├── pone.0129193.g005.jpg
│   ├── pone.0129193.g006.jpg
│   ├── pone.0129193.g007.jpg
│   ├── pone.0129193.g008.jpg
│   ├── pone.0129193.g009.jpg
│   └── supplementaryFiles.zip
└── PMC4480985
    ├── pone.0127621.g001.jpg
    ├── pone.0127621.g002.jpg
    ├── pone.0127621.g003.jpg
    ├── pone.0127621.g004.jpg
    ├── pone.0127621.g005.jpg
    ├── pone.0127621.g006.jpg
    ├── pone.0127621.g007.jpg
    ├── pone.0127621.g008.jpg
    └── supplementaryFiles.zip

25 directories, 360 files

split up results.json and move to ctree

quickscrape creates a result.json in each ctree, and getpapers an apiname_results.json, which gets overwritten with each search. It would make more sense to follow the quickscrape procedure here. Also this would solve issue 45

Can we have optional .csv output? Not everyone groks JSON

Another picky non-urgent issue but...

for a simple arxiv query only JSON data is returned at the moment, as 'arxiv_results.json'
from e.g. getpapers -q dinosaurs --api arxiv --outdir ./dinos

If I want to open it up in a text editor just to check what I've got, JSON isn't nice for that but .csv would be more universally understood, right? Would make it more friendly?

TypeError: Cannot read property 'hitCount' of undefined

getpapers --query '(JOURNAL:"bmc ecol") AND ((FIRST_PDATE:[2015-03-01 TO 2015-05-29]))' --outdir dirxxxxx

returned following error:

TypeError: Cannot read property 'hitCount' of undefined
at EuPmc.completeCallback (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/eupmc.js:61:35)
at Request.EventEmitter.emit (events.js:98:17)
at Request.mixin._fireSuccess (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
at IncomingMessage.parsers.auto (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:394:7)
at Request.mixin._encode (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:195:29)
at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:154:16
at Request.mixin._decode (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:170:7)
at IncomingMessage. (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:147:14)
at IncomingMessage.EventEmitter.emit (events.js:117:20)

Allow dumping of specific journal(s) and date range(s)

Rather than querying an API, this would perform direct mass downloads from an FTP source like the PubMed FTP, the ArXiV FTP, CORE, etc.

This issue will track creation of the general interface - separate issues will track each specific data source.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.