contentmine / getpapers Goto Github PK

View Code? Open in Web Editor NEW

197.0 16.0 37.0 607 KB

Get metadata, fulltexts or fulltext URLs of papers matching a search query

License: MIT License

JavaScript 100.00%

getpapers's Introduction

getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query using any of the following APIs:

EuropePMC
IEEE
ArXiv
Crossref (metadata, no fulltext)

getpapers can fetch article metadata, fulltexts (PDF or XML), and supplementary materials. It's designed for use in content mining, but you may find it useful for quickly acquiring large numbers of papers for reading, or for bibliometrics.

Installation

Installing nodeJS

Please follow these cross-platform instructions

Installing getpapers

$ npm install --global getpapers

Usage

Use getpapers --help to see the command-line help:

    -h, --help                output usage information
    -V, --version             output the version number
    -q, --query <query>       search query (required)
    -o, --outdir <path>       output directory (required - will be created if not found)
    --api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
    -x, --xml                 download fulltext XMLs if available
    -p, --pdf                 download fulltext PDFs if available
    -s, --supp                download supplementary files if available
    -t, --minedterms          download text-mined terms if available
    -l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)
    -a, --all                 search all papers, not just open access
    -n, --noexecute           report how many results match the query, but don't actually download anything
    -f, --logfile <filename>  save log to specified file in output directory as well as printing to terminal
    -k, --limit <int>         limit the number of hits and downloads
    --filter <filter object>  filter by key value pair, passed straight to the crossref api only
    -r, --restart             restart file downloads after failure

By default, getpapers uses the EuropePMC API.

Screenshot

Query formats

Each API has its own query format. Usage guides are provided on our wiki:

License

Caveats

The remote site may timeout or hang (we have found that if EPMC gets a query with no results it will timeout).
Be careful not to download the whole site. use the -k option to limit downloads (this should be a default).

getpapers's People

Contributors

Stargazers

Watchers

getpapers's Issues

Feature Request: Enable using institute access to get full text PDF

Not sure if this can be applied to all types of institute access. My current institute uses SSO, so that could be a bit easy, but form based SSO, such as the one at NYU (and many other institutes) might be a bit more challenging.

Implement basic interface

[dump] PubMed FTP

Add PubMed FTP as a source for #16

eupmc: -p stumbles on BMC OA papers (?)

Success:
getpapers -q Gasteria --api eupmc -s -l verbose --outdir ./blah

Success:
getpapers -q Gasteria --api eupmc -x -l verbose --outdir ./blah

Fail:

$ getpapers -q Gasteria --api eupmc -p   -l verbose  --outdir ./blah
info: Searching using eupmc API
debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria%20OPEN_ACCESS%3Ay&resulttype=core
info: Found 13 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC3978243" had no fulltext HTML url
warn: Article with pmcid "PMC3605904" had no fulltext HTML url
warn: Article with pmcid "PMC3371435" had no fulltext HTML url
warn: Article with pmcid "PMC3364152" had no fulltext HTML url
warn: Article with pmcid "PMC3066391" had no fulltext HTML url
warn: Article with pmcid "PMC2871514" had no fulltext HTML url
warn: Article with pmcid "PMC2602195" had no fulltext HTML url
info: Fulltext HTML URL list written to fulltext_html_urls.txt
warn: Article with pmcid "PMC3371435" had no fulltext PDF url
info: Downloading fulltext PDF files
debug: Creating directory: PMC4377467/
debug: Downloading PDF: http://europepmc.org/articles/PMC4377467?pdf=render
debug: Creating directory: PMC3978243/
debug: Downloading PDF: http://europepmc.org/articles/PMC3978243?pdf=render
debug: Creating directory: PMC4152747/
debug: Downloading PDF: http://europepmc.org/articles/PMC4152747?pdf=render
debug: Creating directory: PMC3729011/
debug: Downloading PDF: http://europepmc.org/articles/PMC3729011?pdf=render
debug: Creating directory: PMC3605904/
debug: Downloading PDF: http://europepmc.org/articles/PMC3605904?pdf=render
debug: Creating directory: PMC3371435/
debug: Downloading PDF: http://europepmc.org/articles/PMC3364152?pdf=render
debug: Creating directory: PMC3364152/
debug: Downloading PDF: http://europepmc.org/articles/PMC3305877?pdf=render
debug: Creating directory: PMC3305877/
debug: Downloading PDF: http://europepmc.org/articles/PMC3066391?pdf=render
debug: Creating directory: PMC3066391/
debug: Downloading PDF: http://europepmc.org/articles/PMC2141413?pdf=render
debug: Creating directory: PMC2141413/
debug: Downloading PDF: http://www.biomedcentral.com/content/pdf/1478-5854-9-18.pdf
debug: Creating directory: PMC2871514/
debug: Downloading PDF: http://europepmc.org/articles/PMC2602195?pdf=render
debug: Creating directory: PMC2602195/
debug: Downloading PDF: http://www.biomedcentral.com/content/pdf/1471-2229-10-32.pdf
Downloading files [===---------------------------] 8% (eta 0.0s)
/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.emit (events.js:117:20)
    at finishMaybe (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:499:14)
    at endWritable (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:509:3)
    at BufferStream.Writable.end (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:474:5)
    at Unzip.onend (_stream_readable.js:502:10)
    at Unzip.g (events.js:180:16)
    at Unzip.emit (events.js:117:20)

PS Gasteria is one of my favourite genera of succulent plants :)

If using IEEE API, getpapers throws an error if 0 results found

Example Query

$ getpapers -q 'cs:"Syracuse University"' -o Output10 --api 'ieee'

Output

info: Searching using ieee API

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.EventEmitter.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:449:9
    at Parser.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.EventEmitter.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

Note: I am not sure if the query format is correct. It may have to be

 -q 'cs="Syracuse University"

The xml returned by IEEE is stupid and copied below. Why not just keep the same format as when you get results with totalcount 0.

<Error>Cannot go to record 1 since query  only returned 0 records</Error>

Allow outputting of bibJSON record with each paper

Make fulltext download optional

--api ieee -x the warning shows, but not a graceful exit

When I saw the graceful exit displayed by --api arxiv -x:

$ getpapers -q 'Gasteria' --api arxiv -x  --outdir ./car
info: Searching using arxiv API
warn: The ArXiv API does not provide fulltext XML, so the --xml flag will be ignored
info: Found 0 results

I realised that even though the warning is clearly displayed, this should still be flagged as a bug because of the non-graceful exit:

$ getpapers -q 'Gasteria' --api ieee -x  --outdir ./car
info: Searching using ieee API
warn: The IEEE API does not provide fulltext XML, so the --xml flag will be ignored

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:448:9
    at Parser.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

[dump] CORE ftp

Add CORE ftp as a source for #16

Add EUPMC results paging

Add IEEE API

http://ieeexplore.ieee.org/gateway/

ieee -a fails when search hits == 0

I saw the no XML warning for using ieee with -x (XML) but there's no similar warning here:

getpapers -q 'Gasteria' --api ieee -a  --outdir ./car
info: Searching using ieee API

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:448:9
    at Parser.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

Add CrossRef API as a source

arxiv: q-bio not recognised as a distinct subject class

$getpapers -q 'dinosaurs' --api arxiv -p  --outdir ./car
# it all worked but ...
$ cd car
$ ls
0704.1912v4  1302.3267v1  arxiv_results.json  cond-mat
1209.5439v1  1302.5142v1  astro-ph            hep-ph

The subfolders by subject class is cool e.g. astro-ph, hep-ph & cond-mat
... but it looks like q-bio isn't similarly recognised!
http://arxiv.org/abs/1209.5439 1209.5439v1 should be in a q-bio folder, following that convention

Convert a list of DOI to BibText entries

Take a list of DOI either from the command line or from stdin and print bib to stdout.

friendlier saving of log info to file [enhancement request]

I know when I asked about this last time (for quickscrape) you suggested this to save the log output:

command blah blah... 2>&1 | tee log.txt

but that doesn't seem very friendly or intuitive for shell newbies. A lot of shell knowledge required.
Compare with wget inbuilt functionality wget URL -o log.log

I for one would love to have an -o log-to-file option in both getpapers and quickscrape please :)

IEEE papers are not 'open access' sensu stricto (just free access)

getpapers -q dinosaurs --api ieee --outdir ./dinos
info: Searching using ieee API
info: Found 9 open access results

At least one of the returned papers I looked at is freely available but otherwise full copyright of the IEEE. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1707659
Just a minor language tweak but may I suggest 'free access full text' instead of 'open access' to be on the safe/correct side?

empty folders created

in cases where neither pdf nor xml are found, folders are created anyway. this may be irritating when interpreting results and working with e.g. norma and other tools

Add output of URL list

Add ArXiv as a source

When downloading fullTextXML for a particular pmcid, if eupmc returns a 404, getpapers crashes

The url http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1693416/fullTextXML returns a 404 and this crashes getpapers.

Command executed

$ getpapers -q 'dinosaurs' -l 'debug' -x -s -p -a -o dinosaursOutput4 >> dinosaursOutput4.log

The output at the location it crashed

debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1693416/fullTextXML
Downloading files [------------------------------] 0% (eta 0.0s)
/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.EventEmitter.emit (events.js:117:20)
    at finishMaybe (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:502:14)
    at endWritable (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:512:3)
    at BufferStream.Writable.end (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:477:5)
    at IncomingMessage.onend (_stream_readable.js:483:10)
    at IncomingMessage.g (events.js:180:16)
    at IncomingMessage.EventEmitter.emit (events.js:117:20)

Version of getpapers: 0.3.0

eupmc: allow filter by article type e.g. patents

getpapers -q 'Gasteria' --api eupmc -a -s -l verbose --outdir ./blah

above search outputs one curious (but technically correct!) folder called: Gasteria plant named 'WT10' with supp data inside (correct expected behaviour).

It turns out it's a patent: http://europepmc.org/patents/PAT/US2012102612P

I would like to be able to filter-out patents AND/OR conversely, search only patents via EUPMC via getpapers.

PLOS ONE dinosaurs search inconsistency

Why can't getpapers metadata-only supply the user a list of dinosaur-related fulltext URLs from PLOS ONE?

(edit: same for PeerJ & eLife. Even when doing metadata only searches, I would like/expect getpapers to output a fulltext_urls.txt file)

$getpapers -q 'dinosaurs JOURNAL:"PLOS ONE"' --api eupmc -o plos_test_eupmc
info: Searching using eupmc API
info: Found 350 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 325 unique results identified
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4439161" had no fulltext HTML url
warn: Article with pmcid "PMC4373905" had no fulltext HTML url
... snip 322 similar warnings snip ...
$ ls
eupmc_results.json

The JSON file from the above metadata only query returns 325 items.

Compare this with the search with added -p, where the JSON file contains 750 records, and the url file contains 18, and it downloaded ~33 PDFs. Super inconsistent!

getpapers -q 'dinosaurs' --api eupmc -p -o pdf_test_eupmc
wc pdf_test_eupmc/fulltext_html_urls.txt 
 18  19 778 fulltext_html_urls.txt

Inconsistent type of URL returned for simple EUPMC search

getpapers -q extremophiles --outdir ./extremophiles

The returned fulltext_html_urls.txt file contains a list of 836 URLs that initially are 100% DOIs ... however down from about 67th in the list to the end, the URLs mysteriously switch from being DOIs to mostly being of the form: http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=EBI&pubmedid=24961215
(and each of these does have a DOI, it's not because they are DOI-less papers)

It doesn't appear to be associated with particular journals. There are journals e.g. PLOS ONE that appear as DOIs if they were among the first 50 results returned and as PMID based links if they were in the later bit of the list e.g. 100-836th.

Odd behaviour.

eupmc: -a AND -x fails

Either switch on their own is fine but trying to do -a AND -x crashes it.
Expected behaviour: search all (not just OA), and download full text XML of the OA ones.

$ getpapers -q 'Gasteria' --api eupmc -a -x   -l verbose  --outdir ./zar
info: Searching using eupmc API
debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria&resulttype=core
info: Found 61 results
Retrieving results [============------------------] 41% (eta 0.0s)debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria&resulttype=core&page=1
Retrieving results [=========================-----] 82% (eta 0.1s)debug: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=Gasteria&resulttype=core&page=2
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 50 unique results identified
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC3978243" had no fulltext HTML url
warn: Article with pmcid "PMC3690993" had no fulltext HTML url
warn: Article with pmcid "PMC3605904" had no fulltext HTML url
warn: Article with pmcid "PMC3371435" had no fulltext HTML url
warn: Article with pmcid "PMC3364152" had no fulltext HTML url
warn: Article with pmcid "PMC3286288" had no fulltext HTML url
warn: Article with pmcid "PMC3066391" had no fulltext HTML url
warn: Article with pmcid "PMC3043933" had no fulltext HTML url
warn: Article with pmcid "PMC3002468" had no fulltext HTML url
warn: Article with pmcid "PMC4242389" had no fulltext HTML url
warn: Article with pmcid "PMC4233834" had no fulltext HTML url
warn: Article with pmcid "PMC2871514" had no fulltext HTML url
warn: Article with pmcid "PMC1460952" had no fulltext HTML url
warn: Article with pmcid "PMC2602195" had no fulltext HTML url
warn: Article with pmcid "PMC1693204" had no fulltext HTML url
warn: Article with pmcid "PMC1203420" had no fulltext HTML url
warn: Article with pmcid "PMC1208741" had no fulltext HTML url
warn: Article with pmcid "PMC1209182" had no fulltext HTML url
warn: Article with pmcid "PMC1208723" had no fulltext HTML url
info: Fulltext HTML URL list written to fulltext_html_urls.txt
warn: Article with title "Gasteria plant named 'WT10' did not have a PMCID (therefore no XML)
warn: Article with pmid "8835456 did not have a PMCID (therefore no XML)
warn: Article with pmid "9087376 did not have a PMCID (therefore no XML)
warn: Article with pmid "24240951 did not have a PMCID (therefore no XML)
warn: Article with pmid "18775771 did not have a PMCID (therefore no XML)
warn: Article with pmid "13033248 did not have a PMCID (therefore no XML)
warn: Article with pmid "5594485 did not have a PMCID (therefore no XML)
warn: Article with pmid "18098797 did not have a PMCID (therefore no XML)
info: Downloading fulltext XML files
debug: Creating directory: PMC4377467/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4377467/fullTextXML
debug: Creating directory: PMC4202640/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4202640/fullTextXML
debug: Creating directory: PMC4202400/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4202400/fullTextXML
debug: Creating directory: PMC3978243/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3978243/fullTextXML
debug: Creating directory: PMC4152747/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4152747/fullTextXML
debug: Creating directory: PMC3690993/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3690993/fullTextXML
debug: Creating directory: PMC3729011/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3729011/fullTextXML
debug: Creating directory: PMC3605904/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3605904/fullTextXML
debug: Creating directory: PMC3371435/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3371435/fullTextXML
debug: Creating directory: PMC3364152/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3364152/fullTextXML
debug: Creating directory: PMC3305877/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3305877/fullTextXML
debug: Creating directory: PMC3286288/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3286288/fullTextXML
debug: Creating directory: Gasteria plant named 'WT10'/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3066391/fullTextXML
debug: Creating directory: PMC3066391/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3043933/fullTextXML
debug: Creating directory: PMC3043933/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC3002468/fullTextXML
debug: Creating directory: PMC3002468/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2141413/fullTextXML
debug: Creating directory: 8835456/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4242389/fullTextXML
debug: Creating directory: 9087376/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4233834/fullTextXML
debug: Creating directory: 24240951/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4257511/fullTextXML
debug: Creating directory: 18775771/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2871514/fullTextXML
debug: Creating directory: 13033248/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1460952/fullTextXML
debug: Creating directory: PMC2141413/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4243674/fullTextXML
debug: Creating directory: PMC4242389/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2602195/fullTextXML
debug: Creating directory: PMC4233834/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1462085/fullTextXML
debug: Creating directory: PMC4257511/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1693204/fullTextXML
debug: Creating directory: PMC2871514/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2707325/fullTextXML
debug: Creating directory: PMC1460952/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2844068/fullTextXML
debug: Creating directory: PMC4243674/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2707867/fullTextXML
debug: Creating directory: PMC2602195/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1203420/fullTextXML
debug: Creating directory: PMC1462085/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC2803647/fullTextXML
debug: Creating directory: PMC1693204/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1078655/fullTextXML
debug: Creating directory: PMC2707325/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1076571/fullTextXML
debug: Creating directory: 5594485/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1077958/fullTextXML
debug: Creating directory: 18098797/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1209227/fullTextXML
debug: Creating directory: PMC2844068/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1076789/fullTextXML
debug: Creating directory: PMC2707867/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1208741/fullTextXML
debug: Creating directory: PMC1203420/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1076941/fullTextXML
debug: Creating directory: PMC2803647/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1209172/fullTextXML
debug: Creating directory: PMC1078655/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC438925/fullTextXML
debug: Creating directory: PMC1076571/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1209182/fullTextXML
debug: Creating directory: PMC1077958/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1200927/fullTextXML
debug: Creating directory: PMC1209227/
debug: Downloading XML: http://www.ebi.ac.uk/europepmc/webservices/rest/PMC1208723/fullTextXML
Downloading files [=-----------------------------] 2% (eta 0.0s)
/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.emit (events.js:117:20)
    at finishMaybe (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:499:14)
    at endWritable (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:509:3)
    at BufferStream.Writable.end (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:474:5)
    at IncomingMessage.onend (_stream_readable.js:502:10)
    at IncomingMessage.g (events.js:180:16)
    at IncomingMessage.emit (events.js:117:20)

Add output of fulltext XML

Biology Direct paper 'had no fulltext HTML url' -- really?

getpapers -q extremophiles --outdir ./extremophiles (again)

many warning lines such as:

warn: Article with pmcid "PMC1586193" had no fulltext HTML url

...so I checked to see what PMC1586193 is and it turns out it's a Biology Direct paper (Rooting the tree of life by transition analyses):
http://europepmc.org/articles/PMC1586193

and visually looking at the above EUPMC url in a web browser it looks like EUPMC does have a copy of the full text of the paper. I don't know if this is an issue with getpapers or EUPMC but it seems odd.

Incidentally I got 283 of those warnings. For a search that returns 836 results that's quite a high proportion!

Nested Boolean searches

Hi,

I am trying to do nested Boolean searches, but I immediately receive an error that an operator was unexpected. More specifically, it is the operator that combines the two Boolean searches that proves to create an error. These kind of searches do work in EuropePMC directly, btw.

The search I used is getpapers --query '(TITLE: "QRP" AND TITLE:"misconduct") OR (PUB_TYPE:"retraction of publication")' --outdir test, see also the attached image.

If I remove the Boolean part from OR onward I also receive an error, so it seems that the parentheses might also be the cause of the problem. Any help on why this creates an error and whether this is solvable?

Kind regards,
Chris Hartgerink

Hangs in Network Drops During Search

Currently seems to silently hang if connectivity drops during either the search or download stage. My solution of to Ctrl-C and restart repeatedly. If network stays stable for whole cycle then runs perfectly. This was using a CM-FTDM VM.

[dump] ArXiv FTP

Add ArXiv ftp as a source for #16

"no fulltext HTML url" could be located on NCBI site

warn: Article with pmcid "PMC4327751" had no fulltext HTML url
warn: Article with pmcid "PMC4015397" had no fulltext HTML url
warn: Article with pmcid "PMC4210678" had no fulltext HTML url
warn: Article with pmcid "PMC3260561" had no fulltext HTML url
warn: Article with pmcid "PMC3337047" had no fulltext HTML url
warn: Article with pmcid "PMC3026713" had no fulltext HTML url
warn: Article with pmcid "PMC3109237" had no fulltext HTML url
warn: Article with pmcid "PMC3023303" had no fulltext HTML url

Followed http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015397/ Manually and could read whole HTML version of paper

Search multiple APIs in one query?

I'm sure this is in the roadmap, but just to flag it up anyway

Would be nice to search EUPMC + arxiv + IEEE from just one query -- if people are really thirsty for knowledge they want it from anywhere!

No PNAS fulltext (PDF or XML) via getpapers

Very strange. It appears one can't get PNAS fulltext as either PDF or XML via getpapers!
Yet, via the EuropePMC website there's clearly a lot of freely available full text articles, with PDF (not so sure about availability of full text XML).

Absolutely zero fulltext downloads appear to be possible for PNAS or Science:

getpapers --query 'Journal:"PNAS" AND FIRST_PDATE:[2000-01-01 TO 2015-05-01]' -x --outdir pnas
info: Searching using eupmc API
info: Found 0 open access results
#include closed papers
 getpapers --query 'Journal:"PNAS" AND FIRST_PDATE:[2000-01-01 TO 2015-05-01]' --all --outdir pnas
info: Searching using eupmc API
info: Found 57575 results

Take Busch et al as the test case: http://europepmc.org/articles/PMC4321246
Clearly available as full text for free for human eyes via EPMC as html & downloadable PDF.

#finds the paper because of --all switch
getpapers --query 'Author:"Busch" AND FIRST_PDATE:[2015-01-01 TO 2015-02-01]' --all --outdir busch
#DOES NOT find the paper
getpapers --query 'Author:"Busch" AND FIRST_PDATE:[2015-01-01 TO 2015-02-01]'  --outdir openbusch

404 for single open Elsevier PDF because it doesn't exist?

404 error when trying to get PDF of this one open article from the journal Academic Radiology:

I suspect it's because there is no PDF for this article at EuropePMC: http://europepmc.org/articles/PMC4234081 . Does getpapers just assume there's PDF if there's XML available?

getpapers --query 'Journal:"Academic Radiology" AND FIRST_PDATE:[2010-01-01 TO 2015-07-01]' -p  --outdir elseacrad
info: Searching using eupmc API
info: Found 1 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4234081" had no fulltext HTML url
info: Downloading fulltext PDF files
Downloading files [==============================] 100% (eta 0.0s)

/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333
          fourohfour();
          ^
TypeError: undefined is not a function
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/eupmc.js:333:11
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/index.js:152:6
    at BufferStream.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/index.js:52:3)
    at BufferStream.emit (events.js:117:20)
    at finishMaybe (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:499:14)
    at afterWrite (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/lib/_stream_writable.js:378:3)
    at afterTick (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/got/node_modules/read-all-stream/node_modules/readable-stream/node_modules/process-nextick-args/index.js:11:8)
    at process._tickCallback (node.js:448:13)

Downloading the XML for the same article is fine, no problem

getpapers --query 'Journal:"Academic Radiology" AND FIRST_PDATE:[2010-01-01 TO 2015-07-01]' -x  --outdir elseacrad
info: Searching using eupmc API
info: Found 1 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metdata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
warn: Article with pmcid "PMC4234081" had no fulltext HTML url
info: Downloading fulltext XML files
Downloading files [==============================] 100% (eta 0.0s)
info: All XML downloads succeeded!

Display identity of timed out downloads

So I'm on wifi at the moment, trying to download a year's worth of Nature supp data. (The flaky wifi isn't the problem, it's to be expected). I get this message:

warn: 20 downloads timed out. Retrying.

I'm not confident the retrying worked. Which 20 downloads out of a possible 2000 timed out?
e.g. the PDF for PMC 123456 , the supp data file for PMC 654321 etc...

I think it would be better to print exact identity of failed downloads to screen and/or file. Related to #43 in some ways

What is the purpose of the file fulltext_html_urls.txt

What is the purpose of the file fulltext_html_urls.txt available as a part of the output?

Purpose: Search open access papers in eupmc for the query dinosaurs and download fulltext XMLs, supplementary files and fulltext PDFs if available

Query used

$ getpapers -q 'dinosaurs' -x -s -p -o dinosaursOutput2 >> dinosaursOutput2.log

This generated a fulltext_html_urls.txt file with 22 urls

Not all pmids listed in fulltext_html_urls.txt had a corresponding fulltext.xml or fulltext.html file downloaded. Of the 22 urls with pmcids listed in the file, the breakdown of what I found was as follows:

20 of the pmcids had an empty dir
2 of the pmcids had a dir with a fulltext.xml file but an empty fulltext.html file
For each of the pmcids in the fulltext_html_urls.txt file, the output produced a message similar to the following one
warn: Article with pmcid "PMC3381548" had no fulltext PDF url

fulltext_html_urls.txt naming issue

PMR and I think that the output fulltext_html_urls.txt should instead be named

APImethod_fulltext_html_urls.txt

e.g. eupmc_fulltext_html_urls.txt , ieee_fulltext_html_urls.txt , arxiv_fulltext_html_urls.txt

This would better follow the convention set by the .json results files which are similarly named.
It's useful when you do a search for 'dinosaurs' in eupmc, then arxiv, then ieee all outputting into the same outdir

No warning that --api ieee -s won't work

ditto for ieee + -s

$ getpapers -q 'Gasteria' --api ieee -s  --outdir ./car
info: Searching using ieee API

TypeError: Cannot read property 'totalfound' of undefined
    at IEEE.completeCallback (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/lib/ieee.js:72:38)
    at Request.emit (events.js:98:17)
    at Request.mixin._fireSuccess (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
    at /home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:448:9
    at Parser.<anonymous> (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:344:20)
    at Parser.emit (events.js:95:17)
    at Object.saxParser.onclosetag (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/lib/xml2js.js:314:24)
    at emit (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:615:33)
    at emitNode (/home/ross/.nvm/v0.10.38/lib/node_modules/getpapers/node_modules/restler/node_modules/xml2js/node_modules/sax/lib/sax.js:620:3)

need a manifest file to preserve provenance info [important]

PMR and I strongly think we need a manifest.json of some sort to document in each search.

the time and date and API used of the getpapers search and the search parameters
a full listing of all files downloaded from the getpapers query

hence the suggested name of either 'manifest' or 'metadata' for this new JSON file.

If I do a search today, just by looking at the output I will have no idea in 7 days time what search I ran to get those results. PMR also thinks it's very important for downstream tools to have a manifest of all the files in the cmdir.

Allow downloading of fulltext PDF

Release on NPM

no-execute (dry-run) mode, -n

My first test query using getpapers returned rather an unexpectedly large number of results (836), which it then proceeded to try and download...

I was wondering if getpapers could have a 'no-execute mode' -n similar to:

mmv -n
rsync -n

whereby the output would simply return the number of papers found for that query and NOT download anything. This is useful in the cases where you don't quite know how many matching papers you're going to get served.

PLOS ONE 'supp materials' are mostly just figures NOT SI

Single paper demonstration of issue: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0079155 Expected getpapers to download the supp info: File S1 (9 Mb PDF). EuropePMC does hold this supp file: http://europepmc.org/articles/PMC3838368/bin/pone.0079155.s001.pdf

But getpapers unexpectedly returns all 9 main paper figure images as supp info and NOT the real supp info.

getpapers --query 'JOURNAL:"PLOS ONE" TITLE:"Monogenean lost clamps"' -s  --outdir plosmono
cd plosmono/PMC3838368/
unzip supplementaryFiles.zip
ls
pone.0079155.g001.jpg  pone.0079155.g005.jpg  pone.0079155.g009.jpg
pone.0079155.g002.jpg  pone.0079155.g006.jpg  supplementaryFiles.zip
pone.0079155.g003.jpg  pone.0079155.g007.jpg
pone.0079155.g004.jpg  pone.0079155.g008.jpg

Multiple PLOS ONE example, where only 3 out of 25 hits returns real supplementary information files.
The three that do return real supp info are: PMC3665537, PMC3669350, PMC3692442
All others, as apparent from file names g001 etc are just the figures from the main paper

#Downloads 25 supplementary materials zip files
getpapers --query 'JOURNAL:"PLOS ONE" METHODS:"NHMUK"' -s --outdir plosbmnh
#unzip all
tree
.
├── eupmc_results.json
├── PMC3648582
├── PMC3665537
│   ├── pone.0065295.e001.jpg
│   ├── pone.0065295.e002.jpg
│   ├── pone.0065295.g001.jpg
│   ├── pone.0065295.g002.jpg
│   ├── pone.0065295.g003.jpg
│   ├── pone.0065295.g004.jpg
│   ├── pone.0065295.g005.jpg
│   ├── pone.0065295.g006.jpg
│   ├── pone.0065295.g007.jpg
│   ├── pone.0065295.s001.doc
│   ├── pone.0065295.s002.doc
│   ├── pone.0065295.s003.doc
│   ├── pone.0065295.s004.doc
│   ├── pone.0065295.s005.wmv
│   ├── pone.0065295.s006.wmv
│   ├── pone.0065295.s007.wmv
│   ├── pone.0065295.s008.wmv
│   └── supplementaryFiles.zip
├── PMC3669350
│   ├── pone.0064203.g001.jpg
│   ├── pone.0064203.g002.jpg
│   ├── pone.0064203.g003.jpg
│   ├── pone.0064203.g004.jpg
│   ├── pone.0064203.g005.jpg
│   ├── pone.0064203.g006.jpg
│   ├── pone.0064203.g007.jpg
│   ├── pone.0064203.g008.jpg
│   ├── pone.0064203.g009.jpg
│   ├── pone.0064203.g010.jpg
│   ├── pone.0064203.g011.jpg
│   ├── pone.0064203.s001.doc
│   ├── pone.0064203.s002.doc
│   ├── pone.0064203.s003.doc
│   ├── pone.0064203.s004.nex
│   └── supplementaryFiles.zip
├── PMC3692442
│   ├── pone.0067176.g001.jpg
│   ├── pone.0067176.g002.jpg
│   ├── pone.0067176.g003.jpg
│   ├── pone.0067176.g004.jpg
│   ├── pone.0067176.g005.jpg
│   ├── pone.0067176.s001.xls
│   ├── pone.0067176.s002.pdf
│   └── supplementaryFiles.zip
├── PMC3789696
│   ├── pone.0077457.g001.jpg
│   ├── pone.0077457.g002.jpg
│   ├── pone.0077457.g003.jpg
│   ├── pone.0077457.g004.jpg
│   └── supplementaryFiles.zip
├── PMC3838368
│   ├── pone.0079155.g001.jpg
│   ├── pone.0079155.g002.jpg
│   ├── pone.0079155.g003.jpg
│   ├── pone.0079155.g004.jpg
│   ├── pone.0079155.g005.jpg
│   ├── pone.0079155.g006.jpg
│   ├── pone.0079155.g007.jpg
│   ├── pone.0079155.g008.jpg
│   ├── pone.0079155.g009.jpg
│   └── supplementaryFiles.zip
├── PMC3847141
│   ├── pone.0080405.g001.jpg
│   ├── pone.0080405.g002.jpg
│   ├── pone.0080405.g003.jpg
│   ├── pone.0080405.g004.jpg
│   ├── pone.0080405.g005.jpg
│   ├── pone.0080405.g006.jpg
│   ├── pone.0080405.g007.jpg
│   ├── pone.0080405.g008.jpg
│   ├── pone.0080405.g009.jpg
│   ├── pone.0080405.g010.jpg
│   ├── pone.0080405.g011.jpg
│   ├── pone.0080405.g012.jpg
│   ├── pone.0080405.g013.jpg
│   ├── pone.0080405.g014.jpg
│   ├── pone.0080405.g015.jpg
│   ├── pone.0080405.g016.jpg
│   ├── pone.0080405.g017.jpg
│   ├── pone.0080405.g018.jpg
│   ├── pone.0080405.g019.jpg
│   ├── pone.0080405.g020.jpg
│   ├── pone.0080405.g021.jpg
│   ├── pone.0080405.g022.jpg
│   ├── pone.0080405.g023.jpg
│   ├── pone.0080405.g024.jpg
│   ├── pone.0080405.g025.jpg
│   ├── pone.0080405.g026.jpg
│   ├── pone.0080405.g027.jpg
│   ├── pone.0080405.g028.jpg
│   ├── pone.0080405.g029.jpg
│   ├── pone.0080405.g030.jpg
│   ├── pone.0080405.g031.jpg
│   ├── pone.0080405.g032.jpg
│   ├── pone.0080405.g033.jpg
│   ├── pone.0080405.g034.jpg
│   └── supplementaryFiles.zip
├── PMC3852158
│   ├── pone.0080974.g001.jpg
│   ├── pone.0080974.g002.jpg
│   ├── pone.0080974.g003.jpg
│   ├── pone.0080974.g004.jpg
│   ├── pone.0080974.g005.jpg
│   ├── pone.0080974.g006.jpg
│   ├── pone.0080974.g007.jpg
│   ├── pone.0080974.g008.jpg
│   ├── pone.0080974.g009.jpg
│   ├── pone.0080974.g010.jpg
│   ├── pone.0080974.g011.jpg
│   ├── pone.0080974.g012.jpg
│   ├── pone.0080974.g013.jpg
│   ├── pone.0080974.g014.jpg
│   ├── pone.0080974.g015.jpg
│   ├── pone.0080974.g016.jpg
│   ├── pone.0080974.g017.jpg
│   └── supplementaryFiles.zip
├── PMC3859474
│   ├── pone.0066075.g001.jpg
│   ├── pone.0066075.g002.jpg
│   ├── pone.0066075.g003.jpg
│   ├── pone.0066075.g004.jpg
│   ├── pone.0066075.g005.jpg
│   ├── pone.0066075.g006.jpg
│   └── supplementaryFiles.zip
├── PMC3897400
│   ├── pone.0084709.g001.jpg
│   ├── pone.0084709.g002.jpg
│   ├── pone.0084709.g003.jpg
│   ├── pone.0084709.g004.jpg
│   ├── pone.0084709.g005.jpg
│   ├── pone.0084709.g006.jpg
│   ├── pone.0084709.g007.jpg
│   ├── pone.0084709.g008.jpg
│   ├── pone.0084709.g009.jpg
│   ├── pone.0084709.g010.jpg
│   ├── pone.0084709.g011.jpg
│   ├── pone.0084709.g012.jpg
│   ├── pone.0084709.g013.jpg
│   ├── pone.0084709.g014.jpg
│   ├── pone.0084709.g015.jpg
│   ├── pone.0084709.g016.jpg
│   ├── pone.0084709.g017.jpg
│   ├── pone.0084709.g018.jpg
│   ├── pone.0084709.g019.jpg
│   ├── pone.0084709.g020.jpg
│   └── supplementaryFiles.zip
├── PMC3907582
│   ├── pone.0086864.g001.jpg
│   ├── pone.0086864.g002.jpg
│   ├── pone.0086864.g003.jpg
│   ├── pone.0086864.g004.jpg
│   ├── pone.0086864.g005.jpg
│   ├── pone.0086864.g006.jpg
│   ├── pone.0086864.g007.jpg
│   ├── pone.0086864.g008.jpg
│   ├── pone.0086864.g009.jpg
│   ├── pone.0086864.g010.jpg
│   ├── pone.0086864.g011.jpg
│   ├── pone.0086864.g012.jpg
│   ├── pone.0086864.g013.jpg
│   ├── pone.0086864.g014.jpg
│   ├── pone.0086864.g015.jpg
│   ├── pone.0086864.g016.jpg
│   ├── pone.0086864.g017.jpg
│   ├── pone.0086864.g018.jpg
│   ├── pone.0086864.g019.jpg
│   └── supplementaryFiles.zip
├── PMC3914794
│   ├── pone.0087048.g001.jpg
│   ├── pone.0087048.g002.jpg
│   ├── pone.0087048.g003.jpg
│   ├── pone.0087048.g004.jpg
│   ├── pone.0087048.g005.jpg
│   └── supplementaryFiles.zip
├── PMC3937355
│   ├── pone.0089165.g001.jpg
│   ├── pone.0089165.g002.jpg
│   ├── pone.0089165.g003.jpg
│   ├── pone.0089165.g004.jpg
│   ├── pone.0089165.g005.jpg
│   ├── pone.0089165.g006.jpg
│   ├── pone.0089165.g007.jpg
│   ├── pone.0089165.g008.jpg
│   ├── pone.0089165.g009.jpg
│   ├── pone.0089165.g010.jpg
│   ├── pone.0089165.g011.jpg
│   ├── pone.0089165.g012.jpg
│   ├── pone.0089165.g013.jpg
│   ├── pone.0089165.g014.jpg
│   ├── pone.0089165.g015.jpg
│   ├── pone.0089165.g016.jpg
│   ├── pone.0089165.g017.jpg
│   ├── pone.0089165.g018.jpg
│   ├── pone.0089165.g019.jpg
│   ├── pone.0089165.g020.jpg
│   ├── pone.0089165.g021.jpg
│   └── supplementaryFiles.zip
├── PMC3991637
│   ├── pone.0095296.g001.jpg
│   ├── pone.0095296.g002.jpg
│   ├── pone.0095296.g003.jpg
│   ├── pone.0095296.g004.jpg
│   ├── pone.0095296.g005.jpg
│   ├── pone.0095296.g006.jpg
│   ├── pone.0095296.g007.jpg
│   ├── pone.0095296.g008.jpg
│   ├── pone.0095296.g009.jpg
│   ├── pone.0095296.g010.jpg
│   ├── pone.0095296.g011.jpg
│   ├── pone.0095296.g012.jpg
│   ├── pone.0095296.g013.jpg
│   ├── pone.0095296.g014.jpg
│   ├── pone.0095296.g015.jpg
│   └── supplementaryFiles.zip
├── PMC4118863
│   ├── pone.0103152.g001.jpg
│   ├── pone.0103152.g002.jpg
│   ├── pone.0103152.g003.jpg
│   ├── pone.0103152.g004.jpg
│   ├── pone.0103152.g005.jpg
│   ├── pone.0103152.g006.jpg
│   ├── pone.0103152.g007.jpg
│   ├── pone.0103152.g008.jpg
│   ├── pone.0103152.g009.jpg
│   ├── pone.0103152.g010.jpg
│   ├── pone.0103152.g011.jpg
│   ├── pone.0103152.g012.jpg
│   ├── pone.0103152.g013.jpg
│   ├── pone.0103152.g014.jpg
│   ├── pone.0103152.g015.jpg
│   └── supplementaryFiles.zip
├── PMC4131922
│   ├── pone.0104551.g001.jpg
│   ├── pone.0104551.g002.jpg
│   ├── pone.0104551.g003.jpg
│   ├── pone.0104551.g004.jpg
│   ├── pone.0104551.g005.jpg
│   ├── pone.0104551.g006.jpg
│   ├── pone.0104551.g007.jpg
│   ├── pone.0104551.g008.jpg
│   ├── pone.0104551.g009.jpg
│   ├── pone.0104551.g010.jpg
│   ├── pone.0104551.g011.jpg
│   └── supplementaryFiles.zip
├── PMC4192354
│   ├── pone.0109785.g001.jpg
│   ├── pone.0109785.g002.jpg
│   ├── pone.0109785.g003.jpg
│   ├── pone.0109785.g004.jpg
│   ├── pone.0109785.g005.jpg
│   └── supplementaryFiles.zip
├── PMC4206445
│   ├── pone.0110646.e001.jpg
│   ├── pone.0110646.e002.jpg
│   ├── pone.0110646.g001.jpg
│   ├── pone.0110646.g002.jpg
│   ├── pone.0110646.g003.jpg
│   ├── pone.0110646.g004.jpg
│   ├── pone.0110646.g005.jpg
│   ├── pone.0110646.g006.jpg
│   ├── pone.0110646.g007.jpg
│   ├── pone.0110646.g008.jpg
│   ├── pone.0110646.g009.jpg
│   ├── pone.0110646.g010.jpg
│   ├── pone.0110646.g011.jpg
│   ├── pone.0110646.g012.jpg
│   ├── pone.0110646.g013.jpg
│   ├── pone.0110646.g014.jpg
│   ├── pone.0110646.g015.jpg
│   └── supplementaryFiles.zip
├── PMC4269487
│   ├── pone.0113911.g001.jpg
│   ├── pone.0113911.g002.jpg
│   ├── pone.0113911.g003.jpg
│   ├── pone.0113911.g004.jpg
│   ├── pone.0113911.g005.jpg
│   ├── pone.0113911.g006.jpg
│   ├── pone.0113911.g007.jpg
│   ├── pone.0113911.g008.jpg
│   ├── pone.0113911.g009.jpg
│   ├── pone.0113911.g010.jpg
│   ├── pone.0113911.g011.jpg
│   ├── pone.0113911.g012.jpg
│   ├── pone.0113911.g013.jpg
│   ├── pone.0113911.g014.jpg
│   ├── pone.0113911.g015.jpg
│   ├── pone.0113911.g016.jpg
│   ├── pone.0113911.g017.jpg
│   ├── pone.0113911.g018.jpg
│   ├── pone.0113911.g019.jpg
│   ├── pone.0113911.g020.jpg
│   ├── pone.0113911.g021.jpg
│   ├── pone.0113911.g022.jpg
│   ├── pone.0113911.g023.jpg
│   ├── pone.0113911.g024.jpg
│   ├── pone.0113911.g025.jpg
│   └── supplementaryFiles.zip
├── PMC4382297
│   ├── pone.0120924.g001.jpg
│   ├── pone.0120924.g002.jpg
│   ├── pone.0120924.g003.jpg
│   ├── pone.0120924.g004.jpg
│   ├── pone.0120924.g005.jpg
│   ├── pone.0120924.g006.jpg
│   ├── pone.0120924.g007.jpg
│   ├── pone.0120924.g008.jpg
│   ├── pone.0120924.g009.jpg
│   ├── pone.0120924.g010.jpg
│   ├── pone.0120924.g011.jpg
│   ├── pone.0120924.g012.jpg
│   ├── pone.0120924.g013.jpg
│   ├── pone.0120924.g014.jpg
│   ├── pone.0120924.g015.jpg
│   ├── pone.0120924.g016.jpg
│   ├── pone.0120924.g017.jpg
│   ├── pone.0120924.g018.jpg
│   ├── pone.0120924.g019.jpg
│   ├── pone.0120924.g020.jpg
│   ├── pone.0120924.g021.jpg
│   ├── pone.0120924.g022.jpg
│   ├── pone.0120924.g023.jpg
│   ├── pone.0120924.g024.jpg
│   ├── pone.0120924.g025.jpg
│   ├── pone.0120924.g026.jpg
│   ├── pone.0120924.g027.jpg
│   ├── pone.0120924.g028.jpg
│   ├── pone.0120924.g029.jpg
│   ├── pone.0120924.g030.jpg
│   ├── pone.0120924.g031.jpg
│   └── supplementaryFiles.zip
├── PMC4406738
│   ├── pone.0123503.g001.jpg
│   ├── pone.0123503.g002.jpg
│   ├── pone.0123503.g003.jpg
│   ├── pone.0123503.g004.jpg
│   ├── pone.0123503.g005.jpg
│   └── supplementaryFiles.zip
├── PMC4454574
│   ├── pone.0125819.g001.jpg
│   ├── pone.0125819.g002.jpg
│   ├── pone.0125819.g003.jpg
│   ├── pone.0125819.g004.jpg
│   ├── pone.0125819.g005.jpg
│   ├── pone.0125819.g006.jpg
│   ├── pone.0125819.g007.jpg
│   ├── pone.0125819.g008.jpg
│   ├── pone.0125819.g009.jpg
│   ├── pone.0125819.g010.jpg
│   ├── pone.0125819.g011.jpg
│   ├── pone.0125819.g012.jpg
│   ├── pone.0125819.g013.jpg
│   ├── pone.0125819.g014.jpg
│   ├── pone.0125819.g015.jpg
│   ├── pone.0125819.g016.jpg
│   ├── pone.0125819.g017.jpg
│   ├── pone.0125819.g018.jpg
│   └── supplementaryFiles.zip
├── PMC4465186
│   ├── pone.0127727.g001.jpg
│   ├── pone.0127727.g002.jpg
│   └── supplementaryFiles.zip
├── PMC4480851
│   ├── pone.0129193.g001.jpg
│   ├── pone.0129193.g002.jpg
│   ├── pone.0129193.g003.jpg
│   ├── pone.0129193.g004.jpg
│   ├── pone.0129193.g005.jpg
│   ├── pone.0129193.g006.jpg
│   ├── pone.0129193.g007.jpg
│   ├── pone.0129193.g008.jpg
│   ├── pone.0129193.g009.jpg
│   └── supplementaryFiles.zip
└── PMC4480985
    ├── pone.0127621.g001.jpg
    ├── pone.0127621.g002.jpg
    ├── pone.0127621.g003.jpg
    ├── pone.0127621.g004.jpg
    ├── pone.0127621.g005.jpg
    ├── pone.0127621.g006.jpg
    ├── pone.0127621.g007.jpg
    ├── pone.0127621.g008.jpg
    └── supplementaryFiles.zip

25 directories, 360 files

Allow downloading of supplementary materials

Add Europe PMC api as a source

split up results.json and move to ctree

quickscrape creates a result.json in each ctree, and getpapers an apiname_results.json, which gets overwritten with each search. It would make more sense to follow the quickscrape procedure here. Also this would solve issue 45

Can we have optional .csv output? Not everyone groks JSON

Another picky non-urgent issue but...

for a simple arxiv query only JSON data is returned at the moment, as 'arxiv_results.json'
from e.g. getpapers -q dinosaurs --api arxiv --outdir ./dinos

If I want to open it up in a text editor just to check what I've got, JSON isn't nice for that but .csv would be more universally understood, right? Would make it more friendly?

Add CORE as a source

TypeError: Cannot read property 'hitCount' of undefined

getpapers --query '(JOURNAL:"bmc ecol") AND ((FIRST_PDATE:[2015-03-01 TO 2015-05-29]))' --outdir dirxxxxx

returned following error:

TypeError: Cannot read property 'hitCount' of undefined
at EuPmc.completeCallback (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/lib/eupmc.js:61:35)
at Request.EventEmitter.emit (events.js:98:17)
at Request.mixin._fireSuccess (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:226:10)
at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:158:20
at IncomingMessage.parsers.auto (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:394:7)
at Request.mixin._encode (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:195:29)
at /home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:154:16
at Request.mixin._decode (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:170:7)
at IncomingMessage. (/home/workshop/.nvm/v0.10.24/lib/node_modules/getpapers/node_modules/restler/lib/restler.js:147:14)
at IncomingMessage.EventEmitter.emit (events.js:117:20)

Allow dumping of specific journal(s) and date range(s)

Rather than querying an API, this would perform direct mass downloads from an FTP source like the PubMed FTP, the ArXiV FTP, CORE, etc.

This issue will track creation of the general interface - separate issues will track each specific data source.