jannisborn / paperscraper Goto Github PK

View Code? Open in Web Editor NEW

160.0 9.0 26.0 860 KB

Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.

License: MIT License

Python 100.00%

medrxiv arxiv pubmed biorxiv chemrxiv paperscraper

paperscraper's Introduction

Hi there 👋

You can reach me also here:


Google Scholar	Jannis Born
ORC-iD	0000-0001-8307-5670
Twitter	@jannisborn (still, but for how long?)

paperscraper's People

Stargazers

Watchers

paperscraper's Issues

ChemRxiv Engage API integration

Turns out, OpenEngage does provide an API :)

Base url: https://chemrxiv.org/engage/chemrxiv/public-api/v1/items

See code here: chemrxiv-dashboard/chemrxiv-dashboard.github.io@d3816f6#diff-d34f2e1442f7c9783f9229f7808dd7cbd276b7229ddea80b65146e4bed283ef7

Will try to integrate this as soon as possible

Randomness in arxiv API requests

The underlying arxiv package had an issue with unreliable results (#43).

Fortunately, this was fixed in the recent 1.0.0 release but here we still depend on the old 0.5.3.

Task: Bump dependency to 1.0.1 and refactor arxiv related code.

import error

Hi!

Sorry to take up your time with this, but I have a small issue when trying to use the package on chemrxiv/biorxiv and medrxiv.
The import fails, and I have a no module found error (I have attached a snapshot).

I was wondering if I had missed something?

Thank you very much for making this package open source, I look forward to using it!
Best regards,

Claire

Remote diconnected and didnt download files

Hi,
Very cool project! It looks like I installed it correctly and I ran this code on a jupyter notebook:

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
    medrxiv()  #  Takes ~30min and should result in ~35 MB file
    biorxiv()  # Takes ~1h and should result in ~350 MB file
    chemrxiv()  #  Takes ~45min and should result in ~20 MB file

I get this response:

61032it [20:29, 49.63it/s]
106700it [1:45:02, 16.93it/s]

And then I get the mess below. Any ideas on what I can do? Thankyou!!

Sincerely,

tom

RemoteDisconnected                        Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:785, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    783     e = ProtocolError("Connection aborted.", e)
--> 785 retries = retries.increment(
    786     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    787 )
    788 retries.sleep()

File ~\anaconda3\lib\site-packages\urllib3\util\retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    549 if read is False or not self._is_method_retryable(method):
--> 550     raise six.reraise(type(error), error, _stacktrace)
    551 elif read is not None:

File ~\anaconda3\lib\site-packages\urllib3\packages\six.py:769, in reraise(tp, value, tb)
    768 if value.__traceback__ is not tb:
--> 769     raise value.with_traceback(tb)
    770 raise value

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:71, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     70 while do_loop:
---> 71     json_response = requests.get(
     72         self.get_papers_url.format(
     73             begin_date=begin_date, end_date=end_date, cursor=cursor
     74         )
     75     ).json()
     76     do_loop = json_response["messages"][0]["status"] == "ok"

File ~\anaconda3\lib\site-packages\requests\api.py:73, in get(url, params, **kwargs)
     63 r"""Sends a GET request.
     64 
     65 :param url: URL for the new :class:`Request` object.
   (...)
     70 :rtype: requests.Response
     71 """
---> 73 return request("get", url, params=params, **kwargs)

File ~\anaconda3\lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~\anaconda3\lib\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~\anaconda3\lib\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)

File ~\anaconda3\lib\site-packages\requests\adapters.py:501, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    500 except (ProtocolError, OSError) as err:
--> 501     raise ConnectionError(err, request=request)
    503 except MaxRetryError as e:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Input In [2], in <cell line: 3>()
      1 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
      2 medrxiv()  #  Takes ~30min and should result in ~35 MB file
----> 3 biorxiv()  # Takes ~1h and should result in ~350 MB file
      4 chemrxiv()

File ~\anaconda3\lib\site-packages\paperscraper\get_dumps\biorxiv.py:42, in biorxiv(begin_date, end_date, save_path)
     40 # dump all papers
     41 with open(save_path, "w") as fp:
---> 42     for index, paper in enumerate(
     43         tqdm(api.get_papers(begin_date=begin_date, end_date=end_date))
     44     ):
     45         if index > 0:
     46             fp.write(os.linesep)

File ~\anaconda3\lib\site-packages\tqdm\std.py:1195, in tqdm.__iter__(self)
   1192 time = self._time
   1194 try:
-> 1195     for obj in iterable:
   1196         yield obj
   1197         # Update and possibly print the progressbar.
   1198         # Note: does not call self.update(1) for speed optimisation.

File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:85, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     83                 yield processed_paper
     84 except Exception as exc:
---> 85     raise RuntimeError(
     86         "Failed getting papers: {} - {}".format(exc.__class__.__name__, exc)
     87     )

RuntimeError: Failed getting papers: ConnectionError - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Scrape X-rxiv via API

Currently bio/med/chemrxiv scraping requires user to first download the entire DB and store locally.

Ideally, these dumps should be stored on a server and updated regularly (cron job). Users would just send requests to the server API. That would be the new default behaviour, but local download should still be supported too

get_dumps.chemrxiv does nothing

I got chem_token from figshare.com.

from paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv
...
chemrxiv(save_path=chem_save_path, token=chem_token)

Running:

WARNING:paperscraper.load_dumps: No dump found for chemrxiv. Skipping entry.
WARNING:paperscraper.load_dumps: No dump found for medrxiv. Skipping entry.
0it [00:00, ?it/s]
INFO:paperscraper.get_dumps.utils.chemrxiv.utils:Done, shutting down

File chemrxiv_2021-10-07.jsonl is created but empty.

Meanwhile med and bio seem to work fine!

ImportError: attempted relative import beyond top-level package

Probably my fault, but I pasted this code:

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv() # Takes ~30min and should result in ~35 MB file
biorxiv() # Takes ~1h and should result in ~350 MB file
chemrxiv() # Takes ~45min and should result in ~20 MB file

into a get_dumps.py in paperscraper/paperscraper and tried running it using python3 paperscraper, and I got this error:

Traceback (most recent call last):
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/get_dumps.py", line 1, in
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/init.py", line 10, in
from .load_dumps import QUERY_FN_DICT
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/load_dumps.py", line 8, in
from .arxiv import get_and_dump_arxiv_papers
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/arxiv/init.py", line 1, in
from .arxiv import * # noqa
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/arxiv/arxiv.py", line 5, in
import arxiv
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/arxiv/init.py", line 1, in
from .arxiv import * # noqa
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/arxiv/arxiv.py", line 7, in
from ..utils import dump_papers
ImportError: attempted relative import beyond top-level package

I'm likely doing something wrong, and I was hoping you could help me figure out what it is in particular.

Thanks!

-Morgan

Edit: I am using a conda environment to run this, if that makes a difference.

How to turn off the DEBUG log information?

There're too much log print outs, how to turn off the DEBUG/INFO?

Thanks!

Error when importing any of chemrxiv, biorxiv, medrxiv from paperscraper.get_dumps

I just installed the paperscraper from pip today. However, I got the error when doing paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv.

WARNING:paperscraper.load_dumps: No dump found for biorxiv. Skipping entry.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item()

TypeError: must be real number, not str

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 'date'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_529/3049928879.py in <module>
----> 1 from paperscraper.get_dumps import medrxiv
      2 # chemrxiv(token=api_token)
      3 # medrxiv()
      4 # biorxiv()

~/.local/lib/python3.7/site-packages/paperscraper/__init__.py in <module>
      8 from typing import List, Union
      9 
---> 10 from .load_dumps import QUERY_FN_DICT
     11 from .utils import get_filename_from_query
     12 

~/.local/lib/python3.7/site-packages/paperscraper/load_dumps.py in <module>
     29         logger.info(f' Multiple dumps found for {db}, taking most recent one')
     30     path = sorted(dump_paths, reverse=True)[0]
---> 31     querier = XRXivQuery(path)
     32     QUERY_FN_DICT.update({db: querier.search_keywords})
     33 

~/.local/lib/python3.7/site-packages/paperscraper/xrxiv/xrxiv_query.py in __init__(self, dump_filepath, fields)
     23         self.fields = fields
     24         self.df = pd.read_json(self.dump_filepath, lines=True)
---> 25         self.df['date'] = [date.strftime('%Y-%m-%d') for date in self.df['date']]
     26 
     27     def search_keywords(

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'date'

edit 1: The same error also occur when doing from paperscraper.arxiv import get_and_dump_arxiv_papers. However, if I did the procedure paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv first (even this one has error), the second error is not occurred.
edit2: The version is 0.1.0

Searching impact factor of journal

Useful to postprocess/filter scraped results.
Achieve with fuzzysearch combined with impact_factor: https://github.com/suqingdong/impact_factor?tab=readme-ov-file#use-in-python

UnexpectedEmptyPageError and associated errorscre

Please excuse me if I do this incorrectly. I a noob. I am using python 3.11 on Windows 11 and Ubuntu 22.04.2. on I have run into an error like this on arxiv as well as medarxiv:

arxiv.arxiv.UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=%28all%3Apyschological+flow+state%29&id_list=&sortBy=relevance&sortOrder=descending&start=29500&max_results=100)

this seems to be an issue in the original code and was patched here lukasschwab/arxiv.py#43

I did not see that and I took a similar path. My code can checks to see if a URL is malformed or is empty. It handles it and logs it. If it runs into a URL that is not responding or hangs it waits some user-defined amount of time and moves on. You can also make it create smaller jsonl for various reasons. I was also going to implement querying by date. Right now it's all hardcoded variables but I was thinking I should make it so that you can call the options from the command line or a config file. I am also thinking about multi-threaded and being able to throttle your calls to service and or a back-off algorithm. I don't know what I am supposed to do. Do I provide my fixes, if needed, and how or do I go to the arxiv team? I also think these issues lurk in other libraries but I have not made anything like extensive testing. Thank you I appreciate your time and paper scraper.

HTTPError for paperscraper.get_dumps.chemrxiv()

Hi, I was trying this library out and it worked for biorxiv() and medxriv(). However, for chemrxiv() I kept getting this error:

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    xxv_dumps(3)
  File "C:\Users\User\Documents\98_Notes\Data Analytics\use_paperscraper.py", line 37, in xxv_dumps
    chemrxiv()  #  Takes ~45min and should result in ~20 MB file
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\chemrxiv.py", line 42, in chemrxiv
    download_full(save_folder, api)
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\utils\chemrxiv\utils.py", line 132, in download_full
    for preprint in tqdm(api.all_preprints()):
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\utils\chemrxiv\chemrxiv_api.py", line 103, in query_generator
    r.raise_for_status()
  File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\requests\models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: 	https://chemrxiv.org/engage/chemrxiv/public-api/v1%5Citems?limit=50&skip=0&searchDateFrom=2017-01-01&searchDateTo=2023-05-08

After reading through past issues and PRs, I noticed that https://chemrxiv.org/engage/chemrxiv/public-api/v1 was a valid address, and found that '%5C' is the hex code for '\'. So I manually stitched together a valid URL in get_dumps\utils\chemrxiv\chemrxiv_api.query_generator by replacing

r = self.request(os.path.join(self.base, query), method, params=params)
with
r = self.request(self.base+"/"+query, method, params=params)

Result: A bunch of JSONs were dumped in server_dumps, each corresponding to a single paper.

Error when downloading papers from Pubmed.

when I tried to download papers from pubmed, I got this error:

JSONDecodeError: Invalid control character at: line 1 column 105 (char 104)

No DOI given in saved dumps of recent arxiv papers

Ran

from paperscraper.arxiv import get_and_dump_arxiv_papers

prompt = ['prompt engineering llm', 'prompt injection llm']
ai = ['Artificial intelligence', 'Large Language Models', 'OpenAI','LLM']
mi = ['ChatGPT']
query = [prompt, ai, mi]

get_and_dump_arxiv_papers(query, output_filepath='pro_inject.jsonl')

Example - jsonl contains title,authors,abstract for the page https://arxiv.org/abs/2302.11382 but journal is always blank and doi is null. This pattern repeats itself for all results.

scrapper Killed

If a killed error is displayed and paper scraping stops, can it be considered that the IP is blocked?

How can I solve this issue?

jannisborn / paperscraper Goto Github PK

paperscraper's Introduction

Hi there 👋

paperscraper's People

Stargazers

Watchers

Forkers

paperscraper's Issues

Recommend Projects

Recommend Topics

Recommend Org