You can reach me also here:
Google Scholar | Jannis Born |
ORC-iD | 0000-0001-8307-5670 |
@jannisborn (still, but for how long?) |
Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.
License: MIT License
You can reach me also here:
Google Scholar | Jannis Born |
ORC-iD | 0000-0001-8307-5670 |
@jannisborn (still, but for how long?) |
Turns out, OpenEngage does provide an API :)
Base url: https://chemrxiv.org/engage/chemrxiv/public-api/v1/items
See code here: chemrxiv-dashboard/chemrxiv-dashboard.github.io@d3816f6#diff-d34f2e1442f7c9783f9229f7808dd7cbd276b7229ddea80b65146e4bed283ef7
Will try to integrate this as soon as possible
The underlying arxiv
package had an issue with unreliable results (#43).
Fortunately, this was fixed in the recent 1.0.0 release but here we still depend on the old 0.5.3.
Task: Bump dependency to 1.0.1 and refactor arxiv related code.
Hi!
Sorry to take up your time with this, but I have a small issue when trying to use the package on chemrxiv/biorxiv and medrxiv.
The import fails, and I have a no module found error (I have attached a snapshot).
I was wondering if I had missed something?
Thank you very much for making this package open source, I look forward to using it!
Best regards,
Claire
Hi,
Very cool project! It looks like I installed it correctly and I ran this code on a jupyter notebook:
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv() # Takes ~30min and should result in ~35 MB file
biorxiv() # Takes ~1h and should result in ~350 MB file
chemrxiv() # Takes ~45min and should result in ~20 MB file
I get this response:
61032it [20:29, 49.63it/s]
106700it [1:45:02, 16.93it/s]
And then I get the mess below. Any ideas on what I can do? Thankyou!!
Sincerely,
tom
RemoteDisconnected Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
704 conn,
705 method,
706 url,
707 timeout=timeout_obj,
708 body=body,
709 headers=headers,
710 chunked=chunked,
711 )
713 # If we're going to release the connection in ``finally:``, then
714 # the response doesn't need to know about the connection. Otherwise
715 # it will also try to release it and we'll have a double-release
716 # mess.
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.
--> 449 six.raise_from(e, None)
450 except (SocketTimeout, BaseSSLError, SocketError) as e:
File <string>:3, in raise_from(value, from_value)
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
443 try:
--> 444 httplib_response = conn.getresponse()
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.
File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
1376 try:
-> 1377 response.begin()
1378 except ConnectionError:
File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
319 while True:
--> 320 version, status, reason = self._read_status()
321 if status != CONTINUE:
File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
286 if not line:
287 # Presumably, the server closed the connection before
288 # sending a valid response.
--> 289 raise RemoteDisconnected("Remote end closed connection without"
290 " response")
291 try:
RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
ProtocolError Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
485 try:
--> 486 resp = conn.urlopen(
487 method=request.method,
488 url=url,
489 body=request.body,
490 headers=request.headers,
491 redirect=False,
492 assert_same_host=False,
493 preload_content=False,
494 decode_content=False,
495 retries=self.max_retries,
496 timeout=timeout,
497 chunked=chunked,
498 )
500 except (ProtocolError, OSError) as err:
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:785, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
783 e = ProtocolError("Connection aborted.", e)
--> 785 retries = retries.increment(
786 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
787 )
788 retries.sleep()
File ~\anaconda3\lib\site-packages\urllib3\util\retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
549 if read is False or not self._is_method_retryable(method):
--> 550 raise six.reraise(type(error), error, _stacktrace)
551 elif read is not None:
File ~\anaconda3\lib\site-packages\urllib3\packages\six.py:769, in reraise(tp, value, tb)
768 if value.__traceback__ is not tb:
--> 769 raise value.with_traceback(tb)
770 raise value
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
704 conn,
705 method,
706 url,
707 timeout=timeout_obj,
708 body=body,
709 headers=headers,
710 chunked=chunked,
711 )
713 # If we're going to release the connection in ``finally:``, then
714 # the response doesn't need to know about the connection. Otherwise
715 # it will also try to release it and we'll have a double-release
716 # mess.
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.
--> 449 six.raise_from(e, None)
450 except (SocketTimeout, BaseSSLError, SocketError) as e:
File <string>:3, in raise_from(value, from_value)
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
443 try:
--> 444 httplib_response = conn.getresponse()
445 except BaseException as e:
446 # Remove the TypeError from the exception chain in
447 # Python 3 (including for exceptions like SystemExit).
448 # Otherwise it looks like a bug in the code.
File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
1376 try:
-> 1377 response.begin()
1378 except ConnectionError:
File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
319 while True:
--> 320 version, status, reason = self._read_status()
321 if status != CONTINUE:
File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
286 if not line:
287 # Presumably, the server closed the connection before
288 # sending a valid response.
--> 289 raise RemoteDisconnected("Remote end closed connection without"
290 " response")
291 try:
ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:71, in XRXivApi.get_papers(self, begin_date, end_date, fields)
70 while do_loop:
---> 71 json_response = requests.get(
72 self.get_papers_url.format(
73 begin_date=begin_date, end_date=end_date, cursor=cursor
74 )
75 ).json()
76 do_loop = json_response["messages"][0]["status"] == "ok"
File ~\anaconda3\lib\site-packages\requests\api.py:73, in get(url, params, **kwargs)
63 r"""Sends a GET request.
64
65 :param url: URL for the new :class:`Request` object.
(...)
70 :rtype: requests.Response
71 """
---> 73 return request("get", url, params=params, **kwargs)
File ~\anaconda3\lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
58 with sessions.Session() as session:
---> 59 return session.request(method=method, url=url, **kwargs)
File ~\anaconda3\lib\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
591 return resp
File ~\anaconda3\lib\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
705 # Total elapsed time of the request (approximately)
File ~\anaconda3\lib\site-packages\requests\adapters.py:501, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
500 except (ProtocolError, OSError) as err:
--> 501 raise ConnectionError(err, request=request)
503 except MaxRetryError as e:
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
Input In [2], in <cell line: 3>()
1 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
2 medrxiv() # Takes ~30min and should result in ~35 MB file
----> 3 biorxiv() # Takes ~1h and should result in ~350 MB file
4 chemrxiv()
File ~\anaconda3\lib\site-packages\paperscraper\get_dumps\biorxiv.py:42, in biorxiv(begin_date, end_date, save_path)
40 # dump all papers
41 with open(save_path, "w") as fp:
---> 42 for index, paper in enumerate(
43 tqdm(api.get_papers(begin_date=begin_date, end_date=end_date))
44 ):
45 if index > 0:
46 fp.write(os.linesep)
File ~\anaconda3\lib\site-packages\tqdm\std.py:1195, in tqdm.__iter__(self)
1192 time = self._time
1194 try:
-> 1195 for obj in iterable:
1196 yield obj
1197 # Update and possibly print the progressbar.
1198 # Note: does not call self.update(1) for speed optimisation.
File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:85, in XRXivApi.get_papers(self, begin_date, end_date, fields)
83 yield processed_paper
84 except Exception as exc:
---> 85 raise RuntimeError(
86 "Failed getting papers: {} - {}".format(exc.__class__.__name__, exc)
87 )
RuntimeError: Failed getting papers: ConnectionError - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Currently bio/med/chemrxiv scraping requires user to first download the entire DB and store locally.
Ideally, these dumps should be stored on a server and updated regularly (cron job). Users would just send requests to the server API. That would be the new default behaviour, but local download should still be supported too
I got chem_token from figshare.com.
from paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv
...
chemrxiv(save_path=chem_save_path, token=chem_token)
Running:
WARNING:paperscraper.load_dumps: No dump found for chemrxiv. Skipping entry.
WARNING:paperscraper.load_dumps: No dump found for medrxiv. Skipping entry.
0it [00:00, ?it/s]
INFO:paperscraper.get_dumps.utils.chemrxiv.utils:Done, shutting down
File chemrxiv_2021-10-07.jsonl
is created but empty.
Meanwhile med and bio seem to work fine!
Probably my fault, but I pasted this code:
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
medrxiv() # Takes ~30min and should result in ~35 MB file
biorxiv() # Takes ~1h and should result in ~350 MB file
chemrxiv() # Takes ~45min and should result in ~20 MB file
into a get_dumps.py in paperscraper/paperscraper and tried running it using python3 paperscraper, and I got this error:
Traceback (most recent call last):
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/get_dumps.py", line 1, in
from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/init.py", line 10, in
from .load_dumps import QUERY_FN_DICT
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/load_dumps.py", line 8, in
from .arxiv import get_and_dump_arxiv_papers
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/arxiv/init.py", line 1, in
from .arxiv import * # noqa
File "/home/morgan/.local/lib/python3.10/site-packages/paperscraper/arxiv/arxiv.py", line 5, in
import arxiv
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/arxiv/init.py", line 1, in
from .arxiv import * # noqa
File "/home/morgan/anaconda3/envs/scraper/paperscraper/paperscraper/arxiv/arxiv.py", line 7, in
from ..utils import dump_papers
ImportError: attempted relative import beyond top-level package
I'm likely doing something wrong, and I was hoping you could help me figure out what it is in particular.
Thanks!
-Morgan
Edit: I am using a conda environment to run this, if that makes a difference.
There're too much log print outs, how to turn off the DEBUG/INFO?
Thanks!
I just installed the paperscraper from pip today. However, I got the error when doing paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv
.
WARNING:paperscraper.load_dumps: No dump found for biorxiv. Skipping entry.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item()
TypeError: must be real number, not str
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
KeyError: 'date'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_529/3049928879.py in <module>
----> 1 from paperscraper.get_dumps import medrxiv
2 # chemrxiv(token=api_token)
3 # medrxiv()
4 # biorxiv()
~/.local/lib/python3.7/site-packages/paperscraper/__init__.py in <module>
8 from typing import List, Union
9
---> 10 from .load_dumps import QUERY_FN_DICT
11 from .utils import get_filename_from_query
12
~/.local/lib/python3.7/site-packages/paperscraper/load_dumps.py in <module>
29 logger.info(f' Multiple dumps found for {db}, taking most recent one')
30 path = sorted(dump_paths, reverse=True)[0]
---> 31 querier = XRXivQuery(path)
32 QUERY_FN_DICT.update({db: querier.search_keywords})
33
~/.local/lib/python3.7/site-packages/paperscraper/xrxiv/xrxiv_query.py in __init__(self, dump_filepath, fields)
23 self.fields = fields
24 self.df = pd.read_json(self.dump_filepath, lines=True)
---> 25 self.df['date'] = [date.strftime('%Y-%m-%d') for date in self.df['date']]
26
27 def search_keywords(
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
/opt/conda/envs/python3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'date'
edit 1: The same error also occur when doing from paperscraper.arxiv import get_and_dump_arxiv_papers
. However, if I did the procedure paperscraper.get_dumps import chemrxiv, biorxiv, medrxiv
first (even this one has error), the second error is not occurred.
edit2: The version is 0.1.0
Useful to postprocess/filter scraped results.
Achieve with fuzzysearch combined with impact_factor
: https://github.com/suqingdong/impact_factor?tab=readme-ov-file#use-in-python
Please excuse me if I do this incorrectly. I a noob. I am using python 3.11 on Windows 11 and Ubuntu 22.04.2. on I have run into an error like this on arxiv as well as medarxiv:
arxiv.arxiv.UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=%28all%3Apyschological+flow+state%29&id_list=&sortBy=relevance&sortOrder=descending&start=29500&max_results=100)
this seems to be an issue in the original code and was patched here lukasschwab/arxiv.py#43
I did not see that and I took a similar path. My code can checks to see if a URL is malformed or is empty. It handles it and logs it. If it runs into a URL that is not responding or hangs it waits some user-defined amount of time and moves on. You can also make it create smaller jsonl for various reasons. I was also going to implement querying by date. Right now it's all hardcoded variables but I was thinking I should make it so that you can call the options from the command line or a config file. I am also thinking about multi-threaded and being able to throttle your calls to service and or a back-off algorithm. I don't know what I am supposed to do. Do I provide my fixes, if needed, and how or do I go to the arxiv team? I also think these issues lurk in other libraries but I have not made anything like extensive testing. Thank you I appreciate your time and paper scraper.
Hi, I was trying this library out and it worked for biorxiv()
and medxriv()
. However, for chemrxiv()
I kept getting this error:
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
xxv_dumps(3)
File "C:\Users\User\Documents\98_Notes\Data Analytics\use_paperscraper.py", line 37, in xxv_dumps
chemrxiv() # Takes ~45min and should result in ~20 MB file
File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\chemrxiv.py", line 42, in chemrxiv
download_full(save_folder, api)
File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\utils\chemrxiv\utils.py", line 132, in download_full
for preprint in tqdm(api.all_preprints()):
File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\tqdm\std.py", line 1178, in __iter__
for obj in iterable:
File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\paperscraper\get_dumps\utils\chemrxiv\chemrxiv_api.py", line 103, in query_generator
r.raise_for_status()
File "C:\Users\User\Desktop\Programs\Winpython64_3.11\WPy64-31110\python-3.11.1.amd64\Lib\site-packages\requests\models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://chemrxiv.org/engage/chemrxiv/public-api/v1%5Citems?limit=50&skip=0&searchDateFrom=2017-01-01&searchDateTo=2023-05-08
After reading through past issues and PRs, I noticed that https://chemrxiv.org/engage/chemrxiv/public-api/v1 was a valid address, and found that '%5C' is the hex code for '\'. So I manually stitched together a valid URL in get_dumps\utils\chemrxiv\chemrxiv_api.query_generator
by replacing
r = self.request(os.path.join(self.base, query), method, params=params)
with
r = self.request(self.base+"/"+query, method, params=params)
Result: A bunch of JSONs were dumped in server_dumps
, each corresponding to a single paper.
when I tried to download papers from pubmed, I got this error:
JSONDecodeError: Invalid control character at: line 1 column 105 (char 104)
Ran
from paperscraper.arxiv import get_and_dump_arxiv_papers
prompt = ['prompt engineering llm', 'prompt injection llm']
ai = ['Artificial intelligence', 'Large Language Models', 'OpenAI','LLM']
mi = ['ChatGPT']
query = [prompt, ai, mi]
get_and_dump_arxiv_papers(query, output_filepath='pro_inject.jsonl')
Example - jsonl contains title,authors,abstract for the page https://arxiv.org/abs/2302.11382 but journal is always blank and doi is null. This pattern repeats itself for all results.
If a killed error is displayed and paper scraping stops, can it be considered that the IP is blocked?
How can I solve this issue?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.