rafsanlab / scrappaper Goto Github PK

View Code? Open in Web Editor NEW

30.0 2.0 6.0 120 KB

A web scrapping method to extract journal information from PubMed and Google Scholar using Python.

License: Mozilla Public License 2.0

Python 100.00%

bioinformatics webscraping googlescholar pubmed literature-mining webscraper

scrappaper's Introduction

ScrapPaper

About this project

ScrapPaper is a web scrapping method to extract journal information from PubMed and Google Scholar using Python script. Users need to install Python 3 and required modules, and run the scrappaper.py script. Refer to the published paper for detailed instruction. This side project was completed on March 8, 2022 by @rafsanlab. Follow me on Twitter: https://twitter.com/rafsanlab

Paper to cite:

Rafsanjani, M. R. (2022). ScrapPaper: A web scrapping method to extract journal information from PubMed and Google Scholar search result using Python. In bioRxiv (p. 2022.03.08.483427). https://doi.org/10.1101/2022.03.08.483427

System Requirement

Python (version 3 or above)
The following Python modules: requests, csv, re, time, random, pandas, sys, bs4
Operating system (current code was tested on Windows 10)
Command prompt (if using Windows) / terminal
Search link of the first page result from PubMed or Google Scholar
Text editor or spreadsheet software to open the results

Simplified instructions

Download the scrappaper.py script and cd terminal to the directory.
Copy the link from the first search results of PubMed or Google Scholar.
Run the code and paste the link when prompted.
When finished, open the results using text editor or spreadsheet.
Refer to the published paper for detailed instruction.

Disclaimer

Web scraping might get you blocked from the server, run at your own risk. So far, we scrapped 28 pages of Google Scholar results with no issues.

scrappaper's People

Contributors

Stargazers

Watchers

Forkers

agabardo done520 baronrustamov ahmedamrali ahfitzpa

scrappaper's Issues

Issue

Removed

Unable to retrieve csv file of Google scholar search results.

The search results in Pubmed are output successfully, but the search results in Google scholar are not output.

I was searching for the human gene symbol "USPL1" in Pubmed and Google scholar, and I got the following error in Google scholar.

% python3 scrappaper.py

Initiating... please wait.

Please paste search URL and press Enter:https://scholar.google.com/scholar?as_ylo=2023&q=%22USPL1%22&hl=en&as_sdt=0,5
Input is from: Google Scholar.

Waiting for a few secs...
Waiting done. Continuing...

Traceback (most recent call last):
  File "/Users/file/to/path/scrappaper.py", line 167, in <module>
    search_results = soup.find_all("div", class_="gs_ab_mdw")[1].text
IndexError: list index out of range

Following the description in the article, I searched as follows;

I ran the git clone repeatedly several times and it succeeded only once, but after that I get the above error.

How should I respond ?

Thank you！

Small Feature Request: Ability to define output file name by the Keyword searches used.

Thank you for developing such a great python script. This truly stands out over the other 10 I tried - finally a script that actually searches for direct keywords within the search results.

There are a couple small things I ran into, that if modified would add monumental additional power and abilities to the script. Currently, it's nearly impossible to complete running the script when its parsing any searches that have more than 49 pages results, due to Captcha. Or if I realize result pages start to have irrelevant studies and I abort, it breaks the csv output process.

A simple solution: A way to use a simple parameter input that pre-defines how many search pages to parse.
Example to parse only the first 25 search pages: (-s = search pages) python scrappaper.py -s 25
Allow the script to receive a new URL that starts on for example the 50th results page. So i can switch VPN IP and continue from the point where it stopped due to Capcha. Currently it still restarts from the 1st results page even if the URL points to the 50th results page.
Have the script automatically name the output csv file with the keywords used in the google scholar/pubmed search with a date and time stamp. That way its clear which CSV file has which information and it doesn't keep overwriting the same output file.

One other suggestion, but less important:

To prevent losing the csv output if a user aborts the process. Create another parameter input (-w = write _to_file) that has the app write_to_file and append the csv file after each search page is completed. This will allow the user to abort the process if current pages have irrelevant data, without breaking the CSV output for the articles already parsed.

Thanks in advance, I'm sure many users would benefit greatly from these small updates.

rafsanlab / scrappaper Goto Github PK

scrappaper's Introduction

ScrapPaper

About this project

Paper to cite:

System Requirement

Simplified instructions

Disclaimer

scrappaper's People

Contributors

Stargazers

Watchers

Forkers

scrappaper's Issues

Issue

Unable to retrieve csv file of Google scholar search results.

Small Feature Request: Ability to define output file name by the Keyword searches used.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent