Giter Club home page Giter Club logo

scrappaper's Introduction

ScrapPaper

bdge_star bdge_fork bdge_twtr

test

About this project

ScrapPaper is a web scrapping method to extract journal information from PubMed and Google Scholar using Python script. Users need to install Python 3 and required modules, and run the scrappaper.py script. Refer to the published paper for detailed instruction. This side project was completed on March 8, 2022 by @rafsanlab. Follow me on Twitter: https://twitter.com/rafsanlab

Paper to cite:

Rafsanjani, M. R. (2022). ScrapPaper: A web scrapping method to extract journal information from PubMed and Google Scholar search result using Python. In bioRxiv (p. 2022.03.08.483427). https://doi.org/10.1101/2022.03.08.483427

System Requirement

  • Python (version 3 or above)
  • The following Python modules: requests, csv, re, time, random, pandas, sys, bs4
  • Operating system (current code was tested on Windows 10)
  • Command prompt (if using Windows) / terminal
  • Search link of the first page result from PubMed or Google Scholar
  • Text editor or spreadsheet software to open the results

Simplified instructions

  1. Download the scrappaper.py script and cd terminal to the directory.
  2. Copy the link from the first search results of PubMed or Google Scholar.
  3. Run the code and paste the link when prompted.
  4. When finished, open the results using text editor or spreadsheet.
  5. Refer to the published paper for detailed instruction.

Disclaimer

Web scraping might get you blocked from the server, run at your own risk. So far, we scrapped 28 pages of Google Scholar results with no issues.

scrappaper's People

Contributors

rafsanlab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scrappaper's Issues

Unable to retrieve csv file of Google scholar search results.

The search results in Pubmed are output successfully, but the search results in Google scholar are not output.

I was searching for the human gene symbol "USPL1" in Pubmed and Google scholar, and I got the following error in Google scholar.

% python3 scrappaper.py

Initiating... please wait.

Please paste search URL and press Enter:https://scholar.google.com/scholar?as_ylo=2023&q=%22USPL1%22&hl=en&as_sdt=0,5
Input is from: Google Scholar.

Waiting for a few secs...
Waiting done. Continuing...

Traceback (most recent call last):
  File "/Users/file/to/path/scrappaper.py", line 167, in <module>
    search_results = soup.find_all("div", class_="gs_ab_mdw")[1].text
IndexError: list index out of range

Following the description in the article, I searched as follows;

image

I ran the git clone repeatedly several times and it succeeded only once, but after that I get the above error.

How should I respond ?

Thank you!

Small Feature Request: Ability to define output file name by the Keyword searches used.

Thank you for developing such a great python script. This truly stands out over the other 10 I tried - finally a script that actually searches for direct keywords within the search results.

There are a couple small things I ran into, that if modified would add monumental additional power and abilities to the script. Currently, it's nearly impossible to complete running the script when its parsing any searches that have more than 49 pages results, due to Captcha. Or if I realize result pages start to have irrelevant studies and I abort, it breaks the csv output process.

image

  1. A simple solution: A way to use a simple parameter input that pre-defines how many search pages to parse.
    Example to parse only the first 25 search pages: (-s = search pages) python scrappaper.py -s 25

  2. Allow the script to receive a new URL that starts on for example the 50th results page. So i can switch VPN IP and continue from the point where it stopped due to Capcha. Currently it still restarts from the 1st results page even if the URL points to the 50th results page.

  3. Have the script automatically name the output csv file with the keywords used in the google scholar/pubmed search with a date and time stamp. That way its clear which CSV file has which information and it doesn't keep overwriting the same output file.

One other suggestion, but less important:

  1. To prevent losing the csv output if a user aborts the process. Create another parameter input (-w = write _to_file) that has the app write_to_file and append the csv file after each search page is completed. This will allow the user to abort the process if current pages have irrelevant data, without breaking the CSV output for the articles already parsed.

Thanks in advance, I'm sure many users would benefit greatly from these small updates.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.