A web scraping and visualization project on SJR and WoS journal indexes.
View Dashboard
Table of Contents
This project is a data scraping, analysis, and visualization project on Research Journals. The project is divided into two parts: the first part is the web scraping part, which is done using Selenium and Python; the second part is the data analysis and visualization part, which is done using Tableau. The project is done as a part of the 1st capstone project of MasterCourse Data Science Cohort 2 program.
The data is scraped from the following websites:
An external dataset is also used in this project:
From these 3 sources, the following information is scraped:
- Journal Name or Title
- Subject Area
- Open Access Status
- Publisher
- Country
- Coverage Year
- Journal Rank
- SJR Index
- Quartile
- H-Index
- CiteScore
- References Count
- Citations Count
- Documents Count ...
The scraped data is then cleaned and analyzed using Python libraries such as Pandas, Numpy, Matplotlib, and Seaborn. The cleaned data is then visualized using Tableau. The final dataset can be found in kaggle.
Python libraries and softwares used in this project:
This project is done using Python 3.11.0. Please install the latest version of Python before running the project.
Below are the steps to run the project:
- Clone the repo
git clone https://github.com/abir0/SJR-Journal-Ranking.git
- Intialize and activate virtual environment
virtualenv --no-site-packages venv
source venv/bin/activate
- Install dependencies
pip install -r requirements.txt
-
Download Chrome WebDrive from https://chromedriver.chromium.org/downloads and add the path to the
chromedriver.exe
file in PATH environment variable. -
Run the scraper scripts
python src/sjr_scraper.py
python src/wos_scraper.py
-
Run all the cells in the data transformation notebook in google colab or download the notebook and run it in Jupyter.
-
You will get a file named
combined_journal_ranking_data.csv
. This is the final data. -
Open the
SJR Journal Ranking Analysis.twb
file in Tableau (or open the public tableau link) and connect thecombined_journal_ranking_data.csv
file to the workbook.
The final dashboard can be found here.
Here are the two dashboards:
Key findings from the analysis:
- From the correlation analysis, it is found that there is a positive correlation between SJR Index and CiteScore, H-index, and Cites per Docs. So, these metrics are better indicators than the simple counts of citations, references, and documents.
- But for lower-ranking journals, these metrics do not represent much significance due to higher randomness (note that correlation plots get more scattered to the right).
- Open Access journals have a higher average of Citations per Document than non-Open Access journals.
- One interesting observation: based on the number of documents, citations, and references MDPI is among the top 5 publishers. This is because MDPI publishes a lot of journals, but the quality of the journals is not as high as the top 5 publishers which is reflected by the poor CiteScore.
- Based on CiteScore, the top 5 publishers are: Wiley, Elsevier, Springer, Nature Portfolio, and Routledge.
- The top 5 countries with the highest number of journals are: United States, United Kingdom, Netherlands, Germany, and Switzerland.
- Medicine and Social Sciences are the top 2 subject areas that have the most number of documents, references, and combined H-index.