Is a python package that provides a dedicated CLI interface for scraping data from Alibaba.com. The purpose of this project is to extract products and theirs related suppliers informations from Alibaba.com and store it in a local database (SQLite or MySQL). The project utilizes asynchronous requests for efficient handling of numerous requests and allows users to easily run the scraper and manage the database using a user-friendly command-line interface (CLI).
- Asynchronous API: Utilizes asynchronous API of Playwright and Brightdata Proxies for efficient handling of numerous pages results.
- Database Integration: Stores scraped data in a database (SQLite or MySQL) for structured persistence.
- User-Friendly CLI: Provides easy-to-use commands for running the scraper and managing the database.
It's recommended to use pipx instead of pip for end-user applications written in Python. pipx
installs the package, exposes his CLI entrypoints in an isolated environment and makes it available everywhere this guarantees no dependency conflicts and clean uninstall. If you'd like to use pip
instead, just replace pipx
with pip
but obviously as usual you'll need to create a virtual environment and activate it before to use aba-cli-scrapper
to avoid any dependency conflicts issues. let's install aba-cli-scrapper
using pipx:
pipx install aba-cli-scrapper
Need Help? run any commands followed by --help
for detailed informations about its usage and options. For example: aba-run --help
will show you all subcommands available and how to use them.
Warnings:
aba-run
is the base command means all other commands that will be introduce bellow are sub-commands and should always be preceded byaba-run
. Practice make perfect isn't ? So let's get started with a use case example. Let's assume that you want to scrape data aboutelectric bikes
from Alibaba.com.
Scraper Demo
WhatsApp.Video.2024-07-24.at.23.30.26.mp4
-
scraper
sub-command: Initiates scraping of Alibaba.com based on the provided keywords. this command takes two required arguments and one optional argument:-
key_words
(required): The search term(s) for finding products on Alibaba. Enclose multiple keywords in quotes.
-
--page-results
or-pr
(required): Usually keys words will results to many pages macthing them. Then you must to indicate how many of them you want to pull out.If any value is not provided10
will be used by default.
-
--html-folder
or-hf
(optional): Specifies the directory to store the raw HTML files. If omitted, a folder with sanitized keywords as name will be automatically created. In this caseelectric_bikes
will be used as a results folder name.
Example:
aba-run scraper "electric bikes" -hf "bike_results" -pr 15
-
by default scrapper
will use async mode which supported by brightdata api which means if you want to use it you will need to provide your api key. set it by using :
aba-run set-api-key your_api_key
and now run scraper
sub-command without --sync-api
flag to use async mode.
However if you want to use sync mode you can use :
aba-run scraper "electric bikes" -hf "bike_results" -pr 15 --sync-api
and voila!
Now bike_results
(since you already provided name you wish to have) directory has been created and should contains all html files from alibaba.com matching your keywords.
db-init Demo with sqlite
WhatsApp.Video.2024-07-25.at.01.04.55.1.mp4
db-init
sub-command: Creates a new database mysql/sqlite. this command takes one required arguments and six optional arguments(depends on engine you choose):-
engine
(required): Choose eithersqlite
ormysql
.
-
--sqlite-file
or-f
(optional, SQLite only): The name for your SQLite database file (without any extension).
-
--host
or-h
,--port
or-p
,--user
or-u
,--password
or-pw
,--db-name
or-db
(required for MySQL): Your MySQL database connection details.
-
--only-with
or-ow
(optional Mysql): If you just want to update some details of your credentials indb_credentials.json
file but not all before to initialize a brand new database.
-
- NB:
--host
and--port
are respectively set tolocalhost
and3306
by default.
MySQL Use case:
aba-run db-init mysql -u "mysql_username" -pw "mysql_password" -db "alibaba_products"
Assuming that you have already initialized your database,and you want to created a new one with a new database name without to set password and username again , simply run :
aba-run db-init mysql --only-with -db "alibaba_products"
NB: When you initialize your mysql as engine, the db-init
sub-command will save your credentials in db_credentials.json
file, so when you will need to update your database, simply run aba-run db-update mysql --kw-results bike_results\
to automatically update your database by using your saved credentials
SQLite Use case :
aba-run db-init sqlite --sqlite-file alibaba_data
db-init subcommand will try to use sqlite engine by default so if you are planning to use it run as bellow :
SQLite Use case V2 :
aba-run db-init -f alibaba_data
As soons as your database has been initialized, you can update it with the scraped data.
db-update Demo
WhatsApp.Video.2024-07-25.at.01.42.52.1.mp4
db-update
sub-command: add scraped data from html files to your database (you can't use this command twice with same database credentals to avoid UNIQUE CONSTRAINT ERROR).
this command takes two required arguments and two optional arguments:
-
--db-engine
(required): Select your database engine:sqlite
ormysql
.
-
--kw-results
(required): The path to the folder containing the HTML files generated by thescraper
sub command.
-
--filename
(required for SQLite): If you're using SQLite, provide the desired filename for your database. whitout any extension.
-
--db-name
(optional for MySQL): If you're using MySQL,and want to push the data to a different database, provide the desired database name.
MySQL Use case:
aba-run db-update mysql --kw-results bike_results\
NB:What if you want to change something while you updating the database? Assuming that you have run another scraping command and you want to save this data in another database name whitout update credential file or rewriting all theses parameter just to change your database name then, simply run aba-run db-update mysql --kw-results another_keyword_folder_result\ --db-name "another_database_name"
.
SQLite Use case:
aba-run db-update sqlite --kw-results bike_results\ --filename alibaba_data
export-as-csv
sub-command: Exports scraped data from your sqlitedatabase to a CSV file. This csv file will contain aFULL OUTER JOIN
with theproducts
andsuppliers
tables.
this command takes one required argument and one optional argument:
-
--sqlite_file
(required): The name for your SQLite database file with his extension.
-
--to
or-t
(required): The name for your CSV file with his extension.
This project has a lot of potential for growth! Here are some exciting features I'm considering for the future:
- Retrieval Augmented Generation (RAG): Integrate a RAG system that allows users to ask natural language questions about the scraped data, making it even more powerful for insights.
I believe in the power of open source! If you'd like to contribute to this project, feel free to fork the repository, make your changes, and submit a pull request. I'm always open to new ideas and improvements.
This project is licensed under the Gnu General Public License Version 3.