Accompanying code for my talk at XtremeJS 2022.
Slides for this talk are also available here. A video of me running the demo is available here, as well as a brief discussion of the tests here.
Designing a scraper is a daunting task, yet a useful one. Often, scraping is the most viable way of programmatically retrieving data when no public API is accessible. In this talk, we’ll design a loosely coordinated system of microservices that scrape information from Glassdoor. You’ll learn how to start from a high-level state machine design to then dive deep into the implementation with Playwright
, where we’ll discuss its capabilities and gotchas. Finally, we’ll see a demo of the open-source implementation, written using TypeScript
and Kafka.js
.
This project is a pnpm
monorepo written in TypeScript
and tested on Node.js
18.
The packages are the following:
@jkomyno/glassdoor-scraper-fsm
: A state machine library for scraping users' data on Glassdoor, written withplaywright
andxstate
.@jkomyno/glassdoor-scraper
: The main application. It runs a long-running consume-and-produce loop in which, for each incoming message, it scrapes the data of the user identified by the message, and produces a new message with the scraped data upon completion. It useskafkajs
.@jkomyno/common-entities
: Library that contains commonzod
validation entities and types.
-
Install dependencies
We recommend using
pnpm
for this project (although you may usenpm
as well).pnpm i
-
Build the project
Run:
pnpm build
-
Run tests You can run unit tests with
pnpm test:unit
or integration tests with:
pnpm test:integration
If you want to run both, you can just use
pnpm test
-
Setup Kafka
-
Install Docker, if you don't have it already
-
Run
docker-compose -f docker/docker-compose.yml up
This step will start a Kafka and a Zookeeper instance, as well as a Kafka UI tool. (Depending on your internet connection, this step might take a while).
- Open http://localhost:8080 to access the Kafka UI tool.
You can login with the default credentials (user:
[email protected]
; password:admin
).
-
-
Setup environment variables
Too busy to run the demo yourself? You can watch a video of it here.
-
Start the Glassdoor scraper
-
Run:
pnpm --filter glassdoor-scraper start
-
Wait for the scraper to connect to Kafka. You should see a message similar to this in your terminal:
[02:10:01.330] INFO (57014): Starting server @jkomyno/glassdoor-scraper [02:10:01.330] INFO (57014): subscribing to input topic... inputTopic: "input-glassdoor" {"level":"INFO","timestamp":"2022-12-12T01:10:25.022Z","logger":"kafkajs","message":"[ConsumerGroup] Consumer has joined the group","groupId":"@jkomyno/glassdoor-scraper","memberId":"@jkomyno/glassdoor-scraper-6a4dbc47-5e9b-48bc-b0a8-c2901a1d32cc","leaderId":"@jkomyno/glassdoor-scraper-6a4dbc47-5e9b-48bc-b0a8-c2901a1d32cc","isLeader":true,"memberAssignment":{"input-glassdoor":[0]},"groupProtocol":"RoundRobinAssigner","duration":23687}
-
-
Open the Kafka UI
- Navigate to http://localhost:8080
- Login with the default credentials (user:
[email protected]
; password:admin
).
-
Create new Kafka topic (
input-glassdoor
)- Navigate to the console at http://localhost:8080/console
- Click on the
Create Topic
button in the top-right corner - Create a new topic named
input-glassdoor
-
Produce a new Kafka message (in
input-glassdoor
)-
Navigate to the console at http://localhost:8080/console
-
Click on the
input-glassdoor
topic -
Click on the
Produce
tab in the top bar -
In the
Key
field, typetest
-
In the
Value
field, type the authentication credentials for your Glassdoor account, e.g.:{ "auth": { "email": "[email protected]", "password": "[email protected]" } }
-
Click on the
Produce
button
-
-
Extract the headless browser setup with
Playwright
out of thexstate
machine configuration- This will allow us to reuse the same browser instance across multiple pages
- What would happen to the browser session when you submit scraping jobs with different authentication credentials?
- How would that impact the resource dispose strategy of
@jkomyno/glassdoor-scraper
?
-
Implement a "retry" strategy for the whole scraping pipeline in
@jkomyno/glassdoor-scraper
- You can use a standard bounded exponential backoff
- What impact does it have on the overall throughput of the system?
- Does it make sense to retry the whole pipeline, or retrying just a subset of the FSM transitions is preferable?
-
Implement a "retry" strategy for the
authenticated.scrape-resumes.store-resumes
state only in@jkomyno/glassdoor-scraper-fsm
- You can again use a bounded exponential backoff
- What impact does it have on the overall throughput of the system?
- How would you reconsider your implementation if you had to implement a "retry" strategy every possible state in the pipeline?
Hi, I'm Alberto Schiabel, you can follow me on:
Give a ⭐️ if this project helped or inspired you!
Built with ❤️ by Alberto Schiabel.
This project is MIT licensed.