const Thang = {
code: ["Typescript", "Solidity", "PHP"],
askMeAbout: [ "web-dev","blockchain", "ops", "translating"],
technologies: {
backEnd: {
js: ["nodeJs", "nestJs"],
php: ["laravel"]
},
frontEnd: {
js: ["reactJs", "jQuery"],
css: ["bootstrap"],
},
smartContract: "Solidity",
libs: ["OpenZeppelin", "Hardhat", "redux toolkit", "coreUI"],
databases: ["mongoDb", "mySql"],
devOps: ["docker", "swarm"]
},
architecture: ["Single page applications", "Event driven", "Upgradeable smart contracts"],
currentFocus: "AWS",
interests: ["tech", "language", "cryptocurrency"]
};
nimble-scraper's Introduction
nimble-scraper's People
nimble-scraper's Issues
[Feature] Missing test
Issue
While some tests were added, e.g. the Scraper, the core business logic of the domains in services is not fully tested: the file upload, keyword parsing and firing the event, etc
The goal is not to always get 100 test coverage (it's a nice and motivating goal for a team), and I also value the opinion of focusing on critical tests first, but in this case, the test coverage is not sufficient. It is not satisfying in the way that these parts are hard to test, and you could have demonstrated your unit testing skills better.
May I know the reason is that you did not have enough time or there were difficulties when writing the tests?
Expected
The core business logic is fully tested.
[Chore] Improve source code organization
Issue
While the application code is organized with several services, some of them violate the single responsibility principle. for example:
nimble-scraper/backend/src/services/file.service.ts
Lines 41 to 65 in f0673eb
This function name (`saveBatch) is misleading as it do many thing:
- initialize a new Batch Entity
- save the batch
- save the keywords in the batch
- emit the event
- query for the updated batch
Shouldn't it be better to separate the part of initializing the Batch entity (and persist it into DB) and fetching for updated Batch into their own functions?
or, in the scraper.utils.ts, putting the pickLeastUsedProxy()
function and SinglePageManipulator
class in the same file isn't appropriate
nimble-scraper/backend/src/utils/scraper.utils.ts
Lines 21 to 40 in f0673eb
the class SinglePageManipulator
itself isn't utility, it has few functions to extract the search result from a page.
could you share some insights why you decided to organize the function and the class under the same file?
[My mistake] My apologies ๐ Opended the issue in the wrong repo since the name is similar.
[Question] How was the application deployed?
Issue
Since the application is deployed at http://skrape.site, but there is no CI/CD workflow in the codebase, I'm unsure how did you deploy the application and where it is being hosted now (DigitalOcean, Heroku, AWS, ...)?
[Question] Why not using background job for the scrapping process?
Issue
While the decision to delegate the work to scrap the search result to another service to do that asynchronously is a good technical decision and provides a good user experience
nimble-scraper/backend/src/services/file.service.ts
Lines 51 to 56 in f0673eb
nimble-scraper/backend/src/services/scraper.service.ts
Lines 28 to 29 in f0673eb
But I would like to know why you decided to use Event to handle that (1):
- fire an event
- use an event listener (ScraperService) to listen to the event and process the scrapping part
Instead of using a background job (2) by:
- enqueue the keywords as a job
- a worker to pick up the job to process
Because in the 1st solution, it has some limitations:
- since it relies on the events, if the current process crash, all events are gone. While you have a corn job (CronService) to handle it, it seems like a workaround.
sleep
had to be used to delay the processing of keywords -> this is a code smell and should be avoided. If using the (2) solution, the job execution can be delayed, thus achieving the same purpose but in a cleaner way.- It is hard to improve the performance, as compared in the (2nd) solution, when we can have several workers to process multiple jobs at a time.
[Feature] Improve the efficiency of the scraping process
Issue
The Scrapping Service has a creative way to use proxies to overcome the mass-searching detection from Google. However, it is using puppeteer, which requires Chromium running in headless mode
nimble-scraper/backend/src/services/scraper.service.ts
Lines 29 to 34 in f0673eb
it requires more resources to work, as pointed out in the Readme
Currently a 2-CPU 4GB Ubuntu server with 22 proxies can handle up to 7 concurrent uploads before showing sign of scraping failures (Captcha-ed, Timeout, etc).
I'm curious why don't you use an HTTP library, e.g. axios
to send the search requests and parse the result (e.g. using a library like cheerio) instead? it would be way more effecient.
Also, as mentioned in #35, instead of using sleep
and make the code way to overcome the detection from Google, there should be a better way e.g. by using the proxies and rotating the user's agent in the request.
Expected
The scrapping process in handled in a more efficient way.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.