21jake / nimble-scraper Goto Github PK

JavaScript 0.48% Dockerfile 0.40% TypeScript 97.18% HTML 1.45% SCSS 0.49%

nimble-scraper's Introduction

Hello there 👋

const Thang = {
    code: ["Typescript", "Solidity", "PHP"],
    askMeAbout: [ "web-dev","blockchain", "ops", "translating"],
    technologies: {
        backEnd: {
            js: ["nodeJs", "nestJs"],
            php: ["laravel"]
        },
        frontEnd: {
            js: ["reactJs", "jQuery"],
            css: ["bootstrap"],
        },
        smartContract: "Solidity",
        libs: ["OpenZeppelin", "Hardhat", "redux toolkit", "coreUI"],
        databases: ["mongoDb", "mySql"],
        devOps: ["docker", "swarm"]
    },
    architecture: ["Single page applications", "Event driven", "Upgradeable smart contracts"],
    currentFocus: "AWS",
    interests: ["tech", "language", "cryptocurrency"]
};

nimble-scraper's People

Contributors

Stargazers

Watchers

nimble-scraper's Issues

[Feature] Missing test

Issue

While some tests were added, e.g. the Scraper, the core business logic of the domains in services is not fully tested: the file upload, keyword parsing and firing the event, etc
The goal is not to always get 100 test coverage (it's a nice and motivating goal for a team), and I also value the opinion of focusing on critical tests first, but in this case, the test coverage is not sufficient. It is not satisfying in the way that these parts are hard to test, and you could have demonstrated your unit testing skills better.

May I know the reason is that you did not have enough time or there were difficulties when writing the tests?

Expected

The core business logic is fully tested.

[Chore] Improve source code organization

Issue

While the application code is organized with several services, some of them violate the single responsibility principle. for example:

nimble-scraper/backend/src/services/file.service.ts

Lines 41 to 65 in f0673eb

 public async saveBatch(file: Express.Multer.File, user: User) { 

 if (this.concurrentUploadCount >= appEnv.MAX_CONCURRENT_UPLOAD) { 

 throw new ServiceUnavailableException(ErrorResponses.MAX_CONCURRENT_UPLOAD); 

 } 

 const batch = new Batch(); 

 batch.uploader = user; 

 batch.originalName = file.originalname; 

 batch.fileName = file.filename; 

 const newBatch = await this.batchRepository.save(batch); 

 await this.saveKeywords(newBatch); 

 this.eventEmitter.emit(EmittedEvent.NEW_BATCH, newBatch); 

 const entityName = 'batch'; 

 let query = this.batchRepository 

 .createQueryBuilder(entityName) 

 .where(`${entityName}.id = :id`, { id: newBatch.id }); 

 query = this.addKeywordCountQb(query, entityName); 

 const newBatchwithKeywordCount = await query.getOne(); 

 return newBatchwithKeywordCount; 

 }

This function name (`saveBatch) is misleading as it do many thing:

initialize a new Batch Entity
save the batch
save the keywords in the batch
emit the event
query for the updated batch

Shouldn't it be better to separate the part of initializing the Batch entity (and persist it into DB) and fetching for updated Batch into their own functions?

or, in the scraper.utils.ts, putting the pickLeastUsedProxy() function and SinglePageManipulator class in the same file isn't appropriate

nimble-scraper/backend/src/utils/scraper.utils.ts

Lines 21 to 40 in f0673eb

 export const pickLeastUsedProxy = async () => { 

 const proxies: IProxy[] = JSON.parse(readFileSync(appEnv.PROXY_FILE_PATH, { encoding: 'utf-8' })); 

 const leastUsed = first(proxies); 

 const sortedProxies = proxies 

 .map((proxy) => { 

 if (proxy === leastUsed) { 

 proxy.count += 1; 

 } 

 return proxy; 

 }) 

 .sort((a, b) => a.count - b.count); 

 await writeFileSync(appEnv.PROXY_FILE_PATH, JSON.stringify(sortedProxies, null, 2)); 

 return leastUsed; 

 }; 

 export class SinglePageManipulator { 

 page: Page;

the class SinglePageManipulator itself isn't utility, it has few functions to extract the search result from a page.

could you share some insights why you decided to organize the function and the class under the same file?

[My mistake] My apologies 🙏 Opended the issue in the wrong repo since the name is similar.

[Question] How was the application deployed?

Issue

Since the application is deployed at http://skrape.site, but there is no CI/CD workflow in the codebase, I'm unsure how did you deploy the application and where it is being hosted now (DigitalOcean, Heroku, AWS, ...)?

[Question] Why not using background job for the scrapping process?

Issue

While the decision to delegate the work to scrap the search result to another service to do that asynchronously is a good technical decision and provides a good user experience

nimble-scraper/backend/src/services/file.service.ts

Lines 51 to 56 in f0673eb

 const newBatch = await this.batchRepository.save(batch); 

 await this.saveKeywords(newBatch); 

 this.eventEmitter.emit(EmittedEvent.NEW_BATCH, newBatch); 

 const entityName = 'batch';

nimble-scraper/backend/src/services/scraper.service.ts

Lines 28 to 29 in f0673eb

 @OnEvent(EmittedEvent.NEW_BATCH) 

 async scrape(payload: Batch) {

But I would like to know why you decided to use Event to handle that (1):

fire an event
use an event listener (ScraperService) to listen to the event and process the scrapping part

Instead of using a background job (2) by:

enqueue the keywords as a job
a worker to pick up the job to process

Because in the 1st solution, it has some limitations:

since it relies on the events, if the current process crash, all events are gone. While you have a corn job (CronService) to handle it, it seems like a workaround.
sleep had to be used to delay the processing of keywords -> this is a code smell and should be avoided. If using the (2) solution, the job execution can be delayed, thus achieving the same purpose but in a cleaner way.
It is hard to improve the performance, as compared in the (2nd) solution, when we can have several workers to process multiple jobs at a time.

[Feature] Improve the efficiency of the scraping process

Issue

The Scrapping Service has a creative way to use proxies to overcome the mass-searching detection from Google. However, it is using puppeteer, which requires Chromium running in headless mode

nimble-scraper/backend/src/services/scraper.service.ts

Lines 29 to 34 in f0673eb

 async scrape(payload: Batch) { 

 this.fileService.concurrentUploadCount++; 

 const args = appEnv.IS_PROD ? ['--no-sandbox', '--disable-setuid-sandbox'] : undefined; 

 const browser = await puppeteer.launch({ args });

it requires more resources to work, as pointed out in the Readme

Currently a 2-CPU 4GB Ubuntu server with 22 proxies can handle up to 7 concurrent uploads before showing sign of scraping failures (Captcha-ed, Timeout, etc).

I'm curious why don't you use an HTTP library, e.g. axios to send the search requests and parse the result (e.g. using a library like cheerio) instead? it would be way more effecient.

Also, as mentioned in #35, instead of using sleep and make the code way to overcome the detection from Google, there should be a better way e.g. by using the proxies and rotating the user's agent in the request.

Expected

The scrapping process in handled in a more efficient way.

	public async saveBatch(file: Express.Multer.File, user: User) {
	if (this.concurrentUploadCount >= appEnv.MAX_CONCURRENT_UPLOAD) {
	throw new ServiceUnavailableException(ErrorResponses.MAX_CONCURRENT_UPLOAD);
	}

	const batch = new Batch();
	batch.uploader = user;
	batch.originalName = file.originalname;
	batch.fileName = file.filename;

	const newBatch = await this.batchRepository.save(batch);
	await this.saveKeywords(newBatch);

	this.eventEmitter.emit(EmittedEvent.NEW_BATCH, newBatch);

	const entityName = 'batch';
	let query = this.batchRepository
	.createQueryBuilder(entityName)
	.where(`${entityName}.id = :id`, { id: newBatch.id });

	query = this.addKeywordCountQb(query, entityName);

	const newBatchwithKeywordCount = await query.getOne();
	return newBatchwithKeywordCount;
	}

	export const pickLeastUsedProxy = async () => {
	const proxies: IProxy[] = JSON.parse(readFileSync(appEnv.PROXY_FILE_PATH, { encoding: 'utf-8' }));

	const leastUsed = first(proxies);

	const sortedProxies = proxies
	.map((proxy) => {
	if (proxy === leastUsed) {
	proxy.count += 1;
	}
	return proxy;
	})
	.sort((a, b) => a.count - b.count);

	await writeFileSync(appEnv.PROXY_FILE_PATH, JSON.stringify(sortedProxies, null, 2));
	return leastUsed;
	};

	export class SinglePageManipulator {
	page: Page;

	@OnEvent(EmittedEvent.NEW_BATCH)
	async scrape(payload: Batch) {

	async scrape(payload: Batch) {
	this.fileService.concurrentUploadCount++;

	const args = appEnv.IS_PROD ? ['--no-sandbox', '--disable-setuid-sandbox'] : undefined;

	const browser = await puppeteer.launch({ args });

21jake / nimble-scraper Goto Github PK

nimble-scraper's Introduction

Hello there 👋

nimble-scraper's People

Contributors

Stargazers

Watchers

nimble-scraper's Issues

Issue

Expected

Issue

Issue

Issue

Issue

Expected

Recommend Projects

Recommend Topics

Recommend Org