Giter Club home page Giter Club logo

nimble-scraper's Introduction

Hello there ๐Ÿ‘‹

const Thang = {
    code: ["Typescript", "Solidity", "PHP"],
    askMeAbout: [ "web-dev","blockchain", "ops", "translating"],
    technologies: {
        backEnd: {
            js: ["nodeJs", "nestJs"],
            php: ["laravel"]
        },
        frontEnd: {
            js: ["reactJs", "jQuery"],
            css: ["bootstrap"],
        },
        smartContract: "Solidity",
        libs: ["OpenZeppelin", "Hardhat", "redux toolkit", "coreUI"],
        databases: ["mongoDb", "mySql"],
        devOps: ["docker", "swarm"]
    },
    architecture: ["Single page applications", "Event driven", "Upgradeable smart contracts"],
    currentFocus: "AWS",
    interests: ["tech", "language", "cryptocurrency"]
};

nimble-scraper's People

Contributors

21jake avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

nimble-scraper's Issues

[Feature] Missing test

Issue

While some tests were added, e.g. the Scraper, the core business logic of the domains in services is not fully tested: the file upload, keyword parsing and firing the event, etc
The goal is not to always get 100 test coverage (it's a nice and motivating goal for a team), and I also value the opinion of focusing on critical tests first, but in this case, the test coverage is not sufficient. It is not satisfying in the way that these parts are hard to test, and you could have demonstrated your unit testing skills better.

May I know the reason is that you did not have enough time or there were difficulties when writing the tests?

Expected

The core business logic is fully tested.

[Chore] Improve source code organization

Issue

While the application code is organized with several services, some of them violate the single responsibility principle. for example:

public async saveBatch(file: Express.Multer.File, user: User) {
if (this.concurrentUploadCount >= appEnv.MAX_CONCURRENT_UPLOAD) {
throw new ServiceUnavailableException(ErrorResponses.MAX_CONCURRENT_UPLOAD);
}
const batch = new Batch();
batch.uploader = user;
batch.originalName = file.originalname;
batch.fileName = file.filename;
const newBatch = await this.batchRepository.save(batch);
await this.saveKeywords(newBatch);
this.eventEmitter.emit(EmittedEvent.NEW_BATCH, newBatch);
const entityName = 'batch';
let query = this.batchRepository
.createQueryBuilder(entityName)
.where(`${entityName}.id = :id`, { id: newBatch.id });
query = this.addKeywordCountQb(query, entityName);
const newBatchwithKeywordCount = await query.getOne();
return newBatchwithKeywordCount;
}

This function name (`saveBatch) is misleading as it do many thing:

  • initialize a new Batch Entity
  • save the batch
  • save the keywords in the batch
  • emit the event
  • query for the updated batch

Shouldn't it be better to separate the part of initializing the Batch entity (and persist it into DB) and fetching for updated Batch into their own functions?


or, in the scraper.utils.ts, putting the pickLeastUsedProxy() function and SinglePageManipulator class in the same file isn't appropriate

export const pickLeastUsedProxy = async () => {
const proxies: IProxy[] = JSON.parse(readFileSync(appEnv.PROXY_FILE_PATH, { encoding: 'utf-8' }));
const leastUsed = first(proxies);
const sortedProxies = proxies
.map((proxy) => {
if (proxy === leastUsed) {
proxy.count += 1;
}
return proxy;
})
.sort((a, b) => a.count - b.count);
await writeFileSync(appEnv.PROXY_FILE_PATH, JSON.stringify(sortedProxies, null, 2));
return leastUsed;
};
export class SinglePageManipulator {
page: Page;

the class SinglePageManipulator itself isn't utility, it has few functions to extract the search result from a page.

could you share some insights why you decided to organize the function and the class under the same file?

[Question] Why not using background job for the scrapping process?

Issue

While the decision to delegate the work to scrap the search result to another service to do that asynchronously is a good technical decision and provides a good user experience

const newBatch = await this.batchRepository.save(batch);
await this.saveKeywords(newBatch);
this.eventEmitter.emit(EmittedEvent.NEW_BATCH, newBatch);
const entityName = 'batch';

@OnEvent(EmittedEvent.NEW_BATCH)
async scrape(payload: Batch) {

But I would like to know why you decided to use Event to handle that (1):

  • fire an event
  • use an event listener (ScraperService) to listen to the event and process the scrapping part

Instead of using a background job (2) by:

  • enqueue the keywords as a job
  • a worker to pick up the job to process

Because in the 1st solution, it has some limitations:

  • since it relies on the events, if the current process crash, all events are gone. While you have a corn job (CronService) to handle it, it seems like a workaround.
  • sleep had to be used to delay the processing of keywords -> this is a code smell and should be avoided. If using the (2) solution, the job execution can be delayed, thus achieving the same purpose but in a cleaner way.
  • It is hard to improve the performance, as compared in the (2nd) solution, when we can have several workers to process multiple jobs at a time.

[Feature] Improve the efficiency of the scraping process

Issue

The Scrapping Service has a creative way to use proxies to overcome the mass-searching detection from Google. However, it is using puppeteer, which requires Chromium running in headless mode

async scrape(payload: Batch) {
this.fileService.concurrentUploadCount++;
const args = appEnv.IS_PROD ? ['--no-sandbox', '--disable-setuid-sandbox'] : undefined;
const browser = await puppeteer.launch({ args });

it requires more resources to work, as pointed out in the Readme

Currently a 2-CPU 4GB Ubuntu server with 22 proxies can handle up to 7 concurrent uploads before showing sign of scraping failures (Captcha-ed, Timeout, etc).

I'm curious why don't you use an HTTP library, e.g. axios to send the search requests and parse the result (e.g. using a library like cheerio) instead? it would be way more effecient.

Also, as mentioned in #35, instead of using sleep and make the code way to overcome the detection from Google, there should be a better way e.g. by using the proxies and rotating the user's agent in the request.

Expected

The scrapping process in handled in a more efficient way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.