Giter Club home page Giter Club logo

crawler's Introduction

Crawler

Intention of Plan

프로젝트를 기획할 때마다 필요한 Data를 크롤링 하는 상황이 빈번하게 발생했습니다.
매번 크롤링 구조를 만드는 것보다 공통된 특징을 추출하여 구조를 정립하는 것이 효율적이라 생각하였습니다.
또한 세부 url 수집과 data 수집의 역할 구분을 위해 Prod, Cons 디자인 컨셉을 도입하였고, Queue와 Promise를 통한 병렬처리를 진행했습니다.

Skill

  • Node.js
  • babel
  • cheerio
  • puppeteer
  • puppeteer-extra-plugin-stealth
  • winston
  • eslint

Concept

Prod and Cons Using Queue

Run

  • Must install npm
npm i package.json
npm run crawl {sourceName}

Flow

image

  1. producer를 통해 main page에서 크롤링 할 데이터가 있는 상품의 url을 Queue에 넣는다.

image

  1. consumer에서 상품의 url을 전달받아 원하는 data의 크롤링을 전달하여 Pusher에 넣는다.

image

  1. Pusher에서는 endpoint로 각 data를 전달한다.

  2. scheduling을 통해 주기적으로 크롤링을 실행한다.

Implement

  • Log function
  • stealth in puppeteer
  • asynchronous processing
  • set variable endpoints
  • scheduling

Need

  1. url 변경 및 기존 bot의 selector가 변경되었을 때, 알람이 오도록 만들어야 한다.

  2. config를 통해 local, prod를 분리한다.

  3. set DockerFile

crawler's People

Contributors

ryucm avatar

Watchers

 avatar

crawler's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.