Giter Club home page Giter Club logo

magictoe's Introduction

MagicToe

MagicToe是一个基于Java爬虫框架WebMagic的Java爬虫实战案例,MagicToe提供了从获取数据到数据持久化、可视化分析以及构建简单的代理池等一系列完整流程,旨在为初涉Java爬虫的程序员提供一个参考教程和一整套完整的解决方案。

仓库目录

  • hupu-spider:爬虫功能实现模块,使用WebMagic + SpringBoot + MyBatis基础架构,NLP工具包是Ansj中文分词,定制抽取逻辑,将爬取的数据持久化到MySQL数据库中,本仓库中的代码示例爬取的是虎扑步行街。
  • data-analysis:数据分析及可视化模块,使用Spring + SpringMVC + MyBatis的基础架构,数据可视化采用的前端技术是 jsp + Echarts
  • ip-spider(可选):爬取代理网站模块,技术选型同hupu-spider,将代理网站上的免费代理地址爬取到本地数据库中,实现一个简单的IP池,以供hupu-spider作为代理使用。

QuickStart

爬虫模块环境准备:

  • JDK 1.8+
  • maven 4.0.0+
  • webmagic 0.7.3+
  • ansj_seg 5.1.1+
  • springboot 1.5.7+
  • mybatis 1.3.1+
  • mysql 5.1.21+

运行爬虫: 以爬取虎扑步行街的帖子、用户和评论为例。

  1. 初始化数据库 在本地MySQL中创建自己的schema,执行初始化数据库的脚本 hupu-spider/src/main/resources/db.sql ,并根据自己的数据库信息修改配置文件 hupu-spider/src/main/resources/application.yml 中的数据源信息。
  2. 启动爬虫 hupuspider通过URL请求的方式运行,在浏览器中键入 localhost:8080/(默认端口为8080,如果遇到端口冲突,可以在配置文件 hupu-spider/src/main/resources/application.yml 中修改端口),爬虫即可开始运行了。
  3. 运行数据可视化模块 将数据爬取到数据库中后,直接在Tomcat中运行data-analysis模块即可,通过在浏览器中输入不同的URL可以得到不同的图表,具体请查看 data-analysis/src/main/java/com/crow/web/EchartsController.java

效果展示

以虎扑用户的地域分布为例:

更多详细的分析请参考我的博客《数据不说谎:用网络爬虫探秘虎扑步行街》

TODO

  • 使用Redis分布式队列实现分布式爬取。
  • 使用Quartz实现定时更新数据。

联系作者

magictoe's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magictoe's Issues

提供的代理不能用啊

        .addHeader("Proxy-Authorization", ProxyGeneratedUtil.authHeader(ORDER_NUM, SECRET, (int) (System.currentTimeMillis() / 1000)))//设置代理

请求商务推广合作

作者您好,我们也是一家专业做IP代理的服务商,极速HTTP,我们注册认证会送10000IP(可以帮助您的学者适当薅羊毛试用 :) 。想跟您谈谈是否能够达成商业推广上的合作。如果您,有意愿的话,可以联系我,微信:13982004324 谢谢(如果没有意愿的话,抱歉,打扰了)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.