Giter Club home page Giter Club logo

social-media-data-extracter's Introduction

social-media-data-extracter

  • 社交媒体数据搜集
  • 数据来源
  • 话题选择
  • 架构设计

社交媒体数据搜集

本项目主要是为了学习目的,本项目作为社交媒体情感分析总项目的上游部分,主要用于微博数据的收集以及数据清洗和存储进mongodb,为后续的数据分析提供基础,本项目不作为商业用途,主要选取当前比较热点的事件对评论进行数据分析和可视化,从而加深对于数据分析技术的理解和运用。

数据来源

本次项目主要会搜集最近热点事件,娱乐明星“蔡徐坤”的睡粉风波,主要在其微博下面搜集其粉丝的评论,评论数量大概为57万左右(截止2023-7-6),对评论会进行数据清洗(去除表情、标点符号、特殊字符)等,搜集信息包括,评论者id, 评论者昵称、性别、地区、评论内容等。

其主要数据结构如下所示:

  • 评论信息
{
    "user_id": "user_id",
    "coment_id": "comment_id",
    "nick_name": "nick_name",
    "reply_count": 1000,
    "like_count": 1000,
    "root_comment_id": "root_comment_id",
    "comment_time": "2023-07-09 11:45:55",
    "content": "content",
    "created_at": "2023-07-09 11:45:55"
}
  • 用户信息
{
    "verified": false, //是否认证
    "user_id": 123123,
    "nick_name": "nick_name",
    "gender": 0, //0 female, 1 male
    "province" : "河南",
    "city": "开封",
    "followers_count": 100, //粉丝数
    "friends_count": 100, //好友数
    "source": "河南", //发评论ip属地
    "created_at": "2023-07-09 11:45:55"
}

话题选择

选择这个话题的主要原因是想了解当前年轻人对于娱乐明星睡粉这个事件的真实反应,以及希望从中能够发现一些有趣的事情,并且也想分析主要参与评论人群的构成以及地理分布等。另外一个主要原因是蔡徐坤是内娱顶流其具有极高话题度,评论数量也十分巨大,对于数据分析来说非常适合。

架构设计

本项目采用python scrapy 来对微博的一条博文评论进行爬虫搜集,实际上一个运行稳定的爬虫项目可能还需要接入Cookie池以及代理,出于成本和时间考虑,本次爬虫采用手动赋予cookie的方式进行爬虫,一旦出现限制可能手动换取Cookie以及切换网络

social-media-data-extracter's People

Contributors

qiujun4417 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.