Giter Club home page Giter Club logo

senti-weibo's Introduction

S2AP

[S2AP]: A Sequential Senti-weibo Analysis Platform.

S2AP

Introduction

Open Source

Corpus/Test-Dataset/Trained-model

Download link Desc
Senti-weibo Google Drive 671053 weibos, 407058 positive, 263995 negative
Test-Dataset Google Drive 1790 weibos, 1010 negative, 780 positive
Trained-Model by fastText Google Drive 1.74 GB
Public-Weibo-Dataset Google Drive 955MB, 35.2 million weibos

Training and test scripts can be available at training_and_test.py

Dictionary and Seed words

Details at README

Topic Data

We open-source the crawled weibos about the topic of Huawei and China-US Trade.

Topic Download Link Time
Huawei Google Drive 2019-03-12 — 2019-07-01
China-US Trade Google Drive 2019-04-20 — 2019-07-01

Weibo-preprocess-toolkit

Open-sourced on Github. In order to prove the significance of consistent text pre-processing rules in the training and online environment, we compare six segmentation tools with this script: segmentation_tools_compare.py.

Demo

$ pip install weibo-preprocess-toolkit
from weibo_preprocess_toolkit import WeiboPreprocess

preprocess = WeiboPreprocess()

from weibo_preprocess_toolkit import WeiboPreprocess

toolkit = WeiboPreprocess()
test_weibo = "所 以 我 都 不 喝 蒙 牛, 一 直 不 喜 歡 蒙 牛。 #南 京· 大 行 宫[地 点]# 赞[122]转发[11] [超 话] 收 藏09月11日 18:57"
cleaned_weibo = toolkit.clean(test_weibo)
print(cleaned_weibo)
# ’所 以 我 都 不 喝 蒙 牛 一 直 不 喜 欢 蒙 牛’
print(toolkit.cut(cleaned_weibo))
# [’所 以’, ’我’, ’都’, ’不 喝’, ’蒙 牛’, ’一 直’, ’不 喜 欢’, ’蒙 牛 ’]
print(toolkit.preprocess(test_weibo , simplified=True , keep_stop_word=True))
# ’所 以 我 都 不 喝 蒙 牛 一 直 不 喜 欢 蒙 牛’

Details of S2AP

Corpus Iteration

In fact, the iteration can be interpreted as a process of sifting sands out of flour. Given a sieve and some flour to be sifted, our goal is to sift out the sand with sieve. The difficulty is that the holes on the sieve are not all small holes, there will always be sands mixed into flour. Purify Function plays an important role in this part. Classifier trained by initialized corpus will lead to over-fitting iff we apply it to purify the corpus. So we divide the corpus into training set and verification set, and sample subset of training dataset to train the classifier many times. Multiple classifiers are used to verify the sentiment label of the verification dataset, and then sift out the samples whose classification results are not unified. Query Function can help to continuously recall new high-quality data to guarantee the robustness of the model.

Corpus Iteration

UML of S2AP

UML of S2AP

Weibo Topic Spider

Architecture of Spider

Snapshot of S2AP

Details at Website-snapshot

senti-weibo's People

Contributors

wansho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.