jxnl / fbms Goto Github PK

Mark sweep is a collection of services that power automated facebook moderation

fbms's Introduction

Facebook Garbage Collection : Mark Sweep

We are currently developing a text classification tool to improve the quality of discussions within very large Facebook groups. Often times the volume of conversations are so great that moderation becomes difficult without a dedicated team.

This was build at PennApps Fall 2014 with Jason Liu, Taylor Blau, and Henry Boldizsar

The plan is to try to detect three core properties of a post.

How spam-like is it?
How hateful is the comment?
Is the post on topic?

Our current goal is to have a overall classification accuracy with 80% with a preference for low false-positive rates.

Services

The goal is to develop a series of restful services that are independent of eachother. Perhaps with Flask. By being loosely coupled we allow other clients to use our REST api and to build ontop of the platform.

POST : classify/ontopic/<int:groupid>/
POST : classify/spam/
POST : classify/troll/
POST : search/

GET  : classify/ontopic/<int:groupid>/<int:postid>/ 
GET  : classify/spam/<int:postid>/
GET  : classify/spam/<int:groupid>/<int:userid>/
GET  : classify/troll/<int:postid>/
GET  : classify/troll/<int:groupid>/<int:userid>/
GET  : search/<int:groupid>/<int:limit>
GET  : search/<int:postid>/comments/
GET  : search/<int:postid>/likes/
GET  : search/<int:userid>/<int:limit>

Topic Classification

The plan is to crawl all the HH subgroups and then persist everything on mongoDB. Training of the data can be done on a daily basis.

To allow Mark Sweep to adapt to shifting conversations themes within a group we will resample some posts with a bias towards time and Facebook likes. To do this will we use exponential and weighed reservoir sampling to resample posts up to three times.

With this bootstrapped dataset we will (for each group) train a OVA classifier. After running a GridSearch over the available classifiers to fine the ideal hyper parameters. We have had great success with a SVM model with weighted classes using a (1,2)-gram bag-of-words representation of the post data.

Spam Classification

Instead of using bag-of-words for spam classification, we decided to classify by behaviour rather than content.

Number of URLS
Number of keywords (hateful and spammy) a full list can be found in the spam director
Number of all cap'd words
Number of times an identical post has been posted
Number of times a single user has posted identical content on various groups (potentially use mongo's Map/Reduce)

Troll Classification

Thanks to an app called Hater News we found a data set of troll posts so we will use that to detec

Deletion

To prevent false positives Mark Sweep will only delete posts that have 95% + spam confidence and is off-topic. Our goal is not to discourage groups from sharing things they may be passionate about (which might sound spammy) but to allow admins to moderate more important conversations instead of wasting time delete obvious offenders.

Setup

To make this as easy to use as possible to setup, we intend on automating the process of scraping/training/testing/deploying Mark Sweep. By just adding Mark Sweep into your group, it will do everything it needs to get started. While we are a bit uncertain on how to implement configuring Mark Sweep on a group to group basis. We would love to hear your ideas on how that can be done.

fbms's People

Contributors

Stargazers

Watchers

Forkers

gragtah ttaylorr tjhorner 0111001101111010

fbms's Issues

Bulk Loading Scripts

Now that the MongoDB is set up we'll need some class that can obtain everything before transforming it for our training algorithms

Access Tokens exposed!

Just an FYI: A bunch of your MongoDB, and Facebook are exposed. I wouldn't want you guys to lose all this stuff so maybe hide them? Or just change the secrets!

Otherwise: Great work! It looks very promising :)

Overall required tasks

I'm going to slowly try to refactor this thing. I would love some help.

Here are some core components of the Mark Sweep project, each will have its own main issue and I would love if someone could champion a component.

Facebook Access Layer.
MongoDB Access Layer.
Queuing Systems (async? multiprocessing?, celery? Gevent?).
Server side shenagians
Management.
Machine Learning.
Admin console for group moderators.

The first milestone is to have a version 0.1.0 release that can automatically retrain daily and moderate all groups.

Queuing System

A lot of discussion needs to be had about how we want to do all the procesing.

Will we be using a messaging broker? What technologies will ne required?

Some research will need to be done for

celery
gevent
multiprocessing

Classification

We will need to have some automated ways to train and test our algorithm and serlize them into our data store.

We will need to think of relevant features and how we acn effectivly extract them. More to come later.

Facebook Layer

To automate the moderation of facebook groups the first thing we need to do is have a consistence way of accessing facebook.

Essentially a class that wraps the Graph API that gives us well defined access to groups, posts, and actions on those posts.

It will need to do the following:

get a list of groupids associated with the group
access the post content of a group
access the comments of a post
access the userids of posts and comments
comment on a post
bulk read the content of a group when invited to group.
tag a user in a comment

More will be added as I come up with more requirements.

Admin dashboards

we will also need a way for admins to access and interact with the data we provide.

Systems.

A bit hand wavy but the guys working on the datastore and systems should talk together and think of smart ways to automate the systems involved and to connect all the pieces.

Data store

We should have a way to write to our datastore and retain group ID information

We will need databases for group, group content, and group users. Schemas are still up in the air, would love to discuss some potential ways.

write to datastore.
read from datastore for admin dashboard.
schedule reads to datastore.
propertly bootstrap from datastore.
properly prepare data for ML api.

__subject to change

What we want to do is persist all the group data so it is searchable. Not only will it be a data store but it will be used for spam detection and flagging users.

We can consider both SQL and noSQL techonologies. I personally think MongoDB may be the way to do due to its synergy with python dictoinaries.