Giter Club home page Giter Club logo

clustertwit's Introduction

clustertwit

Clustertwit is a quick hack/implementation for grouping twits based on a special concept of similarity.

We group the twits that have more than N words identical to words in the group. If the twit is accepted in a group, its words are added to the group Twits are "cleaned":

  • Delete all the words with less than 5 letters. Yes, as you read. (original idea is to get rid of articles. If you know something better, tell me)
  • Delete all the trailing spaces, if there are any
  • Put all the letters in lowercase
  • Delete all the chars ,'!. This is to consider similar words with these chars at the end.
  • It implements a simple blacklist of words, to stop things like 'https' to group.

Tunning the thresholds

To use this software is very useful to tune two thresholds depending on what you want.

  1. The amount of words in each twitt that should be equal to the words in a group. The parameter -w.
    We found that 2 is ok for a small amount of twits. You may need a higher value.
  2. When showing the final groups, which groups to show. That is controled by counting the amount of twits in the groups. The parameter -t controls that. This is a really a interface issue that should be controled depending on what you want to show to the user.

Also use the verbose -v parameter to control what you see and some debuging

Twitter API

If you run the program with -T (and you put your API keys in the code) you can download your own tweets and store them on disk. Then you just use that file as your tweets files.

python ./clustertwit.py -T
cat twitter-text-cache.tmp | python ./clustertwit.py

Usage

To use it, just cat a file with twits and give it as stdin to this program Example: cat TestDataset | python ./clustertwit.py

It works in Linux and OSX at least

Authors

This program was done by Sebastian Garcia, [email protected] (@eldracote). Thanks to @verovaleros for the thinking sessions and the creation of the dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.