Giter Club home page Giter Club logo

text-preprocessing-techniques's Introduction

text-preprocessing-techniques

16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysis.

These techniques were used in comparison in our paper "A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis". If you use this material please cite the paper. An extended paper for this work can be found here, with the title "A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis". Please cite.

Most of these techniques are generic and can be used in various applications except Sentiment Analysis. They are the following:

0. Remove Unicode Strings and Noise

1. Replace URLs, User Mentions and Hashtags

2. Replcae Slang and Abbreviations

3. Replace Contractions

4. Remove Numbers

5. Replace Repetitions of Punctuation

6. Replace Negations with Antonyms

7. Remove Punctuation

8. Handling Capitalized Words

9. Lowercase

10. Remove Stopwords

11. Replace Elongated Words

12. Spelling Correction

13. Part of Speech Tagging

14. Lemmatizing

15. Stemming

This scripts also prints some statistics for the text file like:

  • Total Sentences
  • Total Words before and after preprocess
  • Total Unique words before and after preprocess
  • Average Words per Sentence before and after preprocess
  • Total Run time
  • Total Emoticons found
  • Total Slangs and Abbreviations found
  • 20 Most Commong Sland and Abbreviations and plots them
  • Total Elongated words
  • Total multi Exclamation
  • question and stop marks
  • Total All Capitalized words
  • 100 Most Common words and plots them and most common bigram and trigram collocations

The text file that we included here is a sample (2000 tweets) of the SS-Twitter dataset.

The file "preprocess.py" includes many comments and in order to use a technique you have to uncomment the appropriate line/lines. The initial script uses all techniques. So if you want to use only specific techniques, comment out the others.

text-preprocessing-techniques's People

Contributors

deffro avatar

Watchers

James Cloos avatar Hung Le avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.