Giter Club home page Giter Club logo

spam_filtering's Introduction

垃圾评论过滤系统: 垃圾的定义: (1) 广告 (2) 脏话,含敏感词 (3) 与主题无关的评论

分类问题:短文本分类问题 因为其存在两个问题,1、短文本提供的词语少,提供的有效信息有限。2、根据分词结果构建的词频或者特征矩阵通常十分稀疏,大多数算法在处理稀疏矩阵问题时候,效果都不好。常用的处理短文本的方法基本可分为两类:一类是基于某种规则改进分类过程,优化改进模型;另一类是基于外部语义信息扩充短文本信息量,从而提高分类效果。

  1. 准备数据集 (1) 去重 (2) 脏数据 (3) 数字,间隔符号删除,只保留中文跟英文 (4) 句子分词后用空格隔开 (5) 分词:jieba (6) 数据标准化 (7) 数量:去重后 负样本:50万条左右
  2. 敏感词过滤 (1) 筛选敏感词 (2)
  3. 创建字典
  4. 特征提取过程 (1) 计算词频 count_vect = CountVectorizer(stop_words=stop_words_list, token_pattern=r"(?u)\b\w+\b")

(2) tf-idf: (3) 特征选择:CHI

  1. 训练分类器 svclf_tfidf = SVC(kernel = 'linear')

支持向量机是监督式的二元分类器,在你拥有更多的特征时它非常有效。支持向量机(SVM)的目标是将训练数据中的一些子集从被称为支持向量(support vector,分离超平面的边界)的剩余部分分离。预测测试数据类型的支持向量机模型的决策函数基于支持向量并且利用了核技巧(kernel trick)。 6. 保存及加载模型 with open('svm.pickle', 'wb') as fw: pickle.dump(svm, fw)

with open('svm.pickle', 'rb') as fr: new_svm = pickle.load(fr) print new_svm.predict(X[0:1])

  1. 在测试集上训练 准确率 召回率

spam_filtering's People

Contributors

jansonkong avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.