Giter Club home page Giter Club logo

chinese-article-classification-based-on-own-corpus-via-textcnn-and-gbdt's Introduction

Article-Classification-based-on-own-corpus-via-TextCNN-and-GBDT中文说明

创建时间: 2018-05-24

本文从处理个人语料库(可替换成你自己的)开始,搭建了两套最简易的基础模型来做文本的分类问题
Data Source里为自己手动整理的简易语料库,toy级玩具,直接下载后跑模型文件即可。

1、统计语料库,文本的的TF-IDF值作为词向量 + GBDT(Gradient Boosting Classifier)/SVM(Support Vector Classifer)
2、语料库的Word2Vector作为词向量 + 标准单层TextCNN(Keras Version)   主要第三方依赖包:sklearn, gensim, jieba,pandas,scipy,Keras,Tensorflow  

langcov.py中文简繁体转换,zh_wiki.py(大陆语/**语,繁简体转换词典),感谢skydark@github 源文件连接:
https://raw.githubusercontent.com/skydark/nstools/master/zhtools/langconv.py
https://raw.githubusercontent.com/skydark/nstools/master/zhtools/zh_wiki.py  

文件说明:

1、Articles_Classifer_TF-IDF_GBDT.py 文件为决策树模型及TF-IDF值为词向量
2、Articles_Classifier_Word2Vector_TextCNN_Keras.py 为最简易的卷积神经网络的分类模型
架构:1D-Convolutional Layer + 1-Max Pooling Layer + Flatten Layer+ Dropout Layer+ Dense Layer + Softmax Layer
3、Extract_Articles_zhwiki_latest_pages.py 抽取最新的Wiki中文语料库的文件,路径换成自己的即可,将Json格式转换为一行一篇文章存储在txt中
4、Word2Vector_Training.py 训练自己/Wiki_Zh的词向量,可以根据自己需求更改,Comment掉的Code可以根据需求切换。


English Version Description

Create Time: 2018-05-24

You could use your own corpus to start this project, the sample will be showed in Data_Source/sample.csv(Later update)
Two different Simple Text Classfier models was constructed.

Project Discription

  1. Feature:Words'TF-IDf Value, size = len(Corpus Vocab). Model:Gradient Boosting Classier/Support Vector Classifer
  2. Feature: Wiki_zh Word2Vector Value, missed word was initialized by np.uniform(-0.25, 0.25). Model:TextCNN
    TextCNN Structure: 1D-Convolutional Layer + 1-Max Pooling Layer + Flatten Layer+ Dropout Layer+ Dense Layer + Softmax Layer
    Third Party Reliable Package:sklearn, gensim,jieba,pandas,scipy,Keras(Backend ==Tensorflow)
    [langcov.py, zh_wiki.py] is used to convert traditional characters to simplified characters,thanks to the original author skydard@github
    Resource link:
    https://raw.githubusercontent.com/skydark/nstools/master/zhtools/langconv.py
    https://raw.githubusercontent.com/skydark/nstools/master/zhtools/zh_wiki.py  

Any questions please feel free to contect me, [email protected]

chinese-article-classification-based-on-own-corpus-via-textcnn-and-gbdt's People

Contributors

jialiangshen avatar

Stargazers

 avatar  avatar  avatar Xiutao Liu avatar  avatar  avatar  avatar Terry Wang avatar  avatar tanxi avatar Cyting avatar  avatar  avatar Qcy avatar  avatar

Watchers

littlegod avatar  avatar  avatar

Forkers

liyandan moep0

chinese-article-classification-based-on-own-corpus-via-textcnn-and-gbdt's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.