Giter Club home page Giter Club logo

pymining's Introduction

Objective

This is a platform writing in Python that can use variant data-mining algorithms to get results from a source (like matrix, text documents). Algorithms can using xml configuration to make them run one-by-one.

E.g. at first, we may run PCA(principle components analysis) for feature selection, then we may run random forest for classification. Now, algorithms are mainly design for tasks can be done in a single computer, good scalability of the architecture allows you in a very short period of time to complete the algorithm you want, and use it in your project (believe me, it's faster, better, and easier than Weka).

The another important feature is this platfrom can support text classification or clustering operation very good.

Get start

Just write code like this, you will get amazing result (a naive-bayes training and testing),

# load configuratuon from xml file
config = Configuration.FromFile("conf/test.xml")
GlobalInfo.Init(config, "__global__")

# init module that can create matrix from text file
txt2mat = Text2Matrix(config, "__matrix__")

# create matrix for training (with tag) from a text file "train.txt"
[trainx, trainy] = txt2mat.CreateTrainMatrix("data/train.txt")

# create a chisquare filter from config file
chiFilter = ChiSquareFilter(config, "__filter__")

# get filter model from training matrix
chiFilter.TrainFilter(trainx, trainy)

# filter training matrix
[trainx, trainy] = chiFilter.MatrixFilter(trainx, trainy)

# train naive bayes model
nbModel = NaiveBayes(config, "naive_bayes")
nbModel.Train(trainx, trainy)

# create matrix for test
[testx, testy] = txt2mat.CreatePredictMatrix("data/test.txt")

# using chisquare filter do filtering
[testx, testy] = chiFilter.MatrixFilter(testx, testy)

# test matrix and get result (save in resultY) and precision
[resultY, precision] = nbModel.Test(testx, testy)

print precision

And you need define some paramters in a configuration file (It will be loaded by Configuration.FromFile(...xml)).

<config>
    <__segmenter__>
        <main_dict>dict/dict.main</main_dict>
    </__segmenter__>

    <__matrix__>
    </__matrix__>

    <__global__>
        <term_to_id>mining/term_to_id</term_to_id>
        <id_to_term>mining/id_to_term</id_to_term>
        <id_to_doc_count>mining/id_to_doc_count</id_to_doc_count>
        <class_to_doc_count>mining/class_to_doc_count</class_to_doc_count>
        <id_to_idf>mining/id_to_idf</id_to_idf>
        <newid_to_id>mining/newid_to_id</newid_to_id>
    </__global__>

    <__filter__>
        <rate>0.2</rate>
        <method>max</method>
        <log_path>mining/filter.log</log_path>
        <model_path>mining/filter.model</model_path>
    </__filter__>

    <naive_bayes>
        <model_path>mining/naive_bayes.model</model_path>
        <log_path>mining/naive_bayes.log</log_path>
    </naive_bayes>

    <twc_naive_bayes>
        <model_path>mining/naive_bayes.model</model_path>
        <log_path>mining/naive_bayes.log</log_path>
    </twc_naive_bayes>
    
    <smo_csvc>
      <model_path>mining/smo_csvc.model</model_path>
      <log_path>mining/smo_csvc.log</log_path>
      <c>100</c>
      <eps>0.001</eps>
      <tolerance>0.000000000001</tolerance>
      <times>50</times>
      <kernel>
        <name>RBF</name>
        <parameters>10</parameters>
      </kernel>
      <cachesize>300</cachesize>
    </smo_csvc>
</config>

Features

Clustering algorithm

  • KMeans

Classification algorithm

  • Random forest
  • Naive Bayes
  • TWC Naive Bayes
  • SVM

Feature selector

  • Chisquare
  • PCA

Mathematic

  • Basic operations (like bagging, transpose, etc.)

Data source support

  • Matrix (with csv format)
  • Raw text (now only support Chinese, English to be added)

Benchmark

To be add

pymining's People

Contributors

leftnoteasy avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.