Giter Club home page Giter Club logo

dapy's Introduction

This open source framework fluently implements your ideas for data mining.

DaPy - Enjoy the Tour in Data Mining

中文版

Overview

DaPy is a data analysis library designed with ease of use in mind and it lets you smoothly implement your thoughts by providing well-designed data structures and abundant professional ML models. There has been a lot of famous data operation modules already like Pandas, but there is no module, which

  • supports writing codes in Chain Programming;
  • multi-threading safety data containers;
  • operates feature engineering methods with simple APIs;
  • handles data as easily as using Excel (do not pay attention to data structures);
  • shows the log of each steps on console like MySQL.

Thus, DaPy is more suitable for data analysts, statistic professors and who works with big data with limited computer knowledge than the engineers. In DaPy, our data structure offers 70 APIs for data mining, including 40+ data operation functions, 10+ feature engineering functions and 15+ data exploring functions.

Example

This example simply shows the characters of DaPy of chain programming, working log and simple feature engineering methods. Our goal in this example is to train a classifier for Iris classification task. Detail information can be read from here.

Features of DaPy

We already have abundant of great libraries for data science, why we need DaPy?

The answer is DaPy is designed for data analysts, not for coders. In DaPy, users only need to focus on their thought of handling data, and pay less attention to coding tricks. For example, in contrast with Pandas, DaPy supports you manipulating data by rows as same as using SQL. Here are just a few of things that make DaPy simple:

  • Variety of ways to visualize data in CMD
  • 2D data sheet structures following Python syntax habits
  • SQL-like APIs to process data
  • Thread-safety data container
  • Variety functions for preprocessing and feature engineering
  • Flexible IO tools for loading and saving data (e.g. Website, Excel, Sqlite3, SPSS, Text)
  • Built-in basic models (e.g. Decision Tree, Multilayer Perceptron, Linear Regression, ...)

Also, DaPy has high efficiency to support you solving real-world situations. Following dialog shows a testing result which provides that DaPy has comparable efficiency than some exists C written libraries. The detail of test can be found from here.

Performance Test

Install

The latest version 1.11.1 had been updated to PyPi.

pip install DaPy

Some of functions in DaPy depend on requirements.

  • xlrd: loading data from .xls file【Necessary】
  • xlwt: export data to a .xls file【Necessary】
  • repoze.lru: speed up loading data from .csv file【Necessary】
  • savReaderWrite: loading data from .sav file【Optional】
  • bs4.BeautifulSoup: auto downloading data from a website【Optional】
  • numpy: dramatically increase the efficiency of ML models【Recommand】

Usages

  • Load & Explore Data
    • Load data from a local csv, sav, sqlite3, mysql server, mysql dump file or xls file: sheet = DaPy.read(file_addr)
    • Display the first five and the last five records: sheet.show(lines=5)
    • Summary the statistical information of each columns: sheet.info
    • Count distribution of categorical variable: sheet.count_values('gender')
    • Find differences of the labels in categorical variables: sheet.groupby('city')
    • Calculate the correlation between the continuous variables: sheet.corr(['age', 'income'])
  • Preprocessing & Clean Up Data
    • Remove duplicate records: sheet.drop_duplicates(col, keep='first')
    • Use linear interpolation to fill in NaN : sheet.fillna(method='linear')
    • Remove the records which contains more than 50% variables are NaN: sheet.dropna(axis=0, how=0.5)
    • Remove some meaningless columns (e.g. ID): sheet.drop('ID', axis=1)
    • Sort records by some columns: sheet = sheet.sort('Age', 'DESC')
    • Merge external features from another table: sheet.merge(sheet2, left_key='ID', other_key='ID', keep_key='self', keep_same=False)
    • Merge external records from another table: sheet.join(sheet2)
    • Append records one by one: sheet.append_row(new_row)
    • Append new variables one by one: sheet.append_col(new_col)
    • Get parts of records by index: sheet[:10, 20: 30, 50: 100]
    • Get parts of columns by column name: sheet['age', 'income', 'name']
  • Feature Engineering
    • Transfer a date time into categorical variables: sheet.get_date_label('birth')
    • Transfer numerical variables into categorical variables: sheet.get_categories(cols='age', cutpoints=[18, 30, 50], group_name=['Juveniles', 'Adults', 'Wrinkly', 'Old'])
    • Transfer categorical variables into dummy variables: sheet.get_dummies(['city', 'education'])
    • Create higher-order crossover terms between your selected variables: sheet.get_interactions(n_power=3, col=['income', 'age', 'gender', 'education'])
    • Introduce the ranks of each records: sheet.get_ranks(cols='income', duplicate='mean')
    • Standardize some normal continuous variables: sheet.normalized(col='age')
    • Special processing for some special variables: sheet.normalized('log', col='salary')
    • Create new variables by some business logical formulas: sheet.apply(func=tax_rate, col=['salary', 'income'])
    • Difference process to make time-series stable: DaPy.diff(sheet.income)
  • Developing Models
    • Choose a model and initialize it: m = MLP(), m = LinearRegression(), m = DecisionTree() or m = DiscriminantAnalysis()
    • Train the model parameters: m.fit(X_train, Y_train)
  • Model Evaluation
    • Evaluate model with parameter tests: m.report.show()
    • Evaluate model with visualization: m.plot_error() or DecisionTree.export_graphviz()
    • Evaluate model with test set: DaPy.methods.Performance(m, X_test, Y_test, mode).
  • Saving Result
    • Save the model: m.save(addr)
    • Save the final dataset: sheet.save(addr)

Contributors

Related

Following programs are also great data analyzing/ manipulating frameworks in Python:

  • Agate: Data analysis library optimized for humans
  • Numpy: fundamental package for scientific computing with Python
  • Pandas: Python Analysis Data
  • Scikit-Learn: Machine Learn in Python

Further-Info

dapy's People

Contributors

jacksonwuxs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dapy's Issues

Code formatting in README

The code examples in README would be best formatted as code and not as plain text, to stand out.

I also believe you need a README.rst not README.md for pypi.

No module named 'multiprocess'

Describe the bug
抱歉,已经解决

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

pip 版本Dapy-1.10.10 无法对 Sheet 进行 print

Dapy.core.base.Sheet.py

在第 863 行有一个 print val,会导致报错无法print

        pattern = [_ for _ in pattern.split(string) if _]
        if len(pattern) == 3:
            val = clear_pattern(pattern[2])
            val = auto_str2value(val)
            print val
            if pattern[1] == symbol:
                if i == 0:
                    return set(index.unequal(val))
                return set(func(val))
            try:
                return set(func(val, False))
            except TypeError:
                pass
            try:
                return set(func(val))
            except TypeError:
                return set()

AssertionError:Sn is not a title in current dataset

用DaPy.read()读入一个xlsx文件,其中有三个sheet,前两个sheet有数据,最后一个Sheet3是创建文件的时候默认的,里面没有数据。

读入之后,调用DaPy.core.DataSet.DataSet.info,有数据的两个sheet输出正常结果,但是Sheet3报错,报错提示如标题。

另:建议将guide book的链接放到项目的readme.md
2333

time.clock()的问题

你好,jrs!最近就下载了准备使用下,但是出现了一个问题
ImportError: cannot import name 'clock' from 'time' (unknown location)
我查了下,是python3.8版本以后,是因为time已经从clock里面移除了,目前更换成process_time() 或者perf_counter() 方法,这个地方希望你后边更新的时候可以考虑进去。
主要集中在
DaPy\core\base\Series.py
DaPy\core\DataSet.py

Latest DaPy fails to load pageranker

I just did a clean reinstall of DaPy on Python 3.7.4.

When I tried to load methods (from DaPy import methods) I got the following error:

print pageranker(initial, weight)
^
SyntaxError: invalid syntax

A typo in the Instruction part

DataPy.dataset(addr='data.csv', title=True, split='AUTO', db=None, name=Data, firstline=1, miss_value=None)

Should be

DataPy.dataset(addr='data.csv', title=True, split='AUTO', db=None, name='Data', firstline=1, miss_value=None)

读取个excel再输出为sqlite3时,excel的第一行读取不出来,输出失败。

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.