Giter Club home page Giter Club logo

type_infer's Introduction

MindsDB Type Infer

MindsDB

Automated type inference for Machine Learning pipelines.

In the context of tabular data, type_infer aims for optimal interpretation of each column’s data type for ML use cases. For example, strings with date or time format would be classified as timestamps, or integers as categorical if there is a sufficiently small set of unique values in the column.

Documentation

Documentation link

type_infer's People

Contributors

ea-rus avatar lucas-koontz avatar mcbianconi avatar mrandri19 avatar pafluxa avatar paxcema avatar pedrofluxa avatar stpmax avatar tomhuds avatar vanshv avatar zoranpandovski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

type_infer's Issues

type_check_date function doesn't work

Steps to replicate

df = pd.DataFrame([{'month': '2017-07-01'}, {'month': '2017-08-01'}, {'month': '2017-09-01'}])

import type_infer
type_infer.infer.type_check_date(df['month'])

Raises:

  File "/work/mindsdb/mindsdb_main/venv/lib/python3.10/site-packages/type_infer/infer.py", line 135, in type_check_date
    if dt.hour == 0 and dt.minute == 0 and dt.second == 0 and len(str(element)) <= 16:
  File "/work/mindsdb/mindsdb_main/venv/lib/python3.10/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'hour'

Are there any good first issues? (also, HacktoberFest2022)

Hi! Last year I had a great experience contributing to MindsDB's repos during the HacktoberFest.
This year I am not competing for MindsDB's own competition as I have less free time, but I was still looking for some cool and small issues I could help you with (and get that 2022 shirt :) ).

I was wondering if type_infer had any good first issues that could be labelled with the hacktoberfest label.

I have been writing a small typechecker for pandas + numpy + sklearn on my own, mainly to avoid losing my mind while refactoring some pipelines for Kaggle competitions, so this project looks right up my alley.

Fix warning in division by log

Stack trace:

RuntimeWarning: invalid value encountered in double_scalars
[45](...)
  randomness_per_index.append(S / np.log(N))

Missing coverage report

This issue is related to this one: mindsdb/mindsdb#3504

I wanted to improve the tests but wasnt sure about what was being tested or not, so I made a step in the CI to show de report so we can start to measture and improve the code coverage.

Improve tag detection system

The current rule-based approach is failing to recognize a tag-filled column when it contains a large number of unique tags.

Additionally, it should be robust to additional separators or formatting, like brackets and quotes.

Improve docs

Right now, documentation is in a very preliminar and rough state.

Any help towards making it easier to understand the codebase and the purpose of the package will be of great help.

Migrate to py3langid

Should be quite a bit faster, hopefully installation too. Still, need to check behavior is equal (or very similar) to determine if the move is worth it or not.

Issue handling extremely small values or NaN

Getting the following error when loading in context data for Writer Handler demo. My hypothesis is that is happening due to some columns containing many nulls which pandas handles as NaN.

python3.9/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
python3.9/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.