risenw / datasist Goto Github PK

View Code? Open in Web Editor NEW

155.0 155.0 82.0 7.43 MB

A Python library for easy data analysis, visualization, exploration and modeling

Home Page: https://risingodegua.gitbook.io/datasist-doc/

License: MIT License

Python 9.46% Jupyter Notebook 90.54%

data-analysis data-science data-visualization feature-engineering machine-learning python-3

datasist's Introduction

Hi there 👋

My Name is Rising Odegua. I am Full Stack Software Engineer. I combine my knowledge of software and data science to build data driven products that can solve problems.

Strong Languages: JavaScript, Python, and Typescript

🔭 I’m currently building and maintaining open source data tools like on Danfojs, Dnotebook, Datasist etc.
👯 I’m looking to collaborate on open source tools for data science and machine learning.
💬 Ask me about OSS, Software Engineering, Machine Learning and Data Science.
📫 Connect with me: Linkedin.

Latest Updates:

Launched a new product https://www.heyy.ink
Released https://www.randomabcs.com/
I co-authored a book titled "Building Data Driven Applications With Danfo.js"
I became a Github Star
I became a Google Developer Expert in Machine Learning

My latest writings are:

See more of my technical articles on Medium
Learn more about on my website

datasist's People

Contributors

Stargazers

Watchers

Forkers

vishwaak nelsonchris1 datadeus abdulkereem dehbaiyor nwaobasianthony marquisvictor mikkybang drseyi ifedave olawale0254 codebrain001 aminuisrael jatojoseph justus-coded popoolaio jimohafeezco olaniyiajayi2 jamiu-tijani lilytechie emmarex tosi-n stobasa lpmatrix deewourne aighana basifrank abofficial444 aligapaul fiyinfoluwa6 ifyokoh hoanganhngo610 theoyinbooke emekaborisama sridhar0605 mihael147work thedynamicfemi josh4324 e-stat obasoro shakedzy abdulazeezade paschmaria ibrahimygana lasafabiola rexsimiloluwah adeyinka-hub kennyrich opeyemibami cemeiq julisam zalihat ezekielolugbami iretex abv-hub cumeadi olawaleibrahim poarnold priyankadiddi jamessandy sakshi-agarwal8 wywise colima thayeylolu fagan2888 ayobamiakomolafe sherlocked-blaire olamide100 invest41 ormigi anand268 indianvalantine 1stscience eskayml sarahazzabi sailfish009

datasist's Issues

Vizualization

Plotting confusion Matrix
Plotting Scatter plots with up to 5 legends
Plotting ROC Curve

Add a function to timeseries.py, which displays crypto coin market data as a Candlestick Chart

Utilizing Plotly to make a "Coin-USD" market data Candlestick plot

Required Packages

✅ - signifies packages currently in requirements.txt

  Default function - get_crypto_visuals
  —————————––––––––––––––––

def get_crypto_visuals(coin, period='5d', interval='15m', MA=False,days=[7,25,99], boll=False, boll_sma=25, save_fig=False, img_format='png'):



“””

  ----------------------SKIPPING PARAMETER DESCRIPTION TO END OF DOCSTRING--------------------
  



  Examples of valid use

    ----------------

  >>>  get_crypto_visuals("ETH")

  >>>  get_crypto_visuals("ETH", MA=True)

  >>>  get_crypto_visuals("ETH", period="5d", interval="15m",MA=True, days=[5, 20], save_fig=True)

  >>>  get_crypto_visuals("ETH", period="5d", interval="15m", boll=True, boll_sma=26, save_fig=True, img_format='jpeg')

”””

-------Main Code goes here-------

End Result Snippet

ValueError: Cannot take a larger sample than population when 'replace=False'

The to_date function in the feature_engineering module raises the above error when the row size of a dataset is less than 20. I noticed that the to_date function calls a function get_date_cols in the structdata module. This calls a private function _match_date (also in the structdata module), which returns a list of columns that matches the DateTime expression. The default sample size in the _match_date function is 20. Although, Users hardly use a data set less than 20. I was thinking it would be nice if the sample_size equal to the row size of the dataset when the row size of the dataset is less than 20 and it should be set to default if otherwise.

Add a log transformation feature

A feature that helps to normalize data using the log_transfromation method and also visualizes the dataset as it shows the skewness also.

Function to train and return popular performance metrics of a regression model.

A function like train_classifier in the modeling module that accepts a regression model and displays some of the most popular regression metrics. Helps in model comparison.
@Olaniyiajayi

Function that compares multiple algorithms

A function that accepts a list of models, and trains and compare them according to a given metrics. Result should be displayed as a plot, for easy reading.

Label the y-axis in the compare_model function

The y-axis in the plot of the compare_model does not display the metric being plotted.

Add a function that splits a data set into test and train subsets

Text Preprocessing

Create a function to clean up text dataset.

Move Get_date_info to feature engineering

Get_date_info is supposed to be a feature engineering process, I propose to move it to the feature engineering module.

Unable to extract categorical features from dataframe using datasist

i once used datasist to extract both numerical and categorical features successfully. Now, i tried to re run my codes, i only see numerical features being extracted and my categorical features is blank. Even, it extracted all the 69 features in the dataframe as numerical features while categorical features is blank. Surprisingly, it ran perfectly okay before the rerun showing both numerical and categorical features respectively, and due to this i was able to do feature engineering for my test dataset in the first place. Now that i wanted to conduct the same feature engineering for my train dataset, it only extracted numerical features leaving categorical features blank. Knowingfully well after using "dtype" that there are many object features(i.e categorical features) in both train and test datasets. In fact, there are only 25 numerical features out of 69, others are categorical.
What can be the problem? I tried running my codes both on Kaggle notebook and google colab, the same problem keeps persisting.

Exploratory Data Analysis in one line of code

In regard to the vision behind Dataist, I dislike when i always have to perform EDA functions in Pandas one after the other like; df.describe, df.shape, df.dtype and so on. I will like to have it all in one line of code.

ModuleNotFoundError: No module named "Joblib"

Python kernel keeps returning bug; No module named "Joblib" after installing through anaconda cmd using command "pip install datasist"

Describe function in structdata throws error if there is no categorical features in dataset

The describe function should first check if a DataFrame has categorical features before trying to describe it, to avoid error during execution.

Add Tests

Tests should be added for all functionality to prevent breaking in future. An CI to can be used to run PRs against to ensure its not breaking any existing functionality

Add function to get the top two rows of a dataframe

droping missing values

----> 4 df=ds.drop_missing(data=data, percent=80)

AttributeError: module 'datasist' has no attribute 'drop_missing'

So is this not in the function

Create an NLP module

An NLP model that helps data preprocessing

Add a method that detect rows with outliers

Add function in nlp.py, for extracting keywords from text.

Add function that plots the features against target to visualize the relationship using drawdown to select the particular feature

This function enables visualization of the relation between numerical features and target mainly for linear regression by providing a drawdown to enable the user select the particular features.

An example Usage will be

def features_plot(df,target):

//code here

return widgets

Structdata's "describe" function throws an error when there no categorical features in the DataFrame

This can be fixed by first testing for object types in the DataFrame

Add function to calculate mean squared distance

This function will calculate the mae of a Series and return a single value

An example usage will be

def calculate_mae(Series):
    //calculate stuffs here
   return 20