kaggle / learntools Goto Github PK

View Code? Open in Web Editor NEW

443.0 443.0 231.0 105.33 MB

Tools and tests used in Kaggle Learn exercises

License: Apache License 2.0

Python 19.91% Jupyter Notebook 79.36% Shell 0.73%

learntools's People

Contributors

Stargazers

Watchers

Forkers

ledabm aab77 mkogan alklyn aliciagilbert vivekkumar1204 sfrias fundou priya-gittest francisamani deejmore isunli eckarts leniz stannis-analysis hander0928 bkmalayc afcarl wmy1022 machinelearningpractitioner barseghyanartur arunkumarramanan bastiangl85 kylehiroyasu lucassuchy nahuelnomeisky muehlbacher abhimanyuaryan dbbevan raivtash rosbo rameezrehman83 unhkd-dee datacontrol syedrz aimachlearn amitj1jan brightemah123 donos90 xiao940 renisaskusumah patechoc a402539 caffeinehighzombie rajeshkumarkarra leewalter kaczmarekwill helliad bailianfa czarmat lamoine kaynekirby mcleonard neurolaunch manalalkallas deep-brainz mampiononarajaoferason forkdump przor3n habibmrad pablomanzano434 fargol softowaha mhatoum1 saeedahmadsd7 milog17 kiaruh stephenramthun hongnba gitgirish2 stadlerb ziiin rayhan50001 allensmile colray jingmouren zsyed-gg agamtomar mrdavidmoses satyam14126 cloudstdiolab srihari29 changeli dansbecker mrancl ml-challenge robertdigital godspeed5 krelltin ramprakashmu nkoep ryanholbrook dmiruke hanrui56 justpierre35 omshekhar shahid-prog dougoliver12 battyone abdulla-alromaithi

learntools's Issues

[Feature Engineering] Competition submission instructions out of date

Instructions for competition submission (specifically steps 3 and 4) in https://www.kaggle.com/ryanholbrook/feature-engineering-for-house-prices
are out of date since the competition submission process has been simplified and the ellipsis menu now allows you to directly select submit to competition.

This will need to be updated in the learntools macro as well.

Wrong 'step' configuration in data_viz_to_coder ex1

Description:

In the Data Visualization: from Non-Coder to Coder tutorial in the first exercise (Exercise: Hello, Seaborn) in the third step (Review the data) the step_3.check() does not work
Also step_4.a.check() and step_4.b.check() does not work correctly. If it's changed to step_3.a.check() and step_3.b.check() it works good. step_3 still points to the fourth step - Plot the data

I think it could be the problem provided in this commit. As far as I can see it adds the "Review the data" step

What should be done:

Checking, hinting and solving provided problems works correctly in all tutorial exercise steps

k-fold CV vs static test split

Hi,
I have a question about the cross-validation method, as I understand, it is used to evaluate machine learning models on a random test set. Hence, I think we cannot apply it when having a static test set as provided by competitions. Because submitted models must be evaluated on the same testing data to be compared against each other. Correct me if I am wrong plz
thank u

Count of Aces is wrong when total=21

learntools/learntools/python/blackjack.py

Line 58 in 3d21e2e

if (tot + 10) <= 21:

Should be :
if (tot + 9) <= 21:

because 1pt has already been added in tot for aces.

Example:
I have already 11pts without ace and I hit an Ace.
tot must get 10 pts (1 + 9) and not 11 (1+10)

Here is the example I got during a simulation

SQL tutorial 1 out of date

Hey there! The schema of the hacker news dataset seems to have changed and the “by” field is no longer first. The tutorial may need to be updated to reflect this.

Deprecation warning in `prepare_push.py`

>>> YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
 yield yaml.load(f)

L1 Regularization training error

After implementing the function select_features_l1, the below error pops up in the 'train_model' section. I'm not too familiar with Kaggle's python notebooks and not sure if there's a way to browse defined variables like in VSCode. I explored the source a little and I see the selected variable IS defined in the check function, but in the current notebook, it is failing.

NLP Exercise, step 5, solution code fails .check()

If you run step_5.solution() you get:

def evaluate(model, texts, labels):
        # Get predictions from textcat model
        predicted_class = predict(model, texts)

        # From labels, get the true class as a list of integers (POSITIVE -> 1, NEGATIVE -> 0)
        true_class = [int(each['cats']['POSITIVE']) for each in labels]

        # A boolean or int array indicating correct predictions
        correct_predictions = predicted_class == true_class

        # The accuracy, number of correct predictions divided by all predictions
        accuracy = correct_predictions.mean()

        return accuracy

but if you use that and run step_5.check() it says incorrect. I can't find an answer to this question that passes step_5.check() even though many solutions seem correct and run correctly in the next step.

Learn Pandas : Summary functions and maps - Hint to be updated, argmax deprecated

The hint says: Hint: use a map and the argmax function.
However the documentation and the current answer uses idxmax.

Validation path exercise_4.py

Hi Dan,

there appears to be an error in the validation function. The model works with the specifications in the instructions but check() throws an error indicating that the file path is wrong.

Using this as standalone course running on Kaggle

Is it possible to use this tool for a learning course that isn't deployed on Kaggle Learn?

Imagine I want to create my own course on Python and want to share it with some friends. Is that possible? I still want to run the notebooks on Kaggle.

Type Error in Ex 5 of Intro to SQL

When I ran the first cell I got the error
│ exit code: 128 ╰─> See above for output.
Later in q_3.check() , I got
TypeError: only size-1 arrays can be converted to Python scalars
q_4.check() , I got
IndexError: list index out of range

Issue with Data Viz ex 3

check() for 3b and 4b is absent.

Image directories incorrectly specified? (Deep Learning - exercise 4)

I am working through the Deep Learning course on Kaggle and ran into a problem with the Learning Transfer exercise (raw notebook here). It instructs:

Your training data is in the directory ../input/dogs-gone-sideways/train. The validation data is in ../input/dogs-gone-sideways/val. Use that information when setting up train_generator and validation_generator.

But using these directories didn't work for me. I am wondering if there was a change in the way the data is stored on Kaggle since this lesson was written. Paths that do work seem to be:

../input/dogs-gone-sideways/images/train
../input/dogs-gone-sideways/image/val

Using these directories the code now runs fine, but the checking code does complain about me using the wrong directories:

learntools/learntools/deep_learning/exercise_4.py

Line 60 in 6600110

assert (their_val_dir == '../input/dogs-gone-sideways/val'),\

Missing instructions on dataset setup for Pandas course

I did git clone https://github.com/Kaggle/learntools.git.
I followed instructions for Panda Course running the first example and got the following error:

D:\Projects-intellij\machine-learning-course\kaggle\learntools>python ex1.py
Traceback (most recent call last):
  File "ex1.py", line 4, in <module>
    from learntools.pandas.creating_reading_and_writing import *
  File "D:\Projects-intellij\machine-learning-course\kaggle\learntools\learntools\pandas\creating_reading_and_writing.py", line 49, in <module>
    class ReadWineCsv(EqualityCheckProblem):
  File "D:\Projects-intellij\machine-learning-course\kaggle\learntools\learntools\pandas\creating_reading_and_writing.py", line 54, in ReadWineCsv
    _expected = pd.read_csv('../input/wine-reviews/winemag-data_first150k.csv', index_col=0)
  File "C:\Users\OEM\Miniconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\OEM\Miniconda3\lib\site-packages\pandas\io\parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\OEM\Miniconda3\lib\site-packages\pandas\io\parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "C:\Users\OEM\Miniconda3\lib\site-packages\pandas\io\parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Users\OEM\Miniconda3\lib\site-packages\pandas\io\parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas\_libs\parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'../input/wine-reviews/winemag-data_first150k.csv' does not exist: b'../input/wine-reviews/winemag-data_first150k.csv'

Where can I find instructions on how to setup datasets for the course?

python

build\bdist.win-amd64\egg\learntools\computer_vision\ex5.py:81: SyntaxWarning: "is" with a literal. Did you mean "=="?
assert (activations[0] is 'relu' and activations[1] is 'relu'),
build\bdist.win-amd64\egg\learntools\computer_vision\ex5.py:81: SyntaxWarning: "is" with a literal. Did you mean "=="?
assert (activations[0] is 'relu' and activations[1] is 'relu'),
byte-compiling build\bdist.win-amd64\egg\learntools\computer_vision\ex6.py to ex6.cpython-39.pyc
byte-compiling build\bdist.win-amd64\egg\learntools\computer_vision\visiontools.py to visiontools.cpython-39.pyc
build\bdist.win-amd64\egg\learntools\computer_vision\visiontools.py:40: SyntaxWarning: "is" with a literal. Did you mean "=="?
if type is 'binary':
build\bdist.win-amd64\egg\learntools\computer_vision\visiontools.py:42: SyntaxWarning: "is" with a literal. Did you mean "=="?
elif type is 'sparse':
build\bdist.win-amd64\egg\learntools\computer_vision\visiontools.py:302: SyntaxWarning: "is" with a literal. Did you mean "=="?
if layer.class.name is 'Conv2D']
build\bdist.win-amd64\egg\learntools\computer_vision\visiontools.py:460: SyntaxWarning: "is" with a literal. Did you mean "=="?
if fill_method is 'replicate':
build\bdist.win-amd64\egg\learntools\computer_vision\visiontools.py:464: SyntaxWarning: "is" with a literal. Did you mean "=="?
elif fill_method is 'reflect':

Kaggle

Import Error in Pandas course

When I took the Pandas course I met an error:

import pandas as pd
pd.set_option('max_rows', 5)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.creating_reading_and_writing import *
print("Setup complete.")

WARNING:root:Ignoring repeated attempt to bind to globals

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-1cf6d6f127c2> in <module>
      2 pd.set_option('max_rows', 5)
      3 from learntools.core import binder; binder.bind(globals())
----> 4 from learntools.pandas.creating_reading_and_writing import *
      5 print("Setup complete.")

ModuleNotFoundError: No module named 'learntools.pandas'

AttributeError: 'MultipartProblem' object has no attribute 'check' (data_viz_to_coder)

Hi,
not sure if this is the right spot to provide feedback. I can't find a suitable way over on kaggle.
Anyways, i startet working through the Data Visualization: From Non-Coder to Coder Micro-Course on kaggle and get the following Error in the first exercise:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-15ba42341748> in <module>()
      2 brazil_rank = 3.0
      3 # Check your answer
----> 4 step_3.check()

AttributeError: 'MultipartProblem' object has no attribute 'check'

I think there is a mismatch between the version hosted on kaggle and the current version of this repository (Step 3 on github is plotting the data, on kaggle it's reviewing the data).

Btw, i really like the learntools idea :)

Error in solution in Pandas tutorial

price_extremes = reviews.groupby('variety').price.agg([min, max])

TypeError                                 Traceback (most recent call last)
<ipython-input-81-1ee08b4f09ca> in <module>
      1 #q3.hint()
      2 q3.solution()
----> 3 price_extremes = reviews.groupby('variety').price.agg([min, max])

/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    849             # but not the class list / tuple itself.
    850             func_or_funcs = _maybe_mangle_lambdas(func_or_funcs)
--> 851             ret = self._aggregate_multiple_funcs(func_or_funcs, (_level or 0) + 1)
    852             if relabeling:
    853                 ret.columns = columns

/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self, arg, _level)
    916         for name, func in arg:
    917             obj = self
--> 918             if name in results:
    919                 raise SpecificationError(
    920                     "Function names must be unique, found multiple named "

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in __hash__(self)
   1884         raise TypeError(
   1885             "{0!r} objects are mutable, thus they cannot be"
-> 1886             " hashed".format(self.__class__.__name__)
   1887         )
   1888 

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Exercise: Grouping and Sorting - point 3

Deep Learning EX3 Tensorflow Shape Error

Notebook path & Example:

notebooks/deep_learning/raw/ex3_programming_tf_and_keras.ipynb
2) Run an Example Model

Failing code

from IPython.display import Image, display
from learntools.deep_learning.decode_predictions import decode_predictions
import numpy as np
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.preprocessing.image import load_img, img_to_array


image_size = 224

def read_and_prep_images(img_paths, img_height=image_size, img_width=image_size):
    imgs = [load_img(img_path, target_size=(img_height, img_width)) for img_path in img_paths]
    img_array = np.array([img_to_array(img) for img in imgs])
    output = preprocess_input(img_array)
    return(output)


my_model = ResNet50(weights='../input/resnet50/resnet50_weights_tf_dim_ordering_tf_kernels.h5')
test_data = read_and_prep_images(img_paths)
preds = my_model.predict(test_data)

most_likely_labels = decode_predictions(preds, top=3)

Error

ValueError: Shapes (1, 1, 256, 512) and (512, 128, 1, 1) are incompatible

Supposed fix:

While following along with the example locally, I got the same error. Doing a little google and trial and error, I finally got it to work with the following import while working with the dog files and weights:

from keras.applications.resnet50 import ResNet50

my_model = ResNet50(weights='./pre-trained/resnet50/resnet50_weights_tf_dim_ordering_tf_kernels.h5')

Please note that I'm just a beginner with python and tensorflow, so if there is a better fix, please let me know!

Typo in pandas tutorial 1?

Hey, I was just going through the pandas tutorial and wasn't sure whether this is a typo or if I'm misunderstanding.

learntools/notebooks/pandas/raw/tut_1.ipynb

Line 279 in 886f5c2

 "Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values `Apples, ..., Potatoes, ...`, and we want to select \"all the alphabetical fruit choices between Apples and Potatoes\", then it's a lot more convenient to index `df.loc['Apples':'Potatoes']` than it is to index something like `df.loc['Apples', 'Potatoet]` (`t` coming after `s` in the alphabet).\n", 

it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet)

Should the second code snippet be df.iloc['Apples':'Potatoes']? It was just explained that df.iloc[0:10] gives you indices 0,...,9 but df.loc[0:10] gives you indices 0,...,10; and then I wasn't sure how they got df.loc['Apples', 'Potatoet].

SQL Intro Course Exercise 3 not being set up

The setup imports are written to import hacker_news with the actual being hacker-news which may be the possible cause behind this.

In Machine Learning Explainability Solution Is Wrong?

When I Did My Work Check Is Saying Incorrect Even When Pasting Solution

--

Please remove.

"Frequency" used in place of "phase" in Seasonality tutorial

In the fourier_features() example algorithm on the Seasonality lesson of the Time Series course, the variable name freq is given to a parameter which takes units of days/cycle. This was confusing to me at first, because unless I'm mistaken, frequency typically refers to measurements given in inverse units (cycles/day), whereas period refers to time per cycle.

Trouble running Exercise: Time series as Features

Whenever I try to run the first cell of the Time Series as Features exercise, I get this error:
Collecting git+https://github.com/Kaggle/learntools.git
Cloning https://github.com/Kaggle/learntools.git to /tmp/pip-req-build-65_z7vlm
Running command git clone --filter=blob:none -q https://github.com/Kaggle/learntools.git /tmp/pip-req-build-65_z7vlm
fatal: unable to access 'https://github.com/Kaggle/learntools.git/': Could not resolve host: github.com
WARNING: Discarding git+https://github.com/Kaggle/learntools.git. Command errored out with exit status 128: git clone --filter=blob:none -q https://github.com/Kaggle/learntools.git /tmp/pip-req-build-65_z7vlm Check the logs for full command output.
ERROR: Command errored out with exit status 128: git clone --filter=blob:none -q https://github.com/Kaggle/learntools.git /tmp/pip-req-build-65_z7vlm Check the logs for full command output.

dataviz course uses deprecated distplot module

https://www.kaggle.com/alexisbcook/distributions

This is a great course, but distplot examples should be replaced with displot or histplot as deprecation warnings are being encountered with distplot.

Wrong assert descriptions in ex2 of Intermediate Machine Learning Course

Hi,

I found wrong assert descriptions.

learntools/learntools/ml_intermediate/ex2.py

Line 67 in 1c59223

"`reduced_X_train` should have shape (292, 33)."

In this line, "reduced_X_train" should be replaced with "reduced_X_valid".

learntools/learntools/ml_intermediate/ex2.py

Line 113 in 1c59223

"`imputed_X_train` should have shape (292, 36)."

Also, "imputed_X_train" should be replaced with "imputed_X_valid".

Input file paths are wrong

For Exercise: Machine Learning Competitions, the train data file path should be
iowa_file_path = '../input/train.csv'

The current path is not working:
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

Not making a PR because I'm not sure if the path should be corrected or the files should be placed in that path.

[Exercise: Categorical Variables] FutureWarning: Feature names only support names that are all strings

In the last code cell

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

This shows

MAE from Approach 3 (One-Hot Encoding):

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  FutureWarning,

17525.345719178084

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  FutureWarning,

Adding these to step4 can fix

OH_cols_train.columns = list(map(str, OH_cols_train.columns))
OH_cols_valid.columns = list(map(str, OH_cols_valid.columns))

Bug in make_leads function

Bug in the following file: https://github.com/Kaggle/learntools/blob/master/learntools/time_series/utils.py

The function doesn't correctly create the leads. For example, setting leads to 1, it just copies the existing columns under a new column named: {name}_lead_0. See the attached screenshot for an example.

def make_leads(ts, leads, name='y'):
return pd.concat(
{f'{name}lead{i}': ts.shift(-i)
for i in reversed(range(leads))},
axis=1)