Giter Club home page Giter Club logo

kedro-org / kedro Goto Github PK

View Code? Open in Web Editor NEW
9.4K 9.4K 868.0 200.11 MB

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

Home Page: https://kedro.org

License: Apache License 2.0

Makefile 0.15% Gherkin 0.91% Python 98.29% Shell 0.65%
experiment-tracking hacktoberfest kedro machine-learning machine-learning-engineering mlops pipeline python

kedro's People

Contributors

921kiyo avatar ahdrameraliqb avatar andrii-ivaniuk avatar ankatiyar avatar antonymilne avatar astrojuanlu avatar carolinemlynch avatar datajoely avatar deepyaman avatar dependabot[bot] avatar dimeds avatar dmitriideriabinqb avatar idanov avatar ignacioparicio avatar jiriklein avatar jmholzer avatar limdauto avatar lorenabalan avatar lrcouto avatar merelcht avatar mzjp2 avatar nakhan98 avatar noklam avatar sajidalamqb avatar stichbury avatar tolomea avatar tsanikgr avatar tynandebold avatar waylonwalker avatar yetudada avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kedro's Issues

[KED-921] Dataset class to save matplotlib figures to images locally

Description

A dataset class to save matplotlib figures/plt objects to images locally, MatplotlibWriter. Currently I cannot think of a usecase for reading images into matplotlib, and as such, I've not included any support for load.

Context

In the process of making documentation of features, programmatically making plots of all features in a standardised way and embedding links in .md files has been a big timesaver.

Possible Implementation

# Copyright 2018-2019 QuantumBlack Visual Analytics Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
# The QuantumBlack Visual Analytics Limited (“QuantumBlack”) name and logo
# (either separately or in combination, “QuantumBlack Trademarks”) are
# trademarks of QuantumBlack. The License does not grant you any right or
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
# Trademarks or any confusingly similar mark as a trademark for your product,
#     or use the QuantumBlack Trademarks in any other manner that might cause
# confusion in the marketplace, including but not limited to in advertising,
# on websites, or on software.
#
# See the License for the specific language governing permissions and
# limitations under the License.


"""
``AbstractDataSet`` implementation to save matplotlib objects as image files.
"""

import os.path
from typing import Any, Dict, Optional

from kedro.io import AbstractDataSet, DataSetError, ExistsMixin


class MatplotlibWriter(AbstractDataSet, ExistsMixin):
    """
        ``MatplotlibWriter`` saves matplotlib objects as image files.

        Example:
        ::

            >>> import matplotlib.pyplot as plt
            >>> from kedro.contrib.io.matplotlib import MatplotlibWriter
            >>>
            >>> plt.plot([1,2,3],[4,5,6])
            >>>
            >>> single_plot_writer = MatplotlibWriter(filepath="docs/new_plot.png")
            >>> single_plot_writer.save(plt)
            >>>
            >>> plt.close()
            >>>
            >>> plots = dict()
            >>>
            >>> for colour in ['blue', 'green', 'red']:
            >>>     plots[colour] = plt.figure()
            >>>     plt.plot([1,2,3],[4,5,6], color=colour)
            >>>     plt.close()
            >>>
            >>> multi_plot_writer = MatplotlibWriter(filepath="docs/",
            >>>                                      save_args={'multiFile': True})
            >>> multi_plot_writer.save(plots)

    """

    def _describe(self) -> Dict[str, Any]:
        return dict(
            filepath=self._filepath,
            load_args=self._load_args,
            save_args=self._save_args,
        )

    def __init__(
        self,
        filepath: str,
        load_args: Optional[Dict[str, Any]] = None,
        save_args: Optional[Dict[str, Any]] = None,
    ) -> None:
        """Creates a new instance of ``MatplotlibWriter``.

        Args:
            filepath: path to a text file.
            load_args: Currently ignored as loading is not supported.
            save_args: multiFile: allows for multiple plot objects
                to be saved. Additional load arguments can be found at
                https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html
        """
        default_save_args = {"multiFile": False}
        default_load_args = {}

        self._filepath = filepath
        self._load_args = self._handle_default_args(load_args, default_load_args)
        self._save_args = self._handle_default_args(save_args, default_save_args)
        self._mutlifile_mode = self._save_args.get("multiFile")
        self._save_args.pop("multiFile")

    @staticmethod
    def _handle_default_args(user_args: dict, default_args: dict) -> dict:
        return {**default_args, **user_args} if user_args else default_args

    def _load(self) -> str:
        raise DataSetError("Loading not supported for MatplotlibWriter")

    def _save(self, data) -> None:

        if self._mutlifile_mode:

            if not os.path.isdir(self._filepath):
                os.makedirs(self._filepath)

            if isinstance(data, list):
                for index, plot in enumerate(data):
                    plot.savefig(
                        os.path.join(self._filepath, str(index)), **self._save_args
                    )

            elif isinstance(data, dict):
                for plot_name, plot in data.items():
                    plot.savefig(
                        os.path.join(self._filepath, plot_name), **self._save_args
                    )

            else:
                plot_type = type(data)
                raise DataSetError(
                    (
                        "multiFile is True but data type "
                        "not dict or list. Rather, {}".format(plot_type)
                    )
                )

        else:
            data.savefig(self._filepath, **self._save_args)

    def _exists(self) -> bool:
        return os.path.isfile(self._filepath)

Possible Alternatives

Thought of writing from inside of pipelines, but this seemed hacky. Not including the multifile option seemed technically viable but in terms of the nature of implementation, it seems a required feature.

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Dependency conflict with commonly used AWS libraries

Description

Dependency conflict with PyYAML.
Kedro requires PyYAML>=5.1, <6.0 while various AWS libraries require in the range PyYAML<4.3,>=3.10.

Context

I'm unable to run any kedro project commands due to the dependency conflict.
image

Steps to Reproduce

  1. pip install kedro sagemaker

Actual Result

image

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.14.3
  • Python version used (python -V): 3.7.3
  • Operating system and version: macOS Mojave 10.14.5

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Docker image for Kedro build tools

Description

Not all developers have access to Linux environments.

Context

A docker image that contains all the tools required would enable any developer to contribute and test on the same environment that the core development team.

Possible Implementation

The addition of a DockerFile/docker-compose.yml to contrib.

I currently have a bare bones implementation working that runs the tests (here) but it would need more work to implement it with the current kedro structure.

[KED-922] IO Dataset class to load and save feather files

Description

A dataset class to load and save feather files.

Context

Feather is a useful format for large storage and quick access where it is cpp binding which gives a very good write and read speed. The feature format will be a good format for the intermediate storage files.

Possible Implementation

Copyright 2018-2019 QuantumBlack Visual Analytics Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
(either separately or in combination, "QuantumBlack Trademarks") are
trademarks of QuantumBlack. The License does not grant you any right or
license to the QuantumBlack Trademarks. You may not use the QuantumBlack
Trademarks or any confusingly similar mark as a trademark for your product,
or use the QuantumBlack Trademarks in any other manner that might cause
confusion in the marketplace, including but not limited to in advertising,
on websites, or on software.

See the License for the specific language governing permissions and
limitations under the License.

"""FeatherLocalDataSet loads and saves data to a local feather format file
(https://github.com/wesm/feather). The
underlying functionality is supported by pandas, so it supports all operations the pandas supports.
"""
from pathlib import Path
from typing import Any, Dict

import pandas as pd

from kedro.io.core import AbstractDataSet, DataSetError, FilepathVersionMixIn, Version

class FeatherLocalDataSet(AbstractDataSet, FilepathVersionMixIn):
"""FeatherLocalDataSet loads and saves data to a local feather file. The
underlying functionality is supported by pandas, so it supports all
allowed pandas options for loading and saving csv files.

Example:
::

    >>> from kedro.contrib.io import FeatherLocalDataSet
    >>> import pandas as pd
    >>>
    >>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
    >>>                      'col3': [5, 6]})
    >>> data_set = FeatherLocalDataSet(filepath="test.feather",
    >>>                                  load_args=None,
    >>>                                  save_args={"index": False})
    >>> data_set.save(data)
    >>> reloaded = data_set.load()
    >>>
    >>> assert data.equals(reloaded)

"""

def _describe(self) -> Dict[str, Any]:
    return dict(
        filepath=self._filepath,
        load_args=self._load_args,
        save_args=self._save_args,
        version=self._version,
    )

def __init__(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        version: Version = None,
) -> None:
    """Creates a new instance of ``FeatherLocalDataSet`` pointing to a concrete
    filepath.

    Args:
        filepath: path to a feather file.
        load_args: feather options for loading feather files.
            Here you can find all available arguments:
            https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_feather.html#pandas.read_feather
            All defaults are preserved.
        save_args: None
        version: If specified, should be an instance of
            ``kedro.io.core.Version``. If its ``load`` attribute is
            None, the latest version will be loaded. If its ``save``
            attribute is None, save version will be autogenerated.
    """
    default_save_args = {"index": False}
    default_load_args = {}
    self._filepath = filepath
    self._load_args = (
        {**default_load_args, **load_args}
        if load_args is not None
        else default_load_args
    )
    self._save_args = (
        {**default_save_args, **save_args}
        if save_args is not None
        else default_save_args
    )
    self._version = version

def _load(self) -> pd.DataFrame:
    load_path = self._get_load_path(self._filepath, self._version)
    return pd.read_feather(load_path, **self._load_args)

def _save(self, data: pd.DataFrame) -> None:
    save_path = Path(self._get_save_path(self._filepath, self._version))
    save_path.parent.mkdir(parents=True, exist_ok=True)
    data.to_feather(str(save_path))

    load_path = Path(self._get_load_path(self._filepath, self._version))
    self._check_paths_consistency(
        str(load_path.absolute()), str(save_path.absolute())
    )

def _exists(self) -> bool:
    try:
        path = self._get_load_path(self._filepath, self._version)
    except DataSetError:
        return False
    return Path(path).is_file()

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

User home directory is not expanded for TextLocalDataSet

Description

User home directory ~ is not automatically expanded for TextLocalDataSet, but it is automatically expanded for ParquetLocalDataSet and CSVLocalDataSet .

Context

Trying to specify file paths relative to user home directory to simplify interoperability and handoff of Kedro pipelines between teammates - instead of manually replacing hardcoded absolute paths on each machine or user, each user will automatically recreate same directory structure relative to his/her home directory.

Steps to Reproduce

from kedro.io import TextLocalDataSet
import os


string_to_write = "This will go in a file."

data_set = TextLocalDataSet(filepath="~/code/tmp/new_documentation.md")
data_set.save(string_to_write)

os.path.abspath(data_set._filepath)

Expected Result

~ should be replaced by user home directory and file should be saved relative to it in subdirectory code/tmp/.

Actual Result

Directory ~/code/tmp/ is created in current working directory.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): v0.14.2
  • Python version used (python -V): Python 3.6.8 :: Anaconda, Inc.
  • Operating system and version: MacOS Mojave 10.14.3

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Add complete Kedro plugin example

Description

I'm trying to create a Kedro plugin and it took me a long time to figure out that a plugin is a separate package that exists outside of any Kedro project. I kept modifying 'setup.py' within my Kedro project, following the JSON example in the documentation, only to find that the kedro to_json command didn't do anything. In retrospect it now makes sense that kedro plugins are separate python packages, but this is not obvious to new users from the documentation.

Context

New users would benefit from a more complete Kedro plugin example.

Possible Implementation

Add more detail to the Kedro plugin tutorial. Also add a complete plugin example as its own repository, similar to the kedro-examples repo.

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Documentation enhancements (extend FAQs)

Description

Since open sourcing, a few questions have been raised that an official FAQ can answer neatly for re-use and to provide information/education.

Also add some minor tweaks to section headings in the Data Catalog content for the Advanced User Guide

Context

Adding this change improves the docs

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

[KED-974] Running the pipeline from the last failed node

When you are running the pipeline with 100+ nodes and if it fails in between, it is very difficult to identify the nodes that didn't run and have to run from the starting which could be very time consuming. Example: there are about 250 tasks that a pipeline needs to complete and it failed on 201 task then there is no functionality to run the pipeline by picking the remaining + failed tasks from the last run without manually identifying what's not run.

Context

This could save a lot of time when large number of nodes are involved.

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.
For the last run, capture the nodes that are left and the one that failed and there should be an option which says run from the failed node from the last run.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

  1. Run it from starting again after code fix in that node.

[KED-944] Request for NetworkXDataSet

Description

When building a pipeline to operate on graph data, users must create their own DataSet implementation for graphs. NetworkX is a popular library for doing graph analytics/manipulation in Python, and I believe built-in support for NetworkX graphs would benefit the community.

Context

My colleagues and I focus on experimenting with different DL/ML techniques applied to graphs. We're looking for a testing harness to support this experimentation (along with a path to production), and I believe Kedro is the ideal candidate. I'm optimistic that out-of-the-box support for NetworkX would aid in Kedro's adoption by academics/researchers in the graph analytics space.

Possible Implementation

I've got a NetworkXDataSet implementation living in the src folder of my projects; I'm currently working on adhering to the Kedro style guidelines, and writing unit tests.

Any/all feedback is welcome, especially pertaining to code standards and design decisions (e.g. what load_args and save_args are available to users etc.)

[KED-929] Libcloud Support

Description

At present I make use of libcloud for consistency of interface. It would be great if this tool made use of libcloud so that we can guarantee consistency of interface across all clouds.

Context

Possible Implementation

Make use of libcloud for all cloud automation.

Possible Alternatives

Keep current implementation

Checklist

Include labels so that we can categorise your issue:

  • [Interface standardization ] Add a "Component" label to the issue
  • [ Low ] Add a "Priority" label to the issue

[KED-923] Request for YAMLLocalDataset

Description

YAML file format is apparently not supported by kedro.io.

Context

It is common to load parameters from YAML file and Kedro supports this, but parameters are often modified/imputed during initialization, and it is useful to save the modified parameters as a part of reporting.

Unfortunately, YAML is apparently not supported by kedro.io according to the lists at https://kedro.readthedocs.io/en/latest/kedro.io.html#data-sets and https://kedro.readthedocs.io/en/latest/kedro.contrib.io.html

Possible Implementation

Add a subclass of kedro.io.AbstractDataSet as YAMLLocalDataset.

[KED-927] IO Dask Dataset class to load and save Parquet files on S3

Description

A dataset class to load and save parquet files on S3. Currently, the only way is to use Spark/Pandas but not using Dask

Context

If we are just using Dask for smaller data sizes to deal with parquet files on S3. It will be useful to have an IO class that can read and write parquet files on S3.

Possible Implementation

# Copyright 2018-2019 QuantumBlack Visual Analytics Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
# (either separately or in combination, "QuantumBlack Trademarks") are
# trademarks of QuantumBlack. The License does not grant you any right or
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
# Trademarks or any confusingly similar mark as a trademark for your product,
#     or use the QuantumBlack Trademarks in any other manner that might cause
# confusion in the marketplace, including but not limited to in advertising,
# on websites, or on software.
#
# See the License for the specific language governing permissions and
# limitations under the License.

"""
``ParquetS3DaskDataSet`` is a data set used to load and save
data using dask to parquet files on S3
"""

from typing import Any, Dict, Optional

import dask.dataframe as dd
from s3fs.core import S3FileSystem

from kedro.io.core import (
    AbstractDataSet,
    DataSetError,
    S3PathVersionMixIn,
)


class ParquetS3DaskDataSet(AbstractDataSet, S3PathVersionMixIn):
    """``ParquetS3DaskDataSet`` loads and saves data to file(s) in S3. It uses s3fs
    to read and write from S3 and pandas to handle the parquet file.

    Example:
    ::
        >>> from kedro.contrib.io.dask_parquet_s3 import ParquetS3DaskDataSet
        >>> import pandas as pd
        >>> import dask.dataframe as dd
        >>>
        >>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
        >>>                      'col3': [5, 6]})
        >>> ddf = dd.from_pandas(data, npartitions=2)
        >>>
        >>> data_set = ParquetS3DaskDataSet(
        >>>                         filepath="temp_folder",
        >>>                         bucket_name="test_bucket",
        >>>                         credentials={
        >>>                             'aws_access_key_id': 'YOUR_KEY',
        >>>                             'aws_access_secredt_key': 'YOUR SECRET'},
        >>>                         save_args={"compression": "GZIP"})
        >>> data_set.save(data)
        >>> reloaded = data_set.load()
        >>>
        >>> assert ddf.compute().equals(reloaded.compute())
    """

    def _describe(self) -> Dict[str, Any]:
        return dict(
            filepath=self._filepath,
            bucket_name=self._bucket_name,
            load_args=self._load_args,
            save_args=self._save_args,
        )

    # pylint: disable=too-many-arguments
    def __init__(
        self,
        filepath: str,
        bucket_name: str,
        engine: str = "auto",
        credentials: Optional[Dict[str, Any]] = None,
        load_args: Optional[Dict[str, Any]] = None,
        save_args: Optional[Dict[str, Any]] = None,
    ) -> None:
        """Creates a new instance of ``ParquetS3DaskDataSet`` pointing to concrete
        parquet file(s) on S3.
        Args:
            filepath: Path to parquet file(s)
                parquet collection or the directory of a multipart parquet.

            bucket_name: S3 bucket name.

            credentials: Credentials to access the S3 bucket, such as
                ``aws_access_key_id``, ``aws_secret_access_key``.

            engine: The engine to use, one of: `auto`, `fastparquet`,
                `pyarrow`. If `auto`, then the default behavior is to try
                `pyarrow`, falling back to `fastparquet` if `pyarrow` is
                unavailable.

            load_args: Additional loading options `pyarrow`:
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
                or `fastparquet`:
                https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.ParquetFile.to_pandas

            save_args: Additional saving options for `pyarrow`:
                https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas
                or `fastparquet`:
                https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
        """
        default_save_args = {"compression": None}
        default_load_args = {}

        self._filepath = filepath
        self._engine = engine

        self._load_args = (
            {**default_load_args, **load_args}
            if load_args is not None
            else default_load_args
        )
        self._save_args = (
            {**default_save_args, **save_args}
            if save_args is not None
            else default_save_args
        )

        self._bucket_name = bucket_name
        self._credentials = credentials or {}
        self._s3 = S3FileSystem(client_kwargs=self._credentials)

    @property
    def _client(self):
        return self._s3.s3

    def _load(self) -> dd.DataFrame:
        s3_file = f"s3://{self._bucket_name}/{self._filepath}"
        return dd.read_parquet(s3_file)

    def _save(self, data: dd.DataFrame) -> None:
        save_key = self._get_save_path(
            self._client, self._bucket_name, self._filepath
        )

        output_file = f"s3://{self._bucket_name}/{self._filepath}"
        data.to_parquet(output_file)

        load_key = self._get_load_path(
            self._client, self._bucket_name, self._filepath
        )
        self._check_paths_consistency(load_key, save_key)

    def _exists(self) -> bool:
        try:
            load_key = self._get_load_path(
                self._client, self._bucket_name, self._filepath
            )
        except DataSetError:
            return False
        args = (self._client, self._bucket_name, load_key)
        return any(key == load_key for key in self._list_objects(*args))

Possible Alternatives

#38

[KED-973] HDFS3DataSet won't save a dataframe in table mode (Update)

Description

I have a dataframe with categorical variables.

_save() doesn't work using this dataset, since it is being "put" rather than appended to the store

(see this issue in Pandas: pandas-dev/pandas#11843)

I recommend using this:

store.append(key = self._key, value=data, append=False)

rather than:

store[self._key] = data

source: kedro/io/hdf_s3.py - line 166:

Update

Subsequent to applying this patch, I found that attempting to save H5 files of size = 24mb caused a new error (note: small files are ok):

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\kedro\io\core.py", line 232, in save
    self._save(data)
  File "C:\Users\Me\PycharmProjects\MyProj\datascience_kedro\src\my_proj\nodes\custom_datasets.py", line 120, in _save
    store.append(key=self._key, value=data, append=False)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\pandas\io\pytables.py", line 987, in append
    **kwargs)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\pandas\io\pytables.py", line 1370, in _write_to_group
    s.create_index(columns=index)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\pandas\io\pytables.py", line 3365, in create_index
    v.create_index(**kw)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\tables\table.py", line 3609, in create_index
    tmp_dir, _blocksizes, _verbose)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\tables\table.py", line 319, in _column__create_index
    blocksizes=blocksizes)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\tables\index.py", line 400, in __init__
    super(Index, self).__init__(parentnode, name, title, new, filters)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\tables\group.py", line 238, in __init__
    super(Group, self).__init__(parentnode, name, _log)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\tables\node.py", line 275, in __init__
    self._g_post_init_hook()
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\tables\index.py", line 554, in _g_post_init_hook
    self.create_temp()
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\site-packages\tables\index.py", line 1013, in create_temp
    ".tmp", "pytables-", self.tmp_dir)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\tempfile.py", line 344, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "C:\Users\Me\Miniconda3\envs\datasci\lib\tempfile.py", line 262, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: 'data/\\pytables-_0u2doys.tmp'

This seems to occur while attempting to create an index on the H5 file. The intermediate tempfile in the S3 bucket is not created (note this error happens in both Win10 and Ubuntu 18.04.3 LTS, so I don't think it's an improperly formed path).

My quick hack for this is to go into: /site-packages/pandas/io/pytables.py
in the create_index method and set optlevel=False -> optlevel=0. This prevents index optimisation and hence no attempt is made to create a tempfile.

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

  1. [First Step]
  2. [Second Step]
  3. [and so on...]

Expected Result

Tell us what should happen.

Actual Result

Tell us what happens instead.

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used 0.14.3
  • Python version used 3.6.6:
  • Operating system and version Win10 and Ubuntu 18.04.3 LTS

[KED-933] Passing node parameters from `create_pipeline`

Description

I have been through the tutorial and the docs and have found something which I think could be included as a new feature. It's possible that there may be an alternative way to do this so please enlighten me if there is.

I wanted to define a reusable node that could be configured through function arguments and used in multiple stages of the pipeline. To my understanding, the parameters config doesn't fit what I am trying to do because there will be multiple instances of the same node in the pipeline but with different arguments. The example I have at hand is a node which vectorizes some text based on an input column. I want to be able to define a node like:

def vectorize(data: pd.DataFrame, vectorizer: TfidfVectorizer, column: str):
    documents = data[column]
    return vectorize.transform(documents, vectorizer)

so that in create_pipeline I can do:

    pipeline = Pipeline(
        [
            ...
            node(vectorize,
                 ["question_pairs", "vectorizer", "question1"],
                 "matrix1"),
             node(vectorize,
                 ["question_pairs", "vectorizer", "question2"],
                 "matrix1"),
            ...
        ]
    )

where the third item in the input list is the column I wish to vectorize

Context

Allowing for such paramerisation of nodes will mean that nodes can be reused throughout the pipeline.

Possible Implementation

If I was to run this code, it would fail saying that "question1" or "question2" cannot be found in the DataCatalog. A possible implementation would be to allow any surplus parameters that are not defined in the DataCatalog to be passed in as function arguments. This would allow for the type of behaviour I am looking to implement with a reusable node.

Possible Alternatives

A possible alternative is to wrap the vectorize function with functools.partial and specify the column parameter to create a new partial object, but this fails because __name__ is undefined for a functools partial object.

Max Workers for Parallel Runner

Description

I am working with a very slow database connection to get data migrated to S3. The parallel runner allows me to get up to the number of cpu core connections running in parallel.

Context

I would like to be able to kick it up a notch while waiting on slow sql queries.

Possible Implementation

Parallel Runner Module

As shown int the ProcessPoolExecutor docs max_workers can be passed into the ProcessPoolExecutor

Change Line 34 in the parallel runner module to bring in Optional.

- from typing import Any, Dict
+ from typing import Any, Dict, Optional

Change Line 59-68 in the parallel runner module

-    def __init__(self):
+   def __init__(self, max_workers: Optional[int]=None):
        """Instantiates the runner by creating a Manager.
+    
+     Args:
          max_workers
        """
        self._manager = ParallelRunnerManager()
        self._manager.start()
+     self.max_workers = max_workers

Change line 170 in the parallel runner module to the following

-         with ProcessPoolExecutor() as pool:
+        with ProcessPoolExecutor(max_workers=self.max_workers) as pool:

kedro_cli in template

add MAX_WORKERS_HELP to kedro_cli line 91

+
+MAX_WORKERS_HELP = """Maximum number of process to run in parallel. If max_workers is None or not given, it will default to the number of processors on the machine. If max_workers is lower or equal to 0, then a ValueError will be raised.
In the default template add a flag to the cli in [kedro_cli lines 102-147](https://github.com/quantumblacklabs/kedro/blob/861baa5ee8dd1c8ffce3ef83ae598fa38ecb55e6/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/kedro_cli.py#L102-L147)

``` diff
@click.group(context_settings=CONTEXT_SETTINGS, name=__file__)
def cli():
    """Command line tools for manipulating a Kedro project."""


@cli.command()
@click.option("--from-nodes", type=str, default="", help=FROM_NODES_HELP)
@click.option("--to-nodes", type=str, default="", help=TO_NODES_HELP)
@click.option(
    "--node",
    "-n",
    "node_names",
    type=str,
    default=None,
    multiple=True,
    help=NODE_ARG_HELP,
)
@click.option(
    "--runner", "-r", type=str, default=None, multiple=False, help=RUNNER_ARG_HELP
)
@click.option("--parallel", "-p", is_flag=True, multiple=False, help=PARALLEL_ARG_HELP)
+ @click.option("--max-workers", type=int, default=None, multiple=False, help=MAX_WORKERS_HELP)
@click.option("--env", "-e", type=str, default=None, multiple=False, help=ENV_ARG_HELP)
@click.option("--tag", "-t", type=str, default=None, multiple=True, help=TAG_ARG_HELP)
- def run(tag, env, parallel, runner, node_names, to_nodes, from_nodes):
+ def run(tag, env, parallel, runner, max_workers, node_names, to_nodes, from_nodes):
    """Run the pipeline."""
    from {{cookiecutter.python_package}}.run import main
    from_nodes = [n for n in from_nodes.split(",") if n]
    to_nodes = [n for n in to_nodes.split(",") if n]

    if parallel and runner:
        raise KedroCliError(
            "Both --parallel and --runner options cannot be used together. "
            "Please use either --parallel or --runner."
        )
+
+  if runner and max_workers != None:
+     raise KedroCliError(
+        "Both --runner and --max-workers options cannot be used together."
+        "Please use either --parallel with --max-workers or --runner."

    if parallel:
        runner = "ParallelRunner"
    runner_class = load_obj(runner, "kedro.runner") if runner else SequentialRunner
+
+   runner_kwargs = {}
+   if parallel and max_workers !=None:
+      runner_kwargs['max_workers'] = max_workers

    main(
        tags=tag,
        env=env,
-         runner=runner_class(),
+         runner=runner_class(**runner_kwargs), # i am not familiar with load_obj, so I am not completely sure this is correct
        node_names=node_names,
        from_nodes=from_nodes,
        to_nodes=to_nodes,
    )

Docs

04_create_pipelines.md line 727

- * `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores.
+ * `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores as set by `max_workers`.

05_nodes_and_pipelines.md line 526

+ > *Note: * the parallel runner can be called with --max-workers to set the maximum number of worker processes.

004_create_pipelines.md line 737

+ > *Note: * the parallel runner can be called with --max-workers to set the maximum number of worker processes.

[KED-1054] Add option to kedro build-docs which also opens the documentation

Description

I find myself writing 'kedro build-docs ; open docs/build/html/index.html', adding a simple
--open argument would be appreciated. I do plan to PR this myself, I'm just not sure when I have time.

Context

  • Saves time
  • Is almost always the next action after running the command

Possible Implementation

  • Add CLI argument

Possible Alternatives

N/A

[KED-862] Implement dataset versioning on CSVBlobDataSet

Description

CSVBlobDataSet class in kedro.contrib.io.azure.CSVBlobDataSet does not have versioning implemented and it would be great if we could also enable it in this dataset.

Context

N/A

Possible Implementation

  • The documentation for how to enable versioning can be found in here
  • You could also find actual examples of versioning implementations in the datasets in kedro.io

IO Dataset class to load and save Parquet files on S3

Description

A dataset class to load and save parquet files on S3. Currently, the only way is to use Spark but not using Pandas.

Context

If we are just using Pandas for smaller data sizes to deal with parquet files on S3. It will be useful to have an IO class that can read and write parquet files on S3.

Possible Implementation

# Copyright 2018-2019 QuantumBlack Visual Analytics Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
# (either separately or in combination, "QuantumBlack Trademarks") are
# trademarks of QuantumBlack. The License does not grant you any right or
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
# Trademarks or any confusingly similar mark as a trademark for your product,
#     or use the QuantumBlack Trademarks in any other manner that might cause
# confusion in the marketplace, including but not limited to in advertising,
# on websites, or on software.
#
# See the License for the specific language governing permissions and
# limitations under the License.

"""
``ParquetS3DataSet`` is a data set used to load and save
data to parquet files on S3
"""

from typing import Any, Dict, Optional

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from s3fs.core import S3FileSystem

from kedro.io.core import (
    AbstractDataSet,
    DataSetError,
    ExistsMixin,
    S3PathVersionMixIn,
)


class ParquetS3DataSet(AbstractDataSet, ExistsMixin, S3PathVersionMixIn):
    """``ParquetS3DataSet`` loads and saves data to a file in S3. It uses s3fs
    to read and write from S3 and pandas to handle the parquet file.

    Example:
    ::
        >>> from kedro.contrib.io.parquet_s3 import ParquetS3DataSet
        >>> import pandas as pd
        >>>
        >>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
        >>>                      'col3': [5, 6]})
        >>>
        >>> data_set = ParquetS3DataSet(
        >>>                         filepath="temp3.parquet",
        >>>                         bucket_name="test_bucket",
        >>>                         credentials={
        >>>                             'aws_access_key_id': 'YOUR_KEY',
        >>>                             'aws_access_secredt_key': 'YOUR SECRET'},
        >>>                         save_args={"compression": "GZIP"})
        >>> data_set.save(data)
        >>> reloaded = data_set.load()
        >>>
        >>> assert data.equals(reloaded)
    """

    def _describe(self) -> Dict[str, Any]:
        return dict(
            filepath=self._filepath,
            bucket_name=self._bucket_name,
            load_args=self._load_args,
            save_args=self._save_args,
        )

    # pylint: disable=too-many-arguments
    def __init__(
        self,
        filepath: str,
        bucket_name: str,
        engine: str = "auto",
        credentials: Optional[Dict[str, Any]] = None,
        load_args: Optional[Dict[str, Any]] = None,
        save_args: Optional[Dict[str, Any]] = None,
    ) -> None:
        """Creates a new instance of ``ParquetS3DataSet`` pointing to a concrete
        parquet file on S3.
        Args:
            filepath: Path to a parquet file
                parquet collection or the directory of a multipart parquet.

            bucket_name: S3 bucket name.

            credentials: Credentials to access the S3 bucket, such as
                ``aws_access_key_id``, ``aws_secret_access_key``.

            engine: The engine to use, one of: `auto`, `fastparquet`,
                `pyarrow`. If `auto`, then the default behavior is to try
                `pyarrow`, falling back to `fastparquet` if `pyarrow` is
                unavailable.

            load_args: Additional loading options `pyarrow`:
                https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
                or `fastparquet`:
                https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.ParquetFile.to_pandas

            save_args: Additional saving options for `pyarrow`:
                https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas
                or `fastparquet`:
                https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
        """
        default_save_args = {"compression": None}
        default_load_args = {}

        self._filepath = filepath
        self._engine = engine

        self._load_args = (
            {**default_load_args, **load_args}
            if load_args is not None
            else default_load_args
        )
        self._save_args = (
            {**default_save_args, **save_args}
            if save_args is not None
            else default_save_args
        )

        self._bucket_name = bucket_name
        self._credentials = credentials or {}
        self._s3 = S3FileSystem(client_kwargs=self._credentials)

    @property
    def _client(self):
        return self._s3.s3

    def _load(self) -> pd.DataFrame:
        load_key = self._get_load_path(
            self._client, self._bucket_name, self._filepath
        )

        with self._s3.open(
            "{}/{}".format(self._bucket_name, load_key), mode="rb"
        ) as s3_file:
            return pd.read_parquet(s3_file, engine=self._engine, **self._load_args)

    def _save(self, data: pd.DataFrame) -> None:
        save_key = self._get_save_path(
            self._client, self._bucket_name, self._filepath
        )

        output_file = f"s3://{self._bucket_name}/{self._filepath}"
        pq.write_table(pa.Table.from_pandas(data), output_file, filesystem=self._s3)

        load_key = self._get_load_path(
            self._client, self._bucket_name, self._filepath
        )
        self._check_paths_consistency(load_key, save_key)

    def _exists(self) -> bool:
        try:
            load_key = self._get_load_path(
                self._client, self._bucket_name, self._filepath
            )
        except DataSetError:
            return False
        args = (self._client, self._bucket_name, load_key)
        return any(key == load_key for key in self._list_objects(*args))

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Happy to create a PR if the team agrees to this.

[KED-861] Implement dataset versioning on SparkDataSet

Description

SparkDataSet class in kedro.contrib.io.pyspark.SparkDataSet does not have versioning implemented and it would be great if we could also enable it in this dataset.

Context

N/A

Possible Implementation

  • The documentation for how to enable versioning can be found in here
  • You could also find actual examples of versioning implementations in the datasets in kedro.io

[KED-928] Full Google Cloud Platform Support

Description

In kedro Docs, GCP is mentioned but cannot find any references to data connectors.
Add data connectors for Google Cloud Storage and Google Big Query

Context

Allow the use of the Google Cloud Products for me and my team of data scientists. GCP is a large cloud provider and is a very popular IAAS that is used by many people. GCS is the Google equivalent to AWS S3 and Big Query is Googles hosted database system.

Possible Implementation

using the google python client library, create the following:
kedro.io.gcs_csv
kedro.io.gcs_parquet
kedro.io.gcs_hdfs
kedrio.io.gcs_pickle
kedrio.io.gcs_json
kedro.io.gbq [load df from GBQ]

Possible Alternatives

in the S3 implematations, replace s3fs with boto3 to allow for access to both S3 and GCS with the same code. See GCP simple migration method outlined here, however this does not allow for full access to the GCS product. ie service accounts etc. and could break some functions in the s3 implementations.

Todo

  • Set up authentication via service accounts and default credentials
  • Duplicate functionality in io.s3 connectors for GCS
  • Duplicate functionality in io.sql connector for GBQ
  • Test connectors against public datasets
  • Test connectors on private datasets

Checklist

  • [Component: I/O]
  • [Type: Enhancement]
  • [Priority: Medium]

'kwargs' is not defined in `kedro_cli.py` generated by `kedro new` command

Description

'kwargs' is not defined in kedro_cli.py generated by kedro new command

image

Context

The code generated by kedro new is apparently not tested.

Steps to Reproduce

  1. Run kedro new to create a kedro project with iris dataset.
  2. Run kedro jupyter convert --all

Expected Result

No error.

Actual Result

NameError: name 'kwargs' is not defined

image

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): the latest develop build (commit 39ca35e
    )
  • Python version used (python -V): Python 3.6.8
  • Operating system and version: Ubuntu 18.04.3 LTS (Bionic Beaver)

Supporting conda install in kedro install command

Description

Frequently in data science projects we need to use conda install to handle packages that require configuration / building steps after installation.

Context

I am currently doing a data science project in Kedro which uses deep learning. In this situation using conda install is important as it allows me to handle CUDA/cuDNN dependencies within the ecosystem.

This enhancement fits naturally as most Kedro installations use conda to create virtual environments anyway.

Possible Implementation

If someone creates a file called requirements_conda.txt in the src folder the following code should work to install via pip and conda.

@cli.command()
def install():
    """Install project dependencies from requirements.txt."""
    python_call("pip", ["install", "-U", "-r", "src/requirements.txt"])
    os.system('conda install --file src/requirements_conda.txt --yes')

I'd be happy to contribute the code myself, but I was wondering if this is something that you all envisioned supporting?

Option to include visualized pipelines in the generated document

Description

An option to include the image of visualized pipelines in the Sphinx document generated by kedro build-docs command

Context

kedro-viz offers kedro viz command that can generate interactive visualized pipelines.

This visualization is very useful to explain to the stakeholders and it is even nicer to automate the manual operation to run kedro viz command, access the URL, take a screenshot, and paste it in the document.

![image](https://user-images.githubusercontent.com/33908456/61184262-542f8000-a67e-11e9-872e-3fa40e31b81c.png)

Possible Implementation

Programmatically communicate with the kedro_viz.server.

Possible Alternatives

Use a graph visualization tool such as graphviz.

[KED-519] Support template variables in configuration files

Description

As a user of kedro I would like to be able to parameterize/template my config files, in order to be able to change configurations of lots of datasets, accross config files, by changing only one variable in a centralized place.

This would allow me to for instance:

  • Create flexible folderpaths for all (or a large subset of) Datasets, allowing me to
    • Quickly switch root directories from a dev/test environment to a prod environment
    • Work with data iterations
    • Have pipelines for different countries run using the same config files
  • Reduce the size of my config files by templating common parts (for instance I could parameterize the 'save_args' or 'load_args' of a Spark Dataset)

Context

User stories/ideas from current users have mainly focussed on the templating of the folderpath.

Some more advanced ideas have inlcuded:

  • Allowing for boolean logic inside the template and the values to allow for easier switching (e.g. one could have a parameter dev=True, and when this gets switched to False, all the conifg that depends on this boolean switches. This would be usefull if for instance both the folderpath and the load_args change when running in dev vs prod, in this case only one parameter would have to be changed)
  • Allowing for templating inside the template values. This would reduce the chance of errors in complex config folders, that all have a similar root but might have multiple parameters in them. Example:
	raw_employees:
	  type: kedro.contrib.io.pyspark.SparkDataSet
	  filepath: s3a://${s3_bucket}/${country}/${raw_folder}/employees.csv
	  file_format: csv
	  load_args:
	    delimiter: ";"
	    header: true

with values.yml:

	s3_bucket: 'bucketname'
	country: 'uk'
	raw_folder: '00_raw'

you could have:

	raw_employees:
	  type: kedro.contrib.io.pyspark.SparkDataSet
	  filepath: ${raw_root}/employees.csv
	  file_format: csv
	  load_args:
	    delimiter: ";"
	    header: true

with values.yml:

	s3_bucket: 'bucketname'
	country: 'uk'
	raw_folder: '00_raw'
	raw_root: s3a://${s3_bucket}/${country}/${raw_folder}

Lots of people I spoke to have created their own version of this on kedro projects (mostly the simple one without templating inside the values.yml). It would be good to make a solution for it in the kedro core, incorporating best-practices and unit-tests.

Possible Implementation

You can make a class that inherits from ConfigLoader with an extra class called resolve that is similar to the .get method with the only difference being that it applies the default value dict to config before returning it.

One way this could be done is using the string Template class in python. Otherwise, some logic needs to be written to replace template variables with defaults.

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Separate data storage from data format in IO classes

I'm worried about a potential proliferation of inconsistent IO classes, but I think we could avoid this by making a change to how IO classes are structured.

Right now, an IO class reads a specific file format from a specific type of storage. For example, `CSVS3DataSet`` reads a specific format (CSV) from a specific storage type (S3). This requires that if you want to read a CSV from a different storage type, you need to write a whole new class (e.g. CSVLocalDataSet).

There are at least two problems with this approach:

  1. If there are N formats and M storage types, then we'd need N*M classes, which would slow down development and lead to inconsistencies in how the same file formats are read.
  2. If we want to add support for a new file type, we can't just do this once. It would have to be repeated M times. Which is bad.

We're not feeling the pain of (1) and (2) yet because the project is still new. But as the number of use-cases grows, we'll feel an exponentially increasing level of pain.

Positive suggestion

We can lay a foundation for avoiding these issues by separating the IO classes into two types:

  1. Classes for reading raw data, each of which inherits from a mix-in class. Let's call this KedroFileReader.
  2. A class for parsing data, each of which also inherits from a mix-in class. Let's call this KedroFileParser.

We stipulate that every KedroFileReader class must provide a file-like object supporting the usual read methods, open, close, and contexts (with foo as f:). KedroFileReader classes would not know anything about how to parse the data -- they just provide a uniform set of methods for reading data.

KedroFileParser classes would provide a parse method that takes a file-like object as input and returns whatever data structure is appropriate (e.g. a DataFrame).

Finally, when we want to (e.g.) read a CSV from S3, the class's entire definition is:

class CSVS3FileReader (KedroS3Reader, KedroCSVParser):
    def __init__(self, **kwargs):
        ...here we would call the __init__ methods for the two mix-in classes...

If something very tricky were necessary for a specific KedroFileReader and KedroFileParser combination, methods could be overridden and/or a new __init__ method could do custom stuff and then call the class's super. So we would not lose any flexibility.

Happy to work on this myself and open a PR.

"kedro ipython" doesn't leverage ScientificView in PyCharm

Description

I'm working my way through the Kedro tutorial. When I set df = io.load("foo.csv") for the iris data, I get the interactive mode in Terminal as expected, however I can't see the contents of the dataframe in PyCharm's data viewer.

Is there a way to tell Kedro to run its IPython mode using the native IPython setup in Pycharm?

image

Context

Im trying to use the data viewer in Pycharm's ScientificView

Steps to Reproduce

  1. Install Pycharm Pro
  2. Install sample project
  3. run kedro ipython
  4. Load a sample dataset df = io.load("foo.csv")

Expected Result

I would like to have ScientificView populated with the contents of the Dataframe

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used 0.14.2
  • Python version used 3.6.8
  • Operating system and version: Win10

Change the default behavior of `run` to `run_only_missing`

Description

It is more intuitive that run avoids re-computation of nodes in default leaving the current behavior as force_rerun option.

Context

A major motivation to save the intermediate files, even if it consumes the disk space and requires additional computation time, is to avoid re-computation of the nodes. Thus, it is more intuitive to utilize the saved files in default as implemented in run_only_missing in the current version of kedro.

This suggestion is related to #30 and #25 .

Possible Implementation

Modify run

Allow defining credentials from variable environment

Description

It is not clear how one can best deploy Kedro with credentials for production. A good practice in traditional applications is to put sensitive credentials using environment variables.

In Kedro, it looks like it relies only on credentials.yml, located somewhere except in conf/local/ folder.

(PS: If I'm missing something with this, the problem is probably bad or missing documentation since I'm not able to find this information in the doc)

Context

Using environment variables will help standardize the deployment of Kedro like any app, thus reduces learning curve for developers.

Possible Implementation

Possible change could be to look for environment variables first, before looking for the content of credentials.yml. Changes will mainly be located in ConfigLoader class.

Possible Alternatives

I create credentials.yml file in config/base/ folder, because this folder ignored by Git, but is still packaged in the Kedro Docker. This is still not good because the credentials are now located in the Docker images repository: Anyone can pull that image and get prod credentials!


If this is an interesting implementation (which I think it is), I would be happy to contribute by implementing it. I first want to open the discussion about the best implementation given Kedro orientation, before making any pull request.

Regards,

What is the idiom for multiple S3 connections in parallel tasks

Description

I have a pipeline which takes CSV files in an S3 bucket, processes them and outputs HDF5 files to a new S3 bucket. These are computationally intensive processes so I wish to run them on a large EC2 instance. Do I need to manually manage some sort of connection pool? Will the threads interfere with each other? I am writing my own Dataset specifically for the task based on the example in CSVS3Dataset.

Many thanks

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

  1. [First Step]
  2. [Second Step]
  3. [and so on...]

Expected Result

Tell us what should happen.

Actual Result

Tell us what happens instead.

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V):
  • Python version used (python -V):
  • Operating system and version:

Simple worfklow won't parallelize

Description

I take daily (gzipped) CSV files, process them and return HDF5 files. When I try to do 2 or more days using the Parallel Runner it doesn't work (see error)

Context

The processing function just renames some variables, calls another (non-lambda) function to do some date processing, then uses the LambdaDataset trick to save it as an HDF5 file. I'm using a factory for the LambdaDataSet:

def lambda_data_set_factory(output_dir_file, frame_name="dat") -> LambdaDataSet:
    """ df -> HDF
    """
    def save_df_to_hdf(value):
        with pd.HDFStore(
            path=output_dir_file, mode="w", complib="blosc:blosclz"
        ) as store:
            store.append(
                key=frame_name, value=value, append=False, data_columns=["ts"]
            )

    def load_hdf():
        with pd.HDFStore(output_dir_file, mode="r") as store:
            nrows = store.get_storer("dat").nrows
            df = store.select(frame_name, start=0, stop=nrows)

        # Replace TS index with integers
        # df.reset_index(inplace=True, drop=False)

        return df

    return LambdaDataSet(save=save_df_to_hdf, load=load_hdf)

Steps to Reproduce

  1. [First Step]
  2. [Second Step]
  3. [and so on...]

Expected Result

Work

Actual Result

Doesn't Work

AttributeError: The following data_sets cannot be serialized: [Mon, Tue]
In order to utilize multiprocessing you need to make sure all data sets are serializable, i.e. data sets should not make use of lambda functions, nested functions, closures etc.
If you are using custom decorators ensure they are correctly using functools.wraps().

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.14.3
  • Python version used (python -V): 3.6.6
  • Operating system and version: Win10

SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab

Description

SparkDataSet with a relative local file path doesn't work on jupyter notebook/lab

Context

We can have a SparkDataSet entry which filepath has a relative local file path in catalog.yml which looks like this;

something:
  type: kedro.contrib.io.pyspark.SparkDataSet
  file_format: parquet
  filepath: data/01_intermediate/something.parquet
  save_args:
    mode: overwrite

And when something.parquet is placed under <project_directory>/data/01_intermediate/something.parquet properly, kedro run successfully can load this parquet as long as the pipeline uses the 'something' data.

But on a jupyter notebook invoked by kedro jupyter notebook command, the following script doesn't load 'something' as expected.

io.load('something')

Instead, it raises an exception looks like the following;

Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet

Reading spark_data_set.py (of kedro) and readwriter.py (of pyspark), I think it is caused by spark.read.load implementation. And apparently spark.read.load tries to read the data which is located at /Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet mistakenly, somehow, spark.read.load tries to resolve a given relative filepath referring from the directory of the notebook.

Steps to Reproduce

As I showed above,

  1. put some spark readable data on your local file system
  2. put a corresponding data entry using a relative file path on catalog.yml
  3. invoke jupyter notebook by using kedro jupyter notebook command at the root directory of the kedro project
  4. execute io.load('something') on a jupyter notebook

Expected Result

load a 'something' dataframe

Actual Result

raises exceptions

Py4JJavaError: An error occurred while calling o25.load.
: org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/go_kojima/sample_kedro_project/notebooks/data/01_intermediate/something.parquet

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V):
    kedro, version 0.14.1
    (anaconda3-2019.03)

  • Python version used (python -V):
    Python 3.7.3
    (anaconda3-2019.03)

  • Operating system and version:
    macOS Mojave version 10.14.5

My personal solution

Modifies the filepath as a absolute file path when a relative local file path is given in the SparkDataSet's init function code looks like the following;

class SparkDataSet(AbstractDataSet, ExistsMixin):

    def __init__(
        self,
        filepath: str,
        file_format: str = "parquet",
        load_args: Optional[Dict[str, Any]] = None,
        save_args: Optional[Dict[str, Any]] = None,
    ) -> None:
        import re
        from os.path import abspath, curdir

        ### original version
        # self._filepath = filepath

        ### modified version
        def is_relative_path(path):
            def is_url():
                url_pattern = r'^\S+://'
                return not not re.match(url_pattern, path)
            def is_abspath():
                abspath_pattern = r'^/'
                return not not re.match(abspath_pattern, path)
            return not(is_url() or is_abspath())

        def file_url(rpath):
            return 'file://%s/%s' % (abspath(curdir), rpath)

        self._filepath = file_url(filepath) if is_relative_path(filepath) else filepath

        self._file_format = file_format
        self._load_args = load_args if load_args is not None else {}
        self._save_args = save_args if save_args is not None else {}

[KED-882] Jupyterlab naming collision using text file tooltips

Description

When editing .py text files with an attached console using kedro jupyter lab the kernel used by the console will continually restart when trying to autocompletion and tooltips, during spaceflight tutorial. This is because of the tutorial requiring creating an io module.

Context

I was trying to edit python text files and was wondering why the console attached to the file was constantly dying when I was trying to make use of tooltips (shift + tab). Having read up on the issues I believe its because

Steps to Reproduce

  1. Run kedro jupyter lab (jupyterlab==1.0.2)
  2. Open a .py text file
  3. Open console by right-clicking
  4. Try getting a tooltip for any arbitrary function like print() usign shift+tab

Expected Result

Tell us what should happen.

Actual Result

Tell us what happens instead.

`[I 19:34:54.854 LabApp] KernelRestarter: restarting kernel (4/5), new random ports
Fatal Python error: init_sys_streams: can't initialize sys standard streams
AttributeError: module 'io' has no attribute 'OpenWrapper'

Current thread 0x00007fd301fbe740 (most recent call first):
[W 19:34:57.874 LabApp] KernelRestarter: restart failed
[W 19:34:57.875 LabApp] Kernel c1a79a1c-6ec1-44fb-b5c4-f18e6156ee19 died, removing from map.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): v0.14.3
  • Python version used (python -V): 3.7.3
  • Operating system and version: Ubuntu 18.04
  • conda version: conda version : 4.6.14
  • jupyterlab version: 1.0.2 installed from conda-forge (conda install -c conda-forge jupyterlab==1.0.2)

[KED-988] Using functools.update_wrapper for partial functions raises TypeError

Description

Nodes composed of functions created by functools.partial raise a warning, suggesting the use of functools.update_wrapper, but using update_wrapper on the partial function raises a TypeError.

I think this could be fixed by adding follow_wrapped=False to inspect.signature, which is called in Node's _validate_inputs method. I'm happy to open a PR if this seems a reasonable fix.

Context

I'm running pipelines with node functions that return partial functions. I want to avoid the warning about update_wrapper.

Steps to Reproduce

Run the following minimal code to get the error:

from functools import partial, update_wrapper
from kedro.pipeline import, node
import pandas as pd

def partial_foo(bar, df):
    return df.assign(baz=bar)
    
def foo(bar):
    return update_wrapper(partial(partial_foo, bar), partial_foo)

df = pd.DataFrame([{'foo': 'foo', 'bar': 'bar'}, {'foo': 'baz', 'bar': 'brr'}])
my_node = node(foo('hello'), 'data_in', 'data_out')

Expected Result

I should be able to create a node with a partial function that is passed to update_wrapper without an error.

Actual Result

When I call update_wrapper on the node's partial function, I get the following error:

/usr/local/lib/python3.6/site-packages/kedro/pipeline/node.py in _validate_inputs(self, func, inputs)
    515                         str(list(func_args)), str(inputs)
    516                     )
--> 517                 ) from exc
    518 
    519     def _validate_unique_outputs(self):

TypeError: Inputs of function expected ['bar', 'df'], but got data_in

This is due to inspect.signature getting the signature of the wrapped function, which does not match the wrapper function that is actually called within Node.

Your Environment

  • Kedro: 0.15.0
  • Python: 3.6.8
  • Running in the python:3.6 Docker image

The latest Click version 7.0 is not supported

Description

Kedro requires click==6.7 and causes conflict error if the installed click version is 7.0.

Context

Support for click 7.0 is in demand since the latest click version 7.0 was released on Sep 26, 2018 and commonly used now. The latest Anaconda uses click 7.0 as well.

Reference: https://docs.anaconda.com/anaconda/packages/py3.7_linux-64/

Steps to Reproduce

  1. Follow the Kedro's getting started: https://kedro.readthedocs.io/en/latest/02_getting_started/04_hello_world.html

Expected Result

kedro run does not return any errors.

Actual Result

kedro run returns conflict error as follows.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/kedro/cli/cli.py", line 519, in _get_plugin_command_groups
    groups.append(entry_point.load())
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2433, in load
    self.require(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2456, in require
    items = working_set.resolve(reqs, env, installer, extras=self.extras)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 791, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (Click 7.0 (/usr/local/lib/python3.6/dist-packages), Requirement.parse('click==6.7'), {'kedro'})
Error: Loading project commands from kedro-viz = kedro_viz.server:commands

image

Your Environment

Kedro 0.14.3
Python 3.6.8
Ubuntu 18.04.2 LTS (Bionic Beaver)

Modularize default argument handling for datasets

Description

Near-identical code to handle default arguments is replicated in almost every dataset implementation. Worse still, functionality across said datasets is the same, but implementation is inconsistent.

Context

When I want to implement a new dataset, I look at existing datasets as a baseline for implementing my own. However, there are inconsistencies between these datasets, from the more minor (save_args handled after load_args for some datasets), to the slightly more significant (special casing where there are no default arguments on some datasets but not others) and worse (one case where arguments are evaluated for truthiness instead of is not None) (see https://github.com/quantumblacklabs/kedro/blob/0.14.1/kedro/contrib/io/azure/csv_blob.py#L109-L113 as an example representing several of the above). I don't know which one to follow to maintain consistency across the codebase.

Possible Implementation

#15

By having DEFAULT_LOAD_ARGS/DEFAULT_SAVE_ARGS attributes, users can also see the defaults programmatically (with the caveat that this is a drawback if you consider the few cases where such arguments don't apply, like no save on SqlQueryDataSet or in general on LambdaDataSet/MemoryDataSet).

Possible Alternatives

  • Create an intermediate abstract dataset class (or mixin?) so as to not modify AbstractDataSet and thereby only apply to those with load_args/save_args
  • Move default argument handling into a utility function and call it from each individual __init__ method (not preferred)

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Jupyter Notebook and iPython launch issues

Description

When running kedro ipython or kedro jupyter notebook from my project root, I get FileNotFound errors.

It seems both ipython and Jupyter must be pip installed manually, despite them appearing to be fundamental to using Kedro.

This may be misunderstanding on my part, since I assumed they would be available given that libraries like Pandas were available for writing the data_engineering.py ETL functionality, but a note in the Installation docs reminding users to manually install them would prevent other users from making similar mistakes.

Context

I'm walking through the Spaceflight tutorial, and want to be able to follow the "develop in notebooks then copy/paste to src" philosophy.

Steps to Reproduce

  1. cd kedro-tutorial
  2. conda activate kedro_env
  3. kedro ipython or kedro jupyter notebook

Expected Result

A new ipython session should start, or the Jupyter notebook server should be available for access through my browser.

Actual Result

I receive a FileNotFoundError, indicating that there is No such directory 'ipython' : 'ipython'. It seems that the newly-created conda environment for this project is without either ipython or jupyter.

Jupyter Notebook

-------------------------------------------------------------------------------
Starting a Kedro session with the following variables in scope
proj_dir, proj_name, io, startup_error
Use the line magic %reload_kedro to refresh them
or to see the error message if they are undefined
-------------------------------------------------------------------------------
jupyter-notebook --ip=127.0.0.1
Traceback (most recent call last):
  File "/anaconda3/envs/kedro_env/bin/kedro", line 10, in <module>
    sys.exit(main())
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/kedro/cli/cli.py", line 557, in main
    ("Project specific commands", project_groups),
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/josephhaaga/Documents/TylorData/kedro_test/spaceflights_demo/kedro-tutorial/kedro_cli.py", line 207, in jupyter_notebook
    call(["jupyter-notebook", "--ip=" + ip] + list(args))
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/kedro/cli/utils.py", line 48, in call
    res = subprocess.run(cmd, **kwargs).returncode
  File "/anaconda3/envs/kedro_env/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/anaconda3/envs/kedro_env/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/anaconda3/envs/kedro_env/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'jupyter-notebook': 'jupyter-notebook'

ipython

-------------------------------------------------------------------------------
Starting a Kedro session with the following variables in scope
proj_dir, proj_name, io, startup_error
Use the line magic %reload_kedro to refresh them
or to see the error message if they are undefined
-------------------------------------------------------------------------------
ipython
Traceback (most recent call last):
  File "/anaconda3/envs/kedro_env/bin/kedro", line 10, in <module>
    sys.exit(main())
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/kedro/cli/cli.py", line 557, in main
    ("Project specific commands", project_groups),
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/josephhaaga/Documents/TylorData/kedro_test/spaceflights_demo/kedro-tutorial/kedro_cli.py", line 134, in ipython
    call(["ipython"] + list(args))
  File "/anaconda3/envs/kedro_env/lib/python3.6/site-packages/kedro/cli/utils.py", line 48, in call
    res = subprocess.run(cmd, **kwargs).returncode
  File "/anaconda3/envs/kedro_env/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/anaconda3/envs/kedro_env/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/anaconda3/envs/kedro_env/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ipython': 'ipython'

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.14.2
  • Python version used (python -V): Python 3.6.8 :: Anaconda, Inc.
  • Operating system and version: OS X Mojave 10.14.5

Checklist

Include labels so that we can categorise your issue:

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Creating custom dataset for image segmentation.

I am moving my image segmentation projects to kedro but kedro does not support this datatype . Could you please suggest me how can I incorprate this kind of data in catalog.yml.
FYI My data is in below format:-
train

  • image1
  • image2
  • image3

test

  • image1
  • image2
  • image3

validate

  • image1
  • image2
  • image3

Thanks in adavance.

Documentation of `Running pipelines with IO` using KedroContext is missing

Description

The document at
https://github.com/quantumblacklabs/kedro/blob/develop/docs/source/04_user_guide/05_nodes_and_pipelines.md#running-pipelines-with-io has not been updated to the upcoming version 0.15.0 and the documentation of Running pipelines with IO using KedroContext is missing.

Context

KedroContext class was introduced by [KED-826] Use KedroContext class in project template (#111) (571bb80) 2 days ago.

Using this, I'm trying to give the input dataset in memory to the first node without saving a file and found the document has not been updated.

Related pull request: #60

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): Latest commit c9c49e0 in develop branch (version between 0.14.3 and 0.15.0)
  • Python version used (python -V): 3.6.8
  • Operating system and version: Ubuntu 18.04.2 LTS (Bionic Beaver)

[KED-925] Add Apache Sqoop IO Class

Description

I would like to use Kedro not only for Data Science pipelines but also for Data Ingestions. For large data ingestion, I usually use Apache Sqoop to speed up the job. Would it be possible to implement an operator in Kedro for end-to-end pipeline?

Context

I will use it as source of my pipeline, which will help me (and surely others) to migrate all my current data pipelines to Kedro

Possible Implementation

Not sure about this but I assume that wrapping pysqoop class (https://pypi.org/project/pysqoop/) in a IO class should be enough

pyarrow 0.14.0 in Google Colab is not supported

Description

Kedro requires pyarrow==0.12.0 and does not support pyarrow 0.14.0.

Context

Google Colaboratory is a free cloud-based Jupyter-like environment but comes with pyarrow 0.14.0 which is not supported by Kedro.

image

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): the latest build in develop branch (pre-0.15.0)
  • Python version used (python -V): 3.6.8
  • Operating system and version: Ubuntu 18.04.2 LTS in Google Colab

kedro.io.CSVS3DataSet does not use load_args

Description

When to loading a dataframe from S3, the arguments used are the default, instead of the configured under load_args on the catalog.yml

Context

Trying to load CSV from S3 with custom load_args

Steps to Reproduce

  1. Change catalog.yml with
    XXX.csv.s3:
    type: CSVS3DataSet # https://kedro.readthedocs.io/en/latest/kedro.io.CSVS3DataSet.html
    load_args: # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
    sep: '\t'
    encoding: "ISO-8859-1"

Create pipeline to display data in XXX.csv.s3
example:
node(
read_display,
["XXX.csv.s3"],
None
)

  1. kedro run
  2. Check if data is being displayed correctly.

Expected Result

Data should be split by tab

Actual Result

Data is loaded without custom sep.

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment you experienced the bug in

  • Kedro version used: v0.14
  • Python version used: 3.6.8
  • Operating system and version:MacOSX 10.14.5

DOES NOT HAPPEN in KEDRO v0.13.1.dev53

Checklist

Include labels so that we can categorise your issue

  • Add a "Component" label to the issue
  • Add a "Priority" label to the issue

Smart quotes in the legal disclaimers

Description

There are smart quotes at several spots in the legal disclaimers. Here is one example but all examples can be found by searching the repo:
https://github.com/quantumblacklabs/kedro/blob/d3218bd89ce8d1148b1f79dfe589065f47037be6/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/__init__.py#L17

Context

I discovered this while building the Sphinx docs for the Hello World example and getting a "Non-ASCII character" error.

[KED-867] Using `s3:GetObjectAcl` instead of `s3:GetObject` for the CSVS3DataSet

Description

This is a user query, I am creating the issue here from visibility and input from the user. The user was struggling with using the CSVS3DataSet and AWS. He thought the problem was related to the policy of the AWS environment and access to a specific bucket.

Query:

In CSVS3DataSet, in line 124, should the separator be : instead of /? This includes having self._s3.open("{}/{}".format(self._bucket_name, load_key), mode="rb") as s3_file:return pd.read_csv(s3_file, **self._load_args).

Method to fix this error

I changeds3:GetObject to s3:GetObjectAcl and it worked. I suspect the problem was related with access to a specific bucket.

A possible fix for this is just flagging this in the documentation that you might need to use s3:GetObjectAcl.

What we need help understanding

  • Why "the separator must be ':' instead of '/'" if it's basically a directory path (we are using the s3fs inside https://s3fs.readthedocs.io/en/latest/#examples)
  • Can we get the steps to reproduce this error? And what was the error message?
  • More details about your environment:
    • Kedro version used (pip show kedro or kedro -V)
    • Python version used (python -V)
    • Operating system and version

[KED-949] Kedro-Glue Plugin

Description

Create a plugin to allow for easy conversion and deployment of Kedro pipelines to AWS Glue.

Context

Kedro is great for localized development but currently lacks a way to deploy pipelines on AWS. Kedro-Docker and Kedro-Airflow are great solutions to this problem but it would be great to be able to use AWS native ETL tools.

Possible Implementation

Implementation should be fairly similar to Kedro-Airflow but with obvious alterations tailored towards the AWS API. The general idea would be to convert a Kedro pipeline nodes to CodeGenNode structures, as well as capturing the relationship between nodes in CodeGenEdge structures.

For more information see:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-etl-script-generation.html

Template `run.py` uses `typing` but fails `mypy`

Description

There are typing issues in template run.py; e.g. runner is typed as a str but later used as a Runner.

Some of the issues may just require bypassing (e.g. Pathlib / operator).

Context

I'm trying to check typing in my project.

Steps to Reproduce

  1. Create a new kedro project.
  2. Install mypy.
  3. Run mypy. My mypy.ini:
[mypy]
new_semantic_analyzer = True

[mypy-kedro.*,nbstripout,numpy,pandas,pytest,scipy.*]
ignore_missing_imports = True

Expected Result

No errors.

Actual Result

I have to add:

[mypy-cotopaxi.run]
ignore_errors = True

to mypy.ini, else:

(project-cotopaxi) BOS-178551-C02X31K9JHD4:project-cotopaxi deepyaman$ mypy src/cotopaxi src/tests kedro_cli.py 
src/cotopaxi/run.py:80: error: Incompatible types in assignment (expression has type "Path", variable has type "str")
src/cotopaxi/run.py:83: error: Unsupported left operand type for / ("str")
src/cotopaxi/run.py:84: error: Unsupported left operand type for / ("str")
src/cotopaxi/run.py:147: error: "str" not callable
src/cotopaxi/run.py:147: error: "None" not callable

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.13.1.dev53 🤣
  • Python version used (python -V): 3.7.3
  • Operating system and version:

Link locations for example datasets don't point to raw data

Description

The links in the "Adding your datasets to data" section of this page do not point straight at the CSV / xlsx. So, they cannot be downloaded using wget or similar. They could be replaced with links to the raw content, e.g.:

https://raw.githubusercontent.com/quantumblacklabs/kedro/develop/docs/source/03_tutorial/data/reviews.csv

Happy to submit a PR with these changes.

Context

I was trying to download the data for the spaceflights tutorial using wget.

Steps to Reproduce

  1. Do: wget https://github.com/quantumblacklabs/kedro/tree/develop/docs/source/03_tutorial/data/reviews.csv
  2. Inspect reviews.csv, it is html.

Expected Result

I expected the actual CSV to be returned.

Actual Result

I got html instead.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.14.2
  • Python version used (python -V): 3.7.2
  • Operating system and version: macOS High Sierra

Add only-missing option to kedro run command

Description

While Runner has the run_only_missing method which is very useful especially when a pipeline is being developed incrementally, but kedro run command doesn't have an option for it. If it has the option do Runner#run_only_missing, it may be useful.

Context

If kedro run has an option for doing Runner#run_only_missing, we can skip steps which have been already done before, it makes pipeline developments' more productive.

Possible Implementation

Add the run function of kedro_cli.py an argument for only_missing behavior, and add the run function of main function of run.py an argument for only_missing as well.
(This implementation has already been tried at my local environment, and it's working well)

Checklist

Include labels so that we can categorise your issue:

  • [Type: Enhancement]
  • [Priority: Medium]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.