Giter Club home page Giter Club logo

deep-learning-pyspark's Introduction

Deep Learning with Pyspark

Hi everyone and welcome back to learning :). In this article I’ll continue the discussion on Deep Learning with Apache Spark. You can see the preparation part for this repo and code here.

Here I will focus entirely on the DL pipelines library and how to use it from scratch.

Apache Spark Timeline

The continuous improvements on Apache Spark lead us to this discussion on how to do Deep Learning with it. I created a detailed timeline of the development of Apache Spark until now to see how we got here.

Deep Learning Pipelines


Databricks

Deep Learning Pipelines is an open source library created by Databricks that provides high-level APIs for scalable deep learning in Python with Apache Spark.

It is an awesome effort and it won’t be long until is merged into the official API, so is worth taking a look of it.

Some of the advantages of this library compared to the ones that joins Spark with DL are:

  • In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep learning in very few lines of code.
  • It focuses on ease of use and integration, without sacrificing performace.
  • It’s build by the creators of Apache Spark (which are also the main contributors) so it’s more likely for it to be merged as an official API than others.
  • It is written in Python, so it will integrate with all of its famous libraries, and right now it uses the power of TensorFlow and Keras, the two main libraries of the moment to do DL.

Deep Learning Pipelines builds on Apache Spark’s ML Pipelines for training, and with Spark DataFrames and SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they can be done efficiently in a few lines of code:

  • Image loading
  • Applying pre-trained models as transformers in a Spark ML pipeline
  • Transfer learning
  • Applying Deep Learning models at scale
  • Distributed hyperparameter tuning (next part)
  • Deploying models in DataFrames and SQL

I will describe each of these features in detail with examples. These examples comes from the official notebook by Databricks.

Apache Spark on Deep Cognition

To run and test the codes in this article you will need to create an account in Deep Cognition.

Is very easy and then you can access all of their features. When you log in this is what you should be seeing:

Now just click on the left part, the Notebook button:

And you will be on the Jupyter Notebook with all the installed packages :). Oh! A note here: The Spark Notebook (DLS SPARK) is an upcoming feature which will be released to public sometime next month and tell that it is still in private beta (just for this post).

You can see the full code in the Notebook here:

https://github.com/FavioVazquez/deep-learning-pyspark/blob/master/SparkDL.ipynb

Soon I’ll discuss Distributed Hyperparameter Tuning with Spark, and will try new models and examples :). So stay tuned!

deep-learning-pyspark's People

Contributors

faviovazquez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

deep-learning-pyspark's Issues

as_list() is not defined on an unknown TensorShape.

hello, please i run this code :

from sparkdl import DeepImagePredictor
image_df = ImageSchema.readImages("flower_photos/sample/")

predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10)
predictions_df = predictor.transform(image_df)

and i get this error :

ValueError Traceback (most recent call last)
in ()
4
5 predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10)
----> 6 predictions_df = predictor.transform(image_df)

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in transform(self, dataset, params)
171 return self.copy(params)._transform(dataset)
172 else:
--> 173 return self._transform(dataset)
174 else:
175 raise ValueError("Params must be a param map but got %s." % type(params))

/tmp/spark-ad4dff5c-2eda-49cc-b5a8-b81ede3fac4b/userFiles-d4fb97fa-84ed-4aaf-8036-d6d38363201e/databricks_spark-deep-learning-1.0.0-spark2.3-s_2.11.jar/sparkdl/transformers/named_image.py in _transform(self, dataset)
94 modelName=self.getModelName(),
95 featurize=False)
---> 96 transformed = transformer.transform(dataset)
97 if self.getOrDefault(self.decodePredictions):
98 return self._decodeOutputAsPredictions(transformed)

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in transform(self, dataset, params)
171 return self.copy(params)._transform(dataset)
172 else:
--> 173 return self._transform(dataset)
174 else:
175 raise ValueError("Params must be a param map but got %s." % type(params))

/tmp/spark-ad4dff5c-2eda-49cc-b5a8-b81ede3fac4b/userFiles-d4fb97fa-84ed-4aaf-8036-d6d38363201e/databricks_spark-deep-learning-1.0.0-spark2.3-s_2.11.jar/sparkdl/transformers/named_image.py in _transform(self, dataset)
327 outputMode=modelGraphSpec["outputMode"])
328 resizeUdf = createResizeImageUDF(modelGraphSpec["inputTensorSize"])
--> 329 result = tfTransformer.transform(dataset.withColumn(resizedCol, resizeUdf(inputCol)))
330 return result.drop(resizedCol)
331

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in transform(self, dataset, params)
171 return self.copy(params)._transform(dataset)
172 else:
--> 173 return self._transform(dataset)
174 else:
175 raise ValueError("Params must be a param map but got %s." % type(params))

/tmp/spark-ad4dff5c-2eda-49cc-b5a8-b81ede3fac4b/userFiles-d4fb97fa-84ed-4aaf-8036-d6d38363201e/databricks_spark-deep-learning-1.0.0-spark2.3-s_2.11.jar/sparkdl/transformers/tf_image.py in _transform(self, dataset)
145 "width": "__sdl_image_width",
146 "num_channels": "__sdl_image_nchannels",
--> 147 "image_buffer": "__sdl_image_data"})
148 .drop("__sdl_image_height", "__sdl_image_width", "__sdl_image_nchannels",
149 "__sdl_image_data")

/tmp/spark-ad4dff5c-2eda-49cc-b5a8-b81ede3fac4b/userFiles-d4fb97fa-84ed-4aaf-8036-d6d38363201e/databricks_tensorframes-0.3.0-s_2.11.jar/tensorframes/core.py in map_rows(fetches, dframe, feed_dict, initial_variables)
262 if isinstance(dframe, pd.DataFrame):
263 return _map_pd(fetches, dframe, feed_dict, block=False, trim=None, initial_variables=initial_variables)
--> 264 return _map(fetches, dframe, feed_dict, block=False, trim=None, initial_variables=initial_variables)
265
266 def map_blocks(fetches, dframe, feed_dict=None, trim=False, initial_variables=_initial_variables_default):

/tmp/spark-ad4dff5c-2eda-49cc-b5a8-b81ede3fac4b/userFiles-d4fb97fa-84ed-4aaf-8036-d6d38363201e/databricks_tensorframes-0.3.0-s_2.11.jar/tensorframes/core.py in _map(fetches, dframe, feed_dict, block, trim, initial_variables)
150 builder = _java_api().map_rows(dframe._jdf)
151 _add_graph(graph, builder)
--> 152 ph_names = _add_shapes(graph, builder, fetches)
153 _add_inputs(builder, feed_dict, ph_names)
154 jdf = builder.buildDF()

/tmp/spark-ad4dff5c-2eda-49cc-b5a8-b81ede3fac4b/userFiles-d4fb97fa-84ed-4aaf-8036-d6d38363201e/databricks_tensorframes-0.3.0-s_2.11.jar/tensorframes/core.py in _add_shapes(graph, builder, fetches)
83 t = graph.get_tensor_by_name(op_name + ":0")
84 ph_names.append(t.name)
---> 85 ph_shapes.append(_get_shape(t))
86 logger.info("fetches: %s %s", str(names), str(shapes))
87 logger.info("inputs: %s %s", str(ph_names), str(ph_shapes))

/tmp/spark-ad4dff5c-2eda-49cc-b5a8-b81ede3fac4b/userFiles-d4fb97fa-84ed-4aaf-8036-d6d38363201e/databricks_tensorframes-0.3.0-s_2.11.jar/tensorframes/core.py in _get_shape(node)
36
37 def _get_shape(node):
---> 38 l = node.get_shape().as_list()
39 return [-1 if x is None else x for x in l]
40

~/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py in as_list(self)
898 """
899 if self._dims is None:
--> 900 raise ValueError("as_list() is not defined on an unknown TensorShape.")
901 return [dim.value for dim in self._dims]
902

ValueError: as_list() is not defined on an unknown TensorShape.

Any help please ! Thanks

AttributeError: 'NoneType' object has no attribute 'mode'

Hey Favio,

I'm replicating your Deep Learning with Apache Spark setup. Currently on EMR; made sure all the requirements are on same package versions. I'm currently frozen on the section with deploying via KerasImageFileTransformer.

fs = !ls flower_photos/sample/*.jpg
uri_df = spark.createDataFrame(fs, StringType()).toDF("uri")
keras_pred_df = transformer.transform(uri_df)

Here's the stack trace:

/home/hadoop/conda/envs/sparky3/lib/python3.5/site-packages/keras/engine/saving.py:269: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '

INFO:tensorflow:Froze 378 variables.
Converted 378 variables to const ops.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-29a3a2a6e7dd> in <module>()
----> 1 keras_pred_df = transformer.transform(uri_df)

/usr/lib/spark/python/pyspark/ml/base.py in transform(self, dataset, params)
    171                 return self.copy(params)._transform(dataset)
    172             else:
--> 173                 return self._transform(dataset)
    174         else:
    175             raise ValueError("Params must be a param map but got %s." % type(params))

/mnt/tmp/spark-81145cc6-c291-4b6c-8bf5-88fb669b6370/userFiles-7db57568-a33b-4cd9-b761-5e2d4ae5f580/databricks_spark-deep-learning-1.0.0-spark2.3-s_2.11.jar/sparkdl/transformers/keras_image.py in _transform(self, dataset)
     67                                              outputTensor=outputTensorName,
     68                                              outputMode=self.getOrDefault(self.outputMode))
---> 69             return transformer.transform(image_df).drop(self._loadedImageCol())

/usr/lib/spark/python/pyspark/ml/base.py in transform(self, dataset, params)
    171                 return self.copy(params)._transform(dataset)
    172             else:
--> 173                 return self._transform(dataset)
    174         else:
    175             raise ValueError("Params must be a param map but got %s." % type(params))

/mnt/tmp/spark-81145cc6-c291-4b6c-8bf5-88fb669b6370/userFiles-7db57568-a33b-4cd9-b761-5e2d4ae5f580/databricks_spark-deep-learning-1.0.0-spark2.3-s_2.11.jar/sparkdl/transformers/tf_image.py in _transform(self, dataset)
    126     def _transform(self, dataset):
    127         graph = self.getGraph()
--> 128         composed_graph = self._addReshapeLayers(graph, self._getImageDtype(dataset))
    129         final_graph = self._stripGraph(composed_graph)
    130         with final_graph.as_default():  # pylint: disable=not-context-manager

/mnt/tmp/spark-81145cc6-c291-4b6c-8bf5-88fb669b6370/userFiles-7db57568-a33b-4cd9-b761-5e2d4ae5f580/databricks_spark-deep-learning-1.0.0-spark2.3-s_2.11.jar/sparkdl/transformers/tf_image.py in _getImageDtype(self, dataset)
    166         pdf = dataset.select(self.getInputCol()).take(1)
    167         img = pdf[0][self.getInputCol()]
--> 168         img_type = imageIO.imageTypeByOrdinal(img.mode)
    169         return img_type.dtype
    170 

AttributeError: 'NoneType' object has no attribute 'mode'

I'm aware that running the compile() method is an optimization step prior to saving this model. So haven't dug around there.

Stumped on how to debug further, any pointers would be appreciated.

How can I load a trained model .pt file in pyspark

Would you like to please help me like how can I load my trained model .pt file on Pyspark for job scheduling? I have not seen any resources regarding this, as I don't have a csv file to work with, would you like to share any resources if you have or code to look that how actually load .pt file on Pyspark and do further investigation. Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.