Pyspark tutorial with different query that you can use on notebook using pyspark. It is very useful tool to analyze large amount of data.
PySpark is a Python API for Apache Spark to process larger datasets in a distributed cluster. It is written in Python to run a Python application using Apache Spark capabilities.
A PySpark library to apply SQL-like analysis on a huge amount of structured or semi-structured data. We can also use SQL queries with PySparkSQL. It can also be connected to Apache Hive. HiveQL can be also be applied. PySparkSQL is a wrapper over the PySpark core. PySparkSQL introduced the DataFrame, a tabular representation of structured data that is similar to that of a table from a relational database management system.
Batch processing
PySpark RDD and DataFrame’s are used to process batch pipelines where you would need high throughput.
Realtime processing
PySpark Streaming is used to for real time processing.
Machine Learning
PySpark ML and MLlib is used for machine learning.
Graph processing
PySpark GraphX and GraphFrames are used for Graph processing.