pyspark-queries's Introduction

PySpark-Queries

Pyspark tutorial with different query that you can use on notebook using pyspark. It is very useful tool to analyze large amount of data.

What is PySpark

PySpark is a Python API for Apache Spark to process larger datasets in a distributed cluster. It is written in Python to run a Python application using Apache Spark capabilities.

Implementation

A PySpark library to apply SQL-like analysis on a huge amount of structured or semi-structured data. We can also use SQL queries with PySparkSQL. It can also be connected to Apache Hive. HiveQL can be also be applied. PySparkSQL is a wrapper over the PySpark core. PySparkSQL introduced the DataFrame, a tabular representation of structured data that is similar to that of a table from a relational database management system.

Use Cases

Batch processing
PySpark RDD and DataFrame’s are used to process batch pipelines where you would need high throughput.
Realtime processing
PySpark Streaming is used to for real time processing.
Machine Learning
PySpark ML and MLlib is used for machine learning.
Graph processing
PySpark GraphX and GraphFrames are used for Graph processing.

Recommend Projects

rahulparajuli / pyspark-queries Goto Github PK

pyspark-queries's Introduction

PySpark-Queries

What is PySpark

Implementation

Use Cases

pyspark-queries's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent