Uploaded on Jul 7, 2020
PPT on All about Spark AI Summit 2020
All about Spark AI Summit 2020
All About
Spark AI
Summit
2020
Introduction
The Spark + AI Summit is the world's biggest
conference around technology & machine
learning. Collaborating at the data and ML
intersection is a unique experience for developers,
data engineers, data scientists, and decision-
makers.
The participants will hear about the new
developments in Apache Spark and ML
technologies such as TensorFlow, MLflow, PyTorch
as well as best practices in real-world business AI.
Spark 3.0
Optimizations for
In the opeSninpg akeyrnkote , SCQTO LMatei Zaharia
addressed that 90 percent of Spark API calls run
via the Spark SQL engine, resulting in 46 percent
of the Spark patches being spent by the Apache
Spark group in enhancing Spark SQL.
Spark 3.0 is around 2x faster than Spark 2.4
(using TPC-DS), allowed by adaptive database
execution, dynamic partition pruning, and other
enhancements.
Spark SQL Adaptive
Query Execution
Spark 2.2 introduced cost-dependent optimisation
to the new SQL Optimizer based on the rule. Spark
3.0 now has Adaptive Database Execution(AQE)
runtime.
Runtime statistics retrieved from completed
phases of the database process are used with AQE
to re-optimize the execution schedule for the
remaining database phases. When using AQE, Source: itnext
Databricks tests provided speed-ups ranging from
1.1x to 8x.
Spark SQL Dynamic
Partition Pruning
Pushdown and partition pruning of Spark 2.x static
predicate is a performance enhancement that
restricts the number of files and partitions that
Spark reads when querying.
After partitioning the records, queries meeting
certain parameters for partition filters boost
efficiency by allowing Spark to read only a subset
of the directories and files. Source: Databricks
Spark 3.0 GPU
Acceleration
Robert Evans and Jason Lowe gave an overview of
accelerator-conscious scheduling and the RAPIDS
Accelerator for Apache Spark in the Deep Dive in
GPU Support in Apache Spark 3.x session, allowing
GPU accelerated SQL / DataFrame operations and
Spark shuffles without code shift.
Source: Medium.com
Accelerator-aware
scheduling
This allows Spark to schedule executors with a
specified number of GPUs, and users can specify
the number of GPUs required for each task. Spark
transmits these requests for resources to the
underlying cluster manager, Kubernetes, YARN or
Standalone.
Users can also customize a discovery script which
will detect which GPUs the cluster manager has
allocated.
Source: pixabay
Accelerated
SQL/DataFrame
Spark 3.0 supports SQL optimizer plugins that
process data using batches in columns rather than
rows. Columnar data is GPU-friendly and the
RAPIDS Accelerator links this functionality to
accelerate SQL and DataFrame operators
Source: pixabay
Accelerated Shuffle
Spark operations that sort, group or join data by
value have to move data between partitions in a
process called a shuffle that involves disk I / O,
data serialization and network I / O, when creating
a new DataFrame from an existing one between
stages.
Source: spark.ai (2018)
Accelerated end to
end ML and DL
It allows the TensorFlow and PyTorch models to be
trained directly on Spark DataFrames, exploiting
the ability of Horovod to scale in parallel to
hundreds of GPUs, without any advanced
programming for distributed processing.
With the latest Apache Spark 3.0 accelerator-
conscious scheduling and columnar processing
APIs, a development ETL job will hand out data
within the same pipeline to Horovod running
Source: humancentered.ai
distributed deep learning training on GPUs.
Comments