Uploaded on Jul 7, 2020
PPT on All about Spark AI Summit 2020
All about Spark AI Summit 2020
All About Spark AI Summit 2020 Introduction The Spark + AI Summit is the world's biggest conference around technology & machine learning. Collaborating at the data and ML intersection is a unique experience for developers, data engineers, data scientists, and decision- makers. The participants will hear about the new developments in Apache Spark and ML technologies such as TensorFlow, MLflow, PyTorch as well as best practices in real-world business AI. Spark 3.0 Optimizations for In the opeSninpg akeyrnkote , SCQTO LMatei Zaharia addressed that 90 percent of Spark API calls run via the Spark SQL engine, resulting in 46 percent of the Spark patches being spent by the Apache Spark group in enhancing Spark SQL. Spark 3.0 is around 2x faster than Spark 2.4 (using TPC-DS), allowed by adaptive database execution, dynamic partition pruning, and other enhancements. Spark SQL Adaptive Query Execution Spark 2.2 introduced cost-dependent optimisation to the new SQL Optimizer based on the rule. Spark 3.0 now has Adaptive Database Execution(AQE) runtime. Runtime statistics retrieved from completed phases of the database process are used with AQE to re-optimize the execution schedule for the remaining database phases. When using AQE, Source: itnext Databricks tests provided speed-ups ranging from 1.1x to 8x. Spark SQL Dynamic Partition Pruning Pushdown and partition pruning of Spark 2.x static predicate is a performance enhancement that restricts the number of files and partitions that Spark reads when querying. After partitioning the records, queries meeting certain parameters for partition filters boost efficiency by allowing Spark to read only a subset of the directories and files. Source: Databricks Spark 3.0 GPU Acceleration Robert Evans and Jason Lowe gave an overview of accelerator-conscious scheduling and the RAPIDS Accelerator for Apache Spark in the Deep Dive in GPU Support in Apache Spark 3.x session, allowing GPU accelerated SQL / DataFrame operations and Spark shuffles without code shift. Source: Medium.com Accelerator-aware scheduling This allows Spark to schedule executors with a specified number of GPUs, and users can specify the number of GPUs required for each task. Spark transmits these requests for resources to the underlying cluster manager, Kubernetes, YARN or Standalone. Users can also customize a discovery script which will detect which GPUs the cluster manager has allocated. Source: pixabay Accelerated SQL/DataFrame Spark 3.0 supports SQL optimizer plugins that process data using batches in columns rather than rows. Columnar data is GPU-friendly and the RAPIDS Accelerator links this functionality to accelerate SQL and DataFrame operators Source: pixabay Accelerated Shuffle Spark operations that sort, group or join data by value have to move data between partitions in a process called a shuffle that involves disk I / O, data serialization and network I / O, when creating a new DataFrame from an existing one between stages. Source: spark.ai (2018) Accelerated end to end ML and DL It allows the TensorFlow and PyTorch models to be trained directly on Spark DataFrames, exploiting the ability of Horovod to scale in parallel to hundreds of GPUs, without any advanced programming for distributed processing. With the latest Apache Spark 3.0 accelerator- conscious scheduling and columnar processing APIs, a development ETL job will hand out data within the same pipeline to Horovod running Source: humancentered.ai distributed deep learning training on GPUs.
Comments