VP of Open Source and Founding Member @Alluxio
Migrating Presto workloads from a fully on-premise environment to cloud infrastructure has numerous benefits, including alleviating resource contention and reducing costs by paying for computation resources on an on-demand basis. In the case of Presto running on data stored in HDFS, the separation of compute in the cloud and storage on-premises is apparent since Presto’s architecture enables the storage and compute components to operate independently. The critical issue in this hybrid environment of Presto in the cloud retrieving HDFS data from an on-premise environment is the network latency between the two clusters.
This crucial bottleneck severely limits performance of any workload since a significant portion of its time is spent transferring the requested data between networks that could be residing in geographically disparate locations. As a result, most companies copy their data into a cloud environment and maintain that duplicate data, also known as Lift and Shift. Companies with compliance and data sovereignty requirements may even prevent organizations from copying data into the cloud. This approach is not scalable and requires introducing a lot of manual effort to achieve reasonable results.
This article introduces Alluxio to serve as a data orchestration layer to help serve data to Presto efficiently, as opposed to either directly querying the distant HDFS cluster or manually providing a localized copy of the data to Presto in a cloud cluster.
In the following architecture diagram, both Presto and Alluxio processes are co-located in the cloud cluster. As far as Presto is concerned, it is querying for and writing data to Alluxio as if it were a co-located HDFS cluster. When Alluxio receives a request for data, it fetches the data from the remote HDFS cluster initially, but subsequent requests will be served directly from its cache.
When Presto sends data to be persisted into storage, Alluxio asynchronously writes data to HDFS, freeing the Presto workload from needing to wait for the remote write to complete. In both read and write scenarios, with the exception of the initial read, a Presto workload is able to run at the same, if not faster, performance as if it were in the same network as the HDFS cluster. Note that besides the deployment and configuration of Alluxio and establishing the connection between Presto and Alluxio, there is no additional configuration or other manual efforts needed to maintain the hybrid environment.
For benchmarking, we will use Presto as the compute framework to run SQL queries on data in a geographically separated Hive and HDFS cluster.
The hybrid cloud environment used for experimentation in this section includes two Amazon EMR clusters in different AWS regions. Because the two clusters are geographically dispersed, there is noticeable network latency between the clusters. VPC peering is used to create VPC connections to allow traffic between the two VPCs over the global AWS backbone with no bandwidth bottleneck. Readers can follow the tutorial in the whitepaper to reproduce the benchmark results if using AWS as the cloud provider.
We use data and queries from the industry standard TPC-DS benchmark for decision support systems that examine large amounts of data and answer business questions.
We categorize a subset of the TPC-DS queries into the following classes (according to visualizations in this repository): Reporting, Interactive, and Deep Analytics.
With Alluxio, we collect two numbers for all TPC-DS queries; denoted by Cold and Warm.
With HDFS, we collect two numbers as well; Local and Remote.
We compare the performance of Alluxio (Cold and Warm) with HDFS (Local and Remote). Benchmarking shows an average of 3x improvement in performance with Alluxio when the cache is warm over accessing HDFS data remotely.
Out of the 104 queries available in this repository, q72 did not finish when accessing data remotely and took over 5 hours to finish when the data is local. With that exception, the results in the whitepaper include all of the remaining 103 queries.
The following table summarizes the results by class. Overall the maximum improvement seen with Alluxio was for q9 (7.1x) and the minimum was for q39a (1x - no difference).
With a 10 node compute cluster, the peak bandwidth utilization throughout running all the queries remained under 2Gbps when accessing data from the geographically separated cluster. Bandwidth was not the bottleneck with the AWS backbone network. As the utilization scales with the size of the compute cluster, a bandwidth bottleneck could be expected for larger clusters when not using Alluxio since the bandwidth available with Direct Connect may be limited.
Most of the performance gain seen with Alluxio is explained by the latency difference for both metadata and data, when cached seamlessly into the localized Alluxio cluster.
The hybrid cloud architecture allows cloud computing resources to be used for data analytics, even if the data resides in a completely different network. In addition to achieving significantly better performance, the execution plan outlined does not require any significant reconfiguration of the on-premise infrastructure.
Since users can harness the compute power of a public cloud, this opens up more opportunities for Presto to be utilized as a scalable and performant compute framework for analytics using data stored on-premises.
This article was originally published on PrestoDB's Engineering Blog written by Adit Madan on our team.
An in-depth whitepaper, “Zero-Copy” Hybrid Cloud for Data Analytics - Strategy, Architecture, and Benchmark Report, was originally published by Alluxio on the Alluxio Engineering Blog on April 6, 2020.
Level up your reading game by joining Hacker Noon now!