In a prior blog, we pointed out that warehouses, known for high-performance data processing for business intelligence, can quickly become expensive for new data and evolving workloads. We also made the case that query and reporting, provided by big data engines such as Presto, need to work with the Spark infrastructure framework to support advanced analytics and complex enterprise data decision-making. To do so, Presto and Spark need to readily work with existing and modern data warehouse infrastructures. Now, let’s chat about why data warehouse optimization is a key value of a data lakehouse strategy.
Value of data warehouse optimization
Since its introduction over a century ago, the gasoline-powered engine has remained largely unchanged. It’s simply been adapted over time to accommodate modern demands such as pollution controls, air conditioning and power steering.
Similarly, the relational database has been the foundation for data warehousing for as long as data warehousing has been around. Relational databases were adapted to accommodate the demands of new workloads, such as the data engineering tasks associated with structured and semi-structured data, and for building machine learning models.
Returning to the analogy, there have been significant changes to how we power cars. We now have gasoline-powered engines, battery electric vehicles (BEVs), and hybrid vehicles. An August 2021 Forbes article referenced a 2021 Department of Energy Argonne National Laboratory publication indicating, “Hybrid electric vehicles (think: Prius) had the lowest total 15-year per-mile cost of driving in the Small SUV category beating BEVs”.
Just as hybrid vehicles help their owners balance the initial purchase price and cost over time, enterprises are attempting to find a balance between high performance and cost-effectiveness for their data and analytics ecosystem. Essentially, they want to run the right workloads in the right environment without having to copy datasets excessively.
Optimizing your data lakehouse architecture
Fortunately, the IT landscape is changing thanks to a mix of cloud platforms, open source and traditional software vendors. The rise of cloud object storage has driven the cost of data storage down. Open-data file formats have evolved to support data sharing across multiple data engines, like Presto, Spark and others. Intelligent data caching is improving the performance of data lakehouse infrastructures.
All these innovations are being adapted by software vendors and accepted by their customers. So, what does this mean from a practical perspective? What can enterprises do different from what they are already doing today? Some use case examples will help. To effectively use raw data, it often needs to be curated within a data warehouse. Semi-structured data needs to be reformatted and transformed to be loaded into tables. And ML processes consume an abundance of capacity to build models.
Organizations running these workloads in their data warehouse environment today are paying a high run rate for engineering tasks that add no additional value or insight. Only the outputs from these data-driven models allow an organization to derive additional value. If organizations could execute these engineering tasks at a lower run rate in a data lakehouse while making the transformed data available to both the lakehouse and warehouse via open formats, they could deliver the same output value with low-cost processing.
Benefits of optimizing across your data warehouse and data lakehouse
Optimizing workloads across a data warehouse and a data lakehouse by sharing data using open formats can reduce costs and complexity. This helps organizations drive a better return on their data strategy and analytics investments while also helping to deliver better data governance and security.
And just as a hybrid car allows car owners to get greater value from their car investment, optimizing workloads across a data warehouse and data lakehouse will allow organizations to get greater value from their data analytics ecosystem.
Discover how you can optimize your data warehouse to scale analytics and artificial intelligence (AI) workloads with a data lakehouse strategy.