Building a Data Lake and Data Warehouse in AWS
Neal Achord, Red Pill Consultant
Across any of the platforms, the first thing you need is an object store.
The challenge from a data warehouse perspective is to get things flowing to the data lake first and then into the warehouse. There’s lots of tension there.
Amazon has a strong foothold in the data lake space.
There’s an excellent object store in S3 (Amazon Simple Storage Service) that can handle various types of data as an object. It’s robust and can handle ample capacity. You can make logical tables out of folders and create hierarchies, such as “this is my logical table entity X” and I have it down to year, month and day folder under it, so you have the ability to partition data at volume.
Many people use Redshift to import that data from S3.
You also have the option for Amazon Redshift with Spectrum added, which allows you to write relational queries inside of Redshift SQL against S3 data. Redshift Spectrum can hit both Redshift relational data, as well as writing ANSI SQL statements against S3 objects. You also have the option to query S3 data using AWS Athena. Using Athena, the S3 data is registered in the AWS Glue catalog. The Glue catalog contains the metadata about the S3 data (logical S3 tables, schema, schema versions, etc.). Athena references these catalog objects in its SQL queries. So, you have many options to be able to bulk load data into S3 and query it.
If you’re building a data lake, you will want to structure your S3 areas into logical tables. You’ll also need something to catalog your S3 data such as the AWS Glue catalog.
AWS glue has a catalog of its own. There are other options for the catalog such as a hive metastore. Data crawlers can help detect schema versioning(changes). You want to logically organize your data sources in S3 by source — by entity — using basically folder structures. That’s all your S3 objects. If you want SQL access for ease of access, you’re going to need something like Athena, Redshift or Redshift Spectrum to be able to query those. Athena and Redshift Spectrum for S3 object queries. Redshift for traditional SQL queries.
Stewart Bryson, CEO of Red Pill Analytics
The real magic comes in (for Amazon and Google) with the concept of catalog. In all the years we’ve been building data platforms on-prem, the catalog has never been that important; mostly because everything was built in a single database and you relied on that database’s capabilities to provide a platform.
When you look at a modern analytics platform data lake, you need something that covers everything — from data lake to various use cases — and gives you an idea of what data is available and where it’s been loaded. AWS Glue is really good at that and the concept of crawling. When you go into a tool to design something, it’s good for those tools to have an opinion, not a blank slate. Crawling your resources is something that Amazon and Glue do very well. Is it flawless? No. We have projects that reside primarily in that set of tools and it’s not perfect. But it’s great when you go in to build a data process that you don’t have to use that reverse-engineer step that we’ve had to with legacy tools.
If you’re firmly planted in Amazon, consider Snowflake. Snowflake is an answer for almost any cloud situation.
I wouldn’t invest too heavily in Redshift at this point. If you’re running Redshift workloads, don’t abandon them, but if you’re thinking about new workloads I wouldn’t go Redshift. If I wanted to go pure AWS, I’d think about Athena and Redshift Spectrum. You’ll probably see that displace Redshift over the years. Pure Athena has an architecture that mimics Snowflake. Today, if you’re going into Amazon with no real preference and you’re trying to migrate from an on-prem data warehouse, you’re going to be much happier with Snowflake. It’s going to look and feel a lot more like what you’re used to – running in the cloud with better workloads. From an Amazon perspective, definitely consider Snowflake.
With that, you have two options when you start talking about a Snowflake data lake/data warehouse:
1. Keeping your data lake in Amazon and using all of the tools that Neal just talked about, but then caching that data into Snowflake.
2. Push your entire data lake and everything into Snowflake.
There are definitely pros and cons. If this is your first data lake, the pros that Snowflake offers for giving you that entire consolidation is probably the right approach. If you’re thinking about Snowflake being one target of your data lake, and you want to build multiple targets, other data stores for different workloads, you’d probably want to keep your data lake in S3 and then just caching it into Snowflake.
Stewart is the CEO and Co-founder of Red Pill Analytics and has been working with data and analytics since 1996. He is an internationally recognized writer and speaker, an Oracle Groundbreaker Ambassador, a Snowflake Data Hero, and an Oracle ACE Director.
Neal has 12 years of Technical Architecture and Business Intelligence analytics experience specializing in Data Warehousing, Business Intelligence and Analytics Capabilities. Neal is a consultant at Red Pill Analytics.
Prefer to listen? Watch the Stump The Experts webinar on our Youtube channel.
Is your organization adapting to remain competitive? In the Architect Room, we design a roadmap for the future of your digital organization, while still capitalizing on current technology investments. Our knowledge spans a variety of offerings across all of the major public cloud providers. Visit Red Pill Analytics or feel free to reach out on Twitter, Facebook, and LinkedIn.