Loading Specific Columns from DynamoDB into Redshift
The easy way
Before I learned how to load only the columns I needed from DynamoDB into Redshift, I created an entire glue job to load a single table. I’d like to prevent you from making the same mistake.
But to understand how much easier the easy way is, you will first need to understand how I went about selecting the specific DynamoDB columns the hard way.
The hard way
This is the process that is needed to load dynamo columns into Redshift using glue:
- Creating and configuring a glue crawler
- Populating the glue catalog with the DynamoDB table
- Creating a glue job with the proper connections.
Writing a spark-SQL script that:
- Transforms datatypes
- Removes the nonessential columns
- Maps source columns to target columns
- Copies the DynamoDB table from the glue catalog into the Redshift table.
The easy way
Create a Redshift table with the columns that you want from DynamoDB; make sure the column names match the DynamoDB names exactly.
If the Dynamo table has 4 columns named SerialNumber, DateModified, Values and Array, the Redshift table you create should have 3 columns with matching names : SerialNumber, DateModified, and Values. (leaving out the array column)
Once the Redshift table with your desired columns is created, load the data using a copy command similar to the one below.
copy redshift_table from 'dynamodb://dynamo_table' iam_role 'redshift_cluster_role' readratio 50;
The Copy command loads data directly into the Redshift table from the Amazon DynamoDB table. The iam_role is an identity that you can create in your AWS account that has specific permissions that allows you to access services and databases. Readratio is the percentage of the DynamoDB table’s provisioned throughput to use for the data load. The readratio is required for copy from DynamoDB, I highly recommend setting the ratio to a value less than the average unused provisioned throughput. Valid values are integers 1–200.
The DynamoDB columns that are left out in the Redshift table (array) will be omitted from the load, leaving you with only the columns that you want.
If the copy command seems like it is hanging, check the amount of data that is in the DynamoDB table and the provisioned read and write capacity units. There is a chance the data might just be moving at a leisurely speed.
Red Pill Analytics is a unique analytics intelligence firm, experienced in advising organizations on overall data strategy. Our knowledge spans a variety of offerings across all of the major public cloud providers. From proof-of-concept to implementation to training your users, we can help. If you are interested in guidance while working with any of your data projects, feel free to reach out to us on our website or find us on Twitter, Facebook, and LinkedIn.
One Response to Loading Specific Columns from DynamoDB into Redshift