Connect R to Amazon Redshift Database

This is a quick technical post for anyone who needs full CRUD capabilities to retrieve their data from a Redshift table, manipulate data in #rstats and sending it all back up again.

Dependencies

Load libraries

library(tidyverse)

library(RPostgreSQL) # INTERACT WITH REDSHIFT DATABASE
library(glue) # FORMAT AND INTERPOLATE STRINGS

Amazon S3

For this data pipeline to work you’ll also need the AWS command line interface installed.

# RUN THESE COMMANDS INSIDE TERMINAL
brew install awscli
aws configure

# ANSWER QUESTIONS
access / secret / zone

Read data

Set connection

You’ll need to replace with your own database credentials below:

conn <- dbConnect(dbDriver("PostgreSQL"),
                  dbname = "your_dbname",
                  host = "your_host",
                  port = 12345
                  user = "your_username",
                  password = "your_password")

Pull data

This section is where you run your SQL queries and drop the results into the tidyverse tibble format:

df <- as_tibble(
  dbGetQuery(
    conn, 
    
    "SELECT * FROM table WHERE id = 12345"
    
    ))

Perform magic

Do your data science stuff.

Optional: create table

For a brand new table you can run the following command and I like to take a sample for faster processing:

df_sample <- df %>% 
  head()

dbWriteTable(conn, name = c("schema", "table_name"),
             value = df_sample,
             row.names = FALSE,
             append = TRUE)

You might get an error message but validate it by checking your database… the table should be there.

Upload data

It’s all downhill from here.

Write data to CSV

df %>% 
  write_csv("df_results.csv")

Migrate CSV file to S3 bucket

system("aws s3 cp df_results.csv s3://file_path/df_results.csv")

Fun fact: the system command tells R to access your terminal functions.

Copy from S3 to Redshift database

load_s3_redshift <- glue_sql("COPY schema.table_name
                             FROM 's3://file_path/df_results.csv'
                             access_key_id 'ABCDEFGH'
                             secret_access_key '12345' csv IGNOREHEADER 1")

dbSendStatement(conn, load_s3_redsfhit)

Optional: delete data

dbSendStatement(conn, 
                "DELETE FROM schema.table_name WHERE id = 12345")

Cut connection

Don’t forget!

dbDisconnect(conn)