This is a quick technical post for anyone who needs full CRUD capabilities to retrieve their data from a Redshift table, manipulate data in #rstats and sending it all back up again.
Dependencies
Load libraries
library(tidyverse)
library(RPostgreSQL) # INTERACT WITH REDSHIFT DATABASE
library(glue) # FORMAT AND INTERPOLATE STRINGS
Amazon S3
For this data pipeline to work you’ll also need the AWS command line interface installed.
# RUN THESE COMMANDS INSIDE TERMINAL
brew install awscli
aws configure
# ANSWER QUESTIONS
access / secret / zone
Read data
Set connection
You’ll need to replace with your own database credentials below:
conn <- dbConnect(dbDriver("PostgreSQL"),
dbname = "your_dbname",
host = "your_host",
port = 12345
user = "your_username",
password = "your_password")
Pull data
This section is where you run your SQL queries and drop the results into the tidyverse tibble format:
df <- as_tibble(
dbGetQuery(
conn,
"SELECT * FROM table WHERE id = 12345"
))
Perform magic
Do your data science stuff.
Optional: create table
For a brand new table you can run the following command and I like to take a sample for faster processing:
df_sample <- df %>%
head()
dbWriteTable(conn, name = c("schema", "table_name"),
value = df_sample,
row.names = FALSE,
append = TRUE)
You might get an error message but validate it by checking your database… the table should be there.
Upload data
It’s all downhill from here.
Write data to CSV
df %>%
write_csv("df_results.csv")
Migrate CSV file to S3 bucket
system("aws s3 cp df_results.csv s3://file_path/df_results.csv")
Fun fact: the system command tells R to access your terminal functions.
Copy from S3 to Redshift database
load_s3_redshift <- glue_sql("COPY schema.table_name
FROM 's3://file_path/df_results.csv'
access_key_id 'ABCDEFGH'
secret_access_key '12345' csv IGNOREHEADER 1")
dbSendStatement(conn, load_s3_redsfhit)
Optional: delete data
dbSendStatement(conn,
"DELETE FROM schema.table_name WHERE id = 12345")
Cut connection
Don’t forget!
dbDisconnect(conn)