Chinook, a (fictional) boutique online music store, is preparing for a new stage of growth. Using R and SQL, I’ll be analyzing their proprietary database to answer important questions about sales, employee performance, and customer purchasing habits to help them learn more about their customers, expand their audience, and develop future marketing strategies.

Preparing the Data

First, I’ll load the necessary packages to manipulate data in R, and create SQL queries from the comfort of RStudio. Then I’ll make two “helper functions” – one to list the tables in the Chinook database to make sure the connection is working, and one I’ll use to connect to the database, retrieve the queried data, and securely close the connection.

library(tidyverse)
library(RSQLite)
library(DBI)
library(RColorBrewer)


db <- 'chinook.db'
run_query <- function(que)
{
  conn <- dbConnect(SQLite(), db)
  result <- dbGetQuery(conn, que)
  dbDisconnect(conn)
  return(result)
}
show_tables <- function() {
  q = "SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view')"
  return(run_query(q))
}
show_tables()

Comparing Genre Popularity

Chinook has recently acquired four new albums of different genres; a Hip-Hop album by the group Regal, a Punk album by the band Red Tone, a Pop album from the artist Meteor and the Girls, and a Blues album by Slim Jim Bites. We’ll analyze how well various genres have sold in the past to assess where Chinook should put its marketing money.

#this tells us the total tracks_sold is 1051 - the number we use to calculate percentage
#later on
percentage_query <-
  "SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
ORDER BY tracks_sold DESC"

run_query(percentage_query)

#assessing genres by sales in the US
genre_query <-
  "SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name',
(CAST(COUNT(track.genre_id) AS FLOAT)/1051)*100 'percent_share'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
GROUP BY genre.name
ORDER BY tracks_sold DESC"

gq <- run_query(genre_query)
gq

gq %>%
  mutate(genre_name = fct_reorder(genre_name, tracks_sold)) %>%
  ggplot(aes(x=genre_name, y=tracks_sold)) +
  geom_col(fill="#f68060", alpha=.8, width=.8) +
  coord_flip() +
  theme_bw() +
  theme(axis.title.y = element_blank()) +
  ylab("# of tracks sold")

Rock is by far the most popular genre for Chinook customers, so none of these album acquisitions are really perfect for the Chinook audience. The data suggests it’s best to focus marketing money on the Punk album (the next best-selling genre), and the sales strategy should involve playing up any rock-inspired tracks the other artists’ work offers.

Assessing Sales Support Staff Success

Next, we’ll compare the amount in sales each Chinook sales support agent brings in, plus other data relevant to their performance.

employee_query <- 
  "SELECT
  COUNT(customer.support_rep_id) 'rep_customers',
  employee.first_name || ' ' || employee.last_name 'employee_name',
  SUM(invoice.total) 'rep_total',
  employee.hire_date
  FROM customer
INNER JOIN employee ON employee.employee_id = customer.support_rep_id
INNER JOIN invoice ON invoice.customer_id = customer.customer_id
GROUP BY employee_name"

eq <- run_query(employee_query)
eq

eq_pal <- c("#f68060", "#C5C392", "#ADA7C9")

eq %>%
  ggplot(mapping =aes(x=employee_name, y=rep_total, fill=employee_name)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values=eq_pal) +
  theme_bw() +
  xlab("sales support agent") +
  ylab("$ in sales") +
  theme(legend.position = "none")


eq %>%
  ggplot(mapping =aes(x=employee_name, y=rep_customers, fill=employee_name)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values=eq_pal) +
  theme_bw() +
  xlab("sales support agent") +
  ylab("# of customers served") +
  theme(legend.position = "none")

Jane Peacock has made the most money in sales so far, while Margaret Park has handled the most individual clients. Steve lags in both categories, but he’s also the newest sales agent at the company. Overall, there are no weak links here.

Analyzing Sales Data By Country

Now we’ll analyze sales data by country through a variety of summary statistics.

new_country <-

"
 /* categorizing countries with 1 purchase as Other */ 
WITH country_fix AS
    (
    SELECT
    CASE
      WHEN 
      COUNT(DISTINCT customer.customer_id) = 1 THEN 'Other'
      ELSE customer.country
      END AS 'country',
    COUNT(DISTINCT customer.customer_id) 'total_customers',
    COUNT(DISTINCT invoice.invoice_id) 'total_orders',
    SUM(invoice.total) 'total_sales'
    FROM customer
    INNER JOIN invoice ON invoice.customer_id = customer.customer_id
    GROUP BY country
    ),
/* organizing statistics */ 
results AS
    (
    SELECT
    SUM(total_customers) 'number_customers',
    SUM(total_sales) 'total_sales',
    SUM(total_sales) / SUM(total_customers) 'value_per_customer',
    SUM(total_sales) / SUM(total_orders) 'avg_order_value',
    country
    FROM country_fix
    GROUP BY country
    ORDER BY number_customers DESC
    )
/* making the table, plus pushing 'Other' to the bottom */ 
SELECT
number_customers,
total_sales,
value_per_customer,
avg_order_value,
country
FROM 
  (SELECT
  results.*,
  CASE
    WHEN results.country = 'Other' THEN 1
    ELSE 0
    END AS 'outer_sort'
  FROM results)
ORDER BY outer_sort
"

cq <- run_query(new_country)
cq

cq_pal <- c(Other = "#EC4E20", USA = "#f68060", Canada = "#F3CA40", Brazil = "#C5C392",
            France = "#285238", Germany = "#ADA7C9", "Czech Republic" = "#586994", 
            "United Kingdom" = "#ECDCB0", Portugal = "#D38B5D", India = "#F194B4")

cq %>%
  mutate(country = fct_reorder(country, total_sales)) %>%
  ggplot(aes(x=country, y=total_sales, fill=country)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values = cq_pal) +
  ylab("total $ in sales") +
  theme_bw() +
  theme(legend.position = "none") +
  coord_flip()


cq %>%
  mutate(country = fct_reorder(country,value_per_customer)) %>%
  ggplot(aes(x=country, y=value_per_customer, fill=country)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values = cq_pal) +
  ylab("lifetime value per customer ($)") +
  theme_bw() +
  theme(legend.position = "none") +
  coord_flip()