At the request of the (fictional) Chinook online music store, I’ll be using R and SQL to analyze their company’s proprietary database and answer important questions about sales, employee performance, and future marketing strategies. I’ll use the {RSQLite} package to make SQL queries to the Chinook database.

The code & queries used to generate each section of this report are folded into collapsible sections. For more details about how I designed a particular query, just hit “Code”.

First, I’ll load the necessary packages to manipulate data in R, and create SQL queries from the comfort of RStudio. Then I’ll make two “helper functions” - one to list the tables in the Chinook database to make sure the connection is working, and one I’ll use to connect to the database, retrieve the queried data, and securely close the connection.


Preparing the Database:

db <- 'chinook.db'
run_query <- function(que)
{
  conn <- dbConnect(SQLite(), db)
  result <- dbGetQuery(conn, que)
  dbDisconnect(conn)
  return(result)
}
show_tables <- function() {
  q = "SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view')"
  return(run_query(q))
}
show_tables()

Comparing Genre Popularity

Chinook has recently acquired four new albums of different genres; a Hip-Hop album by the group Regal, a Punk album by the band Red Tone, a Pop album from the artist Meteor and the Girls, and a Blues album by Slim Jim Bites. We’ll analyze how well various genres have sold in the past to assess where Chinook should put its marketing money.

#this tells us the total tracks_sold is 1051 - the number we use to calculate percentage
#later on
percentage_query <-
  "SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
ORDER BY tracks_sold DESC"

run_query(percentage_query)
FALSE   tracks_sold genre_name
FALSE 1        1051       Rock
#assessing genres by sales in the US
genre_query <-
  "SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name',
(CAST(COUNT(track.genre_id) AS FLOAT)/1051)*100 'percent_share'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
GROUP BY genre.name
ORDER BY tracks_sold DESC"

gq <- run_query(genre_query)
gq
FALSE    tracks_sold         genre_name percent_share
FALSE 1          561               Rock   53.37773549
FALSE 2          130 Alternative & Punk   12.36917222
FALSE 3          124              Metal   11.79828735
FALSE 4           53           R&B/Soul    5.04281637
FALSE 5           36              Blues    3.42530923
FALSE 6           35        Alternative    3.33016175
FALSE 7           22                Pop    2.09324453
FALSE 8           22              Latin    2.09324453
FALSE 9           20        Hip Hop/Rap    1.90294957
FALSE 10          14               Jazz    1.33206470
FALSE 11          13     Easy Listening    1.23691722
FALSE 12           6             Reggae    0.57088487
FALSE 13           5  Electronica/Dance    0.47573739
FALSE 14           4          Classical    0.38058991
FALSE 15           3        Heavy Metal    0.28544244
FALSE 16           2         Soundtrack    0.19029496
FALSE 17           1           TV Shows    0.09514748
gq %>%
  mutate(genre_name = fct_reorder(genre_name, tracks_sold)) %>%
  ggplot(aes(x=genre_name, y=tracks_sold)) +
  geom_col(fill="#f68060", alpha=.8, width=.8) +
  coord_flip() +
  theme_bw()


Rock is by far the most popular genre for Chinook customers, so none of these album acquisitions are really perfect for the Chinook audience. The data suggests it’s best to focus marketing money on the Punk album (the next best-selling genre), and the sales strategy should involve playing up any rock-inspired sounds the other artists’ work offers.


Assessing Sales Support Staff

Next, we’ll compare the amount in sales each Chinook sales support agent brings in, plus other data relevant to their performance.

employee_query <- 
  "SELECT
  COUNT(customer.support_rep_id) 'rep_customers',
  employee.first_name || ' ' || employee.last_name 'employee_name',
  SUM(invoice.total) 'rep_total',
  employee.hire_date
  FROM customer
INNER JOIN employee ON employee.employee_id = customer.support_rep_id
INNER JOIN invoice ON invoice.customer_id = customer.customer_id
GROUP BY employee_name"

eq <- run_query(employee_query)
eq
FALSE   rep_customers employee_name rep_total           hire_date
FALSE 1           212  Jane Peacock   1731.51 2017-04-01 00:00:00
FALSE 2           214 Margaret Park   1584.00 2017-05-03 00:00:00
FALSE 3           188 Steve Johnson   1393.92 2017-10-17 00:00:00
eq_pal <- c("#f68060", "#C5C392", "#ADA7C9")

eq %>%
  ggplot(mapping =aes(x=employee_name, y=rep_total, fill=employee_name)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values=eq_pal) +
  theme_bw() +
  theme(legend.position = "none")

eq %>%
  ggplot(mapping =aes(x=employee_name, y=rep_customers, fill=employee_name)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values=eq_pal) +
  theme_bw() +
  theme(legend.position = "none")


Jane Peacock has made the most money in sales so far, while Margaret Park has handled the most individual clients. Steve lags in both categories, but he’s also the newest sales agent at the company. Overall, there are no weak links here.


Analyzing Sales Data By Country

Now we’ll analyze sales data by country through a variety of summary statistics.

new_country <-

"
WITH country_fix AS
    (
    SELECT
    CASE
      WHEN 
      COUNT(DISTINCT customer.customer_id) = 1 THEN 'Other'
      ELSE customer.country
      END AS 'country',
    COUNT(DISTINCT customer.customer_id) 'total_customers',
    COUNT(DISTINCT invoice.invoice_id) 'total_orders',
    SUM(invoice.total) 'total_sales'
    FROM customer
    INNER JOIN invoice ON invoice.customer_id = customer.customer_id
    GROUP BY country
    ),
results AS
    (
    SELECT
    SUM(total_customers) 'number_customers',
    SUM(total_sales) 'total_sales',
    SUM(total_sales) / SUM(total_customers) 'value_per_customer',
    SUM(total_sales) / SUM(total_orders) 'avg_order_value',
    country
    FROM country_fix
    GROUP BY country
    ORDER BY number_customers DESC
    )

SELECT
number_customers,
total_sales,
value_per_customer,
avg_order_value,
country
FROM 
  (SELECT
  results.*,
  CASE
    WHEN results.country = 'Other' THEN 1
    ELSE 0
    END AS 'outer_sort'
  FROM results)
ORDER BY outer_sort
"

cq <- run_query(new_country)
cq
FALSE    number_customers total_sales value_per_customer avg_order_value
FALSE 1                13     1040.49           80.03769        7.942672
FALSE 2                 8      535.59           66.94875        7.047237
FALSE 3                 5      389.07           77.81400        7.781400
FALSE 4                 5      427.68           85.53600        7.011148
FALSE 5                 4      334.62           83.65500        8.161463
FALSE 6                 3      245.52           81.84000        8.768571
FALSE 7                 2      185.13           92.56500        6.383793
FALSE 8                 2      183.15           91.57500        8.721429
FALSE 9                 2      273.24          136.62000        9.108000
FALSE 10               15     1094.94           72.99600        7.448571
FALSE           country
FALSE 1             USA
FALSE 2          Canada
FALSE 3          France
FALSE 4          Brazil
FALSE 5         Germany
FALSE 6  United Kingdom
FALSE 7        Portugal
FALSE 8           India
FALSE 9  Czech Republic
FALSE 10          Other
cq_pal <- c(Other = "#EC4E20", USA = "#f68060", Canada = "#F3CA40", Brazil = "#C5C392",
            France = "#285238", Germany = "#ADA7C9", "Czech Republic" = "#586994", 
            "United Kingdom" = "#ECDCB0", Portugal = "#D38B5D", India = "#F194B4")

cq %>%
  mutate(country = fct_reorder(country, total_sales)) %>%
  ggplot(aes(x=country, y=total_sales, fill=country)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values = cq_pal) +
  theme_bw() +
  theme(legend.position = "none") +
  coord_flip()

cq %>%
  mutate(country = fct_reorder(country,value_per_customer)) %>%
  ggplot(aes(x=country, y=value_per_customer, fill=country)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values = cq_pal) +
  theme_bw() +
  theme(legend.position = "none") +
  coord_flip()