Analyzing Music Sales with SQL

At the request of the (fictional) Chinook online music store, I’ll be using R and SQL to analyze their company’s proprietary database and answer important questions about sales, employee performance, and future marketing strategies. I’ll use the {RSQLite} package to make SQL queries to the Chinook database.

The code & queries used to generate each section of this report are folded into collapsible sections. For more details about how I designed a particular query, just hit “Code”.

First, I’ll load the necessary packages to manipulate data in R, and create SQL queries from the comfort of RStudio. Then I’ll make two “helper functions” - one to list the tables in the Chinook database to make sure the connection is working, and one I’ll use to connect to the database, retrieve the queried data, and securely close the connection.

Preparing the Database:

db <- 'chinook.db'
run_query <- function(que)
{
  conn <- dbConnect(SQLite(), db)
  result <- dbGetQuery(conn, que)
  dbDisconnect(conn)
  return(result)
}
show_tables <- function() {
  q = "SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view')"
  return(run_query(q))
}
show_tables()

Comparing Genre Popularity

Chinook has recently acquired four new albums of different genres; a Hip-Hop album by the group Regal, a Punk album by the band Red Tone, a Pop album from the artist Meteor and the Girls, and a Blues album by Slim Jim Bites. We’ll analyze how well various genres have sold in the past to assess where Chinook should put its marketing money.

#this tells us the total tracks_sold is 1051 - the number we use to calculate percentage
#later on
percentage_query <-
  "SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
ORDER BY tracks_sold DESC"

run_query(percentage_query)

FALSE   tracks_sold genre_name
FALSE 1        1051       Rock

#assessing genres by sales in the US
genre_query <-
  "SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name',
(CAST(COUNT(track.genre_id) AS FLOAT)/1051)*100 'percent_share'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
GROUP BY genre.name
ORDER BY tracks_sold DESC"

gq <- run_query(genre_query)
gq

FALSE    tracks_sold         genre_name percent_share
FALSE 1          561               Rock   53.37773549
FALSE 2          130 Alternative & Punk   12.36917222
FALSE 3          124              Metal   11.79828735
FALSE 4           53           R&B/Soul    5.04281637
FALSE 5           36              Blues    3.42530923
FALSE 6           35        Alternative    3.33016175
FALSE 7           22                Pop    2.09324453
FALSE 8           22              Latin    2.09324453
FALSE 9           20        Hip Hop/Rap    1.90294957
FALSE 10          14               Jazz    1.33206470
FALSE 11          13     Easy Listening    1.23691722
FALSE 12           6             Reggae    0.57088487
FALSE 13           5  Electronica/Dance    0.47573739
FALSE 14           4          Classical    0.38058991
FALSE 15           3        Heavy Metal    0.28544244
FALSE 16           2         Soundtrack    0.19029496
FALSE 17           1           TV Shows    0.09514748

gq %>%
  mutate(genre_name = fct_reorder(genre_name, tracks_sold)) %>%
  ggplot(aes(x=genre_name, y=tracks_sold)) +
  geom_col(fill="#f68060", alpha=.8, width=.8) +
  coord_flip() +
  theme_bw()

Rock is by far the most popular genre for Chinook customers, so none of these album acquisitions are really perfect for the Chinook audience. The data suggests it’s best to focus marketing money on the Punk album (the next best-selling genre), and the sales strategy should involve playing up any rock-inspired sounds the other artists’ work offers.

Assessing Sales Support Staff

Next, we’ll compare the amount in sales each Chinook sales support agent brings in, plus other data relevant to their performance.

employee_query <- 
  "SELECT
  COUNT(customer.support_rep_id) 'rep_customers',
  employee.first_name || ' ' || employee.last_name 'employee_name',
  SUM(invoice.total) 'rep_total',
  employee.hire_date
  FROM customer
INNER JOIN employee ON employee.employee_id = customer.support_rep_id
INNER JOIN invoice ON invoice.customer_id = customer.customer_id
GROUP BY employee_name"

eq <- run_query(employee_query)
eq

FALSE   rep_customers employee_name rep_total           hire_date
FALSE 1           212  Jane Peacock   1731.51 2017-04-01 00:00:00
FALSE 2           214 Margaret Park   1584.00 2017-05-03 00:00:00
FALSE 3           188 Steve Johnson   1393.92 2017-10-17 00:00:00

eq_pal <- c("#f68060", "#C5C392", "#ADA7C9")

eq %>%
  ggplot(mapping =aes(x=employee_name, y=rep_total, fill=employee_name)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values=eq_pal) +
  theme_bw() +
  theme(legend.position = "none")

eq %>%
  ggplot(mapping =aes(x=employee_name, y=rep_customers, fill=employee_name)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values=eq_pal) +
  theme_bw() +
  theme(legend.position = "none")

Jane Peacock has made the most money in sales so far, while Margaret Park has handled the most individual clients. Steve lags in both categories, but he’s also the newest sales agent at the company. Overall, there are no weak links here.

Analyzing Sales Data By Country

Now we’ll analyze sales data by country through a variety of summary statistics.

new_country <-

"
WITH country_fix AS
    (
    SELECT
    CASE
      WHEN 
      COUNT(DISTINCT customer.customer_id) = 1 THEN 'Other'
      ELSE customer.country
      END AS 'country',
    COUNT(DISTINCT customer.customer_id) 'total_customers',
    COUNT(DISTINCT invoice.invoice_id) 'total_orders',
    SUM(invoice.total) 'total_sales'
    FROM customer
    INNER JOIN invoice ON invoice.customer_id = customer.customer_id
    GROUP BY country
    ),
results AS
    (
    SELECT
    SUM(total_customers) 'number_customers',
    SUM(total_sales) 'total_sales',
    SUM(total_sales) / SUM(total_customers) 'value_per_customer',
    SUM(total_sales) / SUM(total_orders) 'avg_order_value',
    country
    FROM country_fix
    GROUP BY country
    ORDER BY number_customers DESC
    )

SELECT
number_customers,
total_sales,
value_per_customer,
avg_order_value,
country
FROM 
  (SELECT
  results.*,
  CASE
    WHEN results.country = 'Other' THEN 1
    ELSE 0
    END AS 'outer_sort'
  FROM results)
ORDER BY outer_sort
"

cq <- run_query(new_country)
cq

FALSE    number_customers total_sales value_per_customer avg_order_value
FALSE 1                13     1040.49           80.03769        7.942672
FALSE 2                 8      535.59           66.94875        7.047237
FALSE 3                 5      389.07           77.81400        7.781400
FALSE 4                 5      427.68           85.53600        7.011148
FALSE 5                 4      334.62           83.65500        8.161463
FALSE 6                 3      245.52           81.84000        8.768571
FALSE 7                 2      185.13           92.56500        6.383793
FALSE 8                 2      183.15           91.57500        8.721429
FALSE 9                 2      273.24          136.62000        9.108000
FALSE 10               15     1094.94           72.99600        7.448571
FALSE           country
FALSE 1             USA
FALSE 2          Canada
FALSE 3          France
FALSE 4          Brazil
FALSE 5         Germany
FALSE 6  United Kingdom
FALSE 7        Portugal
FALSE 8           India
FALSE 9  Czech Republic
FALSE 10          Other

cq_pal <- c(Other = "#EC4E20", USA = "#f68060", Canada = "#F3CA40", Brazil = "#C5C392",
            France = "#285238", Germany = "#ADA7C9", "Czech Republic" = "#586994", 
            "United Kingdom" = "#ECDCB0", Portugal = "#D38B5D", India = "#F194B4")

cq %>%
  mutate(country = fct_reorder(country, total_sales)) %>%
  ggplot(aes(x=country, y=total_sales, fill=country)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values = cq_pal) +
  theme_bw() +
  theme(legend.position = "none") +
  coord_flip()

cq %>%
  mutate(country = fct_reorder(country,value_per_customer)) %>%
  ggplot(aes(x=country, y=value_per_customer, fill=country)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values = cq_pal) +
  theme_bw() +
  theme(legend.position = "none") +
  coord_flip()

cq %>%
  mutate(country = fct_reorder(country, avg_order_value)) %>%
  ggplot(aes(x=country, y=avg_order_value, fill=country)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values = cq_pal) +
  theme_bw() +
  theme(legend.position = "none") +
  coord_flip()

___

The US makes up by far the most value in sales, while the Czech Republic has the biggest spenders in terms of lifetime value per customer and value per order. Chinook should focus their expansion on their loyal Czech customers, as well as emerging markets in Canada and Brazil. The ‘Other’ category, which collates all countries with less than one sale, is prominent in all three categories - so continuing to offer their services to a global audience should be an important part of their strategy moving forward. ___

Breaking Down Purchases

Chinook is considering switching to selling invididual tracks only rather than albums as a package, and want to know what share of their purchases are from sales of full albums versus sales of collections of individual tracks.

rl_query <- 
  "
WITH 
album_count AS
(
  SELECT
  album_id,
  COUNT(track_id) 'album_tracks'
  FROM track
  GROUP BY album_id
  ),
  
album_lookup AS
(
  SELECT
  invoice_line.invoice_id,
  invoice_line.track_id,
  track.album_id
  FROM invoice_line
  LEFT JOIN track ON track.track_id = invoice_line.track_id
  ),
  
purchase_count AS
  (
  SELECT invoice_id,
  album_id,
  COUNT(DISTINCT track_id) 'count_purchased_tracks'
  FROM album_lookup
  GROUP BY invoice_id, album_id
  ),
  
compare_counts AS
  (
  SELECT purchase_count.*,
  album_count.album_tracks,
  (CASE
    WHEN purchase_count.count_purchased_tracks = album_count.album_tracks THEN 1
    WHEN purchase_count.count_purchased_tracks <> album_count.album_tracks THEN 0
    ELSE 9999
  END)
  as 'full_album'
  FROM purchase_count
  INNER JOIN album_count ON album_count.album_id = purchase_count.album_id),
  
classify_inv AS
  (
  SELECT invoice_id,
  MAX(full_album) 'album_purchase'
  FROM compare_counts
  GROUP BY invoice_id
  )
  
SELECT
album_purchase,
COUNT(invoice_id) 'number_of_invoices',
100*CAST(COUNT(invoice_id) AS FLOAT) / 
  (SELECT
  COUNT(invoice_id)
  FROM classify_inv) 'percent_purchases'

FROM classify_inv

GROUP BY album_purchase
  "

run_query(rl_query)

FALSE   album_purchase number_of_invoices percent_purchases
FALSE 1              0                413          67.26384
FALSE 2              1                201          32.73616

With 1 representing orders that included a full album and 0 representing purchases of collections of individual tracks, we can see that sales of albums makes up as much as a third of Chinook’s sales - switching to selling individual tracks doesn’t seem compatible with the purchasing style of Chinook’s customer base.

Exploring Playlist Composition

Chinook lets customers build playlists with their purchased music, and want to know which artists and genres are best represented in the playlist system.

playlist_query <-
  "
SELECT
COUNT(playlist_track.track_id) 'playlist_inclusions',
track.composer,
genre.name
FROM
playlist_track
INNER JOIN track ON track.track_id = playlist_track.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
GROUP BY track.composer
ORDER BY playlist_inclusions DESC
LIMIT 10;
"

pq <- run_query(playlist_query)
# eliminating the #NA tracks - copies? imports? artistless songs?
pq %>%
  filter(playlist_inclusions < 200)

FALSE   playlist_inclusions        composer               name
FALSE 1                 193    Steve Harris              Metal
FALSE 2                 122              U2               Rock
FALSE 3                  96 Jagger/Richards               Rock
FALSE 4                  84     Kurt Cobain               Rock
FALSE 5                  77    Billy Corgan Alternative & Punk
FALSE 6                  71   The Tea Party Alternative & Punk
FALSE 7                  69   Chico Science              Latin
FALSE 8                  66           Titãs Alternative & Punk
FALSE 9                  65    Gilberto Gil              Latin

#graphing
pq_pal <- c("#EC4E20", "#f68060", "#F3CA40", "#C5C392",
            "#285238", "#ADA7C9", "#586994", "#ECDCB0", "#D38B5D", "#F194B4")

pq %>%
  filter(playlist_inclusions < 200) %>%
  mutate(composer = fct_reorder(composer,playlist_inclusions)) %>%
  ggplot(aes(x=composer, y=playlist_inclusions, fill= composer)) +
  scale_fill_manual(values=pq_pal) +
  geom_col(alpha=.8, width=.8) +
  coord_flip() +
  theme_bw() +
  theme(legend.position = "none")

g_query <-
  "SELECT
  COUNT(playlist_track.track_id) 'playlist_inclusions',
  genre.name
  FROM playlist_track
  INNER JOIN track ON track.track_id = playlist_track.track_id
  INNER JOIN genre ON genre.genre_id = track.genre_id
  GROUP BY genre.name
  ORDER BY playlist_inclusions DESC
  LIMIT 10;
"

ggq <- run_query(g_query)
ggq

FALSE    playlist_inclusions               name
FALSE 1                 3238               Rock
FALSE 2                 1454              Latin
FALSE 3                  927              Metal
FALSE 4                  857 Alternative & Punk
FALSE 5                  334          Classical
FALSE 6                  286               Jazz
FALSE 7                  194              Blues
FALSE 8                  186           TV Shows
FALSE 9                  153           R&B/Soul
FALSE 10                 144             Reggae

ggq %>%
  mutate(name = fct_reorder(name,playlist_inclusions)) %>%
  ggplot(aes(x=name, y=playlist_inclusions, fill=name)) +
  geom_col(alpha=.8, width=.8) +
  scale_fill_manual(values=pq_pal) +
  coord_flip() +
  theme_bw() +
  theme(legend.position = "none")

Compared to the most popular genres in terms of purchases, this analysis of playlists has some surprises - even though Alternative/Punk is Chinook’s second most popular genre in sales, Latin and Metal make up the second and third most frequently used genres in Chinook playlists, perhaps thanks to Iron Maiden superfans and customers who listen to Latin acts Chico Science, Titas and Gilberto Gil. Iron Maiden frontman Steve Harris is by far the best represented artist in terms of sales, though U2 and the Rolling Stones are also very popular. A big portion of the database’s playlist tracks don’t have an assigned artist - a deeper exploration of why this is would add to the value of this analysis.

Conclusions

A globally accessible online purchasing system that supports sales of both individual tracks and full albums, as well as a high-performing team of sales support agents, are two of Chinook’s biggest strengths as a company, and should continue to be a focus moving forward. While Chinook has successfully built an international customer base of rock fans, the company should orient their future expansion towards their growing audiences in Canada, Brazil, and the Czech Republic. Users of the Chinook playlist system gravitate toward Metal and Latin genres, while Alternative & Punk is Chinook’s second most sold genre, so future album acquisitions and marketing should cater to these genres.