At the request of the (fictional) Chinook online music store, I’ll be using R and SQL to analyze their company’s proprietary database and answer important questions about sales, employee performance, and future marketing strategies. I’ll use the {RSQLite} package to make SQL queries to the Chinook database.
The code & queries used to generate each section of this report are folded into collapsible sections. For more details about how I designed a particular query, just hit “Code”.
First, I’ll load the necessary packages to manipulate data in R, and create SQL queries from the comfort of RStudio. Then I’ll make two “helper functions” - one to list the tables in the Chinook database to make sure the connection is working, and one I’ll use to connect to the database, retrieve the queried data, and securely close the connection.
db <- 'chinook.db'
run_query <- function(que)
{
conn <- dbConnect(SQLite(), db)
result <- dbGetQuery(conn, que)
dbDisconnect(conn)
return(result)
}
show_tables <- function() {
q = "SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view')"
return(run_query(q))
}
show_tables()
Chinook has recently acquired four new albums of different genres; a Hip-Hop album by the group Regal, a Punk album by the band Red Tone, a Pop album from the artist Meteor and the Girls, and a Blues album by Slim Jim Bites. We’ll analyze how well various genres have sold in the past to assess where Chinook should put its marketing money.
#this tells us the total tracks_sold is 1051 - the number we use to calculate percentage
#later on
percentage_query <-
"SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
ORDER BY tracks_sold DESC"
run_query(percentage_query)
FALSE tracks_sold genre_name
FALSE 1 1051 Rock
#assessing genres by sales in the US
genre_query <-
"SELECT
count(track.genre_id) 'tracks_sold',
genre.name 'genre_name',
(CAST(COUNT(track.genre_id) AS FLOAT)/1051)*100 'percent_share'
from invoice_line
INNER JOIN track ON track.track_id = invoice_line.track_id
INNER JOIN genre ON genre.genre_id = track.genre_id
INNER JOIN invoice ON invoice.invoice_id = invoice_line.invoice_id
INNER JOIN customer ON customer.customer_id = invoice.customer_id
WHERE customer.country == 'USA'
GROUP BY genre.name
ORDER BY tracks_sold DESC"
gq <- run_query(genre_query)
gq
FALSE tracks_sold genre_name percent_share
FALSE 1 561 Rock 53.37773549
FALSE 2 130 Alternative & Punk 12.36917222
FALSE 3 124 Metal 11.79828735
FALSE 4 53 R&B/Soul 5.04281637
FALSE 5 36 Blues 3.42530923
FALSE 6 35 Alternative 3.33016175
FALSE 7 22 Pop 2.09324453
FALSE 8 22 Latin 2.09324453
FALSE 9 20 Hip Hop/Rap 1.90294957
FALSE 10 14 Jazz 1.33206470
FALSE 11 13 Easy Listening 1.23691722
FALSE 12 6 Reggae 0.57088487
FALSE 13 5 Electronica/Dance 0.47573739
FALSE 14 4 Classical 0.38058991
FALSE 15 3 Heavy Metal 0.28544244
FALSE 16 2 Soundtrack 0.19029496
FALSE 17 1 TV Shows 0.09514748
gq %>%
mutate(genre_name = fct_reorder(genre_name, tracks_sold)) %>%
ggplot(aes(x=genre_name, y=tracks_sold)) +
geom_col(fill="#f68060", alpha=.8, width=.8) +
coord_flip() +
theme_bw()
Rock is by far the most popular genre for Chinook customers, so none of these album acquisitions are really perfect for the Chinook audience. The data suggests it’s best to focus marketing money on the Punk album (the next best-selling genre), and the sales strategy should involve playing up any rock-inspired sounds the other artists’ work offers.
Next, we’ll compare the amount in sales each Chinook sales support agent brings in, plus other data relevant to their performance.
employee_query <-
"SELECT
COUNT(customer.support_rep_id) 'rep_customers',
employee.first_name || ' ' || employee.last_name 'employee_name',
SUM(invoice.total) 'rep_total',
employee.hire_date
FROM customer
INNER JOIN employee ON employee.employee_id = customer.support_rep_id
INNER JOIN invoice ON invoice.customer_id = customer.customer_id
GROUP BY employee_name"
eq <- run_query(employee_query)
eq
FALSE rep_customers employee_name rep_total hire_date
FALSE 1 212 Jane Peacock 1731.51 2017-04-01 00:00:00
FALSE 2 214 Margaret Park 1584.00 2017-05-03 00:00:00
FALSE 3 188 Steve Johnson 1393.92 2017-10-17 00:00:00
eq_pal <- c("#f68060", "#C5C392", "#ADA7C9")
eq %>%
ggplot(mapping =aes(x=employee_name, y=rep_total, fill=employee_name)) +
geom_col(alpha=.8, width=.8) +
scale_fill_manual(values=eq_pal) +
theme_bw() +
theme(legend.position = "none")
eq %>%
ggplot(mapping =aes(x=employee_name, y=rep_customers, fill=employee_name)) +
geom_col(alpha=.8, width=.8) +
scale_fill_manual(values=eq_pal) +
theme_bw() +
theme(legend.position = "none")
Jane Peacock has made the most money in sales so far, while Margaret Park has handled the most individual clients. Steve lags in both categories, but he’s also the newest sales agent at the company. Overall, there are no weak links here.
Now we’ll analyze sales data by country through a variety of summary statistics.
new_country <-
"
WITH country_fix AS
(
SELECT
CASE
WHEN
COUNT(DISTINCT customer.customer_id) = 1 THEN 'Other'
ELSE customer.country
END AS 'country',
COUNT(DISTINCT customer.customer_id) 'total_customers',
COUNT(DISTINCT invoice.invoice_id) 'total_orders',
SUM(invoice.total) 'total_sales'
FROM customer
INNER JOIN invoice ON invoice.customer_id = customer.customer_id
GROUP BY country
),
results AS
(
SELECT
SUM(total_customers) 'number_customers',
SUM(total_sales) 'total_sales',
SUM(total_sales) / SUM(total_customers) 'value_per_customer',
SUM(total_sales) / SUM(total_orders) 'avg_order_value',
country
FROM country_fix
GROUP BY country
ORDER BY number_customers DESC
)
SELECT
number_customers,
total_sales,
value_per_customer,
avg_order_value,
country
FROM
(SELECT
results.*,
CASE
WHEN results.country = 'Other' THEN 1
ELSE 0
END AS 'outer_sort'
FROM results)
ORDER BY outer_sort
"
cq <- run_query(new_country)
cq
FALSE number_customers total_sales value_per_customer avg_order_value
FALSE 1 13 1040.49 80.03769 7.942672
FALSE 2 8 535.59 66.94875 7.047237
FALSE 3 5 389.07 77.81400 7.781400
FALSE 4 5 427.68 85.53600 7.011148
FALSE 5 4 334.62 83.65500 8.161463
FALSE 6 3 245.52 81.84000 8.768571
FALSE 7 2 185.13 92.56500 6.383793
FALSE 8 2 183.15 91.57500 8.721429
FALSE 9 2 273.24 136.62000 9.108000
FALSE 10 15 1094.94 72.99600 7.448571
FALSE country
FALSE 1 USA
FALSE 2 Canada
FALSE 3 France
FALSE 4 Brazil
FALSE 5 Germany
FALSE 6 United Kingdom
FALSE 7 Portugal
FALSE 8 India
FALSE 9 Czech Republic
FALSE 10 Other
cq_pal <- c(Other = "#EC4E20", USA = "#f68060", Canada = "#F3CA40", Brazil = "#C5C392",
France = "#285238", Germany = "#ADA7C9", "Czech Republic" = "#586994",
"United Kingdom" = "#ECDCB0", Portugal = "#D38B5D", India = "#F194B4")
cq %>%
mutate(country = fct_reorder(country, total_sales)) %>%
ggplot(aes(x=country, y=total_sales, fill=country)) +
geom_col(alpha=.8, width=.8) +
scale_fill_manual(values = cq_pal) +
theme_bw() +
theme(legend.position = "none") +
coord_flip()
cq %>%
mutate(country = fct_reorder(country,value_per_customer)) %>%
ggplot(aes(x=country, y=value_per_customer, fill=country)) +
geom_col(alpha=.8, width=.8) +
scale_fill_manual(values = cq_pal) +
theme_bw() +
theme(legend.position = "none") +
coord_flip()