Introduction


Let’s compare the series Game of Thrones and Xena the Warrior Princess to answer :

  • Which one is better rated by users?
    • The difference is maintained throughout seasons?
    • Where the episode is placed inside its season (beggining/end or middle) has an effect?

Employed dataset

Exploratory Data Analysis on data from IMDB about the series Game of Thrones and Xena the Warrior Princess. The original data and variables como from this repository. There we have an explanation on how the data were generated and on the meaning of each variable.

episodes <- read_csv(here("evidences/series_from_imdb.csv"), 
                    progress = FALSE,
                    col_types = cols(.default = col_double(), 
                                     series_name = col_character(), 
                                     episode = col_character(), 
                                     url = col_character(),
                                     season = col_character())) %>% 
    filter(series_name %in% c("Game of Thrones","Xena a Princesa Guerreira")) 

episodes %>% 
    glimpse()
## Observations: 194
## Variables: 18
## $ series_name <chr> "Xena a Princesa Guerreira", "Xena a Princesa Guerre…
## $ episode     <chr> "Sins of the Past", "Chariots of War", "Dreamworker"…
## $ series_ep   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ season      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ season_ep   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ url         <chr> "http://www.imdb.com/title/tt0394990/", "http://www.…
## $ user_rating <dbl> 7.9, 7.4, 7.7, 7.4, 7.5, 7.7, 7.5, 8.0, 7.8, 7.6, 7.…
## $ user_votes  <dbl> 440, 339, 318, 297, 288, 282, 270, 303, 278, 287, 27…
## $ r1          <dbl> 0.003623188, 0.025641026, 0.064516129, 0.023474178, …
## $ r2          <dbl> 0.04347826, 0.03846154, 0.03548387, 0.02347418, 0.02…
## $ r3          <dbl> 0.010869565, 0.038461538, 0.029032258, 0.004694836, …
## $ r4          <dbl> 0.007246377, 0.034188034, 0.022580645, 0.023474178, …
## $ r5          <dbl> 0.018115942, 0.042735043, 0.019354839, 0.046948357, …
## $ r6          <dbl> 0.02536232, 0.12393162, 0.03870968, 0.06103286, 0.12…
## $ r7          <dbl> 0.08695652, 0.16239316, 0.05161290, 0.11267606, 0.13…
## $ r8          <dbl> 0.1086957, 0.1666667, 0.1354839, 0.2112676, 0.180451…
## $ r9          <dbl> 0.15579710, 0.09829060, 0.17096774, 0.18309859, 0.13…
## $ r10         <dbl> 0.5398551, 0.2692308, 0.4322581, 0.3098592, 0.338345…

Episodes from the middle of the season

In order to make our discussion a little more interesting let’s generate a new information: “An episode belongs to the middle of a season?” (middle_eps). An episode belongs to the middle of the season when it’s in the 60% central episodes of a season.

sumario_simples <- 
    episodes %>% 
    select(season_ep,season,series_name) %>%
    group_by(series_name,season) %>% 
    summarise(n = n(),
               p20 = quantile(seq(from=1, to=n, by=1), 0.20),
               p80 = quantile(seq(from=1, to=n, by=1), 0.80))

episodes <- left_join(episodes, sumario_simples,
                      by = c("series_name", "season")) %>% 
    group_by(series_name, season) %>%
    mutate(middle_eps = (season_ep > p20) &
               (season_ep < p80)) %>% 
    ungroup()
episodes %>% 
    select(series_name, series_ep, middle_eps)
## # A tibble: 194 x 3
##    series_name               series_ep middle_eps
##    <chr>                         <dbl> <lgl>     
##  1 Xena a Princesa Guerreira         1 FALSE     
##  2 Xena a Princesa Guerreira         2 FALSE     
##  3 Xena a Princesa Guerreira         3 FALSE     
##  4 Xena a Princesa Guerreira         4 FALSE     
##  5 Xena a Princesa Guerreira         5 FALSE     
##  6 Xena a Princesa Guerreira         6 TRUE      
##  7 Xena a Princesa Guerreira         7 TRUE      
##  8 Xena a Princesa Guerreira         8 TRUE      
##  9 Xena a Princesa Guerreira         9 TRUE      
## 10 Xena a Princesa Guerreira        10 TRUE      
## # … with 184 more rows

So, who fared better?



Let’s compare the ratings of the episodes of both series throughout their 6 seasons:



m <- list(
  b = 100,
  r = 185,
  t = 75
  )

p <- episodes %>% 
      ggplot(aes(x = series_name, y = user_rating, 
                 color=middle_eps,
                 group=episode, text = paste(
                    "Series:", series_name,
                    "\nEpisode:", episode,
                    "\nRating:", user_rating
                     ))) + 
        geom_jitter(width = 0.3, alpha=0.7) +
        facet_wrap(~ season) +
        xlab("") +
        ylab("User Rating\n.") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1))  +
        scale_x_discrete(labels=c("GoT", "Xena")) +
        labs(color='Middle of the season?') +
        ggtitle(paste("GoT x Xena (Season by Season)")) +
        theme_update(plot.title = element_text(hjust = -1))
ggplotly(p, 
         autosize = T,
         tooltip = "text") %>%
  layout(margin = m)
678910GoTXena678910GoTXenaGoTXena
FALSE TRUEGoT x Xena (Season by Season)User Rating.123456Middle of the season?

It’s possible to notice that throughout the six seasons (on seasons 5 and 6 being less unanimous), be it episodes from the beginning/end of the season be it episodes from the middle of the season, Game of Thrones (GoT) is better rated than Xena the Warrior Princess (Xena).


Better rated episodes (Curiosity)

Among the competitors for highest rating (9,9) we have:

  • The Rains of Castemere : Episode that broke the heart of many fans, without spoilers we can say that some considered abandoning the series so big was the heartbreak.

  • Battle of the Bastards and The Winds of Winter : The first one has the biggest battle enacted throughout the whole series as I’m writing this. The second one has suicides, explosions and the most expected revelation of the whole series: R+L=J.

  • Hardhome : This episode from season 5 had special effects of such high quality that rumors spread that the visual quality for the rest of the season went downhill for lack of budget.