EDA about series ratings on IMDB

Introduction

Analyzed Dataset

Exploratory Data Analysis on data from IMDB about TV series and Streaming. The original data and variables come from this repository. There you can find explained how the data was retrieved and the meaning of each variable.

episodes <- read_csv(here("evidences/series_from_imdb.csv"), 
                    progress = FALSE,
                    col_types = cols(.default = col_double(), 
                                     series_name = col_character(), 
                                     episode = col_character(), 
                                     url = col_character(),
                                     season = col_character())) 
episodes %>% 
    glimpse()

## Observations: 32,070
## Variables: 18
## $ series_name <chr> "13 Reasons Why", "13 Reasons Why", "13 Reasons Why"…
## $ episode     <chr> "Tape 1, Side A", "Tape 1, Side B", "Tape 2, Side A"…
## $ series_ep   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, …
## $ season      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ season_ep   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, …
## $ url         <chr> "http://www.imdb.com/title/tt5174246/", "http://www.…
## $ user_rating <dbl> 8.5, 8.2, 8.1, 8.3, 8.5, 8.3, 8.6, 8.4, 8.9, 8.8, 9.…
## $ user_votes  <dbl> 3661, 3009, 2784, 2658, 2617, 2491, 2548, 2436, 2507…
## $ r1          <dbl> 0.04143948, 0.04176334, 0.04446038, 0.05065666, 0.05…
## $ r2          <dbl> 0.003816794, 0.003646006, 0.003226963, 0.002251407, …
## $ r3          <dbl> 0.0032715376, 0.0046403712, 0.0046611689, 0.00300187…
## $ r4          <dbl> 0.004634678, 0.006297647, 0.008246683, 0.005253283, …
## $ r5          <dbl> 0.011177754, 0.013258204, 0.019361778, 0.016510319, …
## $ r6          <dbl> 0.031079607, 0.036460060, 0.043743277, 0.038273921, …
## $ r7          <dbl> 0.09133043, 0.13059330, 0.13302259, 0.11031895, 0.09…
## $ r8          <dbl> 0.20692475, 0.27842227, 0.28002868, 0.25628518, 0.20…
## $ r9          <dbl> 0.2764449, 0.2031820, 0.1724632, 0.2112570, 0.243614…
## $ r10         <dbl> 0.3298800, 0.2817368, 0.2907852, 0.3061914, 0.345787…

Small explanation about employed estimators

- Mean: the sum of the values of the elements of a group divided by the number of elements in that group.
- Mode: the most frequent element inside a group of values.
- Median: the value inside a group of sorted values that divides them into 50% bigger and 50% smaller.
- IQR: in a group of sorted values contains the 50% central values, e.g. if we had a sequence going from 1 to 100 the IQR would go from 25 to 75 (the 50 central values).

x<-rnorm(100000,mean=10, sd=2);
qqnorm(x, pch = 1, frame = FALSE);
qqline(x, col = "red", lwd = 2);

The Q-Q Plot (quantile-quantile plot) will be employed to describe metrics used in this analysis. The Q-Q Plot of a well behaved distribution (gaussian, symetric..) is well aligned as in the example chart.

Data Overview

Number of votes of a certain episode

episodes %>%
    group_by(episode) %>%
    filter(user_votes != max(user_votes)) %>%
    ggplot(aes(user_votes,y=..density..)) +
    geom_histogram(binwidth = 50,
                   fill="grey",
                   color="black") +
    scale_x_continuous(breaks=seq(0,9750,250)) +
    labs(x = "User votes",
         y = "Frequency density") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

We have a obvious mode between 0 and 250 votes.
The highest values are very far way from the mode, this impacts the distribution mean and median making them higher than the mode.
There are no values that violate the nature of the quantity (e.g. negative values), and would therefore invalidate our analysis.

Let’s look at the 10 most voted episodes:

episodes %>%
    select(series_name,episode,user_votes) %>%
    top_n(10, user_votes) %>% 
    arrange(desc(user_votes))

## # A tibble: 10 x 3
##    series_name      episode                             user_votes
##    <chr>            <chr>                                    <dbl>
##  1 Game of Thrones  Battle of the Bastards                  138353
##  2 Game of Thrones  The Winds of Winter                      93680
##  3 Breaking Bad     Ozymandias                               87991
##  4 Game of Thrones  Hardhome                                 59404
##  5 Breaking Bad     Felina                                   59235
##  6 Game of Thrones  The Rains of Castamere                   55077
##  7 Game of Thrones  The Door                                 42617
##  8 Game of Thrones  The Lion and the Rose                    31235
##  9 The Walking Dead The Day Will Come When You Won't Be      30624
## 10 Game of Thrones  Home                                     29785

The top 3 most voted episodes have a sensibly higher number of votes than the rest. The most voted episode of all in particular has 40.000 more votes than the second most voted episode.

episodes %>% 
    group_by(episode) %>%
    ggplot(aes(sample=user_votes)) + 
        stat_qq()

The quantile-quantile plot clearly shows the leap around the highest values which are very far away from the rest of the distribution.

Episode rating

episodes %>% 
    group_by(episode) %>%
    ggplot(aes(user_rating, ..density..)) +
    geom_histogram(binwidth = 0.1,
                   fill="grey",
                   color="black") +
    labs(x = "Episode Rating",
         y = "Frequency density") +
        scale_x_continuous(breaks=seq(0,10,1))

The most frequent rating is clearly around 8.
The distribution is relatively well behaved (symetric), there’s however the occurrence of lower ratings (between 3 and 6).
There are no values that violate the nature of the quantity (e.g. negative values), and would therefore invalidate our analysis.

episodes %>% 
    group_by(episode) %>%
    ggplot(aes(sample=user_rating)) + 
        stat_qq()

The lower ratings previously mentioned affect the curvature and prove that the distribution of episodes ratings is not symetric (well behaved).

Season length

sumario_n_seasons <- episodes %>%
    group_by(series_name, season) %>%
    summarise(season_length = n()) 

episodes <- left_join(episodes, sumario_n_seasons, by= c("series_name","season")) 

episodes %>% 
    select(series_name, season, season_length) %>% 
    unique() %>%
    sample_n(10)

## # A tibble: 10 x 3
##    series_name                       season season_length
##    <chr>                             <chr>          <int>
##  1 Lip Service                       2                  6
##  2 DCs Legends of Tomorrow           1                 16
##  3 Dallas                            2                 24
##  4 The Middle                        5                 24
##  5 Gotham                            3                 19
##  6 Crazy Ex-Girlfriend               1                 18
##  7 Beowulf Return to the Shieldlands 1                 12
##  8 Mike e Molly                      5                 22
##  9 Merlin                            1                 13
## 10 Elementary                        2                 24

episodes %>%
    select(series_name,
           season,
           season_length) %>% 
    unique() %>%
    ggplot(aes(season_length,..density..)) + 
    geom_density(aes(y= 2.5 *..density..),
             color="lightpink") +
    geom_histogram(binwidth = 1, 
                   fill="grey",
                   color="black") +
    scale_x_continuous(breaks=seq(0,120,5)) +
    labs(y = "Frequency density", x = "Season length")

The most frequent season lengths are around 10 and 20 episodes (Two modes ou bimodal).
Some few seasons are curiously long.
There are no values that violate the nature of the quantity (e.g. negative values), and would therefore invalidate our analysis.

Let’s observe the seasons that called our attention for being too long:

episodes %>%
    select(series_name,
           season,
           season_length) %>% 
    unique() %>% 
    top_n(10, season_length) %>% 
    arrange(desc(season_length))

## # A tibble: 10 x 3
##    series_name              season season_length
##    <chr>                    <chr>          <int>
##  1 Yu Yu Hakusho            1                112
##  2 Anger Management         2                 90
##  3 Os Cavaleiros do ZodÃaco 1                 73
##  4 One Piece                2                 68
##  5 Thundercats              1                 65
##  6 One Piece                15                62
##  7 One Piece                18                60
##  8 One Piece                17                58
##  9 One Piece                5                 57
## 10 One Piece                3                 53

The series Yu Yu Hakusho which has 4 seasons is being wrongly classified as a single whole season. For this reason, we shall not include the series Yu Yu Hakusho in our analysis regarding seasons.

episodes %>% 
    filter(series_name != "Yu Yu Hakusho") %>%
    ggplot(aes(sample=season_length)) +
    stat_qq()

In the Q-Q plot there’s a step (descontinuity), which matches the two modes (two most frequent values) on the respective histogram .

Median season length of a series

To represent the overall season length of a series we will make use of the estimator median. Among the reasons to choose the median over other estimators such as the mean or the trimmed mean we give:

The median less frequently provides values that aren’t integers (e.g. 4.6) which would make no sense for something as season length.
The median is a robust meausre, in other words it’s less easily affected by outliers.

summary_season_length <- episodes %>%
    group_by(series_name) %>%
    summarize(median_season_length = median(season_length)) %>%
    ungroup()

episodes <- left_join(episodes, summary_season_length,
                      by = c("series_name")) 

episodes %>% 
    filter(series_name != "Yu Yu Hakusho") %>%
    select(series_name,
           season_ep,
           median_season_length) %>%
    sample_n(10)

## # A tibble: 10 x 3
##    series_name           season_ep median_season_length
##    <chr>                     <dbl>                <dbl>
##  1 Nip/Tuck                     14                   16
##  2 One Tree Hill                19                   22
##  3 Awkward                      20                   21
##  4 How I Met Your Mother        17                   24
##  5 Suits                        11                   16
##  6 Survivor                     15                   15
##  7 American Idol                 8                   40
##  8 The Simpsons                  9                   22
##  9 Csi Las Vegas                17                   23
## 10 Alias                         9                   22

episodes %>%
    filter(series_name != "Yu Yu Hakusho") %>%
    select(series_name, median_season_length) %>% 
    unique() %>%
    ggplot(aes(median_season_length,..density..)) +
    geom_density(aes(y= 2.5 *..density..),
                 color="red") +
    geom_histogram(binwidth = 1,
                   fill="grey",
                   color="black") +
    scale_x_continuous(breaks=seq(0,120,5)) +
        labs(x = "Median season length",
         y = "Frequency density")

The median season length behaves similarly to the metric season length.

episodes %>%
    filter(series_name != "Yu Yu Hakusho") %>%
    select(series_name, median_season_length) %>% 
    unique() %>%
    ggplot(aes(sample=median_season_length)) + 
        stat_qq()

The median season length once again behaves similarly to the metric season length.

Consensus about a episode rating

We generated the metric consensus as the relative frequency of the most frequent rating a episode was given from 1 to 10 (r1 to r10).

episodes <- episodes %>%
    group_by(series_name,episode) %>%
    mutate(consensus = max(r1,r2,r3,r4,r5,r6,r7,r8,r9,r10)) %>%
    ungroup()

episodes %>% 
    select(episode,consensus) %>%
    sample_n(10)

## # A tibble: 10 x 2
##    episode                   consensus
##    <chr>                         <dbl>
##  1 New Directions                0.344
##  2 Episode #2.6                  0.290
##  3 Auditions #7                  0.235
##  4 Jr. Sells His Car             0.417
##  5 Lethal Combination            0.288
##  6 Six Chair Challenge 2         0.286
##  7 NS                            0.364
##  8 For the Sake of the Child     0.318
##  9 Madison Berg                  0.285
## 10 The One Hundredth             0.549

episodes %>%
    select(series_name, consensus) %>% 
    unique() %>%
    ggplot(aes(consensus,..density..)) +
    geom_histogram(binwidth = 0.01,
                   fill="grey",
                   color="black") +
    scale_x_continuous(breaks=seq(0,1,0.125)) +
    labs(x = "Level of Consensus",
         y = "Frequency density")

Level of Consensus mode is around 0.3125, in other words most of the time 31.25% of people gave the same rating to an episode.
Higher levels of consensus occur, going up to around 87.50%.
There are no values that violate the nature of the quantity (e.g. negative values), and would therefore invalidate our analysis.

episodes %>%
    select(series_name, consensus) %>% 
    unique() %>%
    ggplot(aes(sample=consensus)) + 
        stat_qq()

The higher values which are far away from the mode affect the curvature and prove that the distribution of level of consensus is not symetric (well behaved).

Are shorter seasons better rated?

episodes %>%
    filter(series_name != "Yu Yu Hakusho") %>%
    ggplot(aes(user_rating,
               season_length)) +
    geom_point(alpha=0.08,
               color="#ff6666") +
        scale_y_continuous(breaks=seq(0,100,5)) +
        scale_x_continuous(breaks=seq(0,10,1)) +
          labs(x = "User rating",
         y = "Season length")

Seasons between 1 and 25 episodes are visibly disperse with ratings ranging from 3 to 10, it’s visible that most of the occurrences have ratings around 8 (more or less between 7 and 9). This pattern overall holds up despite the season length and matches the distribution of ratings mode which is 8.

There are however, two exceptions:

For seasons around 40 episodes there’s a sizable drop in terms of rating
Seasons above 40 episodes seem to experience an increase on user ratings, their relevance doesn’t seem clear or strong, however.

episodes %>%
    filter(series_name != "Yu Yu Hakusho") %>%
    ggplot(aes(season_length, 
               user_rating, 
               group=season_length)) +
    geom_boxplot(position = "dodge",
                 outlier.shape = NA,
                 color="#ff6666") +
    scale_x_continuous(breaks=seq(0,90,5)) +
    scale_y_continuous(breaks=seq(0,10,1)) +
    labs(y = "User rating",
         x = "Season length")

Let’s now talk about boxplots while focusing in the median, to do so we shall desconsider external points to have a clearer perspective of the chart. In terms of median what was seen in the scatter plot repeats itself with the drop on the ratings of seasons around 40 episodes.

The boxplot lower and upper limits (boxplot whiskers) shows that the increase in the ratings of seasons above 40 episodes doesn’t seem very relevant.
- The boxplot of these seasons are between the upper and lower limits of the boxplot of shorter seasons (There’s a strong Intersection of values).

Series of shorter seasons are better rated?

episodes %>%
    filter(series_name != "Yu Yu Hakusho") %>%
    group_by(series_name) %>%
    unique() %>%
    ggplot(aes(user_rating,
               median_season_length)) +
    geom_point(alpha = 0.1, color="#A58AFF") +
    scale_y_continuous(breaks=seq(0,100,5)) +
    scale_x_continuous(breaks=seq(0,10,1)) +
        labs(x = "User rating",
             y = "Median season length")

Looking at the ratings by median season length we see that the distribution of ratings behaves similarly to the ratings by season length. The perceptible effect of focusing on the ratings by median season length is the even sharper drop of user ratings around 40 episodes.

episodes %>%
    filter(series_name != "Yu Yu Hakusho") %>%
    group_by(series_name) %>%
    unique() %>%
    ggplot(aes(median_season_length,
           user_rating,
           group=median_season_length)) +
    geom_boxplot(position = "dodge",
                 outlier.shape = NA,
                 coef = 0,
                 fill="#A58AFF",
                 width=0.5) +
    scale_x_continuous(breaks=seq(0,90,5)) +
    scale_y_continuous(breaks=seq(0,10,1)) +
        labs(y = "User rating",
         x = "Median season rating")

The boxplot shows similar results to those provided by the analysis of ratings by season length.

Does season length affects consensus?

episodes %>%
    filter(series_name != "Yu Yu Hakusho") %>%
    group_by(series_name,season) %>%
    ggplot(aes(season_length,consensus)) +
    geom_point(alpha=0.3) +
    scale_y_continuous(breaks=seq(0,1,0.1)) +
    scale_x_continuous(breaks=seq(0,100,5)) +
    labs(y = "Level of consensus",
         x = "Season length")

The level of consensus for seasons over 30 episodes, overall, remains between 0.2 and 0.5 of consensus (consensus between 20% and 50%).
There’s a considerable disperson in terms of level of consensus for seasons up till 30 episodes. This might suggest that the season length is not the appropriate characteristic to explain the level of consensus for seasons around that number of episodes.

Conclusion

Overall season length doesn’t have a meaningful effect on a series rating. Season length does not entail in a better or worse rated season, whether we look exclusively at the season’s length or at the median length of a serie’s season.
- The exception to the aforementioned is that for seasons (and series of median season length) above 40 episodes there’s a sizable drop on user ratings.

For seasons up to 30 episodes the season length does not seem to provide the proper information to explain level of consensus.
- For seasons over 30 episodes the season length has a limiting effect on the level of consensus, which in general does not go above 0.5 (50%). Specially large seasons could mean episodes of irregular quality, such as filler episodes.

The most voted episodes are better rated?

episodes %>% 
    ggplot(aes(user_votes,user_rating,
               size = user_votes)) +
    geom_point(alpha=.3,
               position = position_jitterdodge()) +
    scale_y_continuous(breaks=seq(0,10,1)) +
    scale_x_continuous(breaks=seq(0,1000000,15000)) +
    labs(y = "User rating",
         x = "Number of votes")

There’s a visible increase in the number of votes when we look at the rating above 7. In the range of ratings between 8 and 10 there’s a perceptibly large increase in the number of votes.

episodes %>% 
    group_by(episode) %>%
    ggplot(aes(user_rating)) +
        geom_density(aes(y= 0.1 * ..density..),
                 color="darkblue") +
    geom_vline(xintercept = mean(episodes$user_rating), color = "darkred") + 
    geom_vline(xintercept = median(episodes$user_rating), color = "darkgreen") +
    scale_x_continuous(breaks=seq(0,10,0.5)) +
    labs(y = "Frequency density",
         x = "User rating")

As has been said before, the rating 8 is a particularly interesting point (mode) in the distribution of ratings. The mean (red line) and the median (green line) also converge to that same rating. For this reason we will compare the consensus of the episodes whose ratings are below 8 (not well_rated) with the consensus of the episodes whose rating are above 8 (well_rated).

episodes <- 
    episodes %>% 
    mutate(well_rated = user_rating >= 8)

episodes %>% 
    select(episode,
           user_rating,
           well_rated) %>%
    sample_n(10)

## # A tibble: 10 x 3
##    episode                               user_rating well_rated
##    <chr>                                       <dbl> <lgl>     
##  1 Bloodlines                                    7.8 FALSE     
##  2 Toxicity                                      8.2 TRUE      
##  3 Inside Out                                    8.1 TRUE      
##  4 The Invitations                               8.9 TRUE      
##  5 Episode #4.7                                  8.1 TRUE      
##  6 There Is No Other Way                         7.9 FALSE     
##  7 Brother in Arms                               8.5 TRUE      
##  8 Urban Matrimony and the Sandwich Arts         7.9 FALSE     
##  9 Skip Day                                      6.7 FALSE     
## 10 Blink                                         7.6 FALSE

episodes %>% 
    ggplot(aes(user_votes,well_rated,
               color = well_rated)) +
    geom_jitter(alpha = .8, width = 0.6) +
    scale_x_continuous(breaks=seq(0,1000000,15000)) +
            labs(x = "Number of votes")

Results similar to the scatter plot, no visible impact of the well_rated characteristic. It’s even clearer the connection between higher ratings and number of votes.

Better rated episodes imply in higher levels of consensus?

episodes %>% 
    ggplot(aes(consensus,well_rated,
               color = well_rated)) +
    geom_jitter(alpha=0.08) +
    scale_x_continuous(breaks=seq(0,1,0.1)) +
    labs(x = "Level of consensus")

Curiously the mass of higher density in the chart (darker area) of the two groups is relatively close to each other, even so it’s still possible to notice that the mass of occurrences of the well_rated is the rightmost one (higher level of consensus).

Better rated episodes generate higher level of consensus, occurrences above 0.5/50% are more frequent in the well_rated episodes and overall the mass of occurences of the well rated is more to the right (higher level of consensus) than the episodes whom were not well rated.

Conclusion

There was a visible increase in the number of votes when we look at higher ratings (specially above 8).
Better rated episodes seem to imply in higher levels of consensus.

EDA about series ratings on IMDB

Introduction

Analyzed Dataset

Small explanation about employed estimators

Data Overview

Number of votes of a certain episode

Episode rating

Season length

Median season length of a series

Consensus about a episode rating

Are shorter seasons better rated?

Series of shorter seasons are better rated?

Does season length affects consensus?

Conclusion

The most voted episodes are better rated?

Better rated episodes imply in higher levels of consensus?

Conclusion

Benardi Nunes

Classification of candidates in Brazilian elections

Analysis with Regularization on Brazilian elections

Analysis on Brazilian elections

C.E.A.P analysis (suppliers and weekend expenses)

C.E.A.P Analysis

Multivariate logistic regression on speed dating data

Multivariate linear regression on speed dating data

Analysis on MovieLens dataset with bootstrap

Analysis on Github commits (2016-2017)

Case Based Reasoning System for MSRP estimation