Do underdogs win in the Game of Thrones book series? We use data to find out

Do underdogs win in the Game of Thrones book series? We use data to find out

Everyone loves a good underdog story. A relatable, disadvantaged, and beloved character fights against insurmountable odds and ultimately comes out on top. But what about Game of Thrones, a book series (also known as A Song of Ice and Fire) and television show notorious for killing its main characters? Do the underdogs still prevail? We can answer this question using data science.

Table of contents

Collecting data from Game of Thrones battles

First, we need data on our underdogs. We will use this “battles” data from Kaggle. This dataset contains information on battles throughout the Game of Thrones book series. For each battle, we know the attackers and the defenders, the size of their armies, their associated kings, commanders, and noble families, and the region that the battle takes place. We can also create variables from the given information that might be useful like the number of attackers and the number of defenders. Finally we want to only use data on battles where the outcome is known.

Collecting Game of Thrones battle dataset screenshot example


battles <- data.table::fread("GoT_battles.csv", 
                             stringsAsFactors = TRUE, 
                             logical01 = TRUE, 
                             na.strings = "", 
                             drop = c("defender_3", "defender_4", "battle_number")) # all NA columns, row numbers

battles$num_attackers <- rowSums(!is.na(select(battles, attacker_1, attacker_2, attacker_3, attacker_4)))
battles$num_defenders <- rowSums(!is.na(select(battles, defender_1, defender_2)))

battles <- drop_na(battles, attacker_outcome)

Defining an underdog

So, what does it mean to be an underdog? We can explore two different definitions.

  1. The first is to assume that the attackers are the “bad guys” and the defenders are the underdogs that everyone wants to support.
  2. The second is to compare the size of the armies, and assume the team with the smaller army is the underdog.

Exploratory data analysis

Let’s perform exploratory data analysis to see what we can uncover.

First to attack vs. Underdog

First let’s look at the distribution of the variable attacker_outcome. This is the overall number of times that the attacker has won versus lost.

First to attack vs. underdog game of thrones battle dataset screenshot


battles %>%
  count(attacker_outcome, name = "count") %>%
  ggplot(aes(x = attacker_outcome, y = count)) +
  geom_col(fill = "#006094") +
  geom_text(aes(label = count), vjust = -0.3, size = 5) +
  scale_y_continuous(limits = c(0,35), breaks = seq.int(0, 35, 5), minor_breaks = NULL) +
  labs(title = "The attackers won 86% of the time") +
  theme_minimal()

Bar chart of attacker results

By the definition that the attackers are the “bad guys” and the defenders are the underdogs, then the underdogs clearly lose most of the time in the Game of Thrones series.

battles by winner screenshot


# battles %>% filter(attacker_outcome == 'win')

Smaller vs. Larger army

The other proposed definition was to compare the size of the armies to determine which is the underdog. Unfortunately we do not have information on the army sizes for some battles, so we will have to look only at the subset which has values for both the attacker army size and the defender army size. We also will only consider the cases where there is a disparity between army sizes, i.e. if the armies are the same size, then neither is an underdog by this definition.

smaller army vs larger army code screenshot example


battles %>%
  filter(!is.na(attacker_size), !is.na(defender_size), attacker_size != defender_size) %>%
  mutate("winner" = ifelse(attacker_outcome == 'win', "attacker", "defender"),
         "larger_army" = ifelse(attacker_size > defender_size, "attacker", "defender"),
         "larger_wins" = (winner == larger_army)) %>%
  count(larger_wins, name = "count") %>%
  ggplot(aes(x = larger_wins, y = count)) +
  geom_col(fill = "#006094") +
  geom_text(aes(label = count), vjust = -0.3, size = 5) +
  scale_y_continuous(limits = c(0,10), breaks = seq.int(0, 10, 2), minor_breaks = NULL) +
  labs(title = "The smaller army won 69% of the time") +
  theme_minimal()

bar chart example of smaller vs. larger army results

By this definition, the underdogs won 69% of the time. We can also break up these total by attacker and defender.

army size by attacker vs defender code screenshot


battles %>%
  filter(!is.na(attacker_size), !is.na(defender_size), attacker_size != defender_size) %>%
  mutate("winner" = ifelse(attacker_outcome == 'win', "attacker", "defender"),
         "larger_army" = ifelse(attacker_size > defender_size, "attacker", "defender"),
         "larger_wins" = (winner == larger_army)) %>%
  group_by(winner) %>%
  count(larger_wins, name = "count") %>%
  ggplot(aes(x = winner, y = count, group = larger_wins)) +
  geom_col(aes(fill = larger_wins), position = position_dodge()) +
  scale_fill_manual(values = c("TRUE" = "#006094", "FALSE" = "#6441bd")) +
  geom_text(aes(label = count), vjust = -0.3, size = 5, position = position_dodge(width = 0.9)) +
  scale_y_continuous(limits = c(0,7), breaks = seq.int(0, 7, 1), minor_breaks = NULL) +
  labs(title = "For both attackers and defenders, \nthe smaller army won the majority of the time") +
  theme_minimal()

bar chart for attackers and defenders by army size

The winner has the smaller army the majority of the time, regardless of whether that winner is the attacker or the defender.

Results

By the first definition, the Game of Thrones series has no sympathy for the underdogs. After all, in the Game of Thrones you win, or you die. Attacking first provides much better odds.

By the second definition based on army size, underdogs have a better chance. However, we must keep in mind that we are missing a lot of information about the size of the armies. In fact, it is possible that the size of the army is explicitly stated only to make the smaller army’s win that much more impressive and dramatic. Unfortunately, we must conclude that the underdogs lose.

Underdogs and direwolves

But there is one more definition that we haven’t considered. We have general definitions for “underdog”, but what about if we take into account the narrative itself? Who do the books want us to rally behind? Arguably the Stark family! So, in once last chance for the underdogs, what king wins the most battles?

Game of Thrones battles by family's king screenshot of code


battles %>%
  select(attacker_outcome, attacker_king, defender_king) %>%
  filter(!is.na(attacker_king), !is.na(defender_king)) %>%
  pivot_longer(!attacker_outcome, names_to = "role", values_to = "king") %>%
  mutate(battle_outcome = ifelse((attacker_outcome == "win" & role == "attacker_king") | 
                                   (attacker_outcome == "loss" & role == "defender_king"), "win", "loss")) %>%
  group_by(king) %>%
  count(battle_outcome, name = "count") %>%
  mutate("count_proportion" = count / sum(count),
         "modified_count" = ifelse(battle_outcome == "win", count, -count)) %>%
  ggplot(aes(x = king, y = modified_count, group = battle_outcome)) +
  geom_col(aes(fill = battle_outcome), position = position_stack()) +
  scale_fill_manual(values = c("win" = "#006094", "loss" = "#6441bd")) +
  geom_text(aes(y = ifelse(modified_count <0, modified_count - 3, modified_count + 3), 
                label = paste0(round(count_proportion*100, 1), "%")), size = 4) +
  scale_y_continuous(limits = c(-20,20), breaks = seq.int(-20, 20, 5), minor_breaks = NULL, labels = abs(seq.int(-20, 20, 5))) +
  labs(title = "The Stark family lost the majority of the time",
       y = "Battle Count") +
  coord_flip() +
  theme_minimal()

 

chart showing game of thrones battle results by family

Unfortunately for the Stark family, or at least Robb Stark, they lose many more battles than they win. They have a win percentage of only 37.5% despite have the second highest total number of battles. So, if the Stark family are the underdogs, then in the Game of Thrones world the underdogs do not prevail. Or perhaps, if we are being optimistic, this is only setting the stage for them to come back in force and win in the end.

What could help the Starks win in the Game of Thrones?

So, what would it take for the Stark family to win? What are the traits that make a winner?

Battle type

One element we can look at is the type of battle. For example, we have ambushes, pitched battles (aka at a predetermined time and location), and sieges. How does Robb Stark fare in these?

screenshot of code to analyze game of thrones data by battle type


battles %>%
  filter(attacker_king == "Robb Stark" | defender_king == "Robb Stark") %>%
  mutate("is_stark_win" = ifelse(((attacker_outcome == 'win' & attacker_king == "Robb Stark") | 
                                   (attacker_outcome == 'loss' & defender_king == "Robb Stark")), 
                                 TRUE, FALSE)) %>%
  group_by(is_stark_win) %>%
  count(battle_type, name = "count") %>%
  
  ggplot(aes(x = battle_type, y = count, group = is_stark_win)) +
  geom_col(aes(fill = is_stark_win), position = position_dodge()) + 
  geom_text(aes(label = count), vjust = -0.3, size = 5, position = position_dodge(width = 0.9)) +
  # scale_fill_manual(values = c("ambush" = "#006094", "pitched battle" = "#6441bd", "siege" = "#34c3ca")) +
  scale_y_continuous(limits = c(0,8), breaks = seq.int(0,8, 1), minor_breaks = NULL) +
  scale_fill_manual(values = c("TRUE" = "#006094", "FALSE" = "#6441bd")) +
  labs(title = "Robb Stark lost pitched battles 78% of the time ") +
  theme_minimal()

bar chart results of robb stark wins by battle type

Robb Stark lost 77.7% of the pitched battles he fought. Overall, if you look at all of the battles that Robb Stark lost, pitched battles make up 46.7%. He would be better off with ambush battles, where he at least has a 50%-50% track record.

Season of the battle

Another factor that may help the norththernly Starks is the season. Perhaps they win more battles in the winter than the summer?

code screenshot to analyze robb stark battles by season


battles %>%
  filter(attacker_king == "Robb Stark" | defender_king == "Robb Stark") %>%
  mutate("is_stark_win" = ifelse(((attacker_outcome == 'win' & attacker_king == "Robb Stark") | 
                                   (attacker_outcome == 'loss' & defender_king == "Robb Stark")), 
                                 TRUE, FALSE)) %>%
  group_by(is_stark_win) %>%
  count(summer, name = "count") %>%
  
  ggplot(aes(x = summer, y = count, group = is_stark_win)) +
  geom_col(aes(fill = is_stark_win), position = position_dodge()) + 
  geom_text(aes(label = count), vjust = -0.3, size = 5, position = position_dodge(width = 0.9)) +
  # scale_fill_manual(values = c("ambush" = "#006094", "pitched battle" = "#6441bd", "siege" = "#34c3ca")) +
  scale_y_continuous(limits = c(0,13), breaks = seq.int(0, 13, 2), minor_breaks = NULL) +
  scale_fill_manual(values = c("TRUE" = "#006094", "FALSE" = "#6441bd")) +
  labs(title = "Robb Stark lost the only two winter battles he participated in") +
  theme_minimal()

bar chart for robb stark's battle wins by season

Upon investigation, this proved to not help him. There were only two battles in the winter, but Robb Stark lost both of them. For the battles that we know were during the summer, he won only 42.9% of them. Thus, there isn’t a particular season that would lead to a Stark win.

Region of the battle

One last factor that we will consider is the region of the battle. If the Starks do not have a winter advantage, perhaps they at least have an advantage fighting in the north?

All the battles are in four regions: North, Riverlands, Westerlands, Crownlands.

  • Torrhen’s Square = North
  • Stony Shore = North
  • Moat Cailin = North
  • Deepwood Motte = North
  • Whispering Wood = Riverlands
  • The Twins = Riverlands
  • Seagard = Riverlands
  • Ruby Ford = Riverlands
  • Riverrun = Riverlands
  • Red Fork = Riverlands
  • Raventree = Riverlands
  • Mummer’s Ford = Riverlands
  • Harrenhal = Riverlands
  • Green Fork = Riverlands
  • Darry = Riverlands
  • Oxcross = Westerlands
  • Golden Tooth = Westerlands
  • Crag = Westerlands
  • Duskendale = Crownlands

code screenshot of game of thrones battles by region


battles %>%
  filter(attacker_king == "Robb Stark" | defender_king == "Robb Stark") %>%
  mutate("is_stark_win" = ifelse(((attacker_outcome == 'win' & attacker_king == "Robb Stark") | 
                                   (attacker_outcome == 'loss' & defender_king == "Robb Stark")), 
                                 TRUE, FALSE)) %>%
  group_by(is_stark_win) %>%
  count(region, name = "count") %>%
  
  ggplot(aes(x = region, y = count, group = is_stark_win)) +
  geom_col(aes(fill = is_stark_win), position = position_dodge()) + 
  geom_text(aes(label = count), vjust = -0.3, size = 5, position = position_dodge(width = 0.9)) +
  # scale_fill_manual(values = c("ambush" = "#006094", "pitched battle" = "#6441bd", "siege" = "#34c3ca")) +
  scale_fill_manual(values = c("TRUE" = "#006094", "FALSE" = "#6441bd")) +
  scale_y_continuous(limits = c(0,9), breaks = seq.int(0, 9, 1), minor_breaks = NULL) +
  labs(title = "Robb Stark lost 5 of the 6 battles fought in the North") +
  theme_minimal()

result chart of robb stark battles by region

It seems that for Robb Stark, not even a fight in home territory was a guaranteed win. He lost the one and only battle that he fought in the Crownlands. He lost all but one of the battles fought in the North. The location with the most victories is the Riverlands. Here he lost 57% of the time. Finally, of the three battles that he fought in the Westerlands, he won 2 of them. This is the only location that he won more battles than he lost.

Taking together each of these traits from previous battles won, Robb Stark should use an ambush battle style in the Westerlands, either in the summer or the winter.

Process and helpful techniques

One problem with this sort of question is our lack of data. We only have 37 rows in our dataset and some of that is missing. Although we can slice and dice and look at historical percentages, we do not have enough data to create a model to truly predict the outcome. In these cases, we have to rely on exploratory data analysis and feature engineering.

We can get more information from the data using a wide variety of techniques. Here are just a few:

1. You can add new columns to the data based on other columns

We did this for one of our charts above. Our data only had a win/loss column to indicate if the attacker won. We could use this information to create a “who won” column, where the value is the attacker or the defender. The same is true of determining who had the larger army. The dataset had a column for army size, but we were able to add a new column to say whether the winner had the larger army, yes or no.

2. You can group by existing columns

We could calculate the win/loss percentage for the dataset overall. However, if we grouped by battle type, we could calculate the win/loss percentage for each battle type. Grouping by columns is a great way to view your data from another perspective. We could group by any categorical column (those with specific discrete values).

3. You can collapse categorical variables

In the case of location, we have way too many discrete values. You could group by all nineteen locations, but that might not be the most useful. What you can do is use external knowledge about the relationship between columns or within a column. We know that each location is within a region. Thus, we can create a new column that collapses the nineteen location values into the four regions. We then can group by and explore various metrics by our newly created region column.

4. You can pivot the dataset

Pivoting can be challenging to understand, but it basically boils down to manipulating the shape of the data. Data is said to be wide when it has many columns compared to the rows. Long data is the opposite, you have few columns and many rows. Pivoting can shift your data into a format that is more useful for your particular use case, which is often plotting.

Example: Pivoting Wide Data to Long

We have one column for favorite color and a column for each age group surveyed. For example, in this fictional data nine people over the age of 100 years old said that their favorite color was green.

example of favorite color data

However, we might want our data in long form. We can pivot this table so that each value becomes its own column “count”. We want favorite color to remain as it is, but we want to turn the age bracket columns into a single column “age”.

pivot wide to long data example table

Now we have a table with 40 rows, one for each former cell in the wide table.

If the case was reversed and we had a really long table, we could also pivot to make it wider, turning a column’s values into columns, with new values equal to a different column’s values, i.e., age -> column for each age group, then the values of these new columns will be from the count column.

Putting it all together (the Data Scientist as a storyteller)

This is just a small sample of techniques that are very useful in exploring and manipulating your data. These techniques can be used in conjunction to shape the data to your needs, all without the use of modeling. In fact, this stage is just as important.

In order to determine the most powerful features, you have to understand and really dig into your data. You want to let your curiosity free and think about new ways to view the data, whether that means pivoting, adding columns, grouping by columns, or collapsing columns. The data has a story to tell, and its up to you as the Data Scientist to find and tell it. You, as the Data Scientist, must act as storyteller and translator. It is only by translating the data that data-driven decisions can be made. For it is not the data itself that allows decision making, but the translation of it.

Although the outlook for the underdog Robb Stark is poor, there is one more character who is arguably even more of an underdog. Across the sea is Daenerys Targaeryan, whose quest to reclaim the Iron Throne has led her to learn the languages of High Valyrian, Common, and Dothraki, not to mention being able to communicate with three dragons… Imagine what would happen if she added the language of data to the list.

Learn more: