Predicting the English Premier League Standings
I’ve had this post up for a while on my blog and on LinkedIn but I have found the beauty that is Medium and decided to share the post here as well. Hope you enjoy it!
Before I begin this post, I would like to point out that I am the most disgruntled Arsenal fan you’d ever meet. Whatever subliminal messages or shade I could be throwing to your team, it’s (mostly) not to hurt you. We Arsenal fans have to find joy in other places seeing as there’s a good chance we might not make Top 4 this season. Take solace knowing that all I say, I say for the love of the game. Happy reading!
Football is a beautiful sport. The adrenaline rush we get from watching our team score in injury time or the embarrassment we feel when our team not only loses the match, but decides to concede 8 goals in the process (Yes, I am looking at you Arsenal) is part of what makes us addicted to the game. What makes football, especially in England, even more interesting is the uncertainty. You can get all the top players from all the top leagues in Europe or even sign a world class manager who almost won his country the World Cup and still not finish Top Four.
So first off, just how uncertain is the Premier League? The answer might be: Not as much as you think. Asides the anomalies once in a few years (like Leicester City winning the league or Chelsea shocking us all by coming 10th a year after lifting the BPL trophy), the Premier League might be fairly consistent. Take for instance, Tottenham. They had a stellar campaign and were favorites to win the league last season but somehow managed to suffer a 5–2 loss to Newcastle on the last day of league, causing them to drop to third on the table and showing that for them consistency means being forever below Arsenal. Did I mention that, at this point, Newcastle was 18th on the table? Heh.
The question we need to answer here is: Can we predict the points/rankings for a club?
Well of course the answer is yes. You can predict pretty much anything. The accuracy, however, is another story. I took a swing at this and found some interesting stuff.
- The number of Shots a team makes per game is not important as we think in winning games and gaining points. In fact, depending on the overall performance of the team year on year, it could actually hurt a team’s chances of gaining more points.
- The best predictors of how well a team will do are mostly offensive statistics like Goals Scored, Shots per game, Penalties Scored, Open Play goals and Goal Difference (which is a balance between a team’s Defensive and Offensive ability). These stats can predict your standings with up to 70% accuracy.
- If these stats are used to predict that a team is going to be in the Top Four, on average that team has a 62.5% chance of being there.
- If they are used to predict that a team is going to be relegated at the end of season, on average that team has an 83.5% chance of being at the relegation zone. And last but definitely not least;
- Tottenham *could* win the 2016/2017 English Premier League. *Gasps*
Now, before I am lynched, let me take you through how I arrived at these conclusions.
Without boring you with a lot of details, I’d tell you about the approach I chose. I decided to use a team’s performance and statistics of the previous year to predict their position on the table in the current year. There are many other (and probably better) approaches out there but I think this would be a fun one to explore.
The first step is to get the data. I got the data from whoscored and Skysports. Whoscored is a JavaScript-rendered site (Those really pretty sites) and you know, the pretty ones always play hard to get (Get it? hard to get? it’s hard to get data from pretty/js-rendered sites? ah never mind). I used the RSelenium package in R to deal with this as it lets you get data from js rendered websites. Other major packages I used are rvest and XML used for web scraping (getting data from the web). I then cleaned (and I mean CLEANED) the data, made sure all the columns were in the right format and removed unnecessary columns. I had about 40 different variables like Offsides per game, Interceptions per game, Dribbles per game etc. The code used to get the data is here.
Using the DataCombine package, I created a lead (opposite of lag) variable of the number of points based on the year and the team. This means that for every row of data, we had a new column that told us the points that team had in the next season.
Now we are done with the hard part! The next (and my favorite) thing to do is to understand the variables in the data set. Questions like how does the number of Red Cards a team gets per game relate with the number of points that team would get the next season (note that I said relate and not affect. Correlation is not Causation). What sort of relationship do these variables have? Is it a linear interaction (a straight line relationship), a polynomial interaction (e.g. A quadratic relationship) or any other? Understanding the variables and answering these sort of questions is done by exploratory data analysis (A fancy word for creating charts). I used Tableau for this part because I like a bit of interactivity in my plots (sorry ggplot2 lovers. I’m lazy, I love Tableau). I put up a sample of the sort of stuff I did for data exploration so you can get a feel of what it looks like. You can view the visualization here.
Just by looking at it, we can see that some variables like Number of Yellow Cards has a weak relationship with leadPoints, some like Goal Difference have a strong relationship with leadPoints and some like Interceptions and Dribbles per game have almost no relationship with the number of points the team would get in the next year (I can’t stop thinking of Coquelin and Sanchez right now. Wonder why). Some pretty amazing things I discovered from visualizing my data:
- As a standalone, a team’s Goal Difference at the end of a season, has the highest correlation with how well that team will do in the next season. This was particularly interesting as it outperformed other stats like the number of points the team got in the previous year, the number of goals scored, the number of wins and stuff like that. This means how balanced a team is would be the best determinant of how well a team would do in the long run.
- At the beginning of this post, we talked about how Shots per game is not what we think it is. Let’s look at that.
If you take a look at the scatter plot above for Shots per Game vs leadPoints, what do you notice? The number of Shots a team makes per game is INVERSELY correlated with how well they would do in the next year. This means that the more shots a team makes, the less the number of points that team is likely to achieve in the BPL next year. I was so blown away by this that I had to dig further. I checked how shots per game interacts with the number of points a team had at the end of the same year (e.g. Shots per game in 2009 vs. Points in 2009) and it was STILL negatively correlated.
Again, before I get lynched, let’s take a closer look.
If you look at the chart closer, you would see that at a certain threshold of about 60 points there’s a subtle change in the movement of the trend.
You can see that from 60 points and above, the relationship is somewhat positive but mostly random while below that threshold you can see a clear negative correlation.
If you think about it, teams with less than 60 points would probably not be very good at converting their shots to goals. I checked this out and found that the highest conversion rate achieved by a team with less than 60 leadpoints was 17.6%. This team was Chelsea in 2014. This means they had above 17% conversion in 2014 (when they were champions) and had only 50 points at the end of the 2015 season (Remember that their position on the table was an anomaly). Asides from them, the average conversion rate for a team that had less than 60 leadpoints was about 10%.
With this in mind, here’s one fact. Every shot attempt made by a team that is not a goal and is not deflected by the other team is invariably handing over possession to the opponent. This means that for the number of minutes (or seconds) the opponent is able to retain that possession; they are probably going to dominate the game. Now back to the gist on conversion rate. Here’s what I think is happening. When a team is not very good at converting shots to goals it means that the more shots they have, the more likely they are to be handing possession back to the opponent and invariably giving their opponents some upper-hand in the game. This is probably why we some negative correlation below 60 points. We could even visualize it differently.
For the Consistent Top Four Teams (Manchester Utd, Man City, Arsenal and Chelsea) …
Looks positive eh? (Remember that is this is for 7 years which is why you are seeing 28 points on the chart)
Adding the consistent Up and Coming Teams (Tottenham, Liverpool)
Looks a bit random but not negative at least.
Now let’s add just ONE ‘under performing team’ like Sunderland…
WOAH! It becomes negatively correlated!
Now, we are done exploring our data and seeing our different variables interact with leadPoints, I split the data I had into four parts. This was split for building models, tuning parameters, testing the models and predicting for the end of 2016/2017 season. I used the Boruta package, which iteratively tests the predictive power of each variable using randomForest and discards unnecessary variables. At the end of this iterative excercise, I ended up with 13 out of the final 37 variables. I tried several modeling techniques from Linear Regression (including Ridge and Lasso Regression) to Support Vector Machines to Tree Modelling and Neural Networks. I found that good ol’ Linear Regression (with some orthogonal polynomials and variable interaction) outperformed all other modelling techniques. On a basic level, Linear regression could be explained simply with x-y graphs where we had points and tried to draw the line of best fit (This is called, in technical terms, Univariate Linear Regression). I then began to experiment the 13 variables I had left to see which combination yielded the best accuracy on the test set. I found that the variables that are the best predictors on how well a team will perform based on last year’s data are: The team’s Goal Difference, Goals scored, Shots per game, number of penalties scored and the number of Open Play goals the team had in the previous season (Don’t worry all ye statistics nerds out there, the variables were checked for collinearity). Let’s test how well the model will do for data it has not seen before. The data we are going to be using to test the model will be for 2013/2014 and 2014/2015 season. We will predict how well each Team will perform based on last year’s data and we would see how well the model does.
For 2014:
And for 2015:
Fairly good predictions, if I do say so myself.
Disclaimer: As we all know; Football is a dynamic sport. It’s hard to say that some years from now, Goal Difference or Penalties scored would be a good predictor of how well a team will do in the next year. The best predictors could become more defensive stats. Until this is tested over a long period of time, we cannot say that these variables would always predict points with this amount of accuracy over the long term.
Some challenges I faced:
- The teams which were not consistently in The Premier League were a bit of a challenge to model. To tackle this, I updated my approach a teeny bit from using last year’s stats to the last year the team played in the premier league. This tweaked approach was of tremendous help because for the consistent teams, it still meant last year and for the relegated teams, it helped in predicting their standings.
- It does not completely help though. There are some teams that had never played in the Premier League till the time window of this data like Watford. What I did was scale down their last achievement and use it to predict. What this means was if last year they were in Division One and got 70 points, I scale that number along with other stats down to be no greater than the smallest point and stats achieved in the BPL and then use that to predict. Fishy I know but a man’s gotta do what a man’s gotta do.
Based on this model, the rankings for the 2016/2017 would be:
- Tottenham
- Manchester City
- Arsenal
- Leicester
- Southampton
- Liverpool
- Manchester United
- West Ham
- Chelsea
- Everton
- Watford
- Swansea
- Stoke
- Middleborough
- Crystal Palace
- West Bromwich Albion
- Hull
- Sunderland
- Bournemouth
- Burnley
Tottenham right up there is why I find solace in the fact that there is about a 37% chance they won’t be at that position. This is the first time I am excited in seeing a model fail.
Going forward, I see two interesting approaches to increase the accuracy of the model:
- I think it would be interesting to see how manager rankings at the time can improve the accuracy of the model.
- Another interesting approach would be; instead of using just last year’s data, we could use an average of all (or some of their data, like a moving average model) their past data to predict this year’s data.
I’d probably try the second approach later on. I’m a bit too lazy for the first one and the third one might be infeasibale for irregular teams but there could be a way around it.
Hope you enjoyed this post! If you have any comments or suggestions, feel free to add a comment.
Thank you for reading!