Predicting the Brownlow results in the 2016 season
The aim is to predict for all AFL matches (games) in the 2016 season, which 3 players will get the 3, 2 and 1 vote(s) in the Brownlow Medal count. Here is the background information you need to know:
- Australian rules football is the country’s national sport. The Australian Football League, known as the AFL, is the sport’s professional league.
- This sport is like no other. It requires a range of abilities: elite skills, speed, strength, agility, courage and strategy just to name a few.
- The AFL, from 2012 comprises 18 teams across Australia.
- Every team plays 22 matches in the regular season (home & away).
- On the field (pitch, oval, ground) 18 players from each team play at any time, from a total of 22 per match.
- The best, second best, and third best players in each match (from any team) are awarded, 3, 2 and 1 votes respectively in the most valuable player award at season’s end.
- This count is known as the Brownlow Medal (or the Brownlow). The votes are deliberated upon and decided by the adjudicating umpires for each match. They remain unknown until the count.
I’ve been a football enthusiast for as long as I know. As mentioned in my about page, I’d considered pursuing a career as a professional player. This project might not be the most pressing challenge to solve in the world, but what it will hopefully do is show what is possible with data science. My inspiration, Nate Silver, did exactly this, when he predicted 49 out of 50 states in the 2008 US the presidential election. Another motivation is that it had a deadline (Brownlow Medal night), which helped me prioritise it. When I decided to do it, it was less than a month before the 2016 Brownlow Medal results came out. At that point, I was travelling in Eastern Europe, and didn’t even have any data. I’d have my work cut out for me.
I correctly predicted the winner, and 8 of the top 12 players. See below the actual results vs my predictions for the top players of the 2016 vote tally:
For a first go, I’m fairly happy with these results. This also is without using off-the-shelf machine learning algorithms (which I hope to do next time). There were 594 individual predictions made – 22 full rounds, 9 matches per round, 3 players receiving votes per match. See full list of predictions here.
As is typically the case, the first step of a data science project is to find some data. The data needed was the player and match data from previous years’ matches, where Brownlow votes have been given. Looking across the web, there were a few sites with AFL data. While some of this data was more easily attainable than others, no one was offering csv or excel files with all of it. Also the quality of the data varied, and some did not include all of the most valuable player statistics nor the Brownlow votes. Finally after scraping and experimenting with available sources of data and finding one, the following stages were undertaken with python’s pandas data science framework (and related):
- Scraping all data from various webpages
- Cleaning and merging the data
- Data visualisation to find features
- Calculating new statistics
- Creating predictions
Scraping data from various webpages
The scraping and cleaning of the data was a large part of this project. So much so that I placed this code a separate iPython notebook. I decided to opt for the last 4 years worth of previous data (2012-2015). The reasons for this were:
- Intuitively 4 years seemed like a good sized sample (2376 rows as explained below). Also it is the most recent, which is most important for these predictions.
- If I went back too much further, there would be several rule changes that come into affect (e.g. number of players allowed on bench). These rule changes would affect the effectiveness of the model (discussed later).
- 4 years of historical data is all of the matches since the league had 18 players. Going back further, I’d need to work into the model, years where there were less teams.
I used the website footywire for the data, as it contained all of the match and player info. Each match was on a different URL, which also had player info. I analysed the URL structure and found out that each match actually had two URLs. One ending with the format ‘mid=9306’ and one ending in the format ‘mid=9306&advv=Y’. I would need both as the first gave the basic player stats, and the later advanced player stats. I discovered the numbers in URL endings corresponded to consecutive matches in that season; the site’s match ids. There were also some difficulties finding the different years (seasons), as they would not always follow incremental match ids from one year to the next. Similarly there were some difficulties dealing with cancelled matches and removing finals series, that shared the same match id number space.
Parsing data from HTML
With four years of historical data, this would make my training set 2376 rows deep. 22 rounds per season, 9 matches per round, 4 seasons of data, 3 vote receivers per match. With 594 matches per season, and the need to scrape also 2016 (whose data will be prediction set), there was 5 seasons of data to collect. Each match also has 2 URLs (as explained above), so in total 5940 pages would need to be scraped. The HTML code of each web page was extracted through the extension of python framework Beautiful Soup. This required a bit of experimentation to choose the best html reader. It was also challenging to figure out how to parse the text by creating my own rules for how I saw the structure of the HTML tags. Similarly for the text between or within the tags. There were various times where it looked like the text was parsed correctly but another page did not fall under the same rule. This was due to shear size of the data set to be scraped, 5940 pages. This challenge of understanding third party data formatting nuances, increases the larger the data set.
Cleaning and merging the data
There were several anomalies found with the data once I had parsed it through HTML into tables. I had to find all of these through experimentation and create rules to fix them. To find some of these anomalies, I used python’s difflib framework, e.g. to find close matches for player names. Some examples of difficulties were:
- Team names not consistent in various places across the page. e.g. ‘Adelaide’ instead of ‘Adelaide Crows’, or ‘Greater Western Sydney Giants’ instead of ‘GWS’
- Player names were genuinely duplicated. There is a Scott Thompson in both Adelaide and North Melbourne teams. Sometimes to differentiate these players, the team name was added to the context, sometimes one player was called Scott Thompson 1 or a middle initial was given. Similarly the name Josh Kennedy for West Coast and Sydney teams.
- There were dozens of other issues with player names that needed to be corrected. Players switch teams, year on year. Also players change their names.
- I had to verify that each match had 3 vote receivers since player names were inconsistent (as above).
- Removing extraneous data from tables
After all of the cleaning, I saved the player and match tables to excel files for use in the later prediction model. The files were titled by the match ids. Next, I had to join all player data with match data noting that each match had 2 tables, basic stats and advanced stats. Each team has roughly 30-40 players per season, and there were 594 vote receivers per seasons (3 per match) from roughly 600 players. The joining of all this information correctly into one table was important, as I wanted to visualise which player statistics are going to be most useful for the predictions. The player statistics in this case, would be the prediction model’s features (variables). This table would be used as the input for the prediction model.
Data visualisation to find features
I began the prediction modelling phase with some data visualisations. This was so I could get a feel for which player statistics (features/variables) most strongly correlated to receiving votes. The histogram graphs below show what the data looked like for each player statistic for the 3-vote receivers. The y-axis is the frequency of 3-vote receivers, and the x-axis is the percentile for that statistic. So if a player was 130 for AF statistic and that is the highest in the match, this will contribute to the 100 on the x-axis. In another match, the highest for AF might only be 105, but it would produce the same result. Similar results were seen for 2 and 1 vote receivers.
As you will see, the strongest correlations to 3-vote receivers were AF (top row, second from left) and SC (fifth row, middle). You can see for both of these that as the score increases, so does the frequency of 3 votes receivers and it sharply increases at the maximum end. The vast majority, 80-90% of the 3 vote receivers are in the top 10% of these statistics. According to the legend on the website, AF corresponds to AFL Fantasy Score, and SC corresponds to Super Coach score. Both of these are scores given by amalgamating key statistics above within the match on separate unique weighted average formulas (e.g. kicks, marks, tackles, handballs etc). They also incorporated player’s effectiveness with the ball and impact on the match, numerically. Given, how rigorous these statistics already were, and how strongly correlated to maximum votes they were, I used these as the only 2 features when making predictions. I had thought about using more features, but realised that time would not allow in this first model.
Creating new statistics
How I would use these key statistics, AF and SC in the prediction model would be the next challenge. What I realised early on, was that I would need to standardise the statistics across matches. As touched upon above, its not how high a player scored numerically, but if they were the top, or near the top of the pack. For example, if in one match the best player of the AF statistic was 85, 90, 100, then they don’t look nearly as good as 100, 120, 125 in another match. So its the rank of your statistic within the match that matters most.
I also took some inspiration from a article I read about the best sports people of all time. This mentioned that the best players are ahead of the next best by a considerable buffer in that statistic. For example within sport test cricket, Donald Bradman has a batting average of 99.94, the next best is Graeme Pollock with a considerably lowly 60.97! Donald Bradman is hence better than the next best, by a buffer of 39% of his score. I extended this principle with the AF and SC statistics within each match, to get the player’s rank and then buffer. The buffers were then put into bins (margins), <=5%, <=10%, <=15% of whatever size was created. I made it so that size of the bin size could be dynamically assigned to the table at run time, to whatever was useful. If the top AF scores were 100, 90, 80 then the top player would have rank 1, buffer 10%, and bin <= 10%. This rank and buffer algorithm I created was applied to all players in every match.
The machine learning algorithm I had envisaged working well for this problem was the random forest. This is essentially the simplicity of the decision tree, but with greater accuracy (for free) with current state of the art frameworks like sci-kit learn. After a bit of experimentation with this algorithm, I had to abandon it due to time constraints, and opt for a simpler solution. I shall return to it in the future however where I will also use the ‘buffer’ feature, and add more features (discussed later). Instead, I looked at the ranks of the AF and SC scores for each player. I figured that if a player is 1 for AF, and 1 for SC there is high likelihood of getting 3 votes. But what about (AF, SC) combination of (1, 2) or (1, 3) etc. So I then looked at every single combination and figured out how many 3 votes, 2 votes, and 1 vote were awarded for each combination. Once I knew how many (AF, SC) combinations corresponded 3, 2 or 1 votes, I could work out the likelihood that said combination would be 3 votes, or similarly 2 or 1. For example (AF, SC) of (1, 1) is most likely a 3, but (1, 3) could most likely be 2 votes.
Not all matches have every combination however though. For example, some might have (1, 2) and others (1, 1) etc. So to begin trying to find the vote receivers, I started by sorting the 2016 data for each match by SC rank, then by AF rank. Then starting from the top of the list and working down, taking the first (AF, SC) combination which had 3 votes as the most likely for that combination. Then after that, I repeated for 2 votes, and then 1 vote. This is what created the final table of predictions for each match. Unfortunately there was no time for cross-validation to test the accuracy of the model. The list looked reasonable when compared to other prediction sites. Time was running out on the last day (it was 4am on a ‘school night’) so I pushed the code GitHub, posted on Facebook and took a well earned rest. As mentioned, I was happy with the results, but since there was not a huge amount of time spent on the prediction model comparatively, there is a lot of room for improvement.
Lessons from prediction results
There were some howlers in the results table that I noticed. Based on my understanding of the AFL and the Brownlow medal, I can quickly see why that is the case. Here are the examples:
- Max Gawn – His position on field is ruckman. Ruckmen are much less likely to poll highly in votes as the umpire’s attention is dominated by the midfielders. Also his team did not win many matches compared with other top players. A player is less likely to poll highly if their team did not win the match (especially if they lost by a lot).
- Scott Pendlebury – He is midfielder, but his team performed even more poorly than Melbourne. Pendlebury is also known as the ‘player’s player’ as he never gives up. I envisage he has spent a lot of time playing well in matches that his team Collingwood lost.
- Heath Shaw – He is a back-man. I can’t recall a backman ever winning a Brownlow medal or even coming close.
New features to add
In the future, I will hopefully rectify these howlers by adding more features, so that the set of features would be:
- AC rank
- AC bin of the buffer
- SC rank
- SC bin of the buffer
- player position
- winner bin of buffer
After this, I will spend more time trying to get a random forest model working. I will also get more data for more years, and perhaps add more features to accommodate rules changes and how that would affect the votes. One example is the sub rule, which means there will be a bench of only 3 players for most of the match, instead of 4 which there is now. This means that players needed to run more to share the workload. It will mean those players with stamina will shine. Some more features would be
- After sub rule introduced
- After sub rule removed
- After rule 3 introduced
- After rule 4 introduced
The next features to try will be the umpires themselves. A lot of them have long careers, and some are probably biased to players or certain types of players. There are not a huge number of umpires within the sport, so this will be doable in theory. Getting the data for who umpired which match will be tricky however. I’ll need to start from scratch: finding, extracting and cleaning the data.
- Umpire 1 for match
- Umpire 2 for match
- Umpire 3 for match
Scoring system and cross validation
Another aspect to improve this project in the future will be cross validation. This is so that I know comfortably, that my model is likely to work well on unseen data. Before doing this however, I need to find a way of measuring how well my model has performed in terms of accuracy. One method I have thought of implementing is the following. For each match:
- Assign a score of +3 if I correctly predicted the 3 vote receiver, similarly 2 & 1 for predicting 2 & 1 votes respectively.
- For those players who did receive votes, but a different amount to my prediction, assign negative score of that difference. e.g. if I predicted 3 and they received 2, assign score -1.
- For those players who did receive votes, and they weren’t in my predictions for that match, assign a negative score of that amount of votes.
Once I’ve implemented this scoring system, it will also be used to tune the model. I can look at the matches that have scored poorly and see why this might be the case. Similarly to above, as I have described the howlers in my top list. But this time, it would be looking at these outliers at a lower level.