Latest updates

Check the Important info page for latest updates! (15 February 2024)

Monday, December 28, 2020

Simulation of scheduled football matches

In this post I try to answer a few readers questions about the simulation system I use. At the same time I describe my research on possible improvements in the probability distribution functions I use in my simulation system.

Why simulate ?

The purpose of a simulation of football matches is to provide probabilities for each team, participating in a scheme of scheduled matches, to reach certain feats. The feats can be anything from just winning a single match to winning a group in a tournament or win the whole tournament. Also, because each scheduled match gets a result assigned to it during the simulation, dependent facts can be quantified, such as the resulting FIFA points (and thus the resulting ranking) of all participating teams after the scheduled matches are 'played'.


How does a simulation work ?

A simulation consists of scheduled matches where both teams in a match are known or where one or both teams can be derived given earlier simulated match results. A quarter final where the number 1 of group A plays against the number 2 of group B can be perfectly simulated as long as the final standing for groups A and B, resulting from earlier simulated group matches, can be computed first. Further more, both elo and FIFA update the ratings of a team directly after a match is played and the updated rating is input for the next match. This updating behaviour needs to be implemented in the simulation model to be able to calculate the correct ratings after the scheduled matches are 'played'.


Now, of course you can assign one match result to each scheduled match once and determine the things you are interested in, but that will not give you the probabilities you are looking for. For that you need Monte Carlo simulations or a high number of random samples from a probability distribution for your main random variable: in this problem the match result of each scheduled match. And the larger that number of samples (N), the more reliable the resulting probabilities will be.


So, technically, to perform simulations this process is followed:

in one simulation the results of all scheduled matches (included in the simulation) are determined and all things you are interested in, are registered after that single simulation. That can for instance be the final ranking of teams in a group or the composition of knock-out matches in a tournament or the resulting position of all teams in the FIFA ranking. Then you perform that single simulation 10.000 times and vary in each simulation the match result of each scheduled match. After each single simulation you still register those same things you are interested in. After the 10.000 simulations are performed, you can just count in the resulting registration let's say the number of times a team ends at first place in a group or as a tournament winner or at spot 4 in the FIFA-ranking. And you get a probability for that event by just dividing that counted number by 10.000.


Predicting the match result

To be able to vary the match result in a meaningful way in each simulation we are looking for a good, describing probability distribution for the match result, given both opponents in the scheduled match and more specific their mutual strengths. A probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It describes the probability of something happening as a function of an independent variable. The independent variable must be computable with a non-random formula.


For this problem we could do with one computable, independent measure that expresses the win expectancy of a team in a match against an opponent, based on a difference in strengths of both teams and possibly some other factors that can deliver a meaningful contribution to the match result. You can think of for instance the actual 'form' of a team, the line-up, their system of play and the qualities of players the coach selected to execute his/her ideas. But these factors are all difficult to quantify easily, because you will need to gather and maintain a lot of detailed extra information about each match in your dataset. We need simple but meaningful factors or no other factors.


Elo win expectancy

Elo (http://www.eloratings.net) has defined such a win expectancy and it can be easily computed because it is only dependent on the difference of strengths of both teams and one other factor which has a proven meaningful contribution to the match result and is very easy to determine: the Home Field Advantage (HFA). When a team plays the match on home ground its win expectancy rises significantly, of course dependent on the difference in strengths of both teams. When a home team is much weaker than the visiting team the HFA will contribute very little to the win expectancy of the home team (less than 2 percent points). However, when both teams are about equal in strength the HFA will contribute up to 14 percent point to the win expectancy.


The win expectancy (We) according to elo is calculated as follows:

where dr equals the difference in elo ratings of both teams in the match plus 100 points if the match is not played at a neutral venue. If a match is played at a neutral venue the 100 HFA-points are ignored.


But the elo win expectancy is still not the probability distribution from which we can sample, it is the independent variable. Besides, to be able to properly calculate the final group standing from some scheduled group matches we need information about the exact number of goals scored by each team in each match, because goal difference and scored goals over all group matches are also important tie breakers (when match points are equal) to determine the final group standing. So we need a probability distribution that describes the number of goals scored by a team as dependent on the elo win expectancy in a match.


ClubElo

Lars Schiefler from ClubElo (http://clubelo.com) estimated such a probability distribution function. He estimated his probability distribution function on a very large dataset of football match results. Only little drawback was that it contained only matches with clubs involved, played in both domestic club competitions and in international club competitions like the European Cups.

Until now I nonetheless used his function in my simulations because it was available and tested and others used this function as well, so I was able to compare my simulation results with them. But over the years I came to the conclusion that the characteristics of NT-football might be a little bit different from club football. To research this I decided to estimate my own probability distribution function based on a very large dataset of NT football match results.


Probability to score goals

Some theory first. The probability to score a certain number of goals by a team in a match is described by the Poisson distribution. This distribution has one parameter that is equal to the expected value of the distribution. The assumption is that the expected number of goals is depending on the elo win expectancy.


The Poisson distribution:

where:

For example: 

in case the average expected number of goals is 0,5

etc.

in case the average expected number of goals is 1,9

etc.


The average expected number of goals G is what we are trying to determine. So we need to determine G as a function of the independent variable 'elo win expectancy': 

This function is estimated based on almost 40.000 NT-matches in my match database using the least squares method which minimizes the vertical distance between function value and data point for each elo win expectancy. I apply polynomial regression so that the function is modelled as a 4th degree polynomial in independent variable We.


Data research and results

In the graphs below you see the data points I derived from my database. One graph for matches on non-neutral or home ground; one graph for matches on neutral ground. 

On the X-axis of each graph the elo win expectancy is plotted ascending from 0 to 1, on the Y-axis the average number of goals scored is plotted. In the 'home ground' graph you see two lines: red is for the average number of goals scored by the home team in a match with the given win expectancy, the blue line depicts the average number of goals scored by the away team in a match with the given win expectancy (for the home team). In the 'neutral ground' graph only one line is plotted, there is no home advantage in these matches.



The estimated functions for non-neutral or home ground matches are:

for the average expected number of goals for the home team (Gh):

for the average expected number of goals for the away team (Ga):


The estimated functions for matches on neutral ground are:

for the average expected number of goals (G):

The win expectancy We for the 'away' team in a match on neutral ground is equal to
1 minus the win expectancy for the 'home' team in the match.


In the graph below these equations are compared with ClubElo's probability distributions, for non-neutral matches only as ClubElo has no separate functions for neutral matches.





Win, draw, lose probability

Based on these equations we can determine for a match with a certain We what the win probability is for the home team, what the draw probability is and what the lose probability is for the home team. For the win probability for instance: simply determine the probability for the exact match result where the home team scores more than the away team, so 1-0, 2-0, 2-1, 3-0, 3-1, 3-2, 4-0, 4-1, 4-2, 4-3 etc. Sum all these probabilities and you have your win probability for the home team for this match.

In the graphs below the win, draw and lose probablity is plotted for both the NT probability distributions and ClubElo's functions (for non-neutral matches). You see that for the NT-equations the probability to win is a little bit higher for all matches with We higher than approximately 0,4. For NT-equations the probability to draw is for almost all We a little bit lower. The lose probability is a little bit lower for all matches with We higher than approximately 0,4.




We see that the win probability in non-neutral matches in NT is slightly higher than in ClubElo for the same win expectancy when both teams are of equal strength or the home team is stronger. This confirms the notion that home advantage in NT-football plays a slightly bigger role in determining the match result than in club football.

I intend to use the derived NT probability distribution functions from now on in my simulations.



About me:

Software engineer, happily unmarried and non-religious. You won't find me on Twitter or other so called social media. Dutchman, joined the blog in March 2018.

8 comments:

  1. Ed- this is probably a dumb question, but for the non-neutral matches, the We for the away team refers to the away teams We, correct? (That is, 1 minus the home teams We)?

    ReplyDelete
  2. Alex, no that's not correct. I should have emphasized this more in the paragraph Data research and results.
    To be absolutely clear: for non-neutral matches the We is always the win expectancy for the home playing team, the team that plays at a venue in their home country (and you understand that's not always the same as the first mentioned team in a match).
    Only for neutral matches you calculate for each team in a match their own win expectancy, of course not including the HFA factor in this case.

    ReplyDelete
  3. hi. great work. thanks

    ReplyDelete
  4. I've just now saw this - amazing work!

    ReplyDelete
  5. Next stage would be to calculate your own ELO ratings, that maximize predictabity given your dataset :)

    ReplyDelete
  6. Question - how are 30-minute extra time and penalty shootouts considered?

    ReplyDelete
  7. Assume you have a one-off play-off match in the simulation. The result of the match is drawn from the probability distribution functions: the number of goals scored by the home team and the number of goals scored by the away team is determined.
    If those numbers are the same then the end result of the match is a draw. So this draw is reached after extra time (if that's played in the match) or after regular time (if no extra time is played in the match). So 'extra time' does not really occur in the simulation.
    However, this match as a one-off play-off needs a winner. In that case a penalty shoot-out is also simulated. And that's done rather simple and completely dependent on the elo win expectancy for the home team in the match: if that's higher than or equal to 0.5 then the home team is the winner of the PSO, otherwise the away team is the PSO-winner.
    So then the, in terms of elo-rating combined with home advantage, strongest team always wins the PSO.

    One could argue that the result of a PSO should be drawn randomly from a uniform distribution, but analysis of all PSO-matches in my match database showed that in some 60% of the cases the team with the higher elo win expectancy won the PSO. That's why I decided to let the PSO result be dependent on the win expectancy, although I realize that now each simulated PSO always results in the 'elo strongest' team winning it.

    ReplyDelete