Python for Fantasy Football – Introduction

Welcome to the first article in the ‘Python for Fantasy Football’ series! Regular readers will be aware that I am a big advocate of using data to help better understand sports, and daily fantasy football lends itself particularly well to this type of analysis. Many of you are probably already familiar with spreadsheet software like Excel, and whilst that is very powerful it often lacks the flexibility and functionality that you can achieve by writing your own code. In this series I’ll be teaching you how to use the Python programming language to draw insights from data, make projections and use machine learning techniques to create your own xG models. Hopefully that sounds interesting already, but if you aren’t convinced that you should to learn to code, here are just a few quotes that might help change your mind:

  • “Whether you want to uncover the secrets of the universe, or you just want to pursue a career in the 21st century, basic computer programming is an essential skill to learn.” Steven Hawking, theoretical physicist.
  • “Learning to write programs stretches your mind, and helps you think better, creates a way of thinking about things that I think is helpful in all domains.” Bill Gates, co-founder of Microsoft.
  • “I quickly came to understand that code is a superpower.” Karlie Kloss, fashion model.
  • “Coding is very important when you think about the future, where everything is going. With more phones and tablets and computers being made, and more people having access to every thing and information being shared, I think it’s very important to be able to learn the language of coding and programming.” Chris Bosh, 11 time NBA all-star.
  • “Whether we’re fighting climate change or going to space, everything is moved forward by computers, and we don’t have enough people who can code.” Richard Branson, founder of Virgin Group.
  • “I think that great programming is not all that dissimilar to great art. Once you start thinking in concepts of programming it makes you a better person – as does learning a foreign language, as does learning math, as does learning how to read.” Jack Dorsey, creator of Twitter.
  • “Learning to code gives you a completely new perspective when you look at a computer. Before, you think of it as an appliance, like a fridge, accepting what it can do. After, you know that you can code that computer to do anything you can imagine it doing.” Tim Berners Lee, inventor of the World Wide Web.

And, of course:

Now that you want to learn to code (I hope!), you need to choose a language. The most useful discipline for sports analytics is data science, with the most popular languages being Python and R. I have used both and personally prefer Python, so I will be focusing on that throughout this series. You shouldn’t need any previous programming experience to follow along, but you will need to install Python before we go any further. To do so, I recommend downloading Anaconda for Python 3, which already comes pre-loaded with the most popular data science libraries and tools you will need. You can find a step-by-step guide on how to install Anaconda on Windows here and Mac here, as well as in the Anaconda documentation itself here. Once installed, open up a Jupyter notebook either via the shortcut created during the Anaconda installation process, or by typing jupyter notebook into the command prompt. Jupyter notebooks provide a nice environment to allow you to write, test and edit code. You can either create a new notebook file to follow along with this article, or download the one I made earlier via Github. I’m going to focus on examples here rather than explaining everything in loads of detail, so if you do want to go more in-depth I suggest heading to some of the links in the ‘Conclusion’ section. Keep in mind that a huge part of being a good programmer is just learning how to adapt existing code to your particular needs, so if you get stuck at any point just Google it! The vast majority of the time you will find out that someone has already asked the same question on

Your first Python notebook

For the first part of the series, we’re going to look at some expected goals data to see if we can identify which teams have been over or under-performing xG. The data we will be using is for the first 8 games of the 2018-2019 EPL season; for now we will be using xG numbers from, but later in the series I will show you how you can create your own xG models. Getting data in the first place can be tricky at the best of times, so I have provided a csv file containing the data that can be downloaded from my Github account here. This exercise will teach you the basics of the pandas library for data-frames, as well as some basic plotting with matplotlib and seaborn.

Reading the data

First, we need to import the pandas library and load the data. By convention, the pandas library is imported as ‘pd’, so any pandas methods (e.g. ‘read_csv()’) can be called via the pd prefix. However, if you fancy doing the whole ‘Hello, World!’ thing before that, go ahead! To run a code cell in a Jupyter notebook, either press the triangular ‘play’ icon or press ‘shift + enter’ on your keyboard.

# Optional hello world (,_World!%22_program)
print('Hello, World!')
# Import the pandas library
import pandas as pd

# Read the data from a csv file and save it as a pandas dataframe named 'xg_data'
# Replace the file path with the location on your computer where the csv file is saved (in my case it's in D:/Tom/Downloads/)
xg_data = pd.read_csv('D:/Tom/Downloads/epl_xg.csv')

# Take a look at the data

Right now the data looks pretty much how it would do in a spreadsheet, with the exception of the numbered ‘index’ column on the far left. With a small dataset like this we can see every row pretty easily, but with larger datasets you are likely to want to use the head and tail methods to get an idea of what the data looks like:

# Show the first 3 rows of the data

# Show the last 7 rows of the data

Filtering and summarising

You can also filter and summarise the data using more specific queries:

# Show the data for Leicester (note that 'is equal to' is written as '==' instead of '=')
# The code below essentially reads as 'show the xg_data dataframe where the 'Team' column is equal to 'Leicester'
# Because Leicester is a string, you need to write it using either single or double quotes
xg_data[xg_data['Team'] == 'Leicester']

# Filter the rows where goals scored is greater than or equal to 15, and save the result in a new dataframe
# Note that in this case because 15 is an integer, we don't need to use quotes
# For more information about data types, see
high_scorers = xg_data[xg_data['G'] >= 15]

# Print a list of the teams that have scored at least 15 goals

[‘Arsenal’,  ‘Bournemouth’,  ‘Chelsea’,  ‘Liverpool’,  ‘Manchester City’,  ‘Tottenham’]

# Show some summary statistics for each column in the original dataframe

Adding columns

It’s likely that you’re going to want to add extra columns to your data containing additional information. Let’s add columns for goal difference, expected goal difference and non-penalty expected goal difference, and then sort the data by NPxGD:

# Add new columns for goal difference, expected goal difference and non-penalty expected goal difference
xg_data['GD'] = xg_data['G'] - xg_data['GA']
xg_data['xGD'] = xg_data['xG'] - xg_data['xGA']
xg_data['NPxGD'] = xg_data['NPxG'] - xg_data['NPxGA']

# Order the teams by NPxGD to help give an idea of who the good and bad teams are currently
xg_data = xg_data.sort_values(by=['NPxGD'], ascending=False)

Manchester City and Liverpool unsurprisingly lead the way so far, with Wolves and Bournemouth keeping pace with Chelsea and Spurs in the early going. The teams at the bottom of this dataframe have been poor, but there are a couple of teams in the middle that stand out for different reasons. Let’s take a look at who has been over or under-performing xG to make it easier to spot these teams.


To get a much clearer picture of what the data is telling us, it’s a good idea to generate a plot or two. In this case we will create a horizontal bar plot (barh) using matplotlib to look at goal difference vs expected goal difference:

# Take a look at who has been overperforming or underperforming so far
xg_data['GD_vs_xGD'] = xg_data['GD'] - xg_data['xGD']
xg_data = xg_data.sort_values(by=['GD_vs_xGD'], ascending=False)

# Import the matplotlib library to use for plotting
from matplotlib import pyplot as plt

# Create a horizontal bar chart to help visualise the teams that have been overperforming or underperforming in terms of GD vs xGD
plt.barh(xg_data['Team'], xg_data['GD_vs_xGD'])

# Show the plot

The plot shows that Arsenal have been significantly over-performing in terms of xGD, whereas Cardiff have been under-performing. However, this isn’t exactly a very good looking plot to say the least… Whilst Arsenal and Cardiff clearly stand out, it’s hard to make comparisons between most of the other teams due to the ordering, and it’s also quite small. Fortunately there is a different plotting library, seaborn, which allows us to easily create much more aesthetically pleasing plots.

# Import seaborn to help create more visually appealing plots; see for more information
import seaborn as sns

# Set the plot style and colour palette to use (remember dodgy spelling if you're from the UK!)

# Initialize the matplotlib figure (f) and axes (ax), and set width and height of the plot
f, ax = plt.subplots(figsize=(12, 10))

# Create the plot, choosing the variables for each axis, the data source and the colour (b = blue)
sns.barplot(x='GD_vs_xGD', y='Team', data=xg_data, color='b')

# Rename the axes, setting y axis label to be blank
ax.set(ylabel='', xlabel='Difference in GD vs xGD')

# Remove the borders from the plot
sns.despine(left=True, bottom=True)

Much better! xG isn’t perfect, but based on this graph it also looks like Chelsea and of course Burnley have been fortunate so far, whereas Southampton and Huddersfield have perhaps been a bit unlucky. Newcastle fans probably won’t be pleased to see that their goal difference is pretty much bang on with expected, although they have had a very tough schedule to begin the season. Feel free to continue playing around with the data on your own to see what other interesting bits of information you can find (e.g. are Arsenal running good in attack, defense, or both?).


We barely scratched the surface in this article, but hopefully you are starting to get an idea of the Python syntax, as well as a sense of how powerful it can be. I wanted to keep this fairly brief, but if you have time I highly recommend taking a free introductory course to get even more familiar with the basics, for example the one by DataCamp here. In the next part, I’ll go through how to adjust stats for a specific matchup by creating functions in Python, which will give you a starting point to create your own projections. If you enjoyed this article, please share it on social media and be sure to look out for the next part!


17 thoughts on “Python for Fantasy Football – Introduction

  1. This is brilliant stuff. Personally, Football and Python are literally what I’m into these days. Hope this sparks some interest for others to get going. Keep up the good stuff. Just one small advice: You can make his elaborate from here on. If someone has come this far, they probably would want to dive deep and see some decent stuff which even they can create.
    Cheers man!

  2. Hello, I am creating a fantasy football program for my final year project at university. I wanted some information regarding how you would go about picking the best players for your fantasy football team based on last years data and how you would start this project and what software would be used.
    Any help i appreciated

    Thank you

    1. Hi,

      There are a few ways to go about this. If you just want to know the ‘optimal’ team for last season, that should be pretty easy to work out. You need to just look up the stats for all players from last year and calculate fantasy points based on the site’s scoring system, then use salary and no. of players as constraints in an optimization algorithm. There are quite a few of these around; see for a Python one, or you can use Excel’s solver function if you’re more comfortable with that. If you want something more predictive for next season, I would look at stats like xG instead of goals and just repeat the same process. E.g. player A has 0.6 xG per 90, expect them to play 3000 mins next season, so you can project 20 goals for them on average. An advanced option would be to look at each fixture of the season individually and project performance for each match, which will give you a way to get an even better ‘optimal’ team by allowing you to incorporate player transfers. E.g. you are projecting player A to score 1.3 goals over the next 3 matches, but you could instead transfer him to a player that will score 1.6 goals on average. You would either have to project stats using the method in part 2, or if using last year’s data you can get historical bookmakers odds from GL!

  3. Great Article. Thank you for sharing it.
    When I was doing this code, the code wouldnt show the graph, I had to add the line of code:

  4. Hi
    This is good stuff, perfect FPL distraction on international breaks 🙂
    Not sure if it matters at all, but the first table in the last section (Plotting) does not plot the same for me as it does in the article.
    Mostly because you actually sort it in the code but you table is not sorted. Might throw someone off that is very new to programming.
    On the last table, to add on the missing “”, you need to add another line of code before that:
    xg_data = xg_data.sort_values(by=[‘GD_vs_xGD’], ascending=True)
    This will sort the data with the overperforming teams showing up on top.

  5. Just wanted to say thanks for this site. I’m a fantasy football player and a beginner at learning Python and this site is going to be really useful for both. I’ve just finished the adding columns section and really pleased to see the same results as you’ve laid out.

    There were just the 2 things that had me stumped. The first was when I was setting the location of the csv file on my computer I was getting all kinds of errors so I googled one of them and found that by placing the letter r next to the drive letter like so, I got it to work.
    xg_data = pd.read_csv(‘rD:/Tom/Downloads/epl_xg.csv’)

    The 2nd was when I typed out this line
    high_scorers = xg_data[xg_data[‘G’] >= 15] and then ran the code I kept getting syntax errors and then I had the idea of changing the code to this
    high_scorers = xg_data[xg_data[‘G’] >= 15] and it worked.

    Looking forward to working my way through the rest of the page.

      1. It also gave me some experience in problem solving rather than just blindly copying and pasting code.

        Just one other thing I made a mistake when I was typing out the code for the 2nd problem. It should be >= 15 for the first line of code not >= 15. I didn’t realise until I posted and I couldn’t edit it.

  6. Awesome stuff! I’m also getting into python to learn some new skills and thought it would be cool to learn off of sports data (to help keep me interested). I’m finding so much out there and have actually started having fun – total nerd moment 🙂 Keep it up mate, great work!

  7. high_scorers = xg_data[xg_data[‘G’] >= 15]

    should be

    high_scorers = xg_data[xg_data[‘G’] >= 15]

Leave a Reply

Your email address will not be published. Required fields are marked *