Python – Learn Pandas with SQL Examples – Football Analytics Example

When working with data, you will often move between SQL databases and Pandas DataFrames. SQL is excellent for storing and retrieving data, while Pandas is ideal for analysis inside Python.

In this article, we show how both can be used together, using a football (soccer) mini-league dataset. We build a small SQLite database in memory, read the data into Pandas, and then solve real analytics questions.

There are neither pythons or pandas in Bulgaria. Just software.
  • Setup – SQLite and Pandas

We start by importing the libraries and creating three tables – [teams, players, matches]  inside an SQLite in-memory database.

import sqlite3
import pandas as pd
import numpy as np

conn = sqlite3.connect(":memory:")
cur = conn.cursor()

cur.executescript("""
DROP TABLE IF EXISTS teams;
DROP TABLE IF EXISTS players;
DROP TABLE IF EXISTS matches;

CREATE TABLE teams (
  team    TEXT PRIMARY KEY,
  city    TEXT NOT NULL,
  founded INTEGER NOT NULL
);

CREATE TABLE players (
  player_id INTEGER PRIMARY KEY AUTOINCREMENT,
  name   TEXT NOT NULL,
  team   TEXT NOT NULL REFERENCES teams(team),
  pos    TEXT NOT NULL,
  age    INTEGER NOT NULL,
  goals  INTEGER NOT NULL,
  assists INTEGER,
  minutes INTEGER NOT NULL
);

CREATE TABLE matches (
  match_id   INTEGER PRIMARY KEY AUTOINCREMENT,
  date       TEXT NOT NULL,
  home       TEXT NOT NULL,
  away       TEXT NOT NULL,
  home_goals INTEGER NOT NULL,
  away_goals INTEGER NOT NULL
);

INSERT INTO teams(team, city, founded) VALUES
 ('Lions','Sofia', 2015),
 ('Wolves','Plovdiv',1914),
 ('Eagles','Varna',1930);

INSERT INTO players(name, team, pos, age, goals, assists, minutes) VALUES
 ('Ivan Petrov','Lions','FW',24,11,3,1350),
 ('Martin Kolev','Lions','MF',29,4,NULL,1490),
 ('Rui Costa','Lions','DF',31,1,2,1600),
 ('Georgi Iliev','Wolves','FW',27,7,5,1410),
 ('Joe Jackson','Wolves','FW',27,17,5,410),
 ('Peter Marin','Eagles','FW',20,5,1,870);

INSERT INTO matches(date,home,away,home_goals,away_goals) VALUES
 ('2024-08-03','Lions','Wolves',2,1),
 ('2024-08-10','Eagles','Lions',1,3),
 ('2024-08-17','Wolves','Eagles',2,2);
""")
conn.commit()

Now, we have three tables.

  • Loading SQL Data into Pandas

pd.read_sql  does the magic to load either a table or a custom query directly.

teams = pd.read_sql("SELECT * FROM teams", conn)
players = pd.read_sql("SELECT * FROM players", conn)
matches = pd.read_sql("SELECT * FROM matches", conn, parse_dates = ["date"])

print(teams)
print(players.head())
print(matches)

At this point, the SQL data is ready for analysis with Pandas.

  • SQL vs Pandas – Filtering Rows

Task: Find forwards (FW) with more than 1200 minutes on the field:

SQL:

sql1 = pd.read_sql("""
SELECT name, team, goals
FROM players
WHERE pos = 'FW' AND minutes > 1200;
""", conn)

Pandas:

pd1 = players.loc[(players['pos']=='FW')&(players["minutes"]>1200),["name", "team", "goals"]]

As expected, both return the same subset, one written in SQL and the other in Pandas.

  • Aggregating Goals

Task: Total goals per team:

SQL:

sql2 = pd.read_sql("""
SELECT team, SUM(goals)
FROM players
GROUP BY team
ORDER BY 2 DESC;
""", conn)

Pandas:

pd2 = players.groupby("team")["goals"].sum().reset_index()
pd2.sort_values("goals", ascending = False).reset_index(drop=True)

Both results show which team has scored more goals overall.

  • Joining Tables

Task: Add the city of each team to the players table.

SQL:

sql3 = pd.read_sql("""
SELECT p.name, t.city
FROM players p
JOIN teams t on t.team = p.team;
""", conn)

Pandas:

pd3 = players.merge(teams, on="team", how="left")
pd3[["name", "city"]]
  • Building a League Table

The fun part: calculating points (3 for a win, 1 for a draw) and goal difference. Only with SQL this time.

m["home_points"] = np.where(m["home_goals"]>m["away_goals"],3,
                     np.where(m["home_goals"]==m["away_goals"],1,0))
m["away_points"] = np.where(m["away_goals"]>m["home_goals"],3,
                     np.where(m["away_goals"]==m["home_goals"],1,0))

home_tbl = m[["home","home_points","home_goals","away_goals"]] \
              .rename(columns={"home":"team","home_points":"points","home_goals":"gf","away_goals":"ga"})
away_tbl = m[["away","away_points","away_goals","home_goals"]] \
              .rename(columns={"away":"team","away_points":"points","away_goals":"gf","home_goals":"ga"})

total_points = pd.concat([home_tbl,away_tbl])
league = total_points.groupby("team").agg(points=("points","sum"), GF=("gf","sum"), GA=("ga","sum"))
league["GD"] = league["GF"] - league["GA"]
league.sort_values(["points","GD"], ascending=[False,False])

This produces a proper football league ranking – teams sorted by points and then goal difference:

  • Quick Pandas Tricks

    • Top scorers with nlargest:
pd4 = players.nlargest(3, "goals")
pd4 = pd4[["name", "team", "goals"]]
pd4
    • Age bands with pd.cut:
bins = [0, 22, 26, 30, np.inf]
labels = ["<=22", "23-26", "27-30", "31+"]
players["age_band"] = pd.cut(players["age"], bins = bins, labels = labels)
  • YouTube Video

Python - Learn Pandas with SQL Examples

  • GitHub link

https://github.com/Vitosh/Python_personal/tree/master/YouTube/041_Python-Learn-Pandas-with-Football-Analytics