Overview
This application predicts the distribution of total goals scored in upcoming English Premier League fixtures. Rather than limiting forecasts to common betting lines (such as over/under 2.5 goals), the model predicts goal counts as discrete categories, providing probability estimates for each outcome (0, 1, 2, 3, up to 7+ goals).
Data Pipeline
Data Acquisition:
Match results are collected by scraping historical data from site X using Selenium, due to dynamic JavaScript content and standard request-blocking mechanisms. The scraper navigates season by season, extracting structured data tables spanning at least two full seasons.
Data Preparation:
The data is cleaned to remove inconsistencies and outliers, with exploratory analysis performed to confirm data integrity and stability across seasons.
Feature Engineering:
Over 40 features are engineered, covering team form, goal averages, recent trends, home/away adjustments, rolling metrics, and head-to-head performance. These features are designed to maximize predictive strength while minimizing redundancy.
Feature Selection & Correlation:
Pearson and Spearman correlation analysis, along with feature importance scoring, are used to refine and optimize the feature set. Low-impact or overlapping features are removed to maintain model efficiency and stability.
Model Architecture
Model Choice:
A Random Forest Classifier (RFC) was selected for its ability to handle non-linear relationships, feature interactions, and class imbalances, while reducing overfitting through ensemble averaging.
Target Structuring:
The target variable is structured as discrete goal categories, allowing for richer, probability-based forecasting rather than simple binary outcomes.
Hyperparameter Optimization:
Key hyperparameters, including n_estimators
, max_depth
, min_samples_leaf
, and max_features
, were tuned via randomized search and cross-validation to achieve optimal performance and generalization.
Testing & Validation:
The model has been evaluated on unseen fixtures, using metrics such as log loss, accuracy, and probability calibration checks to ensure consistent predictive reliability.
Model Evaluation & Success
This model has been rigorously tested on dynamic, unseen data and continuously validated. While predicting the exact number of goals every time is impossible, the model consistently delivers well-calibrated probabilities to help users make smarter, more informed decisions.
We do not display historical backtests, as past statistics can mislead. This is a predictive tool, driven by machine learning and statistical models — not an exact science.
If we could predict every result perfectly, we’d already be on our private beach, having bought Luton Town FC and brought in Ronaldo as a super sub! Until then, this tool helps you approach predictions with sharper data and smarter probabilities.
Deployment
The model, feature engineering pipelines, and dynamic fixture engine are deployed to Streamlit Cloud. Data and features are updated daily, ensuring predictions remain current and relevant.
Scalability
While this version focuses on the English Premier League, the framework is designed for easy adaptation to other leagues, with league-specific feature engineering and model retraining to reflect unique scoring patterns.