Essays [Amir Sohil]

There's a version of this project that's boring. You load a CSV into pandas, compute cosine similarity, and return the top ten results. It works. It's fine. It's also exactly what everyone else does.

The Brief

Starting Point

Most data science portfolios look the same. A regression here, a dashboard there, maybe a neural network trained on MNIST. I wanted to build something where the interesting problem wasn't just the model. What I was truly looking for was a project that would force me to make real architectural decisions and come out the other side having genuinely learned something.

The brief I gave myself was to build something visually impressive, technically differentiated, and useful enough that a non-technical person could immediately understand the value. Enter, football player replacement. Think about it. The domain is universally relatable, the problem ("who can replace this player?") is immediately understandable, and the data structure, as I'll explain, is a genuinely natural fit for a graph database. Oh, and most importantly, I love football.

The result is Transferium Player Intelligence: a player replacement finder powered by Neo4j, a weighted cosine similarity engine, and a live Streamlit app. It took several wrong turns to get there, and I think those wrong turns are worth writing about too.

0 Players modelled

0 Similarity edges written

0 Total cost

Architecture

Why a Graph Database?

This was the first major decision, and the one I want to defend properly, because it's easy to dismiss as over-engineering.

In a relational database, relationships between entities are implicit. For example, if you want to find all players in the Premier League, you join players to clubs to leagues at query time. The database doesn't inherently know that Liverpool FC is connected to the Premier League. It finds out by scanning and matching rows. In a graph database, relationships are first-class citizens. They're stored explicitly with their own properties. A path like (Player)→(Club)→(League) is a literal edge in the graph, not something computed at query time.

For this project, the key payoff is the SIMILAR_TO edge. Once I compute player similarity scores and store them as graph edges — (Kylian Mbappé Lottin)-[:SIMILAR_TO {score: 0.82}]->(Omar Khaled Mohamed Marmoush) — finding replacements becomes a single hop:

MATCH (p:Player {name: "Kylian Mbappé Lottin"})-[r:SIMILAR_TO]->(s:Player)
RETURN s.name, r.score ORDER BY r.score DESC LIMIT 15

Zero recomputing or scanning. Just traversal. At around 18,000 players the performance difference is modest, but the architecture scales, and, more importantly, it forced me to think about my data as a network of relationships rather than a table of rows. I believe that shift in thinking is the real value.

Graph Schema

Player ── [:PLAYS_FOR] ──▶ Club ── [:IN_LEAGUE] ──▶ League

Player ── [:PLAYS_AT] ──▶ Position

Player ── [:SIMILAR_TO {score, pos_match}] ──▶ Player

4 node types · 4 relationship types · each edge carries its own properties

Data Quality

The Dataset Problem

I initially started with a SoFIFA dataset from Kaggle. On the surface, it looked comprehensive. It wasn't. Every individual skill attribute (crossing, finishing, dribbling, stamina, etc.) was null for every single player. The scraper had captured biographical and overall rating data, but not the granular attributes I needed to build a meaningful similarity engine.

Lesson Always audit your data before designing around it. Column names do not imply populated data.

So that was on me for not checking first. I switched to an FC 26 dataset that included the full attribute breakdown across prefixed column groups: attacking_, skill_, movement_, power_, mentality_, defending_, and goalkeeping_. Zero nulls across all 29 outfield attributes and 7 goalkeeper attributes. 18,405 players. This was the foundation.

The Engine

Building the Similarity Model

Before any similarity computation, I made one structural decision that shaped everything downstream: goalkeepers and outfield players are modelled entirely separately. A goalkeeper's reflexes score means something completely different to a striker's finishing score, even if the numbers look similar in magnitude. Mixing them in a single vector space would produce statistically coherent but footballingly meaningless results. The two groups live in separate vector spaces. Similarity is only ever computed within each group.

Weighted Attribute Vectors

The naive approach treats all 29 attributes equally. But a defensive midfielder's interceptions matter far more than their finishing. A striker's positioning is more important than their sliding tackle. Equal weighting ignores what positions actually demand. I assigned weights across four groups:

Group	Weight	Key attributes
Technical	35%	Ball control, dribbling, passing, vision
Tactical	30%	Positioning, interceptions, marking, reactions
Physical	25%	Acceleration, sprint speed, stamina, strength
League / Region	10%	Contextual adaptation bonus

These group weights are informed judgements, not empirically optimised values. Each individual attribute's contribution is its raw score multiplied by its group weight divided by the number of attributes in that group. A 90-rated crossing doesn't dominate the vector just because it's high; it's weighted relative to its group.

Cosine Similarity, Not Euclidean Distance

Cosine similarity measures the angle between two vectors, not the absolute distance between them. This means a 70-rated player with a balanced profile is more "similar" to an 85-rated balanced player than to a 90-rated specialist, because their skill shape is the same, even if the magnitude differs. For player replacement, shape matters more than magnitude. You're looking for someone who plays the same way, not someone of identical quality.

Consider Mohamed Salah and Francisco Trincão, a 91-rated Premier League superstar and an 82-rated winger at Sporting CP. Their Euclidean distance is significant: Salah's raw attribute numbers are simply higher across the board. But the model scores them at 76% similarity, because the shape of their profiles points in the same direction. The ratios between attributes are nearly identical, the same pattern, expressed at different levels. Euclidean distance would tell you they're far apart. Cosine similarity correctly identifies them as the same type of player, separated by level and context rather than style. That distinction is exactly what a replacement finder should care about. A club that can't afford Salah but needs his profile: a left-footed, right-sided attacker who dribbles, finishes, and doesn't defend too much, gets a credible starting point. That's what the engine is built to surface.

The Overall Rating Penalty

My first results exposed a flaw. A 73-rated player appeared as the top replacement for a 90-rated Bellingham because their attribute ratios were nearly identical. Mathematically correct. Practically not useful. I added a penalty term:

overall_diff = abs(player_a_overall - player_b_overall)
overall_penalty = min(overall_diff * 0.015, 0.30)
final_score = base_similarity * 0.90 + league_bonus - overall_penalty

A 17-point gap now reduces the score by up to 25.5%. The ceiling of 30% ensures a large gap is penalised, but never completely eliminates a stylistically similar player from appearing.

The League / Region Bonus

Real scouting isn't purely statistical. Clubs prefer players already adapted to similar footballing cultures. Why? Because there's a meaningful contextual advantage to finding someone already adapted to a similar environment: tactical styles, physical demands, officiating standards, etc. I encoded this as a 10% contextual bonus: same league gets +10%, same regional group (e.g. both in Iberian leagues) gets +5%, different region gets nothing. A perfect statistical twin in a different league still scores very high. The nudge here is small and deliberate.

Position Matching

A central midfielder is not a valid replacement for a striker, even if the underlying stats look similar. I only create SIMILAR_TO edges between players who share at least one position. Primary position match earns an additional 3% boost. Because FC 26 players carry multiple positions, a player listed as CAM, CM can surface as a replacement for either role. This reflects real-world flexibility without forcing false matches.

Engineering

Loading 276,075 Edges Into AuraDB

The production database is Neo4j AuraDB Free: cloud-hosted, zero cost, with a 200k node / 400k relationship limit. The dataset sits comfortably within those limits.

The first loading attempt failed. I was sending 18,000 individual write queries over the internet, one per player. The connection timed out mid-load. The fix was switching to UNWIND batch writes, sending 250 players per round trip instead of one at a time:

UNWIND $batch AS props
MERGE (p:Player {id: props.id})
SET p += props

Lesson Always batch writes to remote databases. This reduced round trips from 18,000 to 74, and the connection held for the full load.

The similarity computation runs as two separate matrices: cosine similarity across 16,343 outfield players produces a 16,343 × 16,343 matrix (roughly 267 million comparisons), and a separate 2,062 × 2,062 matrix handles goalkeepers in their own attribute space. Both are computed in seconds by NumPy's vectorised operations. The top 15 position-compatible candidates per player were then written as SIMILAR_TO edges to AuraDB in batches of 1,000. Total edges thus written were 276,075.

Product Decisions

Four Choices That Shaped the App

The technical layer was only half the design surface. Several product decisions had as much impact on the final experience as any model choice. Click any card to read the reasoning.

↓ Click to expand

Design · Branding

Dark and Minimal Visual Language

Black background, Bebas Neue display type, JetBrains Mono for data. The colour accent is a sharp yellow-green.

The design language I chose is dark and minimal: black background, Bebas Neue for display type, DM Sans for body, JetBrains Mono for data values. The colour accent is #C9F31D, a sharp yellow-green that reads as "data" rather than "football kit." It's legible, distinctive, and avoids every visual cliché in the football analytics space.

This was also built to align with my personal brand. If you look at my website and this project, you would understand what I mean.

Outcome: Visually differentiated from every spreadsheet-style football tool I found in the space.

Data · Longevity

No Club, League, or Value Data in the UI

Displaying dynamic data would make the app feel outdated within weeks.

I made a deliberate choice to not show club, league, market value, or wage data. All of those change. A player's club next season is unknown; their value fluctuates weekly. Displaying them would anchor the app to a moment in time and make it feel stale.

What I do show: overall rating, position, nationality, age, similarity score, is stable within the FC 26 snapshot, which is clearly labelled in the interface. A decision was made to include age since it's genuinely useful context when evaluating a replacement. The league/region bonus was used in the model, but it isn't surfaced as a visible field, because the logic is internal and the data would drift.

Outcome: The app remains accurate and honest about what it knows and when.

Naming

Why Transferium?

It took a few iterations. Earlier candidates included ScoutIQ and Squadrant.

ScoutIQ was too AI-tool-adjacent and generic. It could describe a hundred different products. Squadrant was too hard to say and the wordplay didn't land on first read.

Transferium stuck because it sounds vaguely scientific, and maps immediately to the transfer market without being literal about it. Naming was one of the first UX decisions. And to be honest, it feels inevitable in retrospect.

Outcome: Instantly identifiable, memorable, and domain-appropriate.

Infrastructure · Cost

Free End to End

Neo4j AuraDB Free, Streamlit Community Cloud, Kaggle data. Total cost: £0.

Every component in the stack was deliberately chosen to be free without compromising on technical quality. Neo4j Desktop for local development, AuraDB Free for cloud hosting. Streamlit for the frontend and Streamlit Community Cloud for hosting. Python, pandas, NumPy, and scikit-learn for processing. GitHub for version control. FC 26 dataset from Kaggle.

The constraint was generative. Being forced to work within the AuraDB free tier meant thinking carefully about schema design and edge counts from the start. I believe this made the architecture tighter overall.

Outcome: Production-grade stack with £0 ongoing cost.

Retrospective

What I Would Do Differently

Better base data. FC 26 ratings are community-maintained perceptions, not tracking data. Real clubs use Opta, StatsBomb, GPS telemetry. The limitation is real and worth naming. This makes the project a game-data approximation of a scouting tool. I believe this could be useful on its own.

Player embeddings instead of raw attributes. Rather than vectors built from raw attribute scores, I'd experiment with training embeddings on match performance data: representing players by how they actually play, not how a video game rates them. That's a meaningfully harder problem to solve.

Age cohort filtering. Replacing a 32-year-old with a 19-year-old and a 35-year-old are different scouting problems. A future version would let users filter by age range explicitly.

Explainability. The score ring tells you how similar a player is. It doesn't tell you why. A radar chart overlay comparing the queried player's attribute profile against the replacement would make results far more actionable. Once again, that is for a future version.

The Stack

Everything Used, Free End to End

Graph database (local)	Neo4j Desktop	Free
Graph database (cloud)	Neo4j AuraDB Free	Free
Data processing	Python · pandas · NumPy · scikit-learn	Free
App framework	Streamlit	Free
Hosting	Streamlit Community Cloud	Free
Version control	GitHub	Free
Data	FC 26 · Kaggle	Free

Mohamed Amir Sohil Bishrul Hafi

My Football Player Replacement Finder

Starting Point

Why a Graph Database?

The Dataset Problem

Building the Similarity Model

Weighted Attribute Vectors

Cosine Similarity, Not Euclidean Distance

The Overall Rating Penalty

The League / Region Bonus

Position Matching

Loading 276,075 Edges Into AuraDB

Four Choices That Shaped the App

Dark and Minimal Visual Language

No Club, League, or Value Data in the UI

Why Transferium?

Free End to End

What I Would Do Differently

Everything Used, Free End to End