There's a version of this project that's boring. You load a CSV into pandas, compute cosine similarity, and return the top ten results.
It works. It's fine. It's also exactly what everyone else does.
The Brief
Starting Point
Most data science portfolios look the same. A regression here, a dashboard there, maybe a neural network trained on MNIST.
I wanted to build something where the interesting problem wasn't just the model. What I was truly looking for was a project that would
force me to make real architectural decisions and come out the other side having genuinely learned something.
The brief I gave myself was to build something visually impressive, technically differentiated, and useful enough that a non-technical
person could immediately understand the value. Enter, football player replacement. Think about it. The domain is universally relatable,
the problem ("who can replace this player?") is immediately understandable, and the data structure, as I'll explain, is a genuinely
natural fit for a graph database. Oh, and most importantly, I love football.
The result is Transferium Player Intelligence: a player replacement finder powered by Neo4j, a weighted cosine similarity engine,
and a live Streamlit app. It took several wrong turns to get there, and I think those wrong turns are worth writing about too.
0
Players modelled
0
Similarity edges written
0
Total cost
Architecture
Why a Graph Database?
This was the first major decision, and the one I want to defend properly, because it's easy to dismiss as over-engineering.
In a relational database, relationships between entities are implicit. For example, if you want to find all players in the
Premier League, you join players to clubs to leagues at query time. The database doesn't inherently know that
Liverpool FC is connected to the Premier League. It finds out by scanning and matching rows. In a graph database,
relationships are first-class citizens. They're stored explicitly with their own properties. A path like
(Player)→(Club)→(League) is a literal edge in the graph, not something computed at query time.
For this project, the key payoff is the SIMILAR_TO edge. Once I compute player similarity scores and store
them as graph edges — (Kylian Mbappé Lottin)-[:SIMILAR_TO {score: 0.82}]->(Omar Khaled Mohamed Marmoush) — finding replacements becomes a single hop:
MATCH (p:Player {name: "Kylian Mbappé Lottin"})-[r:SIMILAR_TO]->(s:Player)
RETURN s.name, r.score ORDER BY r.score DESC LIMIT 15
Zero recomputing or scanning. Just traversal. At around 18,000 players the performance difference is modest, but the architecture
scales, and, more importantly, it forced me to think about my data as a network of relationships rather than a table of rows.
I believe that shift in thinking is the real value.
Graph Schema
Player
──
[:PLAYS_FOR]
──▶
Club
──
[:IN_LEAGUE]
──▶
League
Player
──
[:PLAYS_AT]
──▶
Position
Player
──
[:SIMILAR_TO {score, pos_match}]
──▶
Player
4 node types · 4 relationship types · each edge carries its own properties
Data Quality
The Dataset Problem
I initially started with a SoFIFA dataset from Kaggle. On the surface, it looked
comprehensive. It wasn't. Every individual skill attribute (crossing, finishing, dribbling, stamina, etc.) was null
for every single player. The scraper had captured biographical and overall rating data, but not the granular attributes
I needed to build a meaningful similarity engine.
Lesson
Always audit your data before designing around it. Column names do not imply populated data.
So that was on me for not checking first. I switched to an FC 26 dataset that included the full attribute breakdown across prefixed column groups:
attacking_, skill_, movement_, power_, mentality_,
defending_, and goalkeeping_. Zero nulls across all 29 outfield attributes and 7 goalkeeper
attributes. 18,405 players. This was the foundation.
The Engine
Building the Similarity Model
Before any similarity computation, I made one structural decision that shaped everything downstream: goalkeepers and
outfield players are modelled entirely separately. A goalkeeper's reflexes score means something completely different
to a striker's finishing score, even if the numbers look similar in magnitude. Mixing them in a single vector space would
produce statistically coherent but footballingly meaningless results. The two groups live in separate vector spaces.
Similarity is only ever computed within each group.
Weighted Attribute Vectors
The naive approach treats all 29 attributes equally. But a defensive midfielder's interceptions matter far more than their finishing.
A striker's positioning is more important than their sliding tackle. Equal weighting ignores what positions actually demand.
I assigned weights across four groups:
| Group |
Weight |
Key attributes |
| Technical |
35% |
Ball control, dribbling, passing, vision |
| Tactical |
30% |
Positioning, interceptions, marking, reactions |
| Physical |
25% |
Acceleration, sprint speed, stamina, strength |
| League / Region |
10% |
Contextual adaptation bonus |
These group weights are informed judgements, not empirically optimised values. Each individual attribute's contribution
is its raw score multiplied by its group weight divided by the number of attributes in that group.
A 90-rated crossing doesn't dominate the vector just because it's high; it's weighted relative to its group.
Cosine Similarity, Not Euclidean Distance
Cosine similarity measures the angle between two vectors, not the absolute distance between them.
This means a 70-rated player with a balanced profile is more "similar" to an 85-rated balanced player
than to a 90-rated specialist, because their skill shape is the same, even if the magnitude differs.
For player replacement, shape matters more than magnitude. You're looking for someone who plays the same way,
not someone of identical quality.
Consider Mohamed Salah and Francisco Trincão, a 91-rated Premier League superstar and an 82-rated winger at Sporting CP.
Their Euclidean distance is significant: Salah's raw attribute numbers are simply higher across the board. But the model scores them
at 76% similarity, because the shape of their profiles points in the same direction. The ratios between attributes are nearly identical,
the same pattern, expressed at different levels. Euclidean distance would tell you they're far apart. Cosine similarity correctly
identifies them as the same type of player, separated by level and context rather than style. That distinction is exactly what a
replacement finder should care about. A club that can't afford Salah but needs his profile: a left-footed, right-sided attacker
who dribbles, finishes, and doesn't defend too much, gets a credible starting point. That's what the engine is built to surface.
The Overall Rating Penalty
My first results exposed a flaw. A 73-rated player appeared as the top replacement for a 90-rated Bellingham
because their attribute ratios were nearly identical. Mathematically correct. Practically not useful.
I added a penalty term:
overall_diff = abs(player_a_overall - player_b_overall)
overall_penalty = min(overall_diff * 0.015, 0.30)
final_score = base_similarity * 0.90 + league_bonus - overall_penalty
A 17-point gap now reduces the score by up to 25.5%. The ceiling of 30% ensures a large gap is penalised,
but never completely eliminates a stylistically similar player from appearing.
The League / Region Bonus
Real scouting isn't purely statistical. Clubs prefer players already adapted to similar footballing cultures. Why?
Because there's a meaningful contextual advantage to finding someone already adapted to a similar environment:
tactical styles, physical demands, officiating standards, etc. I encoded this as a 10% contextual bonus: same league gets +10%,
same regional group (e.g. both in Iberian leagues) gets +5%, different region gets nothing.
A perfect statistical twin in a different league still scores very high. The nudge here is small and deliberate.
Position Matching
A central midfielder is not a valid replacement for a striker, even if the underlying stats look similar.
I only create SIMILAR_TO edges between players who share at least one position. Primary position match earns
an additional 3% boost. Because FC 26 players carry multiple positions, a player listed as CAM, CM
can surface as a replacement for either role. This reflects real-world flexibility without forcing false matches.
Engineering
Loading 276,075 Edges Into AuraDB
The production database is Neo4j AuraDB Free: cloud-hosted, zero cost, with a 200k node / 400k relationship limit.
The dataset sits comfortably within those limits.
The first loading attempt failed. I was sending 18,000 individual write queries over the internet, one per player.
The connection timed out mid-load. The fix was switching to UNWIND batch writes, sending 250 players
per round trip instead of one at a time:
UNWIND $batch AS props
MERGE (p:Player {id: props.id})
SET p += props
Lesson
Always batch writes to remote databases. This reduced round trips from 18,000 to 74, and the connection held for the full load.
The similarity computation runs as two separate matrices: cosine similarity across 16,343 outfield players produces a 16,343 ×
16,343 matrix (roughly 267 million comparisons), and a separate 2,062 × 2,062 matrix handles goalkeepers in their own attribute space.
Both are computed in seconds by NumPy's vectorised operations. The top 15 position-compatible candidates per player were then
written as SIMILAR_TO edges to AuraDB in batches of 1,000. Total edges thus written were 276,075.
Product Decisions
Four Choices That Shaped the App
The technical layer was only half the design surface. Several product decisions had as much impact on the
final experience as any model choice. Click any card to read the reasoning.
↓ Click to expand
Design · Branding
Dark and Minimal Visual Language
Black background, Bebas Neue display type, JetBrains Mono for data. The colour accent is a sharp
yellow-green.
The design language I chose is dark and minimal: black background, Bebas Neue for display type, DM Sans for body,
JetBrains Mono for data values. The colour accent is #C9F31D, a sharp yellow-green
that reads as "data" rather than "football kit." It's legible, distinctive, and avoids every visual cliché in the football analytics space.
This was also built to align with my personal brand. If you look at my website and this project, you would understand what I mean.
Outcome: Visually differentiated from every spreadsheet-style football tool I found in the space.
Data · Longevity
No Club, League, or Value Data in the UI
Displaying dynamic data would make the app feel outdated within weeks.
I made a deliberate choice to not show club, league, market value, or wage data.
All of those change. A player's club next season is unknown; their value fluctuates weekly.
Displaying them would anchor the app to a moment in time and make it feel stale.
What I do show: overall rating, position, nationality, age, similarity score, is stable within the FC 26 snapshot,
which is clearly labelled in the interface. A decision was made to include age since it's genuinely useful context when
evaluating a replacement. The league/region bonus was used in the model, but it isn't surfaced as a visible field, because
the logic is internal and the data would drift.
Outcome: The app remains accurate and honest about what it knows and when.
Naming
Why Transferium?
It took a few iterations. Earlier candidates included ScoutIQ and Squadrant.
ScoutIQ was too AI-tool-adjacent and generic. It could describe a hundred different products.
Squadrant was too hard to say and the wordplay didn't land on first read.
Transferium stuck because it sounds vaguely scientific, and maps immediately to the transfer market without
being literal about it. Naming was one of the first UX decisions. And to be honest, it feels inevitable in retrospect.
Outcome: Instantly identifiable, memorable, and domain-appropriate.
Infrastructure · Cost
Free End to End
Neo4j AuraDB Free, Streamlit Community Cloud, Kaggle data. Total cost: £0.
Every component in the stack was deliberately chosen to be free without compromising on technical quality.
Neo4j Desktop for local development, AuraDB Free for cloud hosting. Streamlit for the frontend
and Streamlit Community Cloud for hosting. Python, pandas, NumPy, and scikit-learn
for processing. GitHub for version control. FC 26 dataset from Kaggle.
The constraint was generative. Being forced to work within the AuraDB free tier meant thinking carefully about
schema design and edge counts from the start. I believe this made the architecture tighter overall.
Outcome: Production-grade stack with £0 ongoing cost.
Retrospective
What I Would Do Differently
Better base data. FC 26 ratings are community-maintained perceptions, not tracking data.
Real clubs use Opta, StatsBomb, GPS telemetry. The limitation is real and worth naming. This makes the project a
game-data approximation of a scouting tool. I believe this could be useful on its own.
Player embeddings instead of raw attributes. Rather than vectors built from raw attribute scores,
I'd experiment with training embeddings on match performance data: representing players by how they actually play,
not how a video game rates them. That's a meaningfully harder problem to solve.
Age cohort filtering. Replacing a 32-year-old with a 19-year-old and a 35-year-old
are different scouting problems. A future version would let users filter by age range explicitly.
Explainability. The score ring tells you how similar a player is.
It doesn't tell you why. A radar chart overlay comparing the queried player's attribute
profile against the replacement would make results far more actionable. Once again, that is for a future version.
The Stack
Everything Used, Free End to End
| Graph database (local) |
Neo4j Desktop |
Free |
| Graph database (cloud) |
Neo4j AuraDB Free |
Free |
| Data processing |
Python · pandas · NumPy · scikit-learn |
Free |
| App framework |
Streamlit |
Free |
| Hosting |
Streamlit Community Cloud |
Free |
| Version control |
GitHub |
Free |
| Data |
FC 26 · Kaggle |
Free |
Try It
Transferium Player Intelligence is live. Search any player, find their replacements.
The code is on GitHub. If you're thinking about graph databases for a similar problem,
or if you have questions about any of the decisions above, I'd genuinely like to hear from you.
About
Built by Mohamed Amir Sohil Bishrul Hafi.
Data: FC 26 · SoFIFA · Graph: Neo4j · App: Streamlit