Case Study

ScoreFit

AI Job-Fit Scoring for Employed Professionals

I built and shipped a working AI product end to end as a solo founder -- not by writing every line myself, but by operating as the product and architecture layer directing a team of AI coding agents.

Role: Founder, Product & Architecture Solo-built B2C SaaS Active Beta

The Product

What it does

Employed professionals don't have time to read hundreds of listings to find the few worth pursuing. Existing tools either spray applications indiscriminately or bury the user in an unfiltered feed.

ScoreFit casts a wide net, scores sharply against the individual's real background, and lets the person decide. It recommends which resume version to use, tracks the application pipeline, and surfaces the roles worth pursuing -- without automating the decision itself.

ScoreFit pipeline -- 79 active roles scored and organized across pipeline stages

Score a job -- paste a job description and get a fit score in seconds

Scoring result -- fit score, resume recommendation, and capability analysis

Detailed scoring breakdown -- strengths, gaps, and resume guidance

Job detail -- Ask about this role and cover letter generation

Daily email digest -- top matches scored and ranked, delivered every morning

How I Worked

The operating model

This is the part most relevant to how I'd operate on a team. I positioned myself as the planning and architecture layer. Day to day, I made the product decisions, defined the architecture, and wrote precise specs -- then delegated implementation to AI coding agents running in parallel, each scoped to a non-overlapping part of the codebase. I reviewed and integrated everything myself.

That sounds like a force multiplier, and it is. But it only works with three disciplines I had to build deliberately.

Lock the decision before writing the spec

Agents are eager to build. If I handed off an ambiguous problem, I got a confident, wrong implementation -- fast. So I forced every product and design decision to be settled in plain language before any implementation began. The spec is where the thinking happens. The build is mechanical.

Treat "done" and "verified" as untrustworthy by default

An agent reporting "verified" means only "I checked what I thought to check." Repeatedly, the only reliable signal was live data -- the actual database state, the actual API response, the actual network request -- not the agent's description of it. I built a habit of confirming outcomes against ground truth before believing anything was fixed.

Watch outcomes, not just errors

The most dangerous failures were not crashes. They were processes that completed successfully while silently doing nothing useful. A job that "ran fine" but returned no results is worse than one that errors, because nothing alerts you. I learned to measure what users actually received, not whether the code executed.

A Representative Problem

The thin-feed investigation

The clearest example of how this operating model pays off. Several beta testers were getting sparse job feeds despite the product sourcing from an index covering tens of thousands of company career sites. The obvious assumption was a coverage gap -- the data source simply didn't have the roles.

Rather than accept that, I ran a structured investigation, splitting it across agents along clean boundaries.

Investigation Design

One agent traced how the search query was constructed. One traced what happened to results after they came back. One made a live call to see what the source actually returned. Keeping the work disjoint meant the agents could not collide, and each finding was independently verifiable.

The result overturned my own initial theory. A simple external reality-check -- manually searching a major job board for the same titles -- returned over a hundred roles. The source had the inventory. The bug was ours.

The matching logic required job titles to appear in an exact, rigid word order. Real-world titles like "Director, Customer Support Systems" or "VP of Client Success" silently failed to match. A controlled test confirmed it: loosening the matching recovered the missing roles cleanly, with no increase in irrelevant results.

Outcome

The fix was a single change to the query logic. It improved feeds across the entire user base at once -- and it dissolved a much larger, more expensive plan I had been considering (integrating a second data source) that would have solved the wrong problem at permanent cost.

The win was not the code change. It was the discipline to distrust a plausible conclusion, design an investigation that could disprove my own assumption, and let cheap evidence redirect an expensive decision before I committed to it.

Building the Instrument

Measure first, then change

Once sourcing was working for most users, a harder question remained: a few profiles -- very senior roles, niche markets -- were still thin. I needed to know whether that was a fixable query problem or a genuine limit of the data source, before deciding whether to take on a major architecture change.

So rather than guess, I built a set of deliberate test profiles modeling the specific edge cases real users had revealed, including a known-good control. Any future change could be measured against a clean before/after baseline rather than anecdote.

Instrument first, then change, then measure against the instrument. That sequencing is the difference between "the number moved" and "the change worked."

What I'd Carry Into a Product Role

Five things this project made permanent

Decisions before deliverables The quality of what gets built is set at the spec, not the review. An ambiguous brief produces a confident, wrong output every time.

Verify against ground truth, not reports Confident summaries -- from people or tools -- are not evidence. Live data is. I confirmed outcomes against the actual database state, the actual API response, the actual user's feed before calling anything done.

Distrust improvement that is only relative A number that moved is not a number that is right. I repeatedly caught the pattern of declaring victory on a metric that had improved but was still far from where it needed to be.

Cheap reality-checks beat expensive investigations A two-minute manual search on a job board redirected a multi-week architecture build. The fastest path to the right answer is often the most direct one.

Sequence the work to be measurable Baseline, then change, then compare. Without the instrument in place first, you cannot tell whether the change worked or whether you got lucky.

For the Curious

The stack

Architected so the data layer is the single integration point between services -- no fragile runtime coupling between the AI scoring, the sourcing pipeline, and the frontend.

Frontend

Next.js / React on Vercel

Auth & Data

Supabase / PostgreSQL

AI Scoring

Multi-pass Anthropic pipeline

Resume Parsing

Python service

Job Sourcing

Third-party API, daily pipeline

Model Selection

Validated against ground-truth test cases

Architecture note: The data layer is the single integration point between services. The AI scoring pipeline, the job sourcing feed, and the frontend are decoupled -- no runtime dependency between them. Each can be changed or replaced without touching the others.