AI Job-Fit Scoring for Employed Professionals
I built and shipped a working AI product end to end as a solo founder -- not by writing every line myself, but by operating as the product and architecture layer directing a team of AI coding agents.
Employed professionals don't have time to read hundreds of listings to find the few worth pursuing. Existing tools either spray applications indiscriminately or bury the user in an unfiltered feed.
ScoreFit casts a wide net, scores sharply against the individual's real background, and lets the person decide. It recommends which resume version to use, tracks the application pipeline, and surfaces the roles worth pursuing -- without automating the decision itself.
This is the part most relevant to how I'd operate on a team. I positioned myself as the planning and architecture layer. Day to day, I made the product decisions, defined the architecture, and wrote precise specs -- then delegated implementation to AI coding agents running in parallel, each scoped to a non-overlapping part of the codebase. I reviewed and integrated everything myself.
That sounds like a force multiplier, and it is. But it only works with three disciplines I had to build deliberately.
Agents are eager to build. If I handed off an ambiguous problem, I got a confident, wrong implementation -- fast. So I forced every product and design decision to be settled in plain language before any implementation began. The spec is where the thinking happens. The build is mechanical.
An agent reporting "verified" means only "I checked what I thought to check." Repeatedly, the only reliable signal was live data -- the actual database state, the actual API response, the actual network request -- not the agent's description of it. I built a habit of confirming outcomes against ground truth before believing anything was fixed.
The most dangerous failures were not crashes. They were processes that completed successfully while silently doing nothing useful. A job that "ran fine" but returned no results is worse than one that errors, because nothing alerts you. I learned to measure what users actually received, not whether the code executed.
The clearest example of how this operating model pays off. Several beta testers were getting sparse job feeds despite the product sourcing from an index covering tens of thousands of company career sites. The obvious assumption was a coverage gap -- the data source simply didn't have the roles.
Rather than accept that, I ran a structured investigation, splitting it across agents along clean boundaries.
One agent traced how the search query was constructed. One traced what happened to results after they came back. One made a live call to see what the source actually returned. Keeping the work disjoint meant the agents could not collide, and each finding was independently verifiable.
The result overturned my own initial theory. A simple external reality-check -- manually searching a major job board for the same titles -- returned over a hundred roles. The source had the inventory. The bug was ours.
The matching logic required job titles to appear in an exact, rigid word order. Real-world titles like "Director, Customer Support Systems" or "VP of Client Success" silently failed to match. A controlled test confirmed it: loosening the matching recovered the missing roles cleanly, with no increase in irrelevant results.
The fix was a single change to the query logic. It improved feeds across the entire user base at once -- and it dissolved a much larger, more expensive plan I had been considering (integrating a second data source) that would have solved the wrong problem at permanent cost.
The win was not the code change. It was the discipline to distrust a plausible conclusion, design an investigation that could disprove my own assumption, and let cheap evidence redirect an expensive decision before I committed to it.
Once sourcing was working for most users, a harder question remained: a few profiles -- very senior roles, niche markets -- were still thin. I needed to know whether that was a fixable query problem or a genuine limit of the data source, before deciding whether to take on a major architecture change.
So rather than guess, I built a set of deliberate test profiles modeling the specific edge cases real users had revealed, including a known-good control. Any future change could be measured against a clean before/after baseline rather than anecdote.
Instrument first, then change, then measure against the instrument. That sequencing is the difference between "the number moved" and "the change worked."
Architected so the data layer is the single integration point between services -- no fragile runtime coupling between the AI scoring, the sourcing pipeline, and the frontend.