Back to blog

AI consulting guide

How I built a local AI QA system for a contact center

By Bracken Fields

The problem that started this

I am the CTO at a telephone answering service. We handle inbound calls for hundreds of small businesses: medical practices, home-service companies, legal, veterinary, the whole mid-market. Our agents take thousands of calls a day, and every one has to be reviewed for quality. Did the agent greet the client? Did they capture the caller's name and callback number? Was the message clear? Did they handle urgency right?

The old process was exactly what you would expect. A team of QA reviewers listened to recordings and filled out scorecards in a spreadsheet. It was slow. It was expensive. Worst of all, it covered maybe one or two percent of calls. The agents who got reviewed might get real feedback. Everyone else was flying blind.

When we started talking about using AI for this, my first instinct was to send the transcripts to GPT-4. That would have worked. It would also have meant paying per-call API costs on six-figure monthly call volumes and shipping our clients' caller data to a third-party API. Our callers' names, phone numbers, medical symptoms, addresses. For a company whose entire business is being trusted with private calls, that was a non-starter.

So I built it locally. Here is how.

The architecture in one paragraph

Every ten minutes, a pool of five worker processes on a Mac mini pulls the newest un-reviewed calls from the database, grabs the recording from our phone platform, runs it through a local Whisper transcription model, sends the transcript and the agent's typed message to a local quantized LLM with a structured rubric, parses the JSON response, and writes a score (0 to 100), a tier of pass, needs coaching, or fail, a written note, and any flags back to the database. The hardware costs less than one month of the equivalent cloud bill. No data leaves the building.

That is the whole system. The rest is details, and the details are where the interesting tradeoffs live.

Why a Mac mini

When people hear "AI inference on a Mac mini" they assume hobby project. For this workload, they are wrong. Apple Silicon has huge unified memory-to-GPU bandwidth, and Apple's MLX framework runs transformer inference fast. For a model in the 7 to 9 billion parameter range at 4-bit quantization, a modern Mac mini comfortably runs at 25 to 30 tokens per second. You can score a call in seconds.

I could have used a dedicated GPU server. I did not need to. The Mac mini was on the network, the power draw is tiny, and I did not have to explain a new line item to anyone. Starting with hardware you already own is almost always the right first move on projects like this.

The transcription step

For transcription I use OpenAI's Whisper model. Specifically, `mlx_whisper`, the MLX-optimized port. Whisper is open source, Apache 2.0 licensed, and runs offline. I use the medium English model, which is the sweet spot for call-center audio: accurate enough to catch names and numbers, fast enough to process a minute of audio in about five seconds.

Telephone audio is noisy. Landlines, cell connections, speakerphone, hold music, background dogs barking. Whisper handles this well, but I had to deal with a few realities. Some recordings were ambient silence at the start because the agent answered mid-dial, and Whisper would try to hallucinate words into the silence. A small amount of audio pre-processing and a minimum duration filter solved most of it.

The scoring rubric is where most of the value lives

This is the part people underestimate. They assume "AI call scoring" means you paste a transcript into a model and ask it to grade the call. You can do that. The output will look reasonable. It will also be wildly inconsistent, because the model makes up its own criteria every time.

The value comes from writing a real rubric, the same one a human QA reviewer would use, and making the model fill it out like a form. Mine has seven categories:

  1. Greeting (15 points). Did the agent answer professionally and identify the client by name?
  2. Caller ID capture (20 points). Did they get the name and the callback number?
  3. Message quality (20 points). Is the message clear and complete?
  4. Urgency handling (15 points). Did they recognize and escalate anything urgent?
  5. Professionalism (15 points). Tone, clarity, confidence throughout?
  6. Accuracy (10 points). Does the typed message match the call?
  7. Closure (5 points). Did they thank the caller and close cleanly?

That adds up to 100 points. 75 and above passes, 55 to 74 needs coaching, under 55 fails.

Here is the thing that took me the longest to get right. The rubric has to tell the model what NOT to penalize. When I first wrote it, the model was failing half the calls because it docked points for things that were not problems. Calls where the caller volunteered their name without being asked? The model deducted points for "not asking." Routine appointment calls with no emergency? Deducted points for "not handling urgency." Natural pauses while the agent typed the message? Flagged as "dead air."

So I added defaults to every category. Urgency gets full credit unless the caller states an emergency. Accuracy gets full credit unless there is a visible mismatch. Dead air gets flagged only on extended unexplained silence, not normal note-taking. The scores matched what a reasonable human reviewer would say.

That iteration loop is the whole game. Compare the model's scores to human scores on the same calls, find the false positives, add a guardrail. The model does not need to be smarter. The rubric needs to be more honest.

The checkout pattern

I run five workers in parallel to keep up with the inbound call volume. The naive way to split work is to give worker 1 calls 1 to 100, worker 2 calls 101 to 200, and so on. That breaks the moment a worker crashes, because now you have orphaned calls nobody will ever process.

I use a checkout pattern instead. Each worker generates a random process ID on startup. To grab a call, it runs an atomic SQL update: set `procId` to my-id and `checkoutTime` to now WHERE `procId` IS NULL AND the selection criteria match, LIMIT 1. Whichever worker wins the race gets the call. The others try again on the next one. If a worker crashes while holding a call, a sweep job looks for any checkout older than 15 minutes and releases it back to the pool. No coordination, no lost work, no duplicates.

This is a boring pattern. It is also the same pattern every durable job queue uses, for the same reasons. Contact center QA is a job queue whether you realize it or not.

What the system writes back

For every call it processes, the system writes a total score (0 to 100), a tier of pass, needs coaching, or fail, a one-sentence summary note, a list of flags like `missing_callback_number` or `dead_air` or `abrupt_closure`, the individual scores for each rubric category, and the full transcript.

That data feeds an ops dashboard the client team uses every day. Managers see their agents' scores by week, sort a leaderboard, click into a specific call to read the transcript and the AI's notes, and route anything in the "needs coaching" tier to a queue for human review. The humans stay in the loop. They stop spending their days on the 90 percent of calls that were fine.

What this cost

I am not going to put a specific dollar figure on this because every business is different and I do not want to promise outcomes I cannot guarantee. Directionally:

  • Cloud API equivalent: several thousand dollars a month, scaling with call volume, plus the privacy risk of sending caller data to a third party.
  • This build: the hardware was already owned. Whisper is free. The LLM is free. Electricity is a rounding error. Engineering time was a few weeks of focused work plus ongoing tuning.
  • Payback: happened quickly. The reviewer hours we recovered went toward coaching the agents who needed it instead of listening to thousands of routine calls.

The biggest wins were not on the P&L. They were quality-of-work wins. QA reviewers stopped doing mind-numbing sampling and started doing real coaching. Agents started getting feedback within a day or two instead of once a month. Managers stopped flying blind on performance.

Things I would do differently

Start with the rubric, not the model. I wasted a week in the early going fiddling with prompt formats before I went back and wrote a proper rubric. The rubric is 90 percent of the system's value. The model is a commodity.

Test on real calls from day one. Synthetic examples lie to you. I threw out my first few iterations because they worked on fake test data and fell apart on the noisy, weird, half-captured reality of telephone audio.

Run the model locally if you can afford the engineering time. The privacy story alone paid for the complexity. Running locally also forces you to think about scale and batching from day one, which is healthy.

Keep the human review queue. Even with a well-tuned rubric, the model will sometimes be wrong in ways that matter. Queue the borderline cases for a human. It is a small amount of work and it builds the trust you need to scale the rest.

Who should build something like this

This approach makes sense if you have repeatable structured work being done by humans at scale, a rubric or quality standard that could be written down (even if nobody has written it down yet), data privacy or cost concerns that make cloud APIs painful, and a data pipeline you can read from and write back to.

That describes a lot of operations-heavy businesses. Contact centers are an obvious fit. So are intake teams at law firms, claims reviewers at insurance brokerages, QC on document-heavy workflows, and compliance review in regulated industries. Wherever humans review human output against a known standard, there is a version of this that could work for you.

Does that sound like your business? Or have you been told "AI can do that" and you want somebody who has built it to tell you what it takes? <a href="/contact">Let's talk.</a> I live in Indianapolis and I would rather have an honest conversation about your workflow than sell you a transformation deck.

FAQ

Does this replace human QA reviewers?

No, and it should not. It replaces the boring sampling work so reviewers can spend their time on things humans are good at: coaching agents, handling edge cases, and catching what the model missed. The goal is to reshape the job.

What model are you using?

A 7 to 9 billion parameter open-weight model quantized to 4 bits, running on MLX. The specific model matters less than the rubric you wrap around it. I have tested several and they all work when the rubric is solid.

How accurate is it compared to a human reviewer?

Comparable on straightforward calls, which are the vast majority. On borderline calls it is roughly as reliable as a junior reviewer on a tired day. The structured rubric and the flag system surface the ambiguous cases for human review, which is the right place for them.

Could you do this with cloud APIs instead?

Yes, and it would be simpler to stand up. The tradeoffs are cost (scales linearly with volume), privacy (your callers' data leaves the building), and dependency (the day the API has an outage, so does your QA pipeline). For this client, local was the right call. For a smaller operation, cloud might be fine. It is a business decision, not a technical one.

What happens when the model gives a score that is wrong?

Every score writes back with its rubric breakdown and transcript, so it is easy to audit. The dashboard lets reviewers override any score a human disagrees with, and we track those overrides. They feed the next round of rubric tuning. The system improves over time because the humans keep pushing back on it.

Related