How to Choose an AI Implementation Consultant Without Getting Burned

Practical criteria and provable artifacts to demand from AI consultants—so you don’t hire promises over production, but can you verify theirs?

Most AI projects don’t fail in the model. They fail in the boring stuff: deploys, retries, auth, logging, the 2 AM page. By then the AI implementation consultant is gone and you own a prototype that nobody can keep alive.

I’ve shipped code into electronic health records, DoD weapons systems, and cyber security monitoring systems for a fintech company. AI is the newest substrate, not a new discipline. The teams that ship reliable AI are the ones who already knew how to ship reliable anything: they write runbooks, they own incidents, they argue about latency budgets, they don’t hand you a demo and call it done. The teams that fail are the ones treating AI as a science project that happens to have users.

That’s the lens behind this checklist. The questions don’t test whether someone is an “AI expert.” They test whether someone has ever actually been on the hook for a system in production.

Production AI Readiness Scorecard Use before you sign anything. If they fail two rows, walk away. Category What to demand Walk away if… Shipped Systems Evidence of production Live URL, repo, owner contact Architecture diagram matching the repo Reference still running 12+ months Only demos, pilots, “POCs” No verifiable customer contact Operations Can it survive 2 AM? Runbooks, on-call rosters, postmortems CI/CD logs and rollback plan Real incident walkthrough No postmortem they can show Manual deploys, no rollback Metrics Numbers, not adjectives Baseline + target for accuracy, P95, cost Error budgets and SLOs in writing 90-day actuals committed in contract Vague answers, no baselines Refuses to commit numbers Security & Auth Where most consultants fold OAuth / mTLS / short-lived tokens RBAC, least privilege, key rotation PII handling in prompts and logs Freezes on auth whiteboard No PII or secrets policy Edge Cases Who owns the broken path? Named owners for incidents and drift Escalation paths and SLAs Past tickets where ownership stopped repeats “The team handles it” No written ownership matrix Contract Where leverage lives or dies Milestone-based payment with acceptance tests KPIs with remedies for misses 60–90 day warranty with named support hours Lump sum on signature No warranty period michaelitoback.com · InsAI

Key Takeaways

  • Ask for one production system shipped 12+ months ago, with a live URL and an owner you can call. No demos, no POCs, no Fortune 500 name-drops.
  • Demand the artifacts: repo, architecture diagram, runbook, CI/CD logs, monitoring dashboards, and at least one real postmortem.
  • Make them commit to numbers in writing — baseline, target, and 90-day actuals for accuracy, latency, cost, error rate, and adoption.
  • Push hard on security: auth, secrets, PII handling, and which compliance regimes apply. Most AI consultants fold here.
  • Tie payment to milestones, KPIs to remedies, and require a 60–90 day warranty period. If they push back, you’re hiring a sales org.

The one question that filters out 80% of consultants

concrete evidence of experience of AI Implementation consultants

Ask this first: *”Show me a system you shipped 12+ months ago that’s still running, and put me on the phone with the person who owns it today.”*

Most candidates can’t answer it. They have demos, pilots, and “we built a POC for a Fortune 500.” None of that tells you whether they can keep a system alive once real traffic hits it.

If they can produce a live URL, an owner’s phone number, and a story about something that broke at 2 AM and how they fixed it, keep talking. If they pivot to slideware, end the call.

What Handoff Should Look Like

The contract gets you to launch. The handoff is what keeps the system alive after.
Ask for a written handoff plan covering three things: who owns the system day one after launch, how it gets maintained, and how your team gets trained to run it.

Demand channels for user feedback and measurable SLAs for system scalability.

  1. Handover checklist with metrics, runbooks, and target RTO/RPO.
  2. 90/180-day post delivery evaluation milestones and remediation plans.
  3. Maintenance strategies: patching, dependency updates, security reviews.
  4. Ongoing training cadence, documentation ownership, and user feedback loop.

Shipped Artifacts List

demand tangible shipped evidence

Shipping isn’t a deploy. Shipping is everything that has to be true for someone other than the original author to keep the system running. At minimum, demand:

  • A repo you can read
  • An architecture diagram that matches the repo
  • A runbook a stranger could follow at 3 AM
  • CI/CD logs and a documented rollback
  • Monitoring dashboards with real numbers on them
  • At least one postmortem from a real incident

That last one matters more than the others combined. A consultant who can’t show you a postmortem has either never had an incident (impossible) or never wrote one up (worse).

The Metrics Conversation

Vague answers here are the loudest red flag in the entire process. Good AI implementation consultants talk in numbers without being prompted. Bad ones talk in adjectives.

Ask for before-and-after baselines on whatever the system is supposed to improve. Here’s the format I use:

MetricBaselineTarget
Accuracy / F10.620.85
Latency P95480 ms120 ms
Cost per transaction$0.45$0.15
Error Rate4.0%0.5%
Adoption (DAU)120600

If they can’t fill in the left two columns during the sales call, they don’t know your problem yet. If they refuse to commit to the third column in writing, they don’t believe their own numbers.

How the AI Implementation Consultant Handles Failures and Edge Cases

concrete error handling strategies

The happy path is the easy 20% of the work. Ask:

  1. Who owns edge cases after launch?
  2. What’s your escalation path when the model starts drifting?
  3. Walk me through your last incident, start to finish.

The Four Failure Modes Worth Asking About By Name

Ask the consultant how they’d handle each of these by name. If they can’t, they haven’t run AI in production

  1. Data drift: metric, tool, alert, playbook, SLA owner.
  2. Latency spikes: benchmark, threshold, pager, rollback steps.
  3. Model regressions: test suite, alert, mitigation, retrain owner.
  4. Downstream errors: synthetic tests, alert routing, recovery runbook, postmortem owner.

Data, Security, and Auth Requirements for Production

data handling and security

I’m a CySA+, a former patent attorney, and I’ve shipped systems into regulated environments for decades, so this section is where I get picky. Most AI consultants are not security people, and it shows the moment you push on auth.

Ask them to describe, on a whiteboard:

  • How identities are proven (OAuth, mTLS, short-lived tokens)
  • How permissions are enforced (RBAC, ABAC, least privilege)
  • Where secrets live and how they rotate
  • What happens to PII in prompts and logs
  • Which compliance regimes apply (SOC2, HIPAA, GDPR) and how they’re satisfied
  • Verify user management workflows for onboarding, offboarding, audit trails, and periodic access reviews before you hire.

If they freeze on any of these, that’s your answer. You don’t need them to be a security firm. You need them to know enough not to leak your customer data into a vendor’s training set.

Contract Clauses and KPIs That Lock In Outcomes

measurable kpis and remedies

This is where most buyers give up the leverage they spent the whole interview earning. Don’t.

Three things have to be in the contract or you don’t have one:

  1. Milestone-based payment. No lump sums on signature. Tie each payment to a deliverable with an acceptance test you wrote, not them. Require weekly progress dashboards with variance against timelines.
  2. KPIs with teeth. Latency, uptime, error rate, MTTR. Pick numbers, write them down, attach a remedy if they’re missed.
  3. A warranty period. 60 to 90 days minimum, with named support hours, after the system goes live. This is the difference between a vendor and a contractor who vanishes the moment the invoice clears.

If a consultant pushes back on any of these, ask why. The answer tells you whether you’re hiring a partner or a sales org.

The 20-Minute AI Consultant Interview Seven questions. If they can’t answer them, walk away. 1 Show me a production system you shipped 12+ months ago. Good answer: live URL, repo, and an owner you can call today. 2 Walk me through your last 2 AM incident. Good answer: timeline, owner, fix, and the postmortem they wrote. 3 What’s your P95 latency target and how do you hit it? Good answer: a number, and the architecture decisions that defend it. 4 Who owns the system after handoff? Good answer: a name, a runbook, and a written maintenance plan. 5 Show me a postmortem you wrote. Good answer: a real one. If they don’t have any, they haven’t shipped. 6 How do you handle PII in prompts and logs? Good answer: redaction before the prompt leaves your network. 7 What does your warranty period look like? Good answer: 60–90 days, named support hours, in writing. 20 minutes. You’ll know. If they dodge two questions, end the call. Production AI fails in deploys, retries, auth, and the 2 AM page — not in the model. Hire someone who has been on the hook for both. michaelitoback.com · InsAI

A Short List of Questions That Work

Memorize these. Use them in order. Screenshot the card above and bring it to your next vendor call

Seven questions. Twenty minutes. You’ll know.

Frequently Asked Questions

What should I ask an AI implementation consultant about long-term model maintenance?

AI systems aren’t like a CRM rollout. They decay. The model that hit 92% accuracy at launch will quietly slide to 78% six months later as your data shifts, your customers’ language shifts, and the underlying foundation model gets silently updated by the vendor. Somebody has to notice that, and somebody has to fix it.

You want a name, not a role. Ask: who on your team is monitoring drift after handoff? How often do they retrain or re-evaluate? What’s the threshold that triggers action? If the answer is “we’ll set up a dashboard and your team can watch it,” that’s not an answer — you don’t have anyone on staff who knows what they’re looking at. A serious consultant either stays on a retainer to own drift, or trains a specific person on your team to own it, by name, with a written runbook.

Should I pay an AI implementation consultant for a pilot before signing a full contract?

Demos use the consultant’s data, where the model has already been tuned to look brilliant. That tells you nothing. What you need is a paid pilot — small, scoped, two to four weeks — where they run their approach against a slice of *your* actual data and you measure the result against a baseline you defined.

A serious consultant will say yes and quote you a fixed price for it. A consultant who insists on a full engagement before showing you anything on your data is asking you to buy a car you’ve never driven. If the pilot fails, you’ve spent a fraction of the project budget and learned something valuable. If it succeeds, you have real numbers to put in the contract.

How do I keep an AI implementation consultant from locking me into one model vendor?

This is the question your CFO is right to ask, because the model layer is the most volatile cost in the stack. GPT-4 prices have dropped 80% in eighteen months. Claude, Gemini, and open-source models leapfrog each other every quarter. If switching models in your system requires a rewrite, you’re locked in by accident.

A good answer covers three things. First, the application code talks to a model abstraction layer, not directly to one vendor’s SDK — swapping providers is a config change, not a refactor. Second, your prompts, your evaluation suite, and your fine-tuning data live in *your* repo under *your* control, not buried in a vendor’s playground. Third, your embeddings and vector store are portable: if you decide to move from Pinecone to pgvector next year, the data comes with you. If any of those is fuzzy, you’re not buying an AI system — you’re renting a wrapper around someone else’s API and paying full price.

What should an AI implementation consultant’s plan be when OpenAI or Anthropic goes down?

Every AI system you buy is actually a stack of other people’s systems, and the model vendor is the most fragile link. OpenAI has had multi-hour outages. Anthropic has deprecated model versions on 90 days’ notice. The major vendors quietly change tokenization, rate limits, and safety filters in ways that break production behavior overnight.

A good consultant hands you a one-page dependency map: every model and service the system touches, what its published SLA is, and what your system does when it’s unreachable or when its behavior changes. The right answer to “what happens when GPT-4 goes down at 10 AM on a Tuesday” isn’t “we wait.” It’s “we fall back to a secondary model with a documented quality delta,” or “we queue the request and retry with exponential backoff,” or at minimum, “we return a clean error to the user and page the on-call.” Any of those is a real answer. “That hasn’t happened to us yet” means it will happen to you first.

How do I make sure an AI implementation consultant doesn’t expose my customer data to model vendors?

This is the question that keeps your legal and compliance people awake, and it should. Every prompt you send to a model vendor is, by default, data you’ve handed to a third party. Without the right contracts and the right architecture, that data can end up in training sets, in logs you can’t see, and in jurisdictions you didn’t agree to.

A good consultant can tell you, on a whiteboard, exactly where your data goes from the moment a user types something to the moment a response comes back. They can name which vendor terms apply (most enterprise tiers contractually exclude your data from training — most free tiers do not). They’ve thought about PII redaction *before* the prompt leaves your network, not after. They can speak to the regulatory regimes that apply to you — HIPAA if you’re in healthcare, GLBA if you’re in financial services, GDPR if you have any EU customers at all — and how their architecture satisfies each one. If the answer is “we trust the vendor’s defaults,” walk away. The defaults are written by the vendor’s lawyers, not yours.

What To Do Next

If you’re about to hire an AI implementation consultant, run this checklist before the next call. If you’ve already hired someone and the answers are making you nervous, that’s useful too. The cost of catching a bad fit in week two is a fraction of the cost of catching it in month six.

I write about this kind of thing because I keep seeing the same projects fail the same way. If you want a second set of eyes on a vendor pitch or a contract before you sign it, get in touch.