Most AI projects don’t fail in the model. They fail in the boring stuff: deploys, retries, auth, logging, the 2 AM page. By then the AI implementation consultant is gone and you own a prototype that nobody can keep alive.
I’ve shipped code into electronic health records, DoD weapons systems, and cyber security monitoring systems for a fintech company. AI is the newest substrate, not a new discipline. The teams that ship reliable AI are the ones who already knew how to ship reliable anything: they write runbooks, they own incidents, they argue about latency budgets, they don’t hand you a demo and call it done. The teams that fail are the ones treating AI as a science project that happens to have users.
That’s the lens behind this checklist. The questions don’t test whether someone is an “AI expert.” They test whether someone has ever actually been on the hook for a system in production.
Key Takeaways
- Ask for one production system shipped 12+ months ago, with a live URL and an owner you can call. No demos, no POCs, no Fortune 500 name-drops.
- Demand the artifacts: repo, architecture diagram, runbook, CI/CD logs, monitoring dashboards, and at least one real postmortem.
- Make them commit to numbers in writing — baseline, target, and 90-day actuals for accuracy, latency, cost, error rate, and adoption.
- Push hard on security: auth, secrets, PII handling, and which compliance regimes apply. Most AI consultants fold here.
- Tie payment to milestones, KPIs to remedies, and require a 60–90 day warranty period. If they push back, you’re hiring a sales org.
The one question that filters out 80% of consultants

Ask this first: *”Show me a system you shipped 12+ months ago that’s still running, and put me on the phone with the person who owns it today.”*
Most candidates can’t answer it. They have demos, pilots, and “we built a POC for a Fortune 500.” None of that tells you whether they can keep a system alive once real traffic hits it.
If they can produce a live URL, an owner’s phone number, and a story about something that broke at 2 AM and how they fixed it, keep talking. If they pivot to slideware, end the call.
What Handoff Should Look Like
The contract gets you to launch. The handoff is what keeps the system alive after.
Ask for a written handoff plan covering three things: who owns the system day one after launch, how it gets maintained, and how your team gets trained to run it.
Demand channels for user feedback and measurable SLAs for system scalability.
- Handover checklist with metrics, runbooks, and target RTO/RPO.
- 90/180-day post delivery evaluation milestones and remediation plans.
- Maintenance strategies: patching, dependency updates, security reviews.
- Ongoing training cadence, documentation ownership, and user feedback loop.
Shipped Artifacts List

Shipping isn’t a deploy. Shipping is everything that has to be true for someone other than the original author to keep the system running. At minimum, demand:
- A repo you can read
- An architecture diagram that matches the repo
- A runbook a stranger could follow at 3 AM
- CI/CD logs and a documented rollback
- Monitoring dashboards with real numbers on them
- At least one postmortem from a real incident
That last one matters more than the others combined. A consultant who can’t show you a postmortem has either never had an incident (impossible) or never wrote one up (worse).
The Metrics Conversation
Vague answers here are the loudest red flag in the entire process. Good AI implementation consultants talk in numbers without being prompted. Bad ones talk in adjectives.
Ask for before-and-after baselines on whatever the system is supposed to improve. Here’s the format I use:
| Metric | Baseline | Target |
|---|---|---|
| Accuracy / F1 | 0.62 | 0.85 |
| Latency P95 | 480 ms | 120 ms |
| Cost per transaction | $0.45 | $0.15 |
| Error Rate | 4.0% | 0.5% |
| Adoption (DAU) | 120 | 600 |
If they can’t fill in the left two columns during the sales call, they don’t know your problem yet. If they refuse to commit to the third column in writing, they don’t believe their own numbers.
How the AI Implementation Consultant Handles Failures and Edge Cases

The happy path is the easy 20% of the work. Ask:
- Who owns edge cases after launch?
- What’s your escalation path when the model starts drifting?
- Walk me through your last incident, start to finish.
The Four Failure Modes Worth Asking About By Name
Ask the consultant how they’d handle each of these by name. If they can’t, they haven’t run AI in production
- Data drift: metric, tool, alert, playbook, SLA owner.
- Latency spikes: benchmark, threshold, pager, rollback steps.
- Model regressions: test suite, alert, mitigation, retrain owner.
- Downstream errors: synthetic tests, alert routing, recovery runbook, postmortem owner.
Data, Security, and Auth Requirements for Production

I’m a CySA+, a former patent attorney, and I’ve shipped systems into regulated environments for decades, so this section is where I get picky. Most AI consultants are not security people, and it shows the moment you push on auth.
Ask them to describe, on a whiteboard:
- How identities are proven (OAuth, mTLS, short-lived tokens)
- How permissions are enforced (RBAC, ABAC, least privilege)
- Where secrets live and how they rotate
- What happens to PII in prompts and logs
- Which compliance regimes apply (SOC2, HIPAA, GDPR) and how they’re satisfied
- Verify user management workflows for onboarding, offboarding, audit trails, and periodic access reviews before you hire.
If they freeze on any of these, that’s your answer. You don’t need them to be a security firm. You need them to know enough not to leak your customer data into a vendor’s training set.
Contract Clauses and KPIs That Lock In Outcomes

This is where most buyers give up the leverage they spent the whole interview earning. Don’t.
Three things have to be in the contract or you don’t have one:
- Milestone-based payment. No lump sums on signature. Tie each payment to a deliverable with an acceptance test you wrote, not them. Require weekly progress dashboards with variance against timelines.
- KPIs with teeth. Latency, uptime, error rate, MTTR. Pick numbers, write them down, attach a remedy if they’re missed.
- A warranty period. 60 to 90 days minimum, with named support hours, after the system goes live. This is the difference between a vendor and a contractor who vanishes the moment the invoice clears.
If a consultant pushes back on any of these, ask why. The answer tells you whether you’re hiring a partner or a sales org.
A Short List of Questions That Work
Memorize these. Use them in order. Screenshot the card above and bring it to your next vendor call
Seven questions. Twenty minutes. You’ll know.
Frequently Asked Questions
What should I ask an AI implementation consultant about long-term model maintenance?
AI systems aren’t like a CRM rollout. They decay. The model that hit 92% accuracy at launch will quietly slide to 78% six months later as your data shifts, your customers’ language shifts, and the underlying foundation model gets silently updated by the vendor. Somebody has to notice that, and somebody has to fix it.
You want a name, not a role. Ask: who on your team is monitoring drift after handoff? How often do they retrain or re-evaluate? What’s the threshold that triggers action? If the answer is “we’ll set up a dashboard and your team can watch it,” that’s not an answer — you don’t have anyone on staff who knows what they’re looking at. A serious consultant either stays on a retainer to own drift, or trains a specific person on your team to own it, by name, with a written runbook.
Should I pay an AI implementation consultant for a pilot before signing a full contract?
Demos use the consultant’s data, where the model has already been tuned to look brilliant. That tells you nothing. What you need is a paid pilot — small, scoped, two to four weeks — where they run their approach against a slice of *your* actual data and you measure the result against a baseline you defined.
A serious consultant will say yes and quote you a fixed price for it. A consultant who insists on a full engagement before showing you anything on your data is asking you to buy a car you’ve never driven. If the pilot fails, you’ve spent a fraction of the project budget and learned something valuable. If it succeeds, you have real numbers to put in the contract.
How do I keep an AI implementation consultant from locking me into one model vendor?
This is the question your CFO is right to ask, because the model layer is the most volatile cost in the stack. GPT-4 prices have dropped 80% in eighteen months. Claude, Gemini, and open-source models leapfrog each other every quarter. If switching models in your system requires a rewrite, you’re locked in by accident.
A good answer covers three things. First, the application code talks to a model abstraction layer, not directly to one vendor’s SDK — swapping providers is a config change, not a refactor. Second, your prompts, your evaluation suite, and your fine-tuning data live in *your* repo under *your* control, not buried in a vendor’s playground. Third, your embeddings and vector store are portable: if you decide to move from Pinecone to pgvector next year, the data comes with you. If any of those is fuzzy, you’re not buying an AI system — you’re renting a wrapper around someone else’s API and paying full price.
What should an AI implementation consultant’s plan be when OpenAI or Anthropic goes down?
Every AI system you buy is actually a stack of other people’s systems, and the model vendor is the most fragile link. OpenAI has had multi-hour outages. Anthropic has deprecated model versions on 90 days’ notice. The major vendors quietly change tokenization, rate limits, and safety filters in ways that break production behavior overnight.
A good consultant hands you a one-page dependency map: every model and service the system touches, what its published SLA is, and what your system does when it’s unreachable or when its behavior changes. The right answer to “what happens when GPT-4 goes down at 10 AM on a Tuesday” isn’t “we wait.” It’s “we fall back to a secondary model with a documented quality delta,” or “we queue the request and retry with exponential backoff,” or at minimum, “we return a clean error to the user and page the on-call.” Any of those is a real answer. “That hasn’t happened to us yet” means it will happen to you first.
How do I make sure an AI implementation consultant doesn’t expose my customer data to model vendors?
This is the question that keeps your legal and compliance people awake, and it should. Every prompt you send to a model vendor is, by default, data you’ve handed to a third party. Without the right contracts and the right architecture, that data can end up in training sets, in logs you can’t see, and in jurisdictions you didn’t agree to.
A good consultant can tell you, on a whiteboard, exactly where your data goes from the moment a user types something to the moment a response comes back. They can name which vendor terms apply (most enterprise tiers contractually exclude your data from training — most free tiers do not). They’ve thought about PII redaction *before* the prompt leaves your network, not after. They can speak to the regulatory regimes that apply to you — HIPAA if you’re in healthcare, GLBA if you’re in financial services, GDPR if you have any EU customers at all — and how their architecture satisfies each one. If the answer is “we trust the vendor’s defaults,” walk away. The defaults are written by the vendor’s lawyers, not yours.
What To Do Next
If you’re about to hire an AI implementation consultant, run this checklist before the next call. If you’ve already hired someone and the answers are making you nervous, that’s useful too. The cost of catching a bad fit in week two is a fraction of the cost of catching it in month six.
I write about this kind of thing because I keep seeing the same projects fail the same way. If you want a second set of eyes on a vendor pitch or a contract before you sign it, get in touch.
