Which AI models can I run in my own tenant?

Whisper-class and NVIDIA Parakeet for speech-to-text, and open-weight LLMs such as Llama 3.x, Mistral and Qwen for sentiment, intent, QA scoring and summaries. You can also bring your own fine-tuned model or plug in your existing inference stack — there is no lock-in to a model vendor.

Can Arkivo's AI run fully air-gapped?

Yes. The entire enrichment pipeline can run with zero outbound calls inside a private or air-gapped environment, on either CPU or GPU. This is a common fit for government, defense and highly regulated operations that require full network isolation.

What can Arkivo's private AI actually do?

Speech-to-text with speaker diarization, per-call sentiment and emotion, automated QA scoring against your scorecards, rating and ranking of recordings to surface coaching and compliance risk, automatic PII/PHI redaction, topic and intent detection, and call summaries — all in your tenant, across 100% of calls rather than a small sample.

Do I need GPUs to run it?

Not necessarily. Smaller ASR and LLM configurations run on CPU, while single-GPU and multi-GPU setups raise throughput for large historical backfills or near-real-time enrichment. The model menu lets you size hardware to your call volume and accuracy needs.

Private AI

Private, in-tenant AI — your data never leaves

Q: Does Arkivo send my recordings or PHI to an external AI model?

Never. All AI — transcription, sentiment, automated QA scoring, redaction and summaries — runs on models deployed inside your own environment. No recording, transcript or prompt is ever transmitted to a third-party or proprietary model provider for inference, logging or training, so your PHI and PII stay entirely within your compliance boundary.

Transcription, sentiment, automated QA scoring, and call rating and ranking run on your own models inside your environment. No PHI or PII egress, no data sniffing, and nothing is ever used to train a model vendor's system.

Private AI · your models, your data

AI that never sees the outside world.

Transcription, sentiment, and automated QA run on models deployed inside your own environment. PHI and PII never leave your tenant and are never sent to a proprietary model provider — no data sniffing, no training on your conversations, no exposure.

Your data never leaves your boundary. Every AI step — speech-to-text, sentiment, QA scoring, redaction — executes on your own infrastructure. No recording, transcript, or prompt is ever transmitted to an external LLM for inference, logging, or training.

Private transcription

Speech-to-text with speaker diarization, running on open ASR models (Whisper-class) or your own — entirely inside your environment.

Sentiment & emotion

Per-turn and per-call sentiment scoring to power QA, escalation, and trend analysis — the audio never leaves your tenant.

Automated QA scoring

Score every interaction against your scorecards and rubrics automatically — consistent evaluation across 100% of calls, not a 2% sample.

Rating, ranking & prioritization

Auto-rank the archive to surface best and worst calls, coaching moments, and compliance risk — so reviewers see what matters first.

Automatic PII / PHI redaction

Detect and mask sensitive spans — card numbers, member IDs, health data — before storage or playback. Privacy by default.

Topic & intent detection

Classify interactions, detect intents and categories, and auto-tag the archive for faster, smarter search.

Summaries & coaching insights

Generate call summaries and agent-coaching feedback to shorten QA reviews and accelerate improvement.

Bring your own model

Run open-weight LLMs (Llama, Mistral, and more), your fine-tuned models, or your existing inference stack. No lock-in to a model vendor.

Proprietary cloud AI

Your PHI/PII is sent to a third-party model provider
Conversations may be retained or used to improve their models
Per-token data egress on every transcript and prompt
Compliance scope expands to every vendor in the path

Arkivo private AI

Models run inside your VPC, on-prem, or fully air-gapped
Recordings, transcripts, and prompts never leave your tenant
Never used to train anyone's model — yours or a provider's
One compliance boundary: your environment

Deploys alongside Arkivo in your cloud, on-prem, or air-gapped· GPU-accelerated or CPU · scales with your volume

Inference architecture

Every step runs inside your boundary

Audio, transcripts, and prompts move only between services you operate — from object storage to the search index, without a single outbound call.

Audio in object storage

Recordings sit in your own bucket or container — the same tenant-owned storage Arkivo migrates and indexes into.

In-tenant ASR

Whisper-class, NVIDIA Parakeet, or your own speech model transcribes with diarization — inside your environment.

In-tenant LLM

Llama 3.x, Mistral, Qwen, or your fine-tuned model reads each transcript on infrastructure you control.

Enrichment

Sentiment, intent, redaction spans, QA scores, and summaries are generated per call — no transcript leaves the boundary.

Metadata & search index

Structured results land in the Arkivo index, making the whole archive searchable, filterable, and reportable.

Air-gapped path — zero outbound calls. The entire pipeline can run with no internet route at all. Every arrow above stays inside your network; nothing dials home for licensing, telemetry, or inference.

GPU and CPU paths both supported. Run on GPUs for maximum throughput, or on CPU-only nodes where accelerators aren't available — the same models, the same boundary, sized to your hardware.

Model menu & sizing

Pick the models that fit your hardware

Mix and match ASR and LLM families across CPU, single-GPU, or multi-GPU nodes — or drop in your own fine-tuned weights. Throughput figures are indicative and vary with audio length, hardware, and batching.

Model family	Task	Hardware	Throughput (indicative)	Languages	Accuracy tier
Whisper large-v3	ASR	CPU / single-GPU	~120 calls/hr	Multilingual (99+)	High
NVIDIA Parakeet	ASR	Single-GPU	~600 calls/hr	English-focused	Highest (EN)
Llama 3.x 8B	LLM	CPU / single-GPU	~400 calls/hr	Multilingual	Strong
Llama 3.x 70B	LLM	Multi-GPU	~90 calls/hr	Multilingual	Highest
Mistral	LLM	CPU / single-GPU	~450 calls/hr	Multilingual (EU)	Strong
Qwen	LLM	Single-GPU	~350 calls/hr	Multilingual (CJK)	Strong
Your fine-tuned model	LLM	CPU / single-GPU / multi-GPU	Depends on size	Your domain & tongues	Tuned to you

Throughput is indicative only — measured in calls per hour and dependent on average call duration, node specification, and concurrency. Bring your own model to set your own profile.

No token egress · no training on your data.

Audio, transcripts, and prompts never cross the tenant boundary, and your conversations are never used to train any vendor model — ours or a third party's. This is backed by a short, signed technical commitment and data-handling appendix you can attach to your contract and hand to your auditors.

Processing modes

Backfill the past, keep up with the present

The same in-tenant models run in two modes — a full historical backfill of the migrated archive, and continuous enrichment of new recordings as they arrive.

Batch

Backfill the whole archive

Run enrichment across 100% of the historical recordings you migrate — not a sampled slice — so every legacy call carries transcripts, sentiment, and QA scores.

Processes the full migrated corpus on your schedule
Scales across GPU or CPU workers to fit your window
Resumable, checkpointed jobs with live progress

Near-real-time

Enrich new syncs as they land

As fresh recordings sync from your connectors, they flow through the same in-tenant pipeline within minutes — keeping the index current without a manual rerun.

Triggered automatically on each new sync
Same models, same boundary as the batch path
New calls are searchable and scored shortly after capture

Pay for compute you own, not per-minute AI SaaS fees.

Running models in your own tenant turns AI from a metered line item into fixed infrastructure — so enriching a decade of recordings doesn't come with a usage-based bill.

Capex, not metered SaaS

Spend on GPUs or CPU capacity you own and amortize — no per-minute or per-token invoice that scales with every transcript.

Reuse idle capacity

Schedule batch enrichment on hardware you already run, filling spare cycles instead of renting someone else's inference.

Predictable at archive scale

Reprocessing millions of historical calls costs compute time, not a usage bill — so a full backfill never triggers a surprise overage.

Questions, answered

Private AI FAQ

Your cloud · Your keys · Your data

Own your recordings. Keep the experience.

See the control plane live in minutes, or talk to us about migrating off NICE or Genesys into the cloud you already trust. No rip-and-replace, no lost calls.

Launch the live app

No data migration required to evaluate · Your cloud, your keys, your data