AI Agent Hosting: What You Need to Know

An AI agent is an autonomous program that executes tasks without constant human involvement: it analyzes data, makes decisions, calls APIs, and runs other tools. To run, it needs infrastructure. What kind depends on what the agent does and how intensively it does it.

Quick Answer: What Infrastructure You Need

Your agent / scenario	Minimum infrastructure	Approx. cost/mo
Orchestration agent (LangChain, AutoGPT) without local model	VPS 2-4 vCPU / 4-8 GB RAM	$20-60
Agent with local model up to 7B (llama.cpp, Ollama)	VPS 4-8 vCPU / 16-32 GB RAM or 1x RTX 4090	$60-450
Agent with local 13B-70B model	Dedicated GPU: 1-4x A100	$600-5,000+
RAG agent (document search + LLM API)	VPS 4 vCPU / 8 GB RAM + vector DB	$30-100
Multi-agent pipeline (several agents in parallel)	VPS 8-16 vCPU / 16-32 GB RAM	$80-200
Browser automation agent (Playwright, Selenium)	VPS 4 vCPU / 8 GB RAM + headless Chromium	$30-80
Production AI agent with 1,000+ tasks/day	Dedicated server or AI hosting	$200-2,000+

The key split: if the agent calls external LLM APIs (OpenAI, Anthropic, Gemini) – it only needs CPU and RAM for orchestration. If the agent runs a model locally – it needs a GPU or a powerful CPU with large RAM.

What Is AI Agent Hosting

AI agent hosting is providing server infrastructure to run AI agents continuously or on demand. Unlike standard web hosting, AI agents have specific requirements: long-running processes (an agent can work on a task for hours), large RAM for model context, the ability to call external APIs, and the ability to maintain state between runs.

What distinguishes an AI agent from a regular application:

Execution duration – an agent task can take from seconds to hours, unlike an HTTP request that must respond in 100-500ms
State between runs – the agent stores memory, conversation context, and results from previous steps
Dynamic resource consumption – during inference, CPU/RAM peak is far higher than during idle waiting
Tool calls – the agent invokes external APIs, databases, browsers, and code interpreters
Parallelism – multi-agent systems run multiple agents simultaneously

How It Works

A typical AI agent consists of several layers, each with its own infrastructure requirements.

Model layer (LLM)

The brain of the agent is a language model. There are two options: an API call (OpenAI GPT-4, Anthropic Claude, Google Gemini) or a local model (Llama, Mistral, Qwen). An API call only requires a network connection and doesn’t load the server. A local model requires GPU or a powerful CPU plus large RAM. The choice is a tradeoff between cost (APIs become expensive at high volume), privacy (local model means data never leaves the server), and performance.

Orchestration layer

The agent framework (LangChain, LlamaIndex, AutoGen, CrewAI, n8n) coordinates model calls, tool invocations, and state storage. The orchestrator is a relatively lightweight Python/Node.js process. Its main requirement is stable 24/7 operation or on-demand launch without cold start delays. VPS is sufficient for most orchestrators.

Memory and storage layer

The agent stores state in several places: a vector database (Chroma, Qdrant, Weaviate, Pinecone) for semantic document search; a relational database (PostgreSQL) for structured data and metadata; Redis for short-term memory and caching; and file storage for artifacts (documents, images, outputs).

Tools layer

The agent can run: a browser (Playwright, Selenium) for web browsing and scraping; a code interpreter (Python sandbox) for computation; external APIs (calendar, email, CRM, databases); and shell commands for system automation. Each tool has its own resource requirements – especially headless browsers (100-500 MB RAM per session).

Infrastructure Requirements

Agent component	CPU	RAM	GPU	Disk
Orchestrator (no local model)	2-4 vCPU	2-4 GB	Not needed	10-50 GB SSD
Local 7B model (CPU inference)	8-16 vCPU	16-32 GB	Not needed	20 GB NVMe
Local 7B model (GPU inference)	4-8 vCPU	16 GB	1x RTX 4090 (24 GB)	20 GB NVMe
Vector DB (Qdrant/Chroma)	2-4 vCPU	4-16 GB	Not needed	50-500 GB NVMe
Headless browser (Playwright)	2-4 vCPU / browser	1-2 GB / browser	Not needed	10 GB SSD
Python sandbox (code interpreter)	2-4 vCPU	2-8 GB	Not needed	10 GB SSD
Full stack (orchestrator + RAG + browser)	8-16 vCPU	16-32 GB	Optional	100+ GB NVMe

Practical advice: start with a minimal configuration and monitor real consumption. AI agents have very uneven load – a peak during inference and near-zero consumption while waiting. Vertical VPS scaling after launch is a simpler strategy than over-provisioning from the start.

VPS vs Dedicated for AI Agents

Scenario: startup launching a first AI agent

Situation: a dev team is building an agent for customer support automation. The agent uses OpenAI API for generating responses and Playwright for checking order status. Expected volume: 100-500 tasks per day.

VPS 4 vCPU / 8 GB RAM is the optimal starting point. LangChain orchestrator + Playwright fits within 4-6 GB RAM under load. Cost: ~$30-60/month. A dedicated server is overkill here – CPU and RAM aren’t the bottleneck; the bottleneck is OpenAI API latency (~200-500ms per request).

Scenario: agent with local LLM for enterprise

Situation: a financial company is building an agent for document analysis. Data cannot leave the corporate network – local model only. They’ve chosen Llama 3.1 70B.

Llama 3.1 70B in FP16 requires ~140 GB VRAM. Minimum: 2x A100 80GB (160 GB VRAM). VPS doesn’t work at all here – a dedicated GPU server is required. Cost: from $2,000/month. Alternative for smaller requirements: Llama 3.1 8B in INT4 (~5 GB VRAM) fits on an RTX 4090, at ~$350-450/month.

Scenario: platform for multi-agent automation

Situation: a SaaS product where each client gets their own AI agent for workflow automation. 50 clients, each agent runs 10-50 tasks per day. Agents use OpenAI API and have their own vector databases.

A dedicated server with 16-32 cores and 64-128 GB RAM can pack all agent processes onto a single node. Or – several smaller VPS instances plus a load balancer for client isolation. The second option gives better isolation (one VPS going down doesn’t affect others), the first is simpler to manage.

Criterion	VPS	Dedicated server / AI hosting
Local LLM (7B+ models)	CPU inference: slow, GPU: needs dedicated	Optimal with GPU
API-based agent (OpenAI, Anthropic)	Optimal	Overkill for a single agent
RAG with large vector DB (100+ GB)	Limited by RAM	Optimal
Multi-agent platform (50+ agents)	Multiple VPS or large VPS	Dedicated server
Privacy (data stays on server)	Works with proper configuration	Maximum isolation
Cost (single API-based agent)	Lowest ($20-80/mo)	Overkill
Scaling as you grow	Vertical or horizontal	Vertical or cluster

Use Cases

Customer support automation. The agent processes incoming tickets: classifies them, answers common questions, escalates complex cases. Requires: LLM API or local model, vector database with product documentation, helpdesk integration via API. Infrastructure: VPS 4-8 vCPU / 8-16 GB RAM is sufficient for 500-2,000 tickets/day with an API-based approach.

Research agent (web research). The agent searches the web, analyzes pages, and compiles reports. Playwright for browser access, LLM for analysis and synthesis. Headless browser is the most resource-intensive component: each parallel session consumes 200-500 MB RAM. 10 parallel browsers = 2-5 GB just for them. A VPS with 8 GB RAM fills up quickly under active web scraping.

Code generation and review agent. The agent analyzes code in a repository, writes tests, performs code review, and suggests refactoring. GitHub/GitLab integrations via webhook, code execution in a sandbox. Requires: LLM API (or a local code-specialized model), a sandbox environment for safe code execution. VPS with Docker and resource limits for the sandbox is the standard setup.

Data analysis agent. The agent pulls data from various sources, cleans it, analyzes it, and builds reports. Python interpreter for computation, database connections, possibly ML libraries (pandas, scikit-learn). Requirements: enough RAM for in-memory dataset processing (for large datasets – 32+ GB), powerful CPU for computation without GPU.

For AI agent hosting with optimized AI infrastructure: Unihost AI hosting. For API-based agents and orchestrators, a suitable option is

FAQ

How to host AI agent?

Depends on agent architecture. If the agent uses external LLM APIs (OpenAI, Anthropic) – a VPS with a Python/Node.js environment, an agent framework (LangChain, AutoGen), and internet access is sufficient. If the agent runs a model locally – you need either a powerful CPU with large RAM (for small quantized models) or a GPU server (for 7B+ models).

What server is needed for AI agent?

For an orchestration agent without a local model – VPS 2-4 vCPU / 4-8 GB RAM. For an agent with a local 7B model on CPU – 16-32 GB RAM, 8+ cores. For a local 7B+ GPU inference – minimum RTX 4090 (24 GB VRAM). For a 70B model – from 2x A100 80GB. Add RAM for the vector database, browser, and other tools depending on the agent’s task set.

Can AI run on VPS?

Yes, with caveats. An API-based agent (no local model) runs well on a standard VPS. A local model up to 7B in quantized format (INT4/INT8 via llama.cpp or Ollama) can also run on a VPS with 16-32 GB RAM – slower than GPU but functional. For larger models or production-scale load, a dedicated GPU server is required.

Cost of AI hosting?

API-based agent on VPS: $20-80/month for hosting plus API call costs (OpenAI GPT-4: $30/1M input tokens, $60/1M output tokens). At 100k tasks/month with a 2k token average context – ~$12-20/month in API costs alone. Agent with a local model: $350-450/month (1x RTX 4090) – no API cost dependency. At high volumes, a local model becomes cheaper from ~500k-1M API calls per month.

Next Step

Define your agent architecture (API or local model) and choose your infrastructure. AI hosting for production agents: Unihost AI hosting.