An AI agent is an autonomous program that executes tasks without constant human involvement: it analyzes data, makes decisions, calls APIs, and runs other tools. To run, it needs infrastructure. What kind depends on what the agent does and how intensively it does it.
Quick Answer: What Infrastructure You Need
| Your agent / scenario | Minimum infrastructure | Approx. cost/mo |
| Orchestration agent (LangChain, AutoGPT) without local model | VPS 2-4 vCPU / 4-8 GB RAM | $20-60 |
| Agent with local model up to 7B (llama.cpp, Ollama) | VPS 4-8 vCPU / 16-32 GB RAM or 1x RTX 4090 | $60-450 |
| Agent with local 13B-70B model | Dedicated GPU: 1-4x A100 | $600-5,000+ |
| RAG agent (document search + LLM API) | VPS 4 vCPU / 8 GB RAM + vector DB | $30-100 |
| Multi-agent pipeline (several agents in parallel) | VPS 8-16 vCPU / 16-32 GB RAM | $80-200 |
| Browser automation agent (Playwright, Selenium) | VPS 4 vCPU / 8 GB RAM + headless Chromium | $30-80 |
| Production AI agent with 1,000+ tasks/day | Dedicated server or AI hosting | $200-2,000+ |
The key split: if the agent calls external LLM APIs (OpenAI, Anthropic, Gemini) – it only needs CPU and RAM for orchestration. If the agent runs a model locally – it needs a GPU or a powerful CPU with large RAM.
What Is AI Agent Hosting
AI agent hosting is providing server infrastructure to run AI agents continuously or on demand. Unlike standard web hosting, AI agents have specific requirements: long-running processes (an agent can work on a task for hours), large RAM for model context, the ability to call external APIs, and the ability to maintain state between runs.
What distinguishes an AI agent from a regular application:
- Execution duration – an agent task can take from seconds to hours, unlike an HTTP request that must respond in 100-500ms
- State between runs – the agent stores memory, conversation context, and results from previous steps
- Dynamic resource consumption – during inference, CPU/RAM peak is far higher than during idle waiting
- Tool calls – the agent invokes external APIs, databases, browsers, and code interpreters
- Parallelism – multi-agent systems run multiple agents simultaneously
How It Works
A typical AI agent consists of several layers, each with its own infrastructure requirements.
Model layer (LLM)
The brain of the agent is a language model. There are two options: an API call (OpenAI GPT-4, Anthropic Claude, Google Gemini) or a local model (Llama, Mistral, Qwen). An API call only requires a network connection and doesn’t load the server. A local model requires GPU or a powerful CPU plus large RAM. The choice is a tradeoff between cost (APIs become expensive at high volume), privacy (local model means data never leaves the server), and performance.
Orchestration layer
The agent framework (LangChain, LlamaIndex, AutoGen, CrewAI, n8n) coordinates model calls, tool invocations, and state storage. The orchestrator is a relatively lightweight Python/Node.js process. Its main requirement is stable 24/7 operation or on-demand launch without cold start delays. VPS is sufficient for most orchestrators.
Memory and storage layer
The agent stores state in several places: a vector database (Chroma, Qdrant, Weaviate, Pinecone) for semantic document search; a relational database (PostgreSQL) for structured data and metadata; Redis for short-term memory and caching; and file storage for artifacts (documents, images, outputs).
Tools layer
The agent can run: a browser (Playwright, Selenium) for web browsing and scraping; a code interpreter (Python sandbox) for computation; external APIs (calendar, email, CRM, databases); and shell commands for system automation. Each tool has its own resource requirements – especially headless browsers (100-500 MB RAM per session).
Infrastructure Requirements
| Agent component | CPU | RAM | GPU | Disk |
| Orchestrator (no local model) | 2-4 vCPU | 2-4 GB | Not needed | 10-50 GB SSD |
| Local 7B model (CPU inference) | 8-16 vCPU | 16-32 GB | Not needed | 20 GB NVMe |
| Local 7B model (GPU inference) | 4-8 vCPU | 16 GB | 1x RTX 4090 (24 GB) | 20 GB NVMe |
| Vector DB (Qdrant/Chroma) | 2-4 vCPU | 4-16 GB | Not needed | 50-500 GB NVMe |
| Headless browser (Playwright) | 2-4 vCPU / browser | 1-2 GB / browser | Not needed | 10 GB SSD |
| Python sandbox (code interpreter) | 2-4 vCPU | 2-8 GB | Not needed | 10 GB SSD |
| Full stack (orchestrator + RAG + browser) | 8-16 vCPU | 16-32 GB | Optional | 100+ GB NVMe |
Practical advice: start with a minimal configuration and monitor real consumption. AI agents have very uneven load – a peak during inference and near-zero consumption while waiting. Vertical VPS scaling after launch is a simpler strategy than over-provisioning from the start.
VPS vs Dedicated for AI Agents
Scenario: startup launching a first AI agent
Situation: a dev team is building an agent for customer support automation. The agent uses OpenAI API for generating responses and Playwright for checking order status. Expected volume: 100-500 tasks per day.
VPS 4 vCPU / 8 GB RAM is the optimal starting point. LangChain orchestrator + Playwright fits within 4-6 GB RAM under load. Cost: ~$30-60/month. A dedicated server is overkill here – CPU and RAM aren’t the bottleneck; the bottleneck is OpenAI API latency (~200-500ms per request).
Scenario: agent with local LLM for enterprise
Situation: a financial company is building an agent for document analysis. Data cannot leave the corporate network – local model only. They’ve chosen Llama 3.1 70B.
Llama 3.1 70B in FP16 requires ~140 GB VRAM. Minimum: 2x A100 80GB (160 GB VRAM). VPS doesn’t work at all here – a dedicated GPU server is required. Cost: from $2,000/month. Alternative for smaller requirements: Llama 3.1 8B in INT4 (~5 GB VRAM) fits on an RTX 4090, at ~$350-450/month.
Scenario: platform for multi-agent automation
Situation: a SaaS product where each client gets their own AI agent for workflow automation. 50 clients, each agent runs 10-50 tasks per day. Agents use OpenAI API and have their own vector databases.
A dedicated server with 16-32 cores and 64-128 GB RAM can pack all agent processes onto a single node. Or – several smaller VPS instances plus a load balancer for client isolation. The second option gives better isolation (one VPS going down doesn’t affect others), the first is simpler to manage.
| Criterion | VPS | Dedicated server / AI hosting |
| Local LLM (7B+ models) | CPU inference: slow, GPU: needs dedicated | Optimal with GPU |
| API-based agent (OpenAI, Anthropic) | Optimal | Overkill for a single agent |
| RAG with large vector DB (100+ GB) | Limited by RAM | Optimal |
| Multi-agent platform (50+ agents) | Multiple VPS or large VPS | Dedicated server |
| Privacy (data stays on server) | Works with proper configuration | Maximum isolation |
| Cost (single API-based agent) | Lowest ($20-80/mo) | Overkill |
| Scaling as you grow | Vertical or horizontal | Vertical or cluster |
Use Cases
Customer support automation. The agent processes incoming tickets: classifies them, answers common questions, escalates complex cases. Requires: LLM API or local model, vector database with product documentation, helpdesk integration via API. Infrastructure: VPS 4-8 vCPU / 8-16 GB RAM is sufficient for 500-2,000 tickets/day with an API-based approach.
Research agent (web research). The agent searches the web, analyzes pages, and compiles reports. Playwright for browser access, LLM for analysis and synthesis. Headless browser is the most resource-intensive component: each parallel session consumes 200-500 MB RAM. 10 parallel browsers = 2-5 GB just for them. A VPS with 8 GB RAM fills up quickly under active web scraping.
Code generation and review agent. The agent analyzes code in a repository, writes tests, performs code review, and suggests refactoring. GitHub/GitLab integrations via webhook, code execution in a sandbox. Requires: LLM API (or a local code-specialized model), a sandbox environment for safe code execution. VPS with Docker and resource limits for the sandbox is the standard setup.
Data analysis agent. The agent pulls data from various sources, cleans it, analyzes it, and builds reports. Python interpreter for computation, database connections, possibly ML libraries (pandas, scikit-learn). Requirements: enough RAM for in-memory dataset processing (for large datasets – 32+ GB), powerful CPU for computation without GPU.
For AI agent hosting with optimized AI infrastructure: Unihost AI hosting. For API-based agents and orchestrators, a suitable option is
FAQ
How to host AI agent?
Depends on agent architecture. If the agent uses external LLM APIs (OpenAI, Anthropic) – a VPS with a Python/Node.js environment, an agent framework (LangChain, AutoGen), and internet access is sufficient. If the agent runs a model locally – you need either a powerful CPU with large RAM (for small quantized models) or a GPU server (for 7B+ models).
What server is needed for AI agent?
For an orchestration agent without a local model – VPS 2-4 vCPU / 4-8 GB RAM. For an agent with a local 7B model on CPU – 16-32 GB RAM, 8+ cores. For a local 7B+ GPU inference – minimum RTX 4090 (24 GB VRAM). For a 70B model – from 2x A100 80GB. Add RAM for the vector database, browser, and other tools depending on the agent’s task set.
Can AI run on VPS?
Yes, with caveats. An API-based agent (no local model) runs well on a standard VPS. A local model up to 7B in quantized format (INT4/INT8 via llama.cpp or Ollama) can also run on a VPS with 16-32 GB RAM – slower than GPU but functional. For larger models or production-scale load, a dedicated GPU server is required.
Cost of AI hosting?
API-based agent on VPS: $20-80/month for hosting plus API call costs (OpenAI GPT-4: $30/1M input tokens, $60/1M output tokens). At 100k tasks/month with a 2k token average context – ~$12-20/month in API costs alone. Agent with a local model: $350-450/month (1x RTX 4090) – no API cost dependency. At high volumes, a local model becomes cheaper from ~500k-1M API calls per month.
Next Step
Define your agent architecture (API or local model) and choose your infrastructure. AI hosting for production agents: Unihost AI hosting.