If you already know you need a GPU server for ML – start with the table below. If you’re still deciding between CPU and GPU, or unsure which configuration fits – read on.
Quick Decision: Which Config You Need
| Your task | Minimum configuration | Optimal configuration |
| Prototyping, training on small datasets | 1x RTX 4090 (24 GB) | 2x RTX 4090 (48 GB) |
| Fine-tuning 7B-13B models (LoRA/QLoRA) | 1x A100 40GB | 2x A100 80GB |
| Fine-tuning 30B-70B models | 4x A100 80GB | 4x H100 80GB |
| Training 7B-30B from scratch | 4x A100 80GB + NVLink | 8x A100 80GB + NVLink |
| Training 70B+ / foundation models | 8x H100 80GB + InfiniBand | 8x H200 141GB + InfiniBand |
| Production LLM inference | 2x A100 40GB | 4x A100 80GB or 2x H100 |
| Computer vision (real-time) | 1x RTX 4090 | 2-4x A100 40GB |
| Embedding generation (high volume) | 1x A100 40GB | 2x A100 80GB |
If your task is in the table, the configuration is determined. If not – read the scenarios below; they cover less common cases.
Why GPUs Are Important for ML
Training a neural network means billions of matrix multiplication operations executed sequentially across epochs. A CPU has 8-128 powerful cores built for sequential tasks. A GPU has 6,000-18,000+ simple CUDA cores running those operations in parallel. For ML workloads, the difference is 10x to 100x in favor of GPU.
Concretely: training BERT-large (340M parameters) on a single CPU (32-core Xeon) takes ~72 hours. On a single A100 80GB – ~4 hours. On 4x A100 – under an hour. CPU isn’t just slower – it makes large model training practically infeasible.
| Task | CPU | GPU (A100) | Speedup |
| BERT-large training (1 epoch) | ~72 hrs | ~4 hrs | ~18x |
| GPT-2 (1.5B) inference, 1 request | ~8 sec | ~0.1 sec | ~80x |
| ResNet-50 training (ImageNet) | ~10 days | ~12 hrs | ~20x |
| Embedding generation (1M vectors) | ~2 hrs | ~3 min | ~40x |
What Is a GPU Server for ML
A GPU server for machine learning is a dedicated bare-metal server with one or more GPUs, optimized for compute-intensive ML workloads. What distinguishes it from a generic GPU server is the full stack: enough VRAM for the model, NVLink or NVSwitch for inter-chip communication, fast NVMe storage for dataset streaming, and sufficient system RAM for preprocessing.
Key components that determine performance:
- VRAM (GPU memory) – the most common bottleneck. A 70B model in FP16 requires ~140 GB. If the model doesn’t fit in VRAM, the options are quantization (INT8/INT4) or more GPUs.
- GPU interconnect – NVLink allows GPUs on the same node to share memory and communicate at 600 GB/s bandwidth (H100). Without NVLink, communication goes through PCIe, which is 5-10x slower for distributed training.
- NVMe storage – during training, the server continuously streams batches. A single NVMe at 3.5 GB/s can’t keep up with 8xA100. Minimum: a RAID of multiple NVMe drives or a separate storage node.
- System RAM – should be at least equal to total VRAM. With 8xH100 (640 GB VRAM) – minimum 512 GB RAM for normal preprocessing.
Scenarios: Who Chooses What
Scenario 1 – ML engineer at a startup, first experiments
Situation: a 2-3 person ML team, a product idea to test, need to validate hypotheses on small datasets. Budget is constrained, configuration may change month to month.
What happens without GPU: training a simple classifier on 100k examples takes an hour instead of a minute. Iteration speed drops 20-50x. The team spends time waiting, not building.
Solution: 1-2x RTX 4090 (24 GB each). For models up to 13B with quantization – sufficient. Cost: $300-700/month. If flexibility matters – cloud GPU on-demand at the start, dedicated server when utilization exceeds 60% of the month.
Scenario 2 – Fine-tuning an LLM for a product
Situation: base model is available (Llama 3, Mistral, Gemma), need to adapt it for a specific domain (legal text, medical documentation, code). Dataset: 10k-500k examples. Training runs regularly – weekly or monthly.
Fine-tuning 7B via LoRA on a single A100 40GB takes 2-8 hours depending on dataset. For 70B via QLoRA on 4x A100 80GB – 12-24 hours. That’s a real production schedule.
Solution: for 7B-13B – 1-2x A100 40GB or RTX 4090. For 30B-70B – 4x A100 80GB with NVLink. Dedicated bare-metal is justified for regular training runs – cheaper than cloud from ~3 runs per month.
Scenario 3 – Production LLM inference
Situation: model is trained, need to serve an API for 1,000+ users. Requirements: latency < 200ms to first token, throughput 50+ requests/sec.
What matters here isn’t just VRAM but GPU throughput. H100 generates tokens ~3x faster than A100 at the same VRAM due to FlashAttention 2 and higher memory bandwidth (3.35 TB/s vs 2 TB/s). For a 13B model – 1x A100 40GB is sufficient. For 70B – 2x H100 or 4x A100 80GB.
Solution: dedicated server beats cloud at sustained load. 2x H100 for 70B production inference is the standard configuration for LLM APIs.
Scenario 4 – Research team, training from scratch
Situation: academic or R&D team training a custom architecture or foundation model. Datasets: hundreds of GB to terabytes. Training time: days to weeks.
InfiniBand between nodes is critical here: when training across 32 GPUs on different servers, gradients synchronize over the network. InfiniBand 400 Gb/s vs 100 GbE Ethernet delivers up to 2-3x better multi-node training efficiency.
Solution: 8x H100 or H200 as the minimum node for serious workloads. NVLink within the node, InfiniBand between nodes. NVMe RAID for dataset streaming.
Best GPU Configurations
| GPU | VRAM | HBM bandwidth | NVLink | Price/mo (approx.) | Best for |
| RTX 4090 | 24 GB | 1 TB/s | No | $300-450 | Prototypes, small models, inference up to 13B |
| A100 40GB | 40 GB | 2 TB/s | Yes | $600-900 | Fine-tuning 7B-30B, inference 30B+ |
| A100 80GB | 80 GB | 2 TB/s | Yes | $900-1400 | Fine-tuning 70B, training 7B-30B |
| H100 80GB | 80 GB | 3.35 TB/s | Yes (NVLink 4) | $2,000-3,500 | Production inference, training 30B+ |
| H200 141GB | 141 GB | 4.8 TB/s | Yes (NVLink 4) | $3,500-6,000 | Foundation models, 70B+ training |
Prices are per GPU in a dedicated bare-metal server configuration. Cloud on-demand pricing runs 2-4x higher at sustained utilization.
Browse current GPU servers: Unihost GPU servers. Managed AI infrastructure: Unihost AI hosting.
ML Use Cases
Computer Vision. Object detection (YOLO, DETR), segmentation, image classification. VRAM requirements are lower than LLMs – an image batch takes 4-16 GB for most architectures. 1-2x RTX 4090 or A100 40GB covers 90% of CV tasks.
NLP and text processing. BERT, RoBERTa, T5 for classification, NER, sentiment. Models up to 1B parameters – RTX 4090 is more than sufficient. Larger transformers (3B-7B) – A100 40GB.
Recommendation systems. Embedding models, two-tower architectures, ranking. VRAM requirements are relatively modest, but inference speed matters for real-time recommendations. 1-2x A100 40GB for production recommenders.
Audio and image generation. Stable Diffusion, Whisper, MusicGen. SD XL requires 8-12 GB VRAM for basic inference. For fine-tuning and batch generation – 24+ GB. RTX 4090 or A100 40GB.
Reinforcement Learning. RLHF for LLMs, game-playing agents. Combination of GPU and CPU compute. Specific requirements depend on the environment – from RTX 4090 to a multi-GPU cluster for complex tasks.
FAQ
What GPU is best for machine learning?
Depends on task and budget. H100 80GB is the top ML hardware in 2026 – but priced accordingly. A100 80GB is the optimal balance for most production workloads. RTX 4090 is the best choice for a budget start and models up to 13B. If resources are constrained, A100 40GB covers 70% of real-world ML tasks.
Do you need GPU for AI training?
For any serious ML – yes. CPU training of neural networks is 10-100x slower. The exception: small classical ML models (Random Forest, XGBoost, linear models) train fine on CPU. But if you’re working with neural networks from a few million parameters up – GPU is mandatory.
How much VRAM is needed for ML?
Rule of thumb: model size (in parameters) × 2 bytes (FP16) = minimum VRAM. 7B × 2 = ~14 GB. Add activations and optimizer states: for training, multiply by 4-6x. A 7B model for training needs 56-84 GB. For inference – weights only, so 7B fits in 14-16 GB (FP16) or 7-8 GB (INT8).
CPU vs GPU for machine learning?
CPU wins in exactly one scenario: classical ML without neural networks (XGBoost, sklearn, feature engineering). For everything else – GPU is an order of magnitude faster. Practical rule: if your code uses PyTorch or TensorFlow with neural networks, GPU is mandatory at any serious scale.
Next Step
Define your model size and task type – the configuration becomes obvious. Current GPU servers for ML: Unihost GPU servers. Managed AI infrastructure: Unihost AI hosting.