How Auto-Scaling Works for Paperclip Agents
Your Paperclip agent handles 10 requests per minute during normal hours. Then a Product Hunt launch sends 500 requests per minute. What happens?
If you’re self-hosting, probably a timeout cascade. If you’re on HostAgentes, auto-scaling kicks in. Here’s how it works.
The Problem with Static Capacity
Traditional hosting assigns fixed resources. Your agent runs on a single instance with fixed CPU, memory, and concurrent connection limits. When traffic exceeds those limits:
- New requests queue up
- Response times increase
- Requests start timing out
- Users see errors
The fix is manual — spin up more instances, update load balancer config, monitor. By the time you react, the spike is over (and your users are gone).
How HostAgentes Auto-Scaling Works
Request Monitoring
We monitor every request to every agent in real-time:
- Request rate (requests/second)
- Response latency (p50, p95, p99)
- Queue depth (waiting requests)
- Instance CPU and memory utilization
Scaling Triggers
When any metric crosses a threshold, scaling initiates:
| Trigger | Threshold | Action |
|---|---|---|
| Request rate | >80% of capacity | Scale up |
| P95 latency | >2x baseline | Scale up |
| Queue depth | >10 waiting | Scale up |
| CPU utilization | <20% for 5 min | Scale down |
Scale-Up Process
- Detect — monitoring flags a threshold breach (under 1 second)
- Provision — a pre-warmed instance is activated (2-5 seconds)
- Route — new requests are distributed across instances
- Verify — confirm metrics return to healthy levels
Total time from spike to scaled: under 10 seconds.
Scale-Down Process
After traffic subsides:
- Wait — observe for 5 minutes to confirm the spike is over
- Drain — stop routing new requests to excess instances
- Complete — let in-flight requests finish
- Remove — deactivate the extra instances
This ensures no request is dropped during scale-down.
What This Means for You
No Capacity Planning
You don’t need to predict traffic. Deploy your agent and let auto-scaling handle the rest. Whether it’s 1 request or 10,000 per minute, your agent stays responsive.
Pay for What You Use
On the Pro and Scale plans, auto-scaling is included. There are no per-instance charges or surprise bills. Your monthly price stays the same regardless of traffic.
Zero Configuration
No YAML files, no auto-scaling groups, no min/max instance counts. It works out of the box. Just deploy your agent and we handle the rest.
Auto-Scaling by Plan
| Feature | Starter | Pro | Scale |
|---|---|---|---|
| Auto-scaling | Basic | Full | Full + priority |
| Max concurrent | 50 req/min | Unlimited | Unlimited |
| Scale-up speed | ~30 sec | ~10 sec | ~5 sec |
| Pre-warmed instances | 1 | 3 | 10+ |
When You Need Scale
The Scale plan (€45/month) is for teams that need:
- Priority scaling during peak traffic
- Dedicated pre-warmed instances
- Custom scaling thresholds
- Scale-to-zero during off-hours (cost savings)
- Enterprise SLAs
Most teams do great on Pro. Start there and upgrade when you need priority scaling.
Related Posts
Paperclip API Gateway: Everything You Need to Know
Understand the Paperclip API gateway — authentication, rate limiting, request routing, and how to integrate your agents into any application via the REST API.
Why AI Agent Hosting Needs to Be Purpose-Built
AI agents have unique hosting requirements that traditional web hosting can't meet. Learn why purpose-built agent hosting delivers better performance, security, and reliability.
The Future of AI Agent Infrastructure (2026 and Beyond)
Where AI agent infrastructure is heading — from single-model deployments to multi-agent orchestration, edge inference, and the platform shift that will define the next decade.