Private AI clusters pooled over the internet, on your hardware.
No VPS, no API fees.
Run massive open-source models no single machine could host. Send one invite link, everyone joins, everyone shares the compute.
$ pip install progresspals
Three commands. Real distributed inference.
No coordination overhead, no Kubernetes, no public swarms. Just your people, your hardware, your model.
Create a swarm
Pick a model. The CLI claims the layers your machine can hold and starts hosting. Mint an invite for your pals with pals invite create.
Invite your team
Mint a single-use invite token and hand it to each pal over a secure channel. They redeem it, then join your swarm and host their slice of the model.
Run it like OpenAI
Start the local OpenAI-compatible endpoint. Point Cursor, Aider, Continue, or any SDK at it. Inference flows through the chain.
A drop-in replacement for OpenAI — running on your team's hardware.
pals serve exposes the swarm as a local OpenAI-compatible endpoint at http://localhost:8080/v1. Any tool that speaks the OpenAI API works unchanged — point it at your endpoint and it codes, chats, and reasons through your private cluster.
POST /v1/chat/completions · GET /v1/models · SSE streaming
Plug it into the tools your team
already uses.
Because the swarm exposes a standard OpenAI-compatible endpoint, anything in your agent stack — coding harnesses, gateways, frameworks — just works.
Coding agents
Your team's swarm becomes the brain inside the IDE. Point the agent at the local endpoint and it codes, edits, and refactors against your shared cluster.
# Cursor → Settings → Models → Custom OpenAI Base URL
http://localhost:8080/v1
# Aider
aider --openai-api-base http://localhost:8080/v1 \
--openai-api-key any-stringPersonal AI agents
Self-hosted agents and assistants that already speak the OpenAI API. Swap the provider URL for the swarm and they run on your team's hardware instead of someone else's GPUs.
# Most gateways read these standard env vars: export OPENAI_API_BASE=http://localhost:8080/v1 export OPENAI_BASE_URL=http://localhost:8080/v1 export OPENAI_API_KEY=any-string
Agent frameworks & SDKs
Stack your own agents on top. Anything built on the OpenAI SDK accepts a base_url override — your swarm becomes the model layer underneath multi-agent orchestration, RAG, evals, anything.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="any-string",
)
client.chat.completions.create(
model="Qwen/Qwen3-Coder-480B-A35B-Instruct",
messages=[{"role":"user","content":"..."}],
)Every tool listed accepts a custom OpenAI base URL. If yours does, it will too — there is no special integration, just the standard /v1/chat/completions contract with SSE streaming.
Built for teams that want
their own models, privately.
Everything you need to stand up a serious cluster with people you trust — without renting a single GPU.
Invite-only swarms
No public discovery, no random peers. Single-use tokens, regenerable, expiring. Only people you invite can join.
OpenAI-compatible endpoint
`pals serve` exposes /v1/chat/completions and /v1/models with SSE streaming. Cursor, Aider, Continue work unchanged.
Encrypted activations
Per-swarm AES-256-GCM key derived from your swarm's shared secret via HKDF-SHA256. Activation tensors are encrypted before leaving each peer.
Pipeline parallelism
Each peer holds a contiguous slice of the model. Inference flows through the chain one peer at a time, so the swarm can run models no single machine could hold.
Run 480B on consumer GPUs
Qwen3-Coder 480B, Qwen3 235B, Llama 405B, Mixtral 8x22B, Falcon 180B — models no single machine can hold. Spread them across 4, 8, 20 peers.
Member controls
Live peer list, invite status, swarm health. Kick peers, revoke and re-issue invites — from the CLI or the live pals dash TUI.
Qwen3-Coder 480B, across your team's GPUs.
Qwen 3, Qwen 3-Coder, Qwen 2.5, Qwen 2.5-Coder, Llama, Mixtral, Falcon, and BLOOM families all work out of the box. More architectures land as we add them.
HuggingFace model IDs work directly — just pass the id to pals create.
Qwen 3-Coder
Alibaba- ›30B-A3B
- ›480B-A35B
Qwen 3
Alibaba- ›0.6B
- ›1.7B
- ›4B
- ›8B
- ›14B
- ›32B
- ›30B-A3B
- ›235B-A22B
Qwen 2.5-Coder
Alibaba- ›0.5B
- ›1.5B
- ›3B
- ›7B
- ›14B
- ›32B
Qwen 2.5
Alibaba- ›0.5B
- ›1.5B
- ›3B
- ›7B
- ›14B
- ›32B
- ›72B
Llama
Meta- ›2 70B
- ›3 8B
- ›3 70B
- ›3.1 8B
- ›3.1 70B
- ›3.1 405B
- ›3.3 70B
Mixtral
Mistral- ›8x7B
- ›8x22B
Falcon
TII- ›40B
- ›180B
BLOOM
BigScience- ›176B
We tell you exactly what the trust model is.
Only invite people you trust.
We are not a public network. There is no swarm discovery, no stranger prompts, no content moderation queue. Your swarm is exactly the people you sent the link to.
Activations are encrypted in transit.
We derive a 256-bit AES-GCM key from your swarm's shared secret via HKDF-SHA256. Tensors are encrypted before leaving a peer and decrypted on arrival. The key is computed client-side and never leaves member machines.
What we do not pretend.
P2P inference exposes IP addresses to other swarm members. The first peer in the chain sees decrypted inputs. We sandbox computation where the OS allows it, but this is not a hardware enclave. Use a VPN if the threat model demands it.
Free. The whole thing.
No paid tier yet. We will add one based on what teams actually ask for — not before.
Everything. No card. No usage caps.
- Private swarms — invite-only, no public discovery
- Single-use invite tokens, revocable, expirable
- Encrypted activations (AES-256-GCM, HKDF-derived)
- Member list, kick, status — from the CLI
- Full CLI surface (init, create, join, serve, dash + more)
- OpenAI-compatible local endpoint (pals serve)
- Live read-only TUI dashboard (pals dash)
- Account-backed invite verification + allow-list
Questions teams ask
before they install.
If yours is not here, the answer is probably either in how it works or in the trust model.
What is ProgressPals?
Private, peer-to-peer AI inference. You and a small group of trusted people pool your hardware over the internet to run large open-source models that no single machine could host on its own. One CLI, one invite link, one local OpenAI-compatible endpoint.
What models can my swarm run?
Qwen 3 (0.6B–32B dense + 30B-A3B / 235B-A22B MoE), Qwen 3-Coder (30B-A3B and 480B-A35B), Qwen 2.5 (0.5B–72B), Qwen 2.5-Coder (0.5B–32B), Llama 2 / 3 / 3.1 / 3.3 up to 405B, Mixtral 8x7B and 8x22B, Falcon 40B and 180B, BLOOM 176B. Pass any supported HuggingFace model ID directly to pals create.
Can my team use it with Cursor, Aider, or our agent framework?
Yes. pals serve exposes an OpenAI-compatible endpoint at http://localhost:8080/v1. Point Cursor, Cline, Roo Code, Continue, Aider, Zed, OpenClaw, Open WebUI, n8n, LangChain, LlamaIndex, AutoGen, CrewAI, the Vercel AI SDK, or anything that uses the OpenAI SDK directly at it — no code changes.
Who can see my prompts?
The first peer in your chain decrypts your input to run their layers — that is how transformer inference works at all. Activations between subsequent peers are encrypted with a per-swarm AES-256-GCM key derived from your swarm's shared secret via HKDF. The trust model is simple and honest: only invite people you would trust to see your prompts.
Why private swarms only?
Public AI swarms create content moderation queues, expose users to stranger prompts, and pile on legal liability. Removing public swarms removes all three. You only compute on (and decrypt inputs from) people you actually invited.
Do I need a GPU?
Strongly recommended. Each peer's contribution scales with how many model layers their VRAM can hold. CPU-only peers can technically join, but throughput will be slow enough that you probably want at least one consumer GPU per peer.
Does it work on Apple Silicon (M1, M2, M3, M4)?
Yes. Apple Silicon Macs can join any swarm and contribute layers via PyTorch's Metal path. Per-pal throughput is lower than on equivalent NVIDIA hardware, so a Mac is often best as one pal in a mixed swarm or as a client running pals serve.
How many peers do I need for a big model?
It depends on the model and how aggressively it is quantized, but the rule is intuitive: more layers in the model, or less VRAM per peer, means more peers. Each peer can host as many layers as fits its device (configurable per-peer with --num-blocks).
Is it really free?
Yes. No paid tier yet. We will add one when we have real signal from teams about what is worth charging for — not before.
Start your first swarm
in under five minutes.
Linux and macOS. NVIDIA, Apple Silicon, or CPU-only.