How ProgressPals works.
A short technical tour. The architecture, the full CLI, the OpenAI-compatible endpoint, and exactly what is encrypted on the way through your pals.
From zero to swarm
in three commands.
Install the CLI, run init, create a swarm for the model you want, then send the invite link to your pals. That is the whole setup.
- 1.Install the Python package and detect your hardware.
- 2.Pick a model. The CLI claims the layers your machine can hold and prints an invite link.
- 3.Share the link. Each pal runs pals join and starts contributing.
Pipeline parallelism, across your pals.
A modern open-source model is a tall stack of transformer layers — eighty, a hundred and twenty-six, more. The full weights are far too large to fit in any single consumer GPU.
Distributed inference splits the stack. Each pal holds a contiguous slice. Your input flows through the chain: every pal computes only its own layers, hands the activations to the next pal, and so on until the final output emerges.
The win is the point: a model whose weights total 200 GB can run across a team whose machines each hold a fraction of that, as long as enough pals cover the layers between them.
The five-command flow.
Create your local config.
Writes ~/.config/progresspals/config.json with 0600 permissions. Your libp2p identity key (the thing that uniquely identifies your machine to the swarm) gets generated on the first server start.
Mint a single-use invite.
Operator-side. Generates an invite token your pals will redeem. Configure --max-uses and --expires-hours to scope how widely it can be used.
Start hosting layers.
Brings up a server holding a slice of the model. Prints a multiaddr — share it with joiners along with the invite token so they know how to reach you.
Your pals join.
Each pal redeems the invite with pals login, then joins your swarm using the multiaddr. They download only the layer slice they're assigned — much smaller than the full model.
Expose the swarm to your tools.
Starts a local HTTP server at http://localhost:8080/v1 that speaks the OpenAI wire format with SSE streaming. Every tool that already talks to OpenAI now talks to your swarm.
Every command pals can run.
Eleven commands, grouped by what they actually do. No daemons, no shadow CLI surface.
Setup
$ pals initCreate the local config directory at ~/.config/progresspals. The libp2p identity key is generated on the first server start.
$ pals swarm create --name <name>Operator-side. Register a new swarm so you can mint invites and manage members.
Invites & membership (operator)
$ pals invite create [--max-uses N] [--expires-hours H]Mint a fresh invite token. Shown once — share it via a secure channel.
$ pals invite list / revoke / resendManage outstanding invites. List active and historical, revoke unused, re-display an existing token.
$ pals peers listShow redeemed peers — active and revoked.
$ pals peers kick <peer-id>Remove a peer from the swarm allow-list. Alias of pals peers revoke.
Join (peer)
$ pals login --invite-token <token>Redeem an invite. Stores a per-peer credential locally.
$ pals join <model> --peer <multiaddr>Bring your machine into an existing swarm. Downloads only your assigned layer slice.
Run & inspect
$ pals create <model>Start a server hosting a NEW swarm. The first server in any cluster.
$ pals serve [--host 0.0.0.0 --api-key …]Start the local OpenAI-compatible HTTP endpoint on port 8080 (default loopback-only).
$ pals dashLive read-only TUI dashboard: peers, invites, status.
$ pals status / pals listPrint local config + identity state, or list locally cached HuggingFace models.
One server, every OpenAI tool.
pals serve exposes the swarm as a standard OpenAI-compatible HTTP server at http://localhost:8080/v1. The wire format is identical — same request, same response, same SSE streaming — so nothing in your stack has to know the difference.
Before · OpenAI
from openai import OpenAI client = OpenAI( base_url="https://api.openai.com/v1", api_key=OPENAI_API_KEY, ) client.chat.completions.create( model="gpt-4o", messages=[...], stream=True, )
After · ProgressPals
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="any-string", ) client.chat.completions.create( model="Qwen/Qwen3-Coder-480B-A35B-Instruct", messages=[...], stream=True, )
Endpoints exposed
SSE streaming · OpenAI chat shape
Returns the swarm's current model
What is encrypted, what is stored, what is not.
Per-swarm AES-256-GCM key
Each swarm's 256-bit AES key is derived from a shared swarm secret via HKDF-SHA256. The key is computed client-side and never leaves member machines.
Encrypted activations
Activation tensors are encrypted before being sent to the next pal in the chain and decrypted on arrival. Anyone in between sees ciphertext.
Supabase stores only a hash
The backend holds accounts, swarm metadata, the member list, and the per-swarm shared secret (so it can be handed to invited peers when they redeem). Not prompts, not weights, not activations, not the AES key itself — that's derived on each peer's machine via HKDF and never transmitted.
Per-hop integrity via the auth tag
Because activations travel inside AES-GCM, a pal returning garbage would have to forge a valid 128-bit auth tag without the swarm key. They can't — corruption is detected before the next layer runs and the request reroutes.
Honest about the trust model
The first pal in your chain decrypts your input to run their layers — that is how transformer inference works at all, and no amount of cryptography changes it without a hardware enclave. The simple rule is therefore the right rule: only invite pals you would trust to see your prompts.
What you bring to the swarm.
Linux or macOS
Standard Python 3 environment. No special drivers beyond what your GPU already needs.
A consumer GPU
VRAM is the limiter — more VRAM, more layers per pal. NVIDIA boxes run fastest. Apple Silicon (M1+) joins and contributes too, just at lower per-pal throughput. CPU-only joins technically work, slowly.
Auto-sized to your hardware
By default, each pal hosts as many transformer blocks as fits its device. Override with --num-blocks if you want to tune contribution by hand.
That is the whole product.
Eleven commands, one local endpoint, encrypted activations, and the pals you actually trust.