Concepts

How it works.

ProgressPals is a peer-to-peer inference system. The model is split across machines; each machine holds a slice of the layers. Inputs flow through the chain one peer at a time and stream back. Below is the picture in one page.

Pipeline parallelism

A modern open-source model is a deep stack of transformer blocks — 32 for an 8B model, 80 for a 70B model, 126 for a 405B model. Qwen 3 dense ranges from 28 layers (1.7B) to 64 layers (32B); the Qwen 3-Coder 480B-A35B MoE has 62 layers, each with 160 experts. The full weights are usually too large for a single consumer GPU.

ProgressPals splits the stack across peers. Each peer holds a contiguous slice. An inference request enters at the first peer, which runs its layers, encrypts the output activations, and hands them to the next peer in the chain. The final peer streams the output back to the requester.

This is the same pattern used by training-time pipeline parallelism, applied to inference. The trade-off is latency (one round-trip per pipeline stage) for the ability to run models no single machine could hold.

Peer discovery

Peers connect via libp2p. There’s no central coordinator running the show: each peer joins the swarm’s DHT namespace and advertises which model blocks it serves. A client request walks the DHT to find a covering chain.

Connections are direct (or NAT-traversed). There are no public relays in front of the swarm — the connection is between you and the peer you’re talking to.

Membership & auth

Swarms are invite-only. Membership lives in the ProgressPals backend as a per-swarm allow-list of peer IDs. Servers refresh the allow-list every ~30 seconds and reject RPCs from anyone not on it.

Two layers of identity

peer_id — derived from your machine’s libp2p identity key. Cryptographically verified by the libp2p Noise handshake. Used by servers to enforce the allow-list.
peer_credential — opaque receipt returned when you redeem an invite. Stored locally as proof of membership; not presented per-RPC.

Activation encryption

Activation tensors between peers are encrypted under AES-256-GCM. The 256-bit key is derived (via HKDF-SHA256) from the per-swarm secret. The key never leaves member machines; the backend stores only a hash sufficient for invite verification.

Encryption is automatic. Once your config has a swarm_secret (placed there by pals login or pals swarm create), the CLI sets the right env var when launching the client/server. You don’t pass keys around manually.

The OpenAI-compatible server

pals serve runs an HTTP server on your machine that exposes the standard OpenAI endpoints: /v1/chat/completions (streaming + non-streaming) and /v1/models. It translates each request into a ProgressPals inference RPC and walks the activation chain across the swarm. To the client, it looks identical to OpenAI’s API.

The server defaults to 127.0.0.1. Public binding (anything non-loopback) requires --api-key; the CLI refuses to start an unauthenticated public endpoint.

Local model cache

Each peer downloads only the layer slice it is assigned. Files live in the standard HuggingFace cache directory (controlled by HF_HUB_CACHE or HF_HOME). Two peers with the same model never need to download the same blocks.

Next steps

Concepts

Security & trust

What the architecture protects against, and what it doesn't.

Get Started

Quickstart

Run the architecture above on your own machines.

Reference

CLI reference

All 11 commands, organized by intent.

CLI

pals serve

The OpenAI-compatible endpoint, in detail.