Technical tour

How ProgressPals works.

A short technical tour. The architecture, the full CLI, the OpenAI-compatible endpoint, and exactly what is encrypted on the way through your pals.

Get started View documentation

Commands

Endpoint

localhost:8080/v1

Encryption

AES-256-GCM · HKDF

Largest model

Qwen3-Coder 480B

Quickstart

From zero to swarm
in three commands.

Install the CLI, run init, create a swarm for the model you want, then send the invite link to your pals. That is the whole setup.

1.Install the Python package and detect your hardware.
2.Pick a model. The CLI claims the layers your machine can hold and prints an invite link.
3.Share the link. Each pal runs pals join and starts contributing.

quickstart

$pip install progresspals

$pals init

✓ Config written to ~/.config/progresspals/config.json (0600)

$pals create Qwen/Qwen3-32B

✓ Starting NEW swarm · this pal holds layers 0–14

multiaddr: /ip4/HOST/tcp/PORT/p2p/PEER_ID

$pals invite create

⟶ Invite token (shown once):

pp_inv_a8c4f2e9b1d6a7c3…

Architecture

Pipeline parallelism, across your pals.

A modern open-source model is a tall stack of transformer layers — eighty, a hundred and twenty-six, more. The full weights are far too large to fit in any single consumer GPU.

Distributed inference splits the stack. Each pal holds a contiguous slice. Your input flows through the chain: every pal computes only its own layers, hands the activations to the next pal, and so on until the final output emerges.

The win is the point: a model whose weights total 200 GB can run across a team whose machines each hold a fraction of that, as long as enough pals cover the layers between them.

input prompt

↓

alice@studio

layers 0–14·24 GB VRAM

↓encrypted activations↓

ben@office

layers 15–47·16 GB VRAM

↓encrypted activations↓

casey@rig

layers 48–86·48 GB VRAM

↓

streamed response

Lifecycle

The five-command flow.

Create your local config.

Writes ~/.config/progresspals/config.json with 0600 permissions. Your libp2p identity key (the thing that uniquely identifies your machine to the swarm) gets generated on the first server start.

$ pals init

Mint a single-use invite.

Operator-side. Generates an invite token your pals will redeem. Configure --max-uses and --expires-hours to scope how widely it can be used.

$ pals invite create

Start hosting layers.

Brings up a server holding a slice of the model. Prints a multiaddr — share it with joiners along with the invite token so they know how to reach you.

$ pals create <huggingface-id>

Your pals join.

Each pal redeems the invite with pals login, then joins your swarm using the multiaddr. They download only the layer slice they're assigned — much smaller than the full model.

$ pals join <model> --peer <maddr>

Expose the swarm to your tools.

Starts a local HTTP server at http://localhost:8080/v1 that speaks the OpenAI wire format with SSE streaming. Every tool that already talks to OpenAI now talks to your swarm.

$ pals serve

CLI reference

Every command pals can run.

Eleven commands, grouped by what they actually do. No daemons, no shadow CLI surface.

Setup

$ pals init

Create the local config directory at ~/.config/progresspals. The libp2p identity key is generated on the first server start.

$ pals swarm create --name <name>

Operator-side. Register a new swarm so you can mint invites and manage members.

Invites & membership (operator)

$ pals invite create [--max-uses N] [--expires-hours H]

Mint a fresh invite token. Shown once — share it via a secure channel.

$ pals invite list / revoke / resend

Manage outstanding invites. List active and historical, revoke unused, re-display an existing token.

$ pals peers list

Show redeemed peers — active and revoked.

$ pals peers kick <peer-id>

Remove a peer from the swarm allow-list. Alias of pals peers revoke.

Join (peer)

$ pals login --invite-token <token>

Redeem an invite. Stores a per-peer credential locally.

$ pals join <model> --peer <multiaddr>

Bring your machine into an existing swarm. Downloads only your assigned layer slice.

Run & inspect

$ pals create <model>

Start a server hosting a NEW swarm. The first server in any cluster.

$ pals serve [--host 0.0.0.0 --api-key …]

Start the local OpenAI-compatible HTTP endpoint on port 8080 (default loopback-only).

$ pals dash

Live read-only TUI dashboard: peers, invites, status.

$ pals status / pals list

Print local config + identity state, or list locally cached HuggingFace models.

The endpoint

One server, every OpenAI tool.

pals serve exposes the swarm as a standard OpenAI-compatible HTTP server at http://localhost:8080/v1. The wire format is identical — same request, same response, same SSE streaming — so nothing in your stack has to know the difference.

Before · OpenAI

from openai import OpenAI

client = OpenAI(
  base_url="https://api.openai.com/v1",
  api_key=OPENAI_API_KEY,
)

client.chat.completions.create(
  model="gpt-4o",
  messages=[...],
  stream=True,
)

After · ProgressPals

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:8080/v1",
  api_key="any-string",
)

client.chat.completions.create(
  model="Qwen/Qwen3-Coder-480B-A35B-Instruct",
  messages=[...],
  stream=True,
)

Endpoints exposed

POST/v1/chat/completions

SSE streaming · OpenAI chat shape

GET/v1/models

Returns the swarm's current model

Security

What is encrypted, what is stored, what is not.

Per-swarm AES-256-GCM key

Each swarm's 256-bit AES key is derived from a shared swarm secret via HKDF-SHA256. The key is computed client-side and never leaves member machines.

Encrypted activations

Activation tensors are encrypted before being sent to the next pal in the chain and decrypted on arrival. Anyone in between sees ciphertext.

Supabase stores only a hash

The backend holds accounts, swarm metadata, the member list, and the per-swarm shared secret (so it can be handed to invited peers when they redeem). Not prompts, not weights, not activations, not the AES key itself — that's derived on each peer's machine via HKDF and never transmitted.

Per-hop integrity via the auth tag

Because activations travel inside AES-GCM, a pal returning garbage would have to forge a valid 128-bit auth tag without the swarm key. They can't — corruption is detected before the next layer runs and the request reroutes.

Honest about the trust model

The first pal in your chain decrypts your input to run their layers — that is how transformer inference works at all, and no amount of cryptography changes it without a hardware enclave. The simple rule is therefore the right rule: only invite pals you would trust to see your prompts.

Hardware

What you bring to the swarm.

Operating system

Linux or macOS

Standard Python 3 environment. No special drivers beyond what your GPU already needs.

Compute

A consumer GPU

VRAM is the limiter — more VRAM, more layers per pal. NVIDIA boxes run fastest. Apple Silicon (M1+) joins and contributes too, just at lower per-pal throughput. CPU-only joins technically work, slowly.

Configuration

Auto-sized to your hardware

By default, each pal hosts as many transformer blocks as fits its device. Override with --num-blocks if you want to tune contribution by hand.

That is the whole product.

Eleven commands, one local endpoint, encrypted activations, and the pals you actually trust.