# rewget — full content for LLM ingestion

> rewget: wget, but it works everywhere. A drop-in wget replacement that automatically bypasses bot protection. Built in Rust by neul-labs. Dual-licensed under MIT or Apache-2.0.

## What rewget is

rewget is a thin CLI shim around wget (and a sidecar daemon, rewgetd) that intercepts the cases where wget alone fails: 403 Forbidden from bot detection, CAPTCHA challenge pages, 429 rate limits, and "works in a browser but not in wget" pages. When stage 1 (plain wget) fails, stage 2 retries the same URL with browser-grade TLS and HTTP/2 fingerprints via rquest. If that still fails, stage 3 launches headless Chromium to solve JavaScript challenges. Domain-level results are cached for seven days.

All standard wget options work unchanged. The new behaviour is exposed under `--rewget-*` flags so existing scripts don't break.

## Install

- Homebrew: `brew install neul-labs/tap/rewget` (macOS and Linux)
- npm: `npm install -g rewget` (macOS and Linux)
- PyPI: `uv tool install rewget` or `pip install rewget` (macOS and Linux)
- Cargo: `cargo install rewget` (from source)
- Install script: `curl -fsSL https://raw.githubusercontent.com/neul-labs/rewget/main/scripts/install.sh | sh`

Platform support today: Linux and macOS. See the README badges for the canonical platform list.

## How it works

Stage 1 — wget. Fast, zero overhead. If the download succeeds, you get the exact same output you would have got from wget itself.

Stage 2 — Impersonate. rewgetd retries the same URL through rquest, using a Chrome, Firefox, Safari, or Edge TLS and HTTP/2 fingerprint. Profiles are versioned and Ed25519-verified; `--rewget-update-profiles` fetches new ones.

Stage 3 — JS Preflight. rewgetd launches headless Chromium via chromiumoxide. The browser solves the challenge, exports cookies and headers, and stage 2 retries with that session.

Cache: per-domain, 7-day TTL. Subsequent requests skip straight to the working stage.

## Flags

- `--rewget-no-fallback` — disable automatic retry on block
- `--rewget-js` — force JavaScript preflight (Stage 3)
- `--rewget-js-wait=EVENT` — wait condition: load, domcontentloaded, networkidle
- `--rewget-profile=NAME` — use a specific browser profile
- `--rewget-fallback-codes=N,N` — only retry on these HTTP status codes
- `--rewget-engine=wget|wget2` — choose wget engine
- `--rewget-list-profiles` — list available browser profiles
- `--rewget-update-profiles` — fetch latest profiles (Ed25519 verified)
- `--rewget-version` — show rewget version

Profiles currently shipping: Chrome 131/130, Firefox 136/133, Safari 18, Edge 131.

## Architecture

- `rewget`  — thin CLI shim. Parses `--rewget-*` flags, forwards everything else to wget.
- `rewgetd` — daemon, handles stage 2 and stage 3. Manages the browser pool and TLS sessions.
- `rewget-core` — shared library: detection, caching, profile logic.

Stack: Rust with Tokio, rquest for TLS/HTTP2 impersonation, chromiumoxide for headless browser control, nng for CLI-to-daemon IPC, mimalloc for allocation.

## Why rewget

The wget-vs-curl debate is decades old. We don't pick a side. We use both — wget for resumable, recursive downloads, curl for one-shot requests with bodies and headers — and we both still hit the same wall: sites that work in a browser but throw 403s at any HTTP client because the TLS fingerprint or HTTP/2 frame ordering is "wrong".

rewget keeps the wget surface (so your scripts, your aliases, your muscle memory all keep working) and adds the fallback you would otherwise script by hand: detect the block, swap to a fingerprint that matches a real browser, and only escalate to a full browser when the page actually needs JavaScript.

The three-stage design means the common case stays fast. If a site doesn't block, you pay zero overhead — you get wget. If it blocks once, rewget records the working stage and the next download from that domain goes straight to it.

## License & repository

- License: MIT or Apache-2.0 (dual)
- Repo: https://github.com/neul-labs/rewget
- Docs: https://docs.neullabs.com/rewget/
- Issues, contributions: GitHub

## Comparison summary

vs bare wget — same CLI surface, same behaviour on sites that aren't blocking; transparent fallback on the cases wget can't handle.
vs bare curl — different CLI shape. curl is one-shot; wget (and rewget) is download-oriented with resume and recursion. Pick rewget if you already prefer the wget shape and want it to keep working.

## What rewget does not claim

- We do not claim Windows-native support today. Platform badges in the README list Linux and macOS.
- We do not claim to bypass authenticated bot protection. rewget handles fingerprint-based and JavaScript-challenge blocks, not policy-driven access control.
- We do not claim 100% success against every WAF. Domain caching exposes which stage works for which sites.