An end-to-end run
Take a list of URLs from raw seed to a queryable WARC archive: prepare the seed, crawl it, and read the results back.
This guide walks a seed all the way through: from however your URLs arrive, to a crawl, to answering questions about what came back. It ties together the seed formats, tuning, and sharding guides into one path.
1. Get your URLs into a seed
A seed is just a list of URLs. ami reads four shapes and infers the format from the path, so most of the time there is nothing to convert:
- a text file, one URL per line (
urls.txt) - newline-delimited JSON with a
urlfield per line (seed.jsonl) - a Parquet file with a
urlcolumn (seed.parquet) - an XML sitemap, local or remote (
https://example.com/sitemap.xml)
If a producer hands you URLs alongside other context, reach for JSONL or Parquet.
Any field beyond url rides along into the capture index, so producer metadata survives the crawl:
{"url": "https://example.com/a", "source": "frontier", "depth": 1}
{"url": "https://example.com/b", "source": "frontier", "depth": 2}
Two fields are special.
A digest is the SHA-1 of a body from a previous capture; supplying it lets ami detect unchanged content and record a revisit rather than a second copy.
Everything else is kept verbatim in meta_json.
See seed formats for the full rules.
2. Crawl it
Point ami at the seed:
ami crawl seed.parquet -o run-out
That re-fetches every URL with the default 2000 workers and writes the results under run-out.
On a laptop, or against a handful of hosts, ease off so you are not rude or rate-limited:
ami crawl seed.parquet -o run-out --workers 200 --per-host 4
On a well-provisioned box, push harder:
ami crawl seed.parquet -o run-out --workers 4000 --transport-shards 128
The tuning guide covers every knob on the throughput-versus-politeness axis. While the run goes, ami prints a live line of pages per second, bytes per second, and the status breakdown.
3. See what came back
Every run writes WARC files plus a captures.parquet index under -o.
The fastest look needs no other tool:
ami inspect run-out/captures.parquet -n 20
For anything real, the index is an ordinary Parquet file. DuckDB answers the usual questions directly:
-- how did the run go?
SELECT status, count(*) AS n
FROM 'run-out/captures.parquet'
GROUP BY status ORDER BY n DESC;
-- what failed, and why?
SELECT url, error FROM 'run-out/captures.parquet' WHERE error <> '';
The bytes themselves live in the WARC files, in the standard format, so they open in any WARC tool.
Each index row carries a warc_file, warc_offset, and warc_length that point straight at the response record, so you never scan the archive to find one capture.
The configuration reference documents every column.
4. Retry just the failures
Because a failed fetch still produces a row (status 0, with the reason in error), the index is a complete record of the run.
Select the rows that need another attempt, write them back out as a seed, and crawl that smaller list:
duckdb -noheader -csv \
-c "SELECT url FROM 'run-out/captures.parquet' WHERE error <> ''" \
> retry.txt
ami crawl retry.txt -o run-out --run-id retry --mode polite
--run-id retry keeps the retry's output beside the first pass under the same directory, and --mode polite gives the stragglers a browser-like header set that a bot-detection WAF is less likely to block.
Scaling out
When one machine is not enough, the same seed splits across a fleet: each process takes --shard i --shards N and crawls its slice.
The sharding guide covers the partitioning and how the per-machine indexes union back into one logical run.