Tuning a crawl
Push the box as hard as it goes with workers, transport shards, and per-host caps, or dial it back to be polite.
ami's defaults are sized for a fast machine on a fat pipe: 2000 workers across 64 keep-alive transport pools. That is a lot for a laptop and too much for a single small site. These flags move the crawl along the throughput-versus-politeness axis.
Concurrency
--workers sets how many fetches run at once, and --transport-shards sets how many keep-alive connection pools they spread across (sharding the pool reduces lock contention at high worker counts):
# Ease off on a laptop
ami crawl urls.txt --workers 200 --transport-shards 8
# Push a well-provisioned box
ami crawl urls.txt --workers 4000 --transport-shards 128
Per-host limits
A seed that hammers one host is both rude and slow, since the host throttles you. Cap the connections any single host gets, and give up on a host that keeps failing so it stops eating worker slots:
ami crawl urls.txt --per-host 4 --domain-fail-threshold 5
--per-host is the ceiling on concurrent connections to one host; --domain-fail-threshold is how many consecutive failures a domain may rack up before ami skips the rest of its URLs.
Timeouts and body size
--timeout is the hard ceiling on a single request.
--max-body caps how many bytes of a response body ami stores, so one giant download cannot blow up a WARC:
ami crawl urls.txt --timeout 10s --max-body 4194304
Header profile
--mode fast (the default) sends a minimal header set for the highest throughput.
--mode polite sends a full browser-like header set, which a bot-detection WAF is less likely to fingerprint and block:
ami crawl urls.txt --mode polite
Unchanged responses
When a seed carries the digest from a prior capture, ami compares it against the SHA-1 of the body it just fetched.
A match means the content is unchanged, so ami records a revisit instead of re-storing the body.
To store the full body every time regardless, pass --store-unchanged:
ami crawl urls.txt --store-unchanged
WARC roll-over
Long runs roll their output across multiple WARC files.
--warc-size sets the target bytes per file before ami opens the next one:
# Roll over every 256 MiB instead of the 1 GiB default
ami crawl urls.txt --warc-size 268435456