Cast a net over a list of URLs

ami reads a seed (a list of URLs), re-fetches every one concurrently, and writes the responses to standard WARC files plus a columnar Parquet index that points back into them. One machine, thousands of workers, one self-contained capture.

Get started View on GitHub

You have a list of URLs and you want the bytes behind them: a crawl frontier someone else produced, a sitemap, a column lifted out of a dataset. ami (網, "net") takes that list and re-fetches every URL as fast as one machine sustains, then packs what comes back into WARC files and a Parquet index you can query.

The seed is just a list of URLs. Point ami at a text file and go:

ami crawl urls.txt

What it does

Reads any seed. A text file (one URL per line), newline-delimited JSON with a url field, a Parquet file with a url column, an XML sitemap, or stdin. The same engine drives them all.
Fetches concurrently. Thousands of workers, sharded keep-alive transport pools, and per-host connection caps push the box hard while staying inside polite limits.
Writes standard WARC. Every response lands in a gzipped WARC file, the ISO archival format, so the captures open in any WARC tool, not just ami.
Indexes into Parquet. A captures.parquet carries one row per fetch with the URL, host, status, content type, digest, and a pointer (file, offset, length) back into the WARC, so you can find a response without reopening the archive.
Shards across machines. Hand each process its partition with --shard/--shards and a big seed splits cleanly across a fleet.

Where to go next

New here? Start with the introduction, then the quick start.
Want to install it? See installation.
Looking for a specific task? The guides cover the seed formats, tuning a crawl for throughput, and sharding a run across machines.
Need every flag? The CLI reference is the full surface.

Getting started Install ami and crawl your first seed into a WARC archive and a Parquet index in under a minute. Guides Task-oriented walkthroughs for the things people actually do with ami: feeding it a seed, tuning it for throughput, and sharding a run across machines. Reference The complete ami surface: every command, every flag, the output layout, and the capture index schema.