Skip to content
ami

Cast a net over a list of URLs

ami reads a seed (a list of URLs), re-fetches every one concurrently, and writes the responses to standard WARC files plus a columnar Parquet index that points back into them. One machine, thousands of workers, one self-contained capture.

You have a list of URLs and you want the bytes behind them: a crawl frontier someone else produced, a sitemap, a column lifted out of a dataset. ami (網, "net") takes that list and re-fetches every URL as fast as one machine sustains, then packs what comes back into WARC files and a Parquet index you can query.

The seed is just a list of URLs. Point ami at a text file and go:

ami crawl urls.txt

What it does

  • Reads any seed. A text file (one URL per line), newline-delimited JSON with a url field, a Parquet file with a url column, an XML sitemap, or stdin. The same engine drives them all.
  • Fetches concurrently. Thousands of workers, sharded keep-alive transport pools, and per-host connection caps push the box hard while staying inside polite limits.
  • Writes standard WARC. Every response lands in a gzipped WARC file, the ISO archival format, so the captures open in any WARC tool, not just ami.
  • Indexes into Parquet. A captures.parquet carries one row per fetch with the URL, host, status, content type, digest, and a pointer (file, offset, length) back into the WARC, so you can find a response without reopening the archive.
  • Shards across machines. Hand each process its partition with --shard/--shards and a big seed splits cleanly across a fleet.

Where to go next

  • New here? Start with the introduction, then the quick start.
  • Want to install it? See installation.
  • Looking for a specific task? The guides cover the seed formats, tuning a crawl for throughput, and sharding a run across machines.
  • Need every flag? The CLI reference is the full surface.
Getting started Install ami and crawl your first seed into a WARC archive and a Parquet index in under a minute. Guides Task-oriented walkthroughs for the things people actually do with ami: feeding it a seed, tuning it for throughput, and sharding a run across machines. Reference The complete ami surface: every command, every flag, the output layout, and the capture index schema.