Acquisitions & pagination
A recipe acquires data two ways — step (author an HTTP request) and visit (drive a real browser) — and they mix freely in one recipe. There's no engine to pick: the runtime stands up a browser only when the body contains a visit. Then pick the right pagination for each endpoint.
Two acquisitions
step — author a request
Builds the HTTP request itself and pulls the response over reqwest — no browser. Use it when the site has documented JSON endpoints or returns server-rendered HTML you can parse directly. Steps get:
- An implicit cookie jar shared across all steps in a run.
- Polite defaults: rate-limited (~1 req/sec by default), exponential backoff on 429 and 5xx, honest User-Agent.
- Templated URLs, headers, JSON and form-encoded bodies.
auth.staticHeaderandauth.htmlPrimestrategies.- Three HTTP pagination strategies (
pageWithTotal,untilEmpty,cursor), below. - Transient-error retry only: connection timeouts, refused connections, etc. 404s and parse errors fail fast instead of retrying.
visit — drive a browser
Drives a real WebView: WKWebView on macOS, WebView2 on Windows, WebKitGTK on Linux (via wry). The host application owns the event loop; Forage Studio plugs its WebView in, so the daemon's scheduler can run visit-bearing recipes against it. Reach for a visit when the data sits behind:
- A JavaScript single-page app whose data only exists once the page's JS runs — the requests are assembled at runtime (sometimes with a signature or token a non-browser can't reproduce).
- Generic bot-management gates (e.g. Cloudflare) on otherwise-public pages; a real browser clears these.
- Per-session cookies, CSP, or Origin checks a plain HTTP client can't satisfy.
You can't enumerate the page's API requests, so a visit authors an interaction instead: navigate to a URL, then optionally scroll or click until the page settles. The browser drives the page; the page fires its own requests; the visit binds what was observed. A recipe can pair the two — a step listing feeding per-item visits. See The visit statement.
Recipes don't bypass access controls
Neither path logs in for you, solves real CAPTCHAs, or works against pages that require a paid account. Generic bot-management gates on otherwise-public pages are a different category, cleared by a visit's real browser.
The visit statement
A visit is how a dynamic recipe acquires data — the analogue of a step. You author the interaction and observe the result:
visit list {
url "https://letterboxd.com/films/popular/"
scroll until noProgressFor(2)
}
for $card in $list.dom | select("div.film-poster") {
emit Film { title ← $card | select("img") | attr("alt") }
}url(required) — the URL to navigate to. Templated, like a step'surl.- An optional paginate clause —
scroll until noProgressFor(n)orclick "<selector>" until noProgressFor(n)— repeats the action until the page goes quiet (no new requests fornseconds) or amaxIterationscap is reached. Absent, the visit navigates once and settles on load. Tune withmaxIterations <n>anditerationDelay <secs>after thenoProgressFor(n).
The visit binds $<name>:
$<name>.dom— the settled document as a node, ready forselect/text/attr.$<name> | matched("<url-substring>")— the body of the first intercepted fetch/XHR whose URL contains the argument, parsed as JSON. Reach into the parsed shape withgetField:
visit list {
url "https://example.com/feed"
scroll until noProgressFor(2)
}
for $item in $list | matched("/api/feed") | getField("items") {
emit Post { id ← $item.id }
}Chaining visits (master → detail)
Nest a visit inside a for and template its url off a prior capture to follow links — exactly like nesting a step. The list visit settles and binds first; each iteration then drives a detail visit. (Quotes inside a {…} interpolation are escaped, so the selector reads attr(\"href\").)
visit list {
url "https://example.com/films"
scroll until noProgressFor(2)
}
for $card in $list.dom | select("a.film") {
visit detail { url "https://example.com{$card | attr(\"href\")}" }
emit Film {
title ← $detail.dom | select("h1") | text
year ← $detail.dom | select(".year") | text
}
}The live engine and the replay walk handle visits identically; a recorded run reads each visit's capture from disk instead of driving the browser. See Archive & replay.
What a run returns
A run returns a Snapshot alongside a DiagnosticReport. The snapshot is the produced records; the report is the post-run forensics. A clean run leaves the report empty — it fills in only when something's worth flagging: an unmet expect, or a stall_reason when the run was cut short (a visit that didn't settle, or a sample_limit cap). See Diagnostics for the report fields.
Live progress
A run streams progress events while running: phase transitions (starting / stepping / paginating / settling / done / failed), requests sent, records emitted, current URL. Studio wires these to its toolbar counters and per-step run stats; the CLI surfaces them under --verbose.
Cancellation
A run honors task cancellation: the host races it against a cancel signal, and a cancelled run resolves to an error rather than a snapshot. The in-flight request or pagination loop unwinds as the run future is dropped at its next await point.
Pagination
steps expose a small, named set of HTTP pagination strategies. The runtime handles the loop; the recipe declares which strategy and points at the relevant response paths. New strategies are added in Rust as real platforms surface them. (A visit paginates inside itself instead, with its scroll / click until noProgressFor(n) clause.)
pageWithTotal
For endpoints that return a page of items plus a total count. The engine bumps the page parameter until accumulated items meet or exceed the total.
step products {
method "POST"
url "https://api.example.com/products"
body.json { page: 1, pageSize: 200 }
paginate pageWithTotal {
items: $.list
total: $.total
pageParam: "page"
pageSize: 200
}
}untilEmpty
For endpoints that return a page of items but no total. The engine bumps the page parameter until a response comes back empty or shorter than the page size.
paginate untilEmpty {
items: $.data.products
pageParam: "page"
pageSize: 60
}Anti-pattern
Don't bypass pagination by demanding an oversized batch in a single request (e.g. raising a natural pageSize: 60 to 2000). Drive the site's natural pagination instead; it's politer, less likely to trip rate limits, and survives shape changes.
Diagnostics
A run's DiagnosticReport flags only what's noteworthy: an unmet expect, or a stall_reason when the run was cut short — a visit that hit maxIterations or ran out of time before settling, or a sample_limit cap. A visit that can't be driven at all (no recorded capture in replay, or a failed live navigation) fails the run with an error instead. Unmet expect clauses report what the run produced versus what it demanded. See Diagnostics.