Skip to content

Acquisitions & pagination

A recipe acquires data two ways — step (author an HTTP request) and visit (drive a real browser) — and they mix freely in one recipe. There's no engine to pick: the runtime stands up a browser only when the body contains a visit. Then pick the right pagination for each endpoint.

Two acquisitions

step — author a request

Builds the HTTP request itself and pulls the response over reqwest — no browser. Use it when the site has documented JSON endpoints or returns server-rendered HTML you can parse directly. Steps get:

  • An implicit cookie jar shared across all steps in a run.
  • Polite defaults: rate-limited (~1 req/sec by default), exponential backoff on 429 and 5xx, honest User-Agent.
  • Templated URLs, headers, JSON and form-encoded bodies.
  • auth.staticHeader and auth.htmlPrime strategies.
  • Three HTTP pagination strategies (pageWithTotal, untilEmpty, cursor), below.
  • Transient-error retry only: connection timeouts, refused connections, etc. 404s and parse errors fail fast instead of retrying.

visit — drive a browser

Drives a real WebView: WKWebView on macOS, WebView2 on Windows, WebKitGTK on Linux (via wry). The host application owns the event loop; Forage Studio plugs its WebView in, so the daemon's scheduler can run visit-bearing recipes against it. Reach for a visit when the data sits behind:

  • A JavaScript single-page app whose data only exists once the page's JS runs — the requests are assembled at runtime (sometimes with a signature or token a non-browser can't reproduce).
  • Generic bot-management gates (e.g. Cloudflare) on otherwise-public pages; a real browser clears these.
  • Per-session cookies, CSP, or Origin checks a plain HTTP client can't satisfy.

You can't enumerate the page's API requests, so a visit authors an interaction instead: navigate to a URL, then optionally scroll or click until the page settles. The browser drives the page; the page fires its own requests; the visit binds what was observed. A recipe can pair the two — a step listing feeding per-item visits. See The visit statement.

Recipes don't bypass access controls

Neither path logs in for you, solves real CAPTCHAs, or works against pages that require a paid account. Generic bot-management gates on otherwise-public pages are a different category, cleared by a visit's real browser.

The visit statement

A visit is how a dynamic recipe acquires data — the analogue of a step. You author the interaction and observe the result:

forage
visit list {
    url    "https://letterboxd.com/films/popular/"
    scroll until noProgressFor(2)
}
for $card in $list.dom | select("div.film-poster") {
    emit Film { title  $card | select("img") | attr("alt") }
}
  • url (required) — the URL to navigate to. Templated, like a step's url.
  • An optional paginate clause — scroll until noProgressFor(n) or click "<selector>" until noProgressFor(n) — repeats the action until the page goes quiet (no new requests for n seconds) or a maxIterations cap is reached. Absent, the visit navigates once and settles on load. Tune with maxIterations <n> and iterationDelay <secs> after the noProgressFor(n).

The visit binds $<name>:

  • $<name>.dom — the settled document as a node, ready for select / text / attr.
  • $<name> | matched("<url-substring>") — the body of the first intercepted fetch/XHR whose URL contains the argument, parsed as JSON. Reach into the parsed shape with getField:
forage
visit list {
    url    "https://example.com/feed"
    scroll until noProgressFor(2)
}
for $item in $list | matched("/api/feed") | getField("items") {
    emit Post { id  $item.id }
}

Chaining visits (master → detail)

Nest a visit inside a for and template its url off a prior capture to follow links — exactly like nesting a step. The list visit settles and binds first; each iteration then drives a detail visit. (Quotes inside a {…} interpolation are escaped, so the selector reads attr(\"href\").)

forage
visit list {
    url    "https://example.com/films"
    scroll until noProgressFor(2)
}
for $card in $list.dom | select("a.film") {
    visit detail { url "https://example.com{$card | attr(\"href\")}" }
    emit Film {
        title  $detail.dom | select("h1") | text
        year   $detail.dom | select(".year") | text
    }
}

The live engine and the replay walk handle visits identically; a recorded run reads each visit's capture from disk instead of driving the browser. See Archive & replay.

What a run returns

A run returns a Snapshot alongside a DiagnosticReport. The snapshot is the produced records; the report is the post-run forensics. A clean run leaves the report empty — it fills in only when something's worth flagging: an unmet expect, or a stall_reason when the run was cut short (a visit that didn't settle, or a sample_limit cap). See Diagnostics for the report fields.

Live progress

A run streams progress events while running: phase transitions (starting / stepping / paginating / settling / done / failed), requests sent, records emitted, current URL. Studio wires these to its toolbar counters and per-step run stats; the CLI surfaces them under --verbose.

Cancellation

A run honors task cancellation: the host races it against a cancel signal, and a cancelled run resolves to an error rather than a snapshot. The in-flight request or pagination loop unwinds as the run future is dropped at its next await point.

Pagination

steps expose a small, named set of HTTP pagination strategies. The runtime handles the loop; the recipe declares which strategy and points at the relevant response paths. New strategies are added in Rust as real platforms surface them. (A visit paginates inside itself instead, with its scroll / click until noProgressFor(n) clause.)

pageWithTotal

For endpoints that return a page of items plus a total count. The engine bumps the page parameter until accumulated items meet or exceed the total.

forage
step products {
    method "POST"
    url    "https://api.example.com/products"
    body.json { page: 1, pageSize: 200 }
    paginate pageWithTotal {
        items:     $.list
        total:     $.total
        pageParam: "page"
        pageSize:  200
    }
}

untilEmpty

For endpoints that return a page of items but no total. The engine bumps the page parameter until a response comes back empty or shorter than the page size.

forage
paginate untilEmpty {
    items:     $.data.products
    pageParam: "page"
    pageSize:  60
}

Anti-pattern

Don't bypass pagination by demanding an oversized batch in a single request (e.g. raising a natural pageSize: 60 to 2000). Drive the site's natural pagination instead; it's politer, less likely to trip rate limits, and survives shape changes.

Diagnostics

A run's DiagnosticReport flags only what's noteworthy: an unmet expect, or a stall_reason when the run was cut short — a visit that hit maxIterations or ran out of time before settling, or a sample_limit cap. A visit that can't be driven at all (no recorded capture in replay, or a failed live navigation) fails the run with an error instead. Unmet expect clauses report what the run produced versus what it demanded. See Diagnostics.