Skip to content

HTML / DOM extraction

Forage treats HTML the same way it treats JSON: a parsed value queryable by path expressions and pipes. There's no second grammar for DOM; the recipe-author skills you already have for JSON apply unchanged. A handful of transforms (parseHtml, select, text, attr, …) and one grammar extension (for-loops accept pipelines) are all that distinguish HTML extraction from JSON extraction in a recipe.

The shape

forage
recipe "example"

type Story {
    title: String
    url:   String?
}

step front {
    method "GET"
    url    "https://news.ycombinator.com"
}

for $title in $front | parseHtml | select(".titleline") {
    emit Story {
        title  $title | select("a") | text
        url $title | select("a") | attr("href")
    }
}

What's happening:

  1. $front is the response body. When the server returned Content-Type: text/html the body comes through as a string instead of a JSON-parse failure.
  2. parseHtml turns the string into a queryable node.
  3. select(".titleline") returns an array of matching nodes (CSS selectors, jQuery-style).
  4. for $title in <pipeline> iterates over that array. Each $title is bound to one matched node.
  5. Inside the loop, $title | select("a") | text chains: get the <a> descendants, take the first one's text. (text / attr / html auto-flatten a single-element array, the jQuery convention.)

The transforms

TransformReceivesReturnsPurpose
parseHtmlstringnodeParse an HTML/XML document. Lenient: malformed markup works.
parseJsonstringJSONThe companion for the "data is embedded in a <script>" pattern.
select(sel)node[node]CSS selector match. Returns an array, even for one match.
textnode | [node]stringWhitespace-collapsed text content. Auto-flattens single-element array.
attr(name)node | [node]string?Attribute value, or null if missing/empty.
htmlnode | [node]stringOuter HTML (the wrapping tag and everything inside).
innerHtmlnode | [node]stringInner HTML (children only).
firstarrayelement | nullExplicit head-of-list.

select always returns an array because most CSS selectors match more than one element. When you only want the first match's text/attr, the auto-flatten on text/attr/html saves you a | first call. When you want all matches, drive a for $x in ... loop.

When recipes need HTML extraction

The native fit is server-rendered HTML pages with no public API. Three common shapes:

  1. Classic server-rendered sites. Wikipedia, news.ycombinator.com, government data portals, Craigslist, public records databases. The data is in the HTML; there's no JSON endpoint.
  2. SSR with embedded JSON. Modern Next.js / Remix sites often render a <script id="__NEXT_DATA__">{…}</script> blob containing the data the React tree was hydrated from. Pattern: $page | parseHtml | select("script#__NEXT_DATA__") | text | parseJson | $.props.pageProps.results[*].
  3. Hybrid pages with both. Some pages render the first batch as HTML and subsequent batches via XHR. The HTML-extraction primitive handles the first; a visit with matched("…") reaches the XHR batches. Same primitive, both shapes.

For sites that need Cloudflare-gated access or are fully JS-rendered with no useful initial HTML (eBay search results, Datadome-protected sites), reach for a visit: it binds the rendered document to $<name>.dom, which you extract from the same way a step extracts from a static response.

Content-type dispatch

step HTTP responses are decoded by content type:

  • application/json (or no content-type with parseable JSON) → response is a JSON value.
  • text/html, text/xml, text/plain, etc. → response is a string. Pipe through parseHtml to query.

The fallback is intentional: an HTML response doesn't crash the recipe; it just lands as a string the recipe explicitly chooses to parse. This makes the parsing step legible at the call site rather than implicit.

Browser visit: $visit.dom

Some sites lock HTTP scrapers out: Cloudflare, Akamai, and similar serve a JS challenge or a 403 to anything that isn't a real browser. For those, a visit drives a hidden WKWebView through the gate, settles the page, and binds the rendered HTML to $<name>.dom.

forage
recipe "letterboxd-popular"

type Film { title: String, url: String? }

visit popular {
    url    "https://letterboxd.com/films/popular/this/week/"
    scroll until noProgressFor(2)
}

for $poster in $popular.dom | select("div.poster.film-poster") {
    emit Film {
        title  $poster | select("span.frame-title") | text
        url $poster | select("a.frame") | attr("href")
    }
}

$popular.dom is the parsed post-settle document; recipes walk it with select(...) directly, no parseHtml call needed. The rest is the same extraction primitive working over a different content source.

Coverage:

  • Works for Cloudflare-protected sites with JS challenges (Letterboxd, many mid-tier e-commerce sites, smaller news sites). A visit passes the challenge by virtue of being a real WebKit.
  • CAPTCHA-walled sites (eBay's Akamai layer, Datadome on hot-ticket sites) need a human to clear the verification; a visit doesn't solve human-verification challenges on its own.
  • Works in replay mode. Archived runs preserve each visit's capture in _fixtures/<recipe>.jsonl; ReplayVisitSource matches each visit to its capture without re-navigating.

Recipe inventory

Browse the canonical recipes on hub.foragelang.com:

  • hacker-news-html: HN front page scraped from the rendered HTML, as a companion to the JSON-API version in hacker-news. Same record shape, different data source.
  • scotus-opinions: US Supreme Court slip opinions for a given term, extracted from supremecourt.gov's HTML table. Typed Opinion records with date, docket number, case name, PDF URL, and holding text.
  • letterboxd-popular: Films popular this week on Letterboxd, scraped via a visit through Cloudflare. End-to-end demonstration of $visit.dom.

The step recipes (HN HTML, SCOTUS) are the smallest end-to-end uses of the HTML-extraction primitive. The visit recipe (Letterboxd) is the smallest use of $visit.dom. Copy any of them as a starting template.