HTML / DOM extraction
Forage treats HTML the same way it treats JSON: a parsed value queryable by path expressions and pipes. There's no second grammar for DOM; the recipe-author skills you already have for JSON apply unchanged. A handful of transforms (parseHtml, select, text, attr, …) and one grammar extension (for-loops accept pipelines) are all that distinguish HTML extraction from JSON extraction in a recipe.
The shape
recipe "example"
type Story {
title: String
url: String?
}
step front {
method "GET"
url "https://news.ycombinator.com"
}
for $title in $front | parseHtml | select(".titleline") {
emit Story {
title ← $title | select("a") | text
url ← $title | select("a") | attr("href")
}
}What's happening:
$frontis the response body. When the server returnedContent-Type: text/htmlthe body comes through as a string instead of a JSON-parse failure.parseHtmlturns the string into a queryable node.select(".titleline")returns an array of matching nodes (CSS selectors, jQuery-style).for $title in <pipeline>iterates over that array. Each$titleis bound to one matched node.- Inside the loop,
$title | select("a") | textchains: get the<a>descendants, take the first one's text. (text/attr/htmlauto-flatten a single-element array, the jQuery convention.)
The transforms
| Transform | Receives | Returns | Purpose |
|---|---|---|---|
parseHtml | string | node | Parse an HTML/XML document. Lenient: malformed markup works. |
parseJson | string | JSON | The companion for the "data is embedded in a <script>" pattern. |
select(sel) | node | [node] | CSS selector match. Returns an array, even for one match. |
text | node | [node] | string | Whitespace-collapsed text content. Auto-flattens single-element array. |
attr(name) | node | [node] | string? | Attribute value, or null if missing/empty. |
html | node | [node] | string | Outer HTML (the wrapping tag and everything inside). |
innerHtml | node | [node] | string | Inner HTML (children only). |
first | array | element | null | Explicit head-of-list. |
select always returns an array because most CSS selectors match more than one element. When you only want the first match's text/attr, the auto-flatten on text/attr/html saves you a | first call. When you want all matches, drive a for $x in ... loop.
When recipes need HTML extraction
The native fit is server-rendered HTML pages with no public API. Three common shapes:
- Classic server-rendered sites. Wikipedia, news.ycombinator.com, government data portals, Craigslist, public records databases. The data is in the HTML; there's no JSON endpoint.
- SSR with embedded JSON. Modern Next.js / Remix sites often render a
<script id="__NEXT_DATA__">{…}</script>blob containing the data the React tree was hydrated from. Pattern:$page | parseHtml | select("script#__NEXT_DATA__") | text | parseJson | $.props.pageProps.results[*]. - Hybrid pages with both. Some pages render the first batch as HTML and subsequent batches via XHR. The HTML-extraction primitive handles the first; a
visitwithmatched("…")reaches the XHR batches. Same primitive, both shapes.
For sites that need Cloudflare-gated access or are fully JS-rendered with no useful initial HTML (eBay search results, Datadome-protected sites), reach for a visit: it binds the rendered document to $<name>.dom, which you extract from the same way a step extracts from a static response.
Content-type dispatch
step HTTP responses are decoded by content type:
application/json(or no content-type with parseable JSON) → response is a JSON value.text/html,text/xml,text/plain, etc. → response is a string. Pipe throughparseHtmlto query.
The fallback is intentional: an HTML response doesn't crash the recipe; it just lands as a string the recipe explicitly chooses to parse. This makes the parsing step legible at the call site rather than implicit.
Browser visit: $visit.dom
Some sites lock HTTP scrapers out: Cloudflare, Akamai, and similar serve a JS challenge or a 403 to anything that isn't a real browser. For those, a visit drives a hidden WKWebView through the gate, settles the page, and binds the rendered HTML to $<name>.dom.
recipe "letterboxd-popular"
type Film { title: String, url: String? }
visit popular {
url "https://letterboxd.com/films/popular/this/week/"
scroll until noProgressFor(2)
}
for $poster in $popular.dom | select("div.poster.film-poster") {
emit Film {
title ← $poster | select("span.frame-title") | text
url ← $poster | select("a.frame") | attr("href")
}
}$popular.dom is the parsed post-settle document; recipes walk it with select(...) directly, no parseHtml call needed. The rest is the same extraction primitive working over a different content source.
Coverage:
- Works for Cloudflare-protected sites with JS challenges (Letterboxd, many mid-tier e-commerce sites, smaller news sites). A
visitpasses the challenge by virtue of being a real WebKit. - CAPTCHA-walled sites (eBay's Akamai layer, Datadome on hot-ticket sites) need a human to clear the verification; a
visitdoesn't solve human-verification challenges on its own. - Works in replay mode. Archived runs preserve each visit's capture in
_fixtures/<recipe>.jsonl;ReplayVisitSourcematches eachvisitto its capture without re-navigating.
Recipe inventory
Browse the canonical recipes on hub.foragelang.com:
hacker-news-html: HN front page scraped from the rendered HTML, as a companion to the JSON-API version inhacker-news. Same record shape, different data source.scotus-opinions: US Supreme Court slip opinions for a given term, extracted from supremecourt.gov's HTML table. TypedOpinionrecords with date, docket number, case name, PDF URL, and holding text.letterboxd-popular: Films popular this week on Letterboxd, scraped via avisitthrough Cloudflare. End-to-end demonstration of$visit.dom.
The step recipes (HN HTML, SCOTUS) are the smallest end-to-end uses of the HTML-extraction primitive. The visit recipe (Letterboxd) is the smallest use of $visit.dom. Copy any of them as a starting template.