Documentation
¶
Overview ¶
Package capturebatch implements the chrest side of the Web Capture Archive Protocol (RFC 0001). The capturer reads a batch of capture requests as JSON on stdin, runs them sequentially, streams each artifact to a writer subprocess for content-addressed storage, and emits a JSON result envelope on stdout.
MVP scope: split=false only. For split=true, the runner emits a per-capture not-implemented error.
Index ¶
- Constants
- Variables
- func BuildEnvelope(url string, capturedAt time.Time, stripped map[string]any, ...) ([]byte, error)
- func BuildSpec(r Resolved, browser firefox.BrowserInfo, host HostFingerprint, ...) ([]byte, error)
- func Canonicalize(v any) ([]byte, error)
- func Normalize(format string, raw []byte) (normalized []byte, stripped map[string]any, err error)
- func NormalizeStream(format string, src io.Reader) (io.Reader, map[string]any, error)
- type ArtifactRef
- type CaptureDefaults
- type CaptureError
- type CapturerInfo
- type Error
- type Extension
- type HostFingerprint
- type Input
- type InputCapture
- type Options
- type Output
- type OutputCapture
- type Resolved
- type WriterResult
- type WriterSpec
Constants ¶
const CapturerName = "chrest"
CapturerName is chrest's identifier in the protocol. Hardcoded so other capturers implementing RFC 0001 can be distinguished.
const EnvelopeMediaType = "application/vnd.web-capture-archive.envelope+json"
EnvelopeMediaType is the Content-Type of the canonicalized envelope bytes. Stable across schema versions — the discriminator is the `schema` field inside the blob.
const EnvelopeSchemaPreview = "web-capture-archive.envelope/v1-preview"
EnvelopeSchemaPreview is emitted when the backend cannot populate the RFC-required `http.status` + `http.headers` fields (CDP / headless-Chrome backend today, chrest#24 follow-up work). Marked `-preview` so v1-strict consumers reject it per RFC forward-compat rules, while preview-tolerant consumers opt in knowingly.
const EnvelopeSchemaV1 = "web-capture-archive.envelope/v1"
EnvelopeSchemaV1 is emitted when http.* is fully populated. Today this is produced by the Firefox/BiDi backend via network.responseCompleted event subscription.
const InputSchema = "web-capture-archive/v1"
InputSchema is the constant `schema` value for the batch input.
const OutputSchema = "web-capture-archive/v1"
OutputSchema is the constant `schema` value for the batch output.
const SpecMediaType = "application/vnd.web-capture-archive.spec+json"
SpecMediaType is the Content-Type of the canonicalized spec bytes.
const SpecSchema = "web-capture-archive.spec/v1"
SpecSchema is the `schema` constant in the spec artifact.
Variables ¶
var PayloadMediaTypes = map[string]string{
"text": "text/plain; charset=utf-8",
"pdf": "application/pdf",
"screenshot": "image/png",
"mhtml": "multipart/related",
"a11y": "application/json",
"html-monolith": "text/html; charset=utf-8",
"html-outer": "text/html; charset=utf-8",
"markdown-full": "text/markdown; charset=utf-8",
"markdown-reader": "text/markdown; charset=utf-8",
"markdown-selector": "text/markdown; charset=utf-8",
}
PayloadMediaTypes maps each supported capture format to the media type recorded on the payload ArtifactRef. RFC 0001 §Payload Artifact.
Functions ¶
func BuildEnvelope ¶
func BuildEnvelope(url string, capturedAt time.Time, stripped map[string]any, http *firefox.HTTPResponse) ([]byte, error)
BuildEnvelope assembles the envelope artifact for a resolved capture and returns the JCS-canonicalized bytes. When http is non-nil, emits the full v1 schema with http.* fields populated; when nil, emits v1-preview with the http key omitted.
Per RFC 0001 §Envelope Artifact:
- `schema`, `url`, `captured_at` are required.
- `http.status`, `http.headers` are required by the RFC v1 but only present when the backend supports network-event capture.
- `stripped.<format>` is optional; the format normalizer returns what it removed, or nil if nothing.
func BuildSpec ¶
func BuildSpec( r Resolved, browser firefox.BrowserInfo, host HostFingerprint, capturerVersion string, ) ([]byte, error)
BuildSpec assembles the spec artifact for a resolved capture and returns the JCS-canonicalized bytes.
Per RFC 0001 §Capture Spec Artifact:
- `capture.options` is an echo of the input (may be any JSON value); empty object `{}` if input omitted it.
- `browser.command_line`, `browser.prefs`, `browser.extensions[].manifest_digest` are optional; omitted when empty (vs present-and-empty).
- `browser.extensions` is required; must be `[]` if none.
- MUST NOT contain time-varying data.
func Canonicalize ¶
Canonicalize encodes v as JCS (RFC 8785) bytes.
Our schema uses strings, integers, booleans, objects, arrays, and null — no floating-point numbers. This implementation is correct for that subset:
- map keys are sorted by UTF-16 code units (same as alphabetical for ASCII-only keys, which our schema uses);
- objects and arrays emit with no whitespace;
- strings are escaped per RFC 8785 §3.2.2.2 (only required control chars are escaped; Go's default json.Encoder escapes more);
- booleans and nulls emit as `true` / `false` / `null`;
- integers (int, int64, json.Number) emit in base 10 with no leading zeros or `+`.
If the schema ever grows floating-point fields, this will need the ES6 ToString semantics from RFC 8785 §3.2.2.3; that is intentionally out of MVP scope.
func Normalize ¶
Normalize produces the payload bytes that the writer should store when split=true. Each format has its own normalization rules specified in RFC 0001 §Payload Artifact. Unsupported formats return a not-implemented error so the runner can surface it as a per-capture error.
MVP scope: "text", "screenshot", "pdf", and "mhtml" are implemented. "a11y" is blocked on chrest#14 (Chrome SIGTRAP on kernel 6.17) and returns the not-implemented error until that lifts.
func NormalizeStream ¶
NormalizeStream is the streaming counterpart to Normalize. It reads the full input into memory, normalizes, and returns a reader plus the stripped map. Normalization is unavoidably buffering for most formats (need to see the whole document), so streaming here is about interface symmetry with StreamCapture rather than memory.
Types ¶
type ArtifactRef ¶
type ArtifactRef struct {
ID string `json:"id"`
Size int64 `json:"size"`
MediaType string `json:"media_type"`
Normalized *bool `json:"normalized,omitempty"`
}
ArtifactRef points to a content-addressed blob via its markl ID.
type CaptureDefaults ¶
type CaptureDefaults struct {
Browser string `json:"browser,omitempty"`
Isolation string `json:"isolation,omitempty"`
Split *bool `json:"split,omitempty"`
}
CaptureDefaults are applied to any fields a given capture leaves unset. RFC 0001 §Capturer Protocol.
type CaptureError ¶
CaptureError is a per-capture error embedded in OutputCapture.
type CapturerInfo ¶
CapturerInfo identifies the capturer implementation + version.
type Error ¶
Error is a batch-level error (e.g. malformed input).
type Extension ¶
type Extension struct {
ID string `json:"id"`
Version string `json:"version"`
ManifestDigest string `json:"manifest_digest,omitempty"`
}
Extension is a loaded browser extension declared in the batch input or echoed in the spec artifact.
type HostFingerprint ¶
type HostFingerprint struct {
OS string
Arch string
Kernel string
Libc string
FontsDigest string
GPUVendor string
GPUModel string
GPUDriver string
}
HostFingerprint is the per-batch host snapshot embedded in every capture's spec artifact. Only `os`, `kernel`, `arch` are required by RFC 0001; other fields are best-effort and omitted on failure.
func GatherHost ¶
func GatherHost() HostFingerprint
GatherHost samples the host once at the start of a batch.
func (HostFingerprint) ToJSON ¶
func (h HostFingerprint) ToJSON() map[string]any
ToJSON converts HostFingerprint into the schema shape. Empty fields are omitted so consumers can distinguish "not gathered" from "gathered and empty".
type Input ¶
type Input struct {
Schema string `json:"schema"`
Writer WriterSpec `json:"writer"`
URL string `json:"url"`
Defaults *CaptureDefaults `json:"defaults,omitempty"`
Captures []InputCapture `json:"captures"`
}
Input is the single JSON document read from stdin.
type InputCapture ¶
type InputCapture struct {
Name string `json:"name"`
Format string `json:"format"`
Options json.RawMessage `json:"options,omitempty"`
Browser string `json:"browser,omitempty"`
Isolation string `json:"isolation,omitempty"`
Split *bool `json:"split,omitempty"`
Extensions []Extension `json:"extensions,omitempty"`
}
InputCapture is one entry in the batch input `captures` array.
type Options ¶
type Options struct {
CapturerVersion string
Writer WriterSpec
URL string
Defaults *CaptureDefaults
}
Options configure the runner; most come from Input.
type Output ¶
type Output struct {
Schema string `json:"schema"`
Capturer CapturerInfo `json:"capturer"`
Errors []Error `json:"errors"`
Captures []OutputCapture `json:"captures"`
}
Output is the single JSON document written to stdout.
func Run ¶
Run executes every capture in order and returns the batch output. The runner never fails fatally on per-capture errors — they become OutputCapture.Error entries. Batch-level failures (e.g. writer.cmd empty) are returned as errors.
type OutputCapture ¶
type OutputCapture struct {
Name string `json:"name"`
Spec *ArtifactRef `json:"spec,omitempty"`
Payload *ArtifactRef `json:"payload,omitempty"`
Envelope *ArtifactRef `json:"envelope,omitempty"`
Error *CaptureError `json:"error,omitempty"`
}
OutputCapture is one entry in the batch output `captures` array. Exactly one of `Error` or the artifact refs is set.
type Resolved ¶
type Resolved struct {
Name string
Format string
Options json.RawMessage
Browser string
Isolation string
Split bool
Extensions []Extension
}
Resolved is a capture after defaults have been applied.
func Resolve ¶
func Resolve(in InputCapture, def *CaptureDefaults) Resolved
Resolve applies batch-level defaults to a single input capture and produces the final tuple used by the runner.
type WriterResult ¶
WriterResult is the shape the writer protocol returns on stdout. RFC 0001 §Writer Protocol allows additional fields; we ignore them.
func WriteThrough ¶
WriteThrough spawns the writer subprocess declared by cmd, streams src into its stdin until EOF, closes stdin, and parses the single JSON object the writer writes to stdout.
Per RFC 0001 §Writer Protocol, the writer MUST exit 0 on success and MUST write exactly one line of JSON to stdout containing `id` and `size`. Non-zero exit or malformed stdout is a hard error; the caller maps it into a per-capture error.
Source Files
¶
- envelope.go
- fingerprint.go
- jcs.go
- mhtml.go
- normalize.go
- pdf.go
- png.go
- runner.go
- spec.go
- types.go
- writer.go