capturebatch

package
v0.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 24, 2026 License: MIT Imports: 24 Imported by: 0

Documentation

Overview

Package capturebatch implements the chrest side of the Web Capture Archive Protocol (RFC 0001). The capturer reads a batch of capture requests as JSON on stdin, runs them sequentially, streams each artifact to a writer subprocess for content-addressed storage, and emits a JSON result envelope on stdout.

MVP scope: split=false only. For split=true, the runner emits a per-capture not-implemented error.

Index

Constants

View Source
const CapturerName = "chrest"

CapturerName is chrest's identifier in the protocol. Hardcoded so other capturers implementing RFC 0001 can be distinguished.

View Source
const EnvelopeMediaType = "application/vnd.web-capture-archive.envelope+json"

EnvelopeMediaType is the Content-Type of the canonicalized envelope bytes. Stable across schema versions — the discriminator is the `schema` field inside the blob.

View Source
const EnvelopeSchemaPreview = "web-capture-archive.envelope/v1-preview"

EnvelopeSchemaPreview is emitted when the backend cannot populate the RFC-required `http.status` + `http.headers` fields (CDP / headless-Chrome backend today, chrest#24 follow-up work). Marked `-preview` so v1-strict consumers reject it per RFC forward-compat rules, while preview-tolerant consumers opt in knowingly.

View Source
const EnvelopeSchemaV1 = "web-capture-archive.envelope/v1"

EnvelopeSchemaV1 is emitted when http.* is fully populated. Today this is produced by the Firefox/BiDi backend via network.responseCompleted event subscription.

View Source
const InputSchema = "web-capture-archive/v1"

InputSchema is the constant `schema` value for the batch input.

View Source
const OutputSchema = "web-capture-archive/v1"

OutputSchema is the constant `schema` value for the batch output.

View Source
const SpecMediaType = "application/vnd.web-capture-archive.spec+json"

SpecMediaType is the Content-Type of the canonicalized spec bytes.

View Source
const SpecSchema = "web-capture-archive.spec/v1"

SpecSchema is the `schema` constant in the spec artifact.

Variables

View Source
var PayloadMediaTypes = map[string]string{
	"text":              "text/plain; charset=utf-8",
	"pdf":               "application/pdf",
	"screenshot":        "image/png",
	"mhtml":             "multipart/related",
	"a11y":              "application/json",
	"html-monolith":     "text/html; charset=utf-8",
	"html-outer":        "text/html; charset=utf-8",
	"markdown-full":     "text/markdown; charset=utf-8",
	"markdown-reader":   "text/markdown; charset=utf-8",
	"markdown-selector": "text/markdown; charset=utf-8",
}

PayloadMediaTypes maps each supported capture format to the media type recorded on the payload ArtifactRef. RFC 0001 §Payload Artifact.

Functions

func BuildEnvelope

func BuildEnvelope(url string, capturedAt time.Time, stripped map[string]any, http *firefox.HTTPResponse) ([]byte, error)

BuildEnvelope assembles the envelope artifact for a resolved capture and returns the JCS-canonicalized bytes. When http is non-nil, emits the full v1 schema with http.* fields populated; when nil, emits v1-preview with the http key omitted.

Per RFC 0001 §Envelope Artifact:

  • `schema`, `url`, `captured_at` are required.
  • `http.status`, `http.headers` are required by the RFC v1 but only present when the backend supports network-event capture.
  • `stripped.<format>` is optional; the format normalizer returns what it removed, or nil if nothing.

func BuildSpec

func BuildSpec(
	r Resolved,
	browser firefox.BrowserInfo,
	host HostFingerprint,
	capturerVersion string,
) ([]byte, error)

BuildSpec assembles the spec artifact for a resolved capture and returns the JCS-canonicalized bytes.

Per RFC 0001 §Capture Spec Artifact:

  • `capture.options` is an echo of the input (may be any JSON value); empty object `{}` if input omitted it.
  • `browser.command_line`, `browser.prefs`, `browser.extensions[].manifest_digest` are optional; omitted when empty (vs present-and-empty).
  • `browser.extensions` is required; must be `[]` if none.
  • MUST NOT contain time-varying data.

func Canonicalize

func Canonicalize(v any) ([]byte, error)

Canonicalize encodes v as JCS (RFC 8785) bytes.

Our schema uses strings, integers, booleans, objects, arrays, and null — no floating-point numbers. This implementation is correct for that subset:

  • map keys are sorted by UTF-16 code units (same as alphabetical for ASCII-only keys, which our schema uses);
  • objects and arrays emit with no whitespace;
  • strings are escaped per RFC 8785 §3.2.2.2 (only required control chars are escaped; Go's default json.Encoder escapes more);
  • booleans and nulls emit as `true` / `false` / `null`;
  • integers (int, int64, json.Number) emit in base 10 with no leading zeros or `+`.

If the schema ever grows floating-point fields, this will need the ES6 ToString semantics from RFC 8785 §3.2.2.3; that is intentionally out of MVP scope.

func Normalize

func Normalize(format string, raw []byte) (normalized []byte, stripped map[string]any, err error)

Normalize produces the payload bytes that the writer should store when split=true. Each format has its own normalization rules specified in RFC 0001 §Payload Artifact. Unsupported formats return a not-implemented error so the runner can surface it as a per-capture error.

MVP scope: "text", "screenshot", "pdf", and "mhtml" are implemented. "a11y" is blocked on chrest#14 (Chrome SIGTRAP on kernel 6.17) and returns the not-implemented error until that lifts.

func NormalizeStream

func NormalizeStream(format string, src io.Reader) (io.Reader, map[string]any, error)

NormalizeStream is the streaming counterpart to Normalize. It reads the full input into memory, normalizes, and returns a reader plus the stripped map. Normalization is unavoidably buffering for most formats (need to see the whole document), so streaming here is about interface symmetry with StreamCapture rather than memory.

Types

type ArtifactRef

type ArtifactRef struct {
	ID         string `json:"id"`
	Size       int64  `json:"size"`
	MediaType  string `json:"media_type"`
	Normalized *bool  `json:"normalized,omitempty"`
}

ArtifactRef points to a content-addressed blob via its markl ID.

type CaptureDefaults

type CaptureDefaults struct {
	Browser   string `json:"browser,omitempty"`
	Isolation string `json:"isolation,omitempty"`
	Split     *bool  `json:"split,omitempty"`
}

CaptureDefaults are applied to any fields a given capture leaves unset. RFC 0001 §Capturer Protocol.

type CaptureError

type CaptureError struct {
	Kind    string `json:"kind"`
	Message string `json:"message"`
}

CaptureError is a per-capture error embedded in OutputCapture.

type CapturerInfo

type CapturerInfo struct {
	Name    string `json:"name"`
	Version string `json:"version"`
}

CapturerInfo identifies the capturer implementation + version.

type Error

type Error struct {
	Kind    string `json:"kind"`
	Message string `json:"message"`
}

Error is a batch-level error (e.g. malformed input).

type Extension

type Extension struct {
	ID             string `json:"id"`
	Version        string `json:"version"`
	ManifestDigest string `json:"manifest_digest,omitempty"`
}

Extension is a loaded browser extension declared in the batch input or echoed in the spec artifact.

type HostFingerprint

type HostFingerprint struct {
	OS          string
	Arch        string
	Kernel      string
	Libc        string
	FontsDigest string
	GPUVendor   string
	GPUModel    string
	GPUDriver   string
}

HostFingerprint is the per-batch host snapshot embedded in every capture's spec artifact. Only `os`, `kernel`, `arch` are required by RFC 0001; other fields are best-effort and omitted on failure.

func GatherHost

func GatherHost() HostFingerprint

GatherHost samples the host once at the start of a batch.

func (HostFingerprint) ToJSON

func (h HostFingerprint) ToJSON() map[string]any

ToJSON converts HostFingerprint into the schema shape. Empty fields are omitted so consumers can distinguish "not gathered" from "gathered and empty".

type Input

type Input struct {
	Schema   string           `json:"schema"`
	Writer   WriterSpec       `json:"writer"`
	URL      string           `json:"url"`
	Defaults *CaptureDefaults `json:"defaults,omitempty"`
	Captures []InputCapture   `json:"captures"`
}

Input is the single JSON document read from stdin.

type InputCapture

type InputCapture struct {
	Name       string          `json:"name"`
	Format     string          `json:"format"`
	Options    json.RawMessage `json:"options,omitempty"`
	Browser    string          `json:"browser,omitempty"`
	Isolation  string          `json:"isolation,omitempty"`
	Split      *bool           `json:"split,omitempty"`
	Extensions []Extension     `json:"extensions,omitempty"`
}

InputCapture is one entry in the batch input `captures` array.

type Options

type Options struct {
	CapturerVersion string
	Writer          WriterSpec
	URL             string
	Defaults        *CaptureDefaults
}

Options configure the runner; most come from Input.

type Output

type Output struct {
	Schema   string          `json:"schema"`
	Capturer CapturerInfo    `json:"capturer"`
	Errors   []Error         `json:"errors"`
	Captures []OutputCapture `json:"captures"`
}

Output is the single JSON document written to stdout.

func Run

func Run(ctx context.Context, inputCaptures []InputCapture, opts Options) (Output, error)

Run executes every capture in order and returns the batch output. The runner never fails fatally on per-capture errors — they become OutputCapture.Error entries. Batch-level failures (e.g. writer.cmd empty) are returned as errors.

type OutputCapture

type OutputCapture struct {
	Name     string        `json:"name"`
	Spec     *ArtifactRef  `json:"spec,omitempty"`
	Payload  *ArtifactRef  `json:"payload,omitempty"`
	Envelope *ArtifactRef  `json:"envelope,omitempty"`
	Error    *CaptureError `json:"error,omitempty"`
}

OutputCapture is one entry in the batch output `captures` array. Exactly one of `Error` or the artifact refs is set.

type Resolved

type Resolved struct {
	Name       string
	Format     string
	Options    json.RawMessage
	Browser    string
	Isolation  string
	Split      bool
	Extensions []Extension
}

Resolved is a capture after defaults have been applied.

func Resolve

func Resolve(in InputCapture, def *CaptureDefaults) Resolved

Resolve applies batch-level defaults to a single input capture and produces the final tuple used by the runner.

type WriterResult

type WriterResult struct {
	ID   string `json:"id"`
	Size int64  `json:"size"`
}

WriterResult is the shape the writer protocol returns on stdout. RFC 0001 §Writer Protocol allows additional fields; we ignore them.

func WriteThrough

func WriteThrough(ctx context.Context, cmd []string, src io.Reader) (WriterResult, error)

WriteThrough spawns the writer subprocess declared by cmd, streams src into its stdin until EOF, closes stdin, and parses the single JSON object the writer writes to stdout.

Per RFC 0001 §Writer Protocol, the writer MUST exit 0 on success and MUST write exactly one line of JSON to stdout containing `id` and `size`. Non-zero exit or malformed stdout is a hard error; the caller maps it into a per-capture error.

type WriterSpec

type WriterSpec struct {
	Cmd []string `json:"cmd"`
}

WriterSpec is the writer-command contract from the orchestrator.

Source Files

  • envelope.go
  • fingerprint.go
  • jcs.go
  • mhtml.go
  • normalize.go
  • pdf.go
  • png.go
  • runner.go
  • spec.go
  • types.go
  • writer.go

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL