webfetch

package

v0.17.0 Latest Latest Go to latest Published: May 23, 2026 License: Apache-2.0 Imports: 26 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/secmon-lab/warren

Links

Open Source Insights

README ¶

WebFetch

Fetch a web page and return its body. When an LLM is configured, the body is reformatted as Markdown and screened for indirect prompt injection; otherwise the extracted text is returned verbatim and every call is gated by a HITL approval dialog.

Overview

web_fetch follows a 3-stage pipeline:

Fetch the URL over HTTP/HTTPS (30 s timeout, fixed User-Agent). No response-size cap is applied; the request timeout and the context deadline bound the operation.
Extract the textual body. HTML pages are parsed with golang.org/x/net/html, stripped of <script> / <style> / <svg> / hidden nodes / comments, and rendered as semi-structured plain text that preserves headings, lists, code blocks, and tables. Text-based responses (text/plain, application/json, etc.) pass through verbatim. Binary content types are rejected.
Analyze (optional). When an LLM is configured via --webfetch-llm-* flags, the body goes through an LLM call that (a) reformats the content into clean Markdown and (b) checks for indirect prompt injection (role-change attempts, system-prompt overrides, control-token-style strings, etc.). The body is delivered as a user-role message; all instructions live in the system prompt, which declares the user message untrusted. When no LLM is configured, this stage is skipped — the extracted text is returned verbatim with "llm_analysis": "disabled" in the response.

If the LLM judges the body malicious, the tool returns an error tagged validation with the LLM-supplied reason. Otherwise, it returns the formatted Markdown (or the raw extracted text in LLM-disabled mode).

Configuration

The webfetch tool owns its LLM configuration independently from warren's main --gemini-* / --claude-* flags, so the screening provider can differ from (or be absent compared to) the main warren LLM.

Environment Variable	CLI Flag	Default	Description
`WARREN_WEBFETCH_LLM_PROVIDER`	`--webfetch-llm-provider`	(empty)	LLM provider for analyze step: `gemini`, `claude`, or `openai`. Empty disables LLM analysis.
`WARREN_WEBFETCH_LLM_MODEL`	`--webfetch-llm-model`	(empty)	LLM model name. Required when provider is set.
`WARREN_WEBFETCH_LLM_ARGS`	`--webfetch-llm-args`	(empty)	Provider-specific options as `key=value,key=value`. Recognized: `project_id`, `location`, `temperature`.
`WARREN_WEBFETCH_LLM_API_KEY`	`--webfetch-llm-api-key`	(empty)	API key. Required for `openai` and `claude` Anthropic-direct. Ignored for `gemini` and `claude` Vertex.

Claude routing

The claude provider auto-selects between Vertex AI and the direct Anthropic API based on the supplied inputs:

Inputs	Route
`project_id` + `location` in `--webfetch-llm-args`, no api-key	Vertex AI
`--webfetch-llm-api-key` set, no `project_id` / `location`	Anthropic direct
Both Vertex args and api-key	Start-up error (ambiguous)
Neither	Start-up error (route unspecified)

Examples

Gemini on Vertex AI:

--webfetch-llm-provider gemini
--webfetch-llm-model    gemini-2.5-flash
--webfetch-llm-args     "project_id=my-proj,location=us-central1"

Claude on Vertex AI:

--webfetch-llm-provider claude
--webfetch-llm-model    "claude-sonnet-4@20250514"
--webfetch-llm-args     "project_id=my-proj,location=us-east5"

Claude via Anthropic direct API:

--webfetch-llm-provider claude
--webfetch-llm-model    claude-sonnet-4-5-20250929
--webfetch-llm-api-key  "$ANTHROPIC_API_KEY"

OpenAI:

--webfetch-llm-provider openai
--webfetch-llm-model    gpt-4o
--webfetch-llm-args     "temperature=0.2"
--webfetch-llm-api-key  "$OPENAI_API_KEY"

LLM disabled (no flags): every call is HITL-gated, response includes "llm_analysis": "disabled".

Start-up ping

Whenever --webfetch-llm-provider is set, warren issues a minimal max_tokens=1 generation request at start-up so misconfigurations surface immediately instead of at first use. The ping cost is negligible (single-digit tokens) but it catches:

API key typos or expired keys
Model name typos
Vertex AI project_id / location mistakes or permission gaps
Network reachability issues

The ping covers the analyze provider only; the warren-wide LLM (Gemini/Claude) is independent.

Available Functions

Function	Description
`web_fetch`	Fetch the given URL and return Markdown (when LLM is configured) or raw extracted text (when not). Argument: `url` (http/https only).

HITL (Human-in-the-Loop)

web_fetch is conditionally registered as a HITL-gated tool. The decision is driven by --webfetch-llm-provider:

--webfetch-llm-provider empty (LLM disabled) → HITL approval dialog is shown for every web_fetch call. The human is the only safety net since indirect-prompt-injection screening does not run in this mode.
--webfetch-llm-provider set (LLM enabled) → HITL approval is suppressed. The LLM screening handles IPI detection.

The wiring lives in pkg/cli/serve.go: it consults each tool's RequiresHITL() method (implemented by webfetch) and passes the resulting names to aster.WithHITLTools / bluebell.WithHITLTools.

Setup

No external API keys are required when running in LLM-disabled mode (every call goes through HITL approval). When LLM analysis is enabled, the API key requirement depends on the provider — see the routing table above.

Limitations

No URL allow/deny filtering. Access control is delegated to the HITL approval dialog (LLM-disabled mode) and / or the LLM IPI screen (LLM-enabled mode).
No SSRF protection beyond scheme restriction. Internal IPs and metadata endpoints are not blocked at this layer.
UTF-8 assumed. The Content-Type charset is not honored; non-UTF-8 pages may produce garbled output.
No summarization. The LLM only reformats — it does not condense the body. Large pages (after extraction) are passed to the upstream agent as-is.
No retries. Transient LLM or HTTP failures bubble up to the agent loop, which owns the retry strategy.

Tests

The package tests cover both pure logic and external integrations:

Unit tests for the LLM-flag parser / builder / ping helper (llmflag_test.go) — always run; uses gollem mocks for the ping path.
Unit / extract / analyze mock tests — always run as part of go test ./pkg/tool/webfetch/....
TestExtract_Live_RealWebsites — always runs. Fetches a small set of real security-info sites (NVD, MITRE ATT&CK, Wikipedia, GitHub Security Advisory, OWASP) and asserts that extraction produces non-empty text containing expected keywords and no surviving <script> / <style> markup. Per-case transient failures (network errors, anti-bot 4xx/5xx) are converted into per-case skips rather than hard failures.
TestAnalyze_Live_* — runs when both TEST_GEMINI_PROJECT_ID and TEST_GEMINI_LOCATION are set. Verifies indirect-prompt-injection detection against a real Gemini endpoint.

Documentation ¶

Index ¶

type Action

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Action ¶

type Action struct {
	// contains filtered or unexported fields
}

Action implements the interfaces.Tool interface for fetching web content and sanitizing it through an LLM-based pipeline.

LLM analysis is configured via the --webfetch-llm-* flags. If --webfetch-llm-provider is empty, the analyze step is disabled and the extracted text is returned verbatim. In that mode, HITL approval is required for every web_fetch invocation (RequiresHITL returns true) so a human gates each request.

func (*Action) Configure ¶

func (x *Action) Configure(ctx context.Context) error

func (*Action) Description ¶ added in v0.14.0

func (x *Action) Description() string

func (*Action) Flags ¶

func (x *Action) Flags() []cli.Flag

func (*Action) Helper ¶

func (x *Action) Helper() *cli.Command

func (*Action) ID ¶ added in v0.14.0

func (x *Action) ID() string

func (*Action) LogValue ¶

func (x *Action) LogValue() slog.Value

func (*Action) Prompt ¶

func (x *Action) Prompt(_ context.Context) (string, error)

func (*Action) RequiresHITL ¶ added in v0.17.0

func (x *Action) RequiresHITL() bool

RequiresHITL reports whether the web_fetch tool should be gated by a HITL (Human-in-the-Loop) approval dialog. When LLM analysis is enabled, the LLM performs indirect-prompt-injection screening so the dialog is unnecessary and is skipped; when LLM is disabled, the dialog is the only remaining safety layer and is always required.

func (*Action) Run ¶

func (x *Action) Run(ctx context.Context, name string, args map[string]any) (map[string]any, error)

func (*Action) Specs ¶

func (x *Action) Specs(_ context.Context) ([]gollem.ToolSpec, error)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL