grammargen

package
v0.20.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 6, 2026 License: MIT Imports: 19 Imported by: 4

README

grammargen

grammargen is the pure-Go grammar compiler used by gotreesitter. It turns a grammar definition into a *gotreesitter.Language, a serialized .bin blob, a tree-sitter-compatible parser.c, or generated Go DSL source.

The authoring surface is intentionally input-neutral:

  • Go DSL grammars built with NewGrammar, Define, Seq, Choice, Token, Field, PrecLeft, and related constructors.
  • Resolved upstream grammar.json files, which are the preferred import format for tree-sitter grammars.
  • .grammar files, a compact ecosystem-agnostic grammar format that parses into the same IR and can emit Go DSL.

grammar.js import also exists, but grammar.json is usually more reliable because helper functions, require() calls, and JavaScript evaluation have already been resolved by tree-sitter.

Authoring Commands

Use doctor when changing a grammar. It validates, generates parser tables, runs embedded tests when present, optionally parses a sample, and suggests the next command.

go run ./cmd/grammargen doctor calc -text '1+2*3'
go run ./cmd/grammargen doctor -json /tmp/grammar_parity/go/src/grammar.json -sample sample.go
go run ./cmd/grammargen doctor -grammar ./mini.grammar -text '123'
go run ./cmd/grammargen doctor calc -text '1+2*3' -conflicts 3
go run ./cmd/grammargen doctor calc -text '1+2*3' -format json

Use parse when you want quick sample-to-tree feedback:

go run ./cmd/grammargen parse calc -text '1+2*3'
go run ./cmd/grammargen parse -grammar ./mini.grammar -stdin
go run ./cmd/grammargen parse calc -text '1+2*3' -format sexpr
go run ./cmd/grammargen parse calc -text '1+2*3' -format json

Use emit to write artifacts from any supported input:

# gotreesitter blob
go run ./cmd/grammargen emit go -bin grammars/grammar_blobs/go.bin

# Go DSL source from a resolved grammar.json
go run ./cmd/grammargen emit \
  -json /tmp/grammar_parity/go/src/grammar.json \
  -go grammargen/go_grammar.go \
  -pkg grammargen \
  -func GoGrammar

# Go DSL source from a .grammar file
go run ./cmd/grammargen emit -grammar ./mini.grammar -go ./mini_grammar.go -pkg grammargen

# Resolved grammar.json from any supported input
go run ./cmd/grammargen emit -grammar ./mini.grammar -json-out ./mini.grammar.json

# tree-sitter parser.c
go run ./cmd/grammargen emit calc -c /tmp/parser.c

# inferred highlight query
go run ./cmd/grammargen emit calc -highlight

For golden parse snapshots, write the current tree once, then compare future runs against it:

go run ./cmd/grammargen parse calc -text '1+2*3' -write-expect ./calc.sexpr
go run ./cmd/grammargen parse calc -text '1+2*3' -expect ./calc.sexpr

parse -strict exits non-zero when parsing finishes with ERROR nodes or an early stop condition. doctor treats sample parse errors as gate failures by default.

The legacy flag surface still works:

go run ./cmd/grammargen -validate calc
go run ./cmd/grammargen -report calc
go run ./cmd/grammargen -grammar ./mini.grammar -go ./mini_grammar.go

For grammars that benefit from local LR(1) state splitting, pass -lr-split:

go run ./cmd/grammargen doctor go -lr-split -sample sample.go
go run ./cmd/grammargen emit go -lr-split -bin grammars/grammar_blobs/go.bin

Go DSL

The first defined rule is the start rule. Names beginning with _ are hidden rules. String rules create literal tokens, pattern rules create regex terminals, and Token groups a rule into one lexer token.

func MiniExprGrammar() *Grammar {
	g := NewGrammar("mini_expr")

	g.Define("program", Sym("expression"))
	g.Define("expression", Choice(
		PrecLeft(1, Seq(
			Field("left", Sym("expression")),
			Field("operator", Str("+")),
			Field("right", Sym("expression")),
		)),
		PrecLeft(2, Seq(
			Field("left", Sym("expression")),
			Field("operator", Str("*")),
			Field("right", Sym("expression")),
		)),
		Sym("number"),
		Seq(Str("("), Sym("expression"), Str(")")),
	))
	g.Define("number", Token(Repeat1(Pat(`[0-9]`))))
	g.SetExtras(Pat(`\s`))

	g.Test("precedence", "1 + 2 * 3", "")

	return g
}

An embedded test with an empty expected S-expression only checks that parsing finishes without ERROR nodes. Fill in the expected S-expression when a rule's exact tree shape should be locked down.

Common grammar-level settings:

  • SetExtras(...): whitespace, comments, or other extra tokens.
  • SetConflicts(...): declared ambiguity groups that should keep GLR alternatives.
  • SetExternals(...): external scanner tokens.
  • SetInline(...): rules to inline during normalization.
  • SetWord(...): word token used for keyword extraction.
  • SetSupertypes(...): structural supertypes exposed in metadata.
  • Precedences: ordered named and symbol precedence levels imported from grammar.json.

Useful DSL helpers live in grammar.go: CommaSep, CommaSep1, SepBy, SepBy1, Parens, Brackets, Braces, AppendChoice, and ExtendGrammar.

.grammar Files

.grammar is the ecosystem-agnostic text format. It is currently line-oriented, so keep each rule definition on one line.

grammar mini

extras = [ /\s/ ]

rule program = number
rule number = token(repeat1(/[0-9]/))

Run it through the same command surface:

go run ./cmd/grammargen doctor -grammar ./mini.grammar -text '123'
go run ./cmd/grammargen parse -grammar ./mini.grammar -text '123'
go run ./cmd/grammargen emit -grammar ./mini.grammar -go ./mini_grammar.go -pkg grammargen
go run ./cmd/grammargen emit -grammar ./mini.grammar -json-out ./mini.grammar.json
go run ./cmd/grammargen emit -grammar ./mini.grammar -bin /tmp/mini.bin

Supported top-level lines:

grammar <name>
extras = [ <rule-expr>, ... ]
word = <rule_name>
supertypes = [ <rule_name>, ... ]
conflicts = [ [<rule>, <rule>], ... ]
rule <name> = <rule-expr>

Supported expressions:

"literal"
/regex/
identifier

seq(a, b, ...)
choice(a, b, ...)
repeat(a)
repeat1(a)
optional(a)
token(a)
field("name", a)
prec(1, a)
prec.left(1, a)
prec.right(1, a)
prec.dynamic(1, a)
alias(a, name)
alias(a, "anonymous_name")

For large upstream grammars, resolved grammar.json remains the most complete input. .grammar is the portable authoring format and should stay independent of any host language syntax.

Validation Loop

For small package-local checks, keep tests focused:

go test ./cmd/grammargen ./grammargen \
  -run '^TestJSONGenerate$|^TestGenerateWithReportCtxSkipsDiagnosticsWhenNotRequested$' \
  -count=1

When changing GLR, incremental, import, or parity-sensitive behavior, use the Docker parity runners and keep runs to one grammar at a time:

# Focused package test inside Docker
bash cgo_harness/docker/run_parity_in_docker.sh \
  -- "cd /workspace && go test ./grammargen -run '^TestName$' -count=1"

# Real-corpus parity for one grammar
bash cgo_harness/docker/run_single_grammar_parity.sh typescript

# Focused grammargen real-corpus lane
bash cgo_harness/docker/run_grammargen_focus_targets.sh --mode real-corpus --langs typescript

# Focused grammargen-vs-C lane
bash cgo_harness/docker/run_grammargen_focus_targets.sh --mode cgo --langs typescript

Do not run repo-wide go test ./... or broad race sweeps on the host for grammargen work. Heavy correctness, parity, and race coverage belongs in Docker or CI, scoped to one language or one regression at a time.

Reading the Package

  • grammar.go: public IR and Go DSL constructors.
  • parse_grammar_file.go: .grammar parser.
  • import_grammarjson.go: resolved tree-sitter grammar.json import.
  • import_grammarjs.go: best-effort grammar.js import.
  • normalize.go: rule lowering, metadata, fields, terminals, and production construction.
  • lr.go: LR/LALR table construction and conflict resolution.
  • lr_split.go, lr_split_oracle.go: local LR(1) split support.
  • dfa.go, nfa.go, regex.go: lexer construction.
  • encode.go, assemble.go: Language assembly and blob encoding.
  • diagnostics.go: validation, embedded tests, and generation reports.
  • emit_grammar_go.go, export_grammarjson.go, codegen_c.go: artifact emitters.
  • parity_test.go, parity_real_corpus_test.go: generated-vs-reference parity infrastructure.

Troubleshooting

Start with doctor. It reports validation warnings, generation failures, table sizes, conflict count, embedded test status, and sample parse status. Add -conflicts N when precedence or GLR behavior needs inspection, or -format json when another tool should consume the report.

Use parse when a grammar generates but the tree shape looks wrong. It prints the root type, byte range, error flag, stop reason, and named-node S-expression. Use -format sexpr or -expect/-write-expect for golden tree snapshots.

For upstream imports, prefer src/grammar.json from a generated tree-sitter repository. If import fails on grammar.js, regenerate or locate the resolved JSON first.

External-scanner grammars need a compatible Go scanner binding in grammars/. The generated grammar can expose external tokens, but scanner behavior is still hand-written runtime code.

When corpus parity fails, narrow before changing generator behavior: one language, one focused test, one sample if possible. Use GTS_GRAMMARGEN_REAL_CORPUS_ONLY, GTS_GRAMMARGEN_REAL_CORPUS_MAX_CASES, and the focused Docker runners to keep the workload reproducible and attributable.

Documentation

Overview

Package grammargen implements a pure-Go grammar generator for gotreesitter. It compiles grammar definitions expressed in a Go DSL into binary blobs that the gotreesitter runtime can load and use for parsing.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AddConflict

func AddConflict(g *Grammar, names ...string)

AddConflict appends a GLR conflict declaration to the grammar.

func AppendChoice

func AppendChoice(g *Grammar, name string, rule *Rule)

AppendChoice appends an alternative to an existing rule, wrapping the prior definition in a Choice if needed.

func EmitC

func EmitC(name string, lang *gotreesitter.Language) (string, error)

EmitC emits a parser.c string from a compiled Language struct.

func EmitGrammarGo

func EmitGrammarGo(g *Grammar, pkgName, funcName string) ([]byte, error)

EmitGrammarGo takes a Grammar IR and emits Go source code that reconstructs it using grammargen DSL calls. The output is a standalone Go file in the given package with a function of the given name that returns *Grammar.

func ExportGrammarJSON

func ExportGrammarJSON(g *Grammar) ([]byte, error)

ExportGrammarJSON serializes a Grammar struct to the tree-sitter grammar.json format. The output is compatible with ImportGrammarJSON — a round-trip ImportGrammarJSON(ExportGrammarJSON(g)) should produce an equivalent grammar.

The JSON structure matches tree-sitter's canonical resolved grammar.json:

{
  "name": "...",
  "word": "...",
  "rules": { ... },
  "extras": [...],
  "conflicts": [...],
  "externals": [...],
  "inline": [...],
  "supertypes": [...]
}

func Generate

func Generate(g *Grammar) ([]byte, error)

Generate compiles a Grammar definition into a binary blob that gotreesitter can load via DecodeLanguageBlob / loadEmbeddedLanguage. LR(1) state splitting is always attempted; a rollback guard reverts to the plain LALR table if splitting does not reduce GLR conflicts.

func GenerateC

func GenerateC(g *Grammar) (string, error)

GenerateC compiles a Grammar to a standard tree-sitter parser.c string. The output is compatible with tree-sitter's C runtime ABI 14/15 features that grammargen currently emits.

func GenerateHighlightQueries

func GenerateHighlightQueries(base, extended *Grammar) string

GenerateHighlightQueries produces tree-sitter highlight queries for rules added by a grammar extension. It diffs base and extended to find new rules, then applies naming conventions to generate appropriate highlights.

Conventions:

  • New Str() tokens matching identifier pattern -> @keyword
  • *_declaration with "name" field -> name: (identifier) @type.definition
  • *_variant with "name" field -> name: (identifier) @constructor
  • *_block with "description" field -> description: @string
  • *_expression -> no default highlight (expressions are structural)
  • *_statement -> no default highlight
  • Field named "params"/"parameters" -> children (identifier) @variable.parameter
  • let_declaration name -> @variable.definition
  • New string tokens that are operators (non-alphanumeric) -> @operator
  • New string tokens that are keywords (alphanumeric) -> @keyword

func GenerateHighlightQuery

func GenerateHighlightQuery(g *Grammar) string

GenerateHighlightQuery infers a tree-sitter highlight query from grammar structure. It maps well-known rule names and patterns to standard capture names:

  • comment → @comment
  • string, string_content → @string
  • number, integer, float → @number
  • true, false → @boolean
  • null, nil, none → @constant.builtin
  • identifier → @variable
  • type_identifier → @type
  • function keywords → @keyword.function
  • control flow keywords → @keyword.control
  • other keyword-like string terminals → @keyword
  • operators → @operator

func GenerateLanguage

func GenerateLanguage(g *Grammar) (*gotreesitter.Language, error)

GenerateLanguage compiles a Grammar into a Language struct without encoding. LR(1) state splitting is always attempted; a rollback guard reverts to the plain LALR table if splitting does not reduce GLR conflicts.

func GenerateLanguageAndBlob added in v0.10.0

func GenerateLanguageAndBlob(g *Grammar) (*gotreesitter.Language, []byte, error)

GenerateLanguageAndBlob compiles a Grammar into both a Language and its serialized blob representation in a single generation pass.

func GenerateLanguageAndBlobWithContext added in v0.10.0

func GenerateLanguageAndBlobWithContext(ctx context.Context, g *Grammar) (*gotreesitter.Language, []byte, error)

GenerateLanguageAndBlobWithContext is like GenerateLanguageAndBlob but accepts a context for cancellation.

func GenerateLanguageWithContext

func GenerateLanguageWithContext(ctx context.Context, g *Grammar) (*gotreesitter.Language, error)

GenerateLanguageWithContext is like GenerateLanguage but accepts a context for cancellation. When the context is cancelled, LR table construction and DFA building abort promptly, allowing the caller to reclaim memory that would otherwise be held by an orphaned goroutine.

func RunTests

func RunTests(g *Grammar) error

RunTests generates the grammar and runs all embedded test cases. Returns nil if all tests pass, or an error describing failures.

func Validate

func Validate(g *Grammar) []string

Validate checks the grammar for common issues and returns warnings.

Types

type AliasInfo

type AliasInfo struct {
	ChildIndex int
	Name       string
	Named      bool
}

AliasInfo stores alias information for a child position.

type Assoc

type Assoc int

Assoc is the associativity of a production.

const (
	AssocNone Assoc = iota
	AssocLeft
	AssocRight
)

type ConflictDiag

type ConflictDiag struct {
	Kind          ConflictKind
	State         int
	LookaheadSym  int
	Actions       []lrAction // the conflicting actions
	Resolution    string     // how it was resolved (or "GLR" if kept)
	IsMergedState bool       // was this state produced by LALR merging?
	MergeCount    int        // how many merge origins this state has
}

ConflictDiag describes a conflict encountered during LR table construction.

func (*ConflictDiag) String

func (d *ConflictDiag) String(ng *NormalizedGrammar) string

type ConflictKind

type ConflictKind int

ConflictKind describes the type of LR conflict.

const (
	ShiftReduce ConflictKind = iota
	ReduceReduce
)

type FieldAssign

type FieldAssign struct {
	ChildIndex int
	FieldName  string
}

FieldAssign maps a child position in a production to a field name.

type GenerateReport

type GenerateReport struct {
	Language        *gotreesitter.Language
	Blob            []byte
	Conflicts       []ConflictDiag
	SplitCandidates []splitCandidate
	SplitResult     *splitReport
	Warnings        []string
	SymbolCount     int
	StateCount      int
	TokenCount      int
}

GenerateReport holds the result of grammar generation with diagnostics.

func GenerateWithReport

func GenerateWithReport(g *Grammar) (*GenerateReport, error)

GenerateWithReport compiles a grammar and returns a full diagnostic report.

type Grammar

type Grammar struct {
	Name                                       string
	Rules                                      map[string]*Rule
	RuleOrder                                  []string // order rules were defined (first = start rule)
	Extras                                     []*Rule
	Conflicts                                  [][]string
	Externals                                  []*Rule
	Inline                                     []string
	Word                                       string
	ReservedWordSets                           []ReservedWordSet
	Supertypes                                 []string
	Tests                                      []TestCase    // embedded test cases
	EnableLRSplitting                          bool          // opt-in: attempt LR(1) state splitting for merge pathology
	BinaryRepeatMode                           bool          // use tree-sitter's binary repeat helper shape (aux→seq(aux,aux)|inner)
	FlattenGeneratedRepeatAux                  bool          // allow generated repeat helpers to participate in hidden-choice flattening
	ReuseRepeatAuxForParents                   []string      // parent rule names whose repeat helpers may be shared by canonical body
	PreserveKeywordIdentifierConflicts         bool          // keep keyword-as-identifier S/R ambiguity for grammars like Fortran
	ExactPrefixStates                          int           // keep this many LR(1) states exact before merge compaction
	Precedences                                [][]PrecEntry // ordered precedence levels (each level: earlier = higher prec)
	ChoiceLiftThreshold                        int           // if >0, lift inline CHOICE nodes with more alternatives than this into auxiliary nonterminals to prevent production explosion
	SuppressEquivalentExternalReduceLookaheads bool          // suppress external scanner validity for duplicate reduce-only lookaheads
	ExternalReduceFollowLookaheads             []string      // external token names that may be valid after reducing in the current state
	PriorityInlinePatterns                     []string      // anonymous pattern terminals that should win same-length ties against named tokens
}

Grammar is the top-level grammar definition.

func AliasSuperGrammar

func AliasSuperGrammar() *Grammar

AliasSuperGrammar returns a grammar that exercises aliases and supertypes.

Supertypes:

_expression is a supertype with children: number, string, identifier, binary_expression

Aliases:

In assignment, the left-hand side identifier is aliased to "variable"
In binary_expression, the operator string is aliased to "op"

func CalcGrammar

func CalcGrammar() *Grammar

CalcGrammar returns a calculator grammar that exercises precedence and associativity. It defines:

  • Binary operators: +, -, *, / with standard math precedence
  • Unary prefix minus: -x (highest precedence)
  • Parenthesized expressions: (x)
  • Integer literals: number

func ExtScannerGrammar

func ExtScannerGrammar() *Grammar

ExtScannerGrammar returns a grammar with external scanner tokens. It models a simple block-structured language where INDENT and DEDENT tokens are produced by an external scanner (like Python).

program: repeat(statement)
statement: simple_statement | block
simple_statement: identifier ";"
block: identifier ":" NEWLINE INDENT repeat(statement) DEDENT

External tokens: INDENT, DEDENT, NEWLINE

func ExtendGrammar

func ExtendGrammar(name string, base *Grammar, customize func(g *Grammar)) *Grammar

ExtendGrammar creates a new grammar that inherits from a base grammar. The customize function receives the new grammar with all base rules copied in, and can override rules, add new ones, or modify extras/conflicts/etc.

Example:

cpp := ExtendGrammar("cpp", cGrammar(), func(g *Grammar) {
    g.Define("class_declaration", Seq(Str("class"), Sym("identifier"), Sym("class_body")))
    // Override an existing rule:
    g.Define("declaration", Choice(Sym("class_declaration"), Sym("function_declaration")))
})

func FortranGrammar added in v0.16.0

func FortranGrammar() *Grammar

FortranGrammar returns the fortran grammar. Code generated by EmitGrammarGo. DO NOT EDIT.

func GLRGrammar

func GLRGrammar() *Grammar

GLRGrammar returns a grammar with intentional ambiguity that requires GLR parsing. It models a simplified C-like language where `a * b` can be parsed as either multiplication or a pointer declaration:

expression_statement: a * b ;  (multiplication)
pointer_declaration:  a * b ;  (type * name)

The conflict between _expression and type_name is declared, causing the parser to fork stacks when it encounters the ambiguity.

func GoGrammar

func GoGrammar() *Grammar

GoGrammar returns the go grammar. Code generated by EmitGrammarGo. DO NOT EDIT.

func INIGrammar

func INIGrammar() *Grammar

INIGrammar returns a production-grade INI file grammar.

Parses the superset of major INI dialects (Windows API, Python configparser, Git config, PHP parse_ini_file):

  • Sections: [name] and [section "subsection"] (Git-style)
  • Key-value pairs: key = value, key : value, key=value
  • Comments: ; and # (full-line only)
  • Quoted string values: "..." with \" and \\ escapes
  • Global pairs: key=value before any [section]
  • Empty values: key= (value is optional)

INI is line-oriented: newlines are significant (not extras). Only horizontal whitespace (spaces, tabs) is treated as extras.

func ImportGrammarJS

func ImportGrammarJS(source []byte) (*Grammar, error)

ImportGrammarJS parses a tree-sitter grammar.js file and returns a Grammar IR. This uses gotreesitter's own JavaScript grammar to parse the file, demonstrating the full-circle capability: gotreesitter parsing its own input format.

func ImportGrammarJSON

func ImportGrammarJSON(data []byte) (*Grammar, error)

ImportGrammarJSON parses a tree-sitter grammar.json file (the canonical resolved form generated by `tree-sitter generate`) and returns a Grammar IR. This is more reliable than ImportGrammarJS because grammar.json has no require() calls, helper functions, or other JavaScript-specific constructs.

func JSGrammar added in v0.16.0

func JSGrammar() *Grammar

JSGrammar returns the JSX-capable JavaScript grammar.

func JSONGrammar

func JSONGrammar() *Grammar

JSONGrammar returns the JSON grammar defined using the Go DSL. This mirrors tree-sitter-json's grammar.js definition.

func JSXGrammar added in v0.16.0

func JSXGrammar() *Grammar

JSXGrammar returns the JSX-capable JavaScript grammar. The upstream lockfile does not carry a separate jsx language; JSX is parsed by the JavaScript grammar.

func JavaScriptGrammar added in v0.16.0

func JavaScriptGrammar() *Grammar

JavaScriptGrammar returns the javascript grammar. Code generated by EmitGrammarGo. DO NOT EDIT.

func JavascriptGrammar added in v0.16.0

func JavascriptGrammar() *Grammar

JavascriptGrammar is kept for consistency with grammars.JavascriptLanguage.

func KeywordGrammar

func KeywordGrammar() *Grammar

KeywordGrammar returns a simplified language grammar that exercises keyword extraction and the word token mechanism. Keywords "var" and "return" match the identifier pattern but are promoted to their own symbols by the keyword DFA.

func KotlinGrammar added in v0.16.0

func KotlinGrammar() *Grammar

KotlinGrammar returns the kotlin grammar. Code generated by EmitGrammarGo. DO NOT EDIT. Source: fwcd/tree-sitter-kotlin@57170e50a32b29122b9e41a4a24aea8be1a16599/src/grammar.json.

func LoxGrammar

func LoxGrammar() *Grammar

LoxGrammar returns a production-grade Lox grammar (Crafting Interpreters spec).

Implements the full Lox language:

  • Variables: var x = expr;
  • Functions: fun name(params) { body }
  • Classes: class Name < Super { methods }
  • Control flow: if/else, while, for
  • Operators: or, and, ==, !=, <, >, <=, >=, +, -, *, /, !, unary -
  • Calls and property access: f(args), obj.prop, obj.prop = val
  • Literals: numbers, strings, true, false, nil, this, super
  • Print: print expr;
  • Return: return expr;
  • Comments: // line comments
  • Block scoping: { statements }

func MarkdownGrammar added in v0.20.0

func MarkdownGrammar() *Grammar

MarkdownGrammar returns the Go-DSL definition of the CommonMark + GFM Markdown grammar. Equivalent in shape to the upstream tree-sitter-markdown grammar.json but owned in Go so it can be refactored, extended via ExtendGrammar, and compiled directly with GenerateLanguage(AndBlob).

External scanner is NOT attached here. Callers must follow GenerateLanguage with `grammars.AdaptScannerForLanguage("markdown", lang)` to attach the hand-written external scanner that owns the 47 block/inline external tokens.

func MustacheGrammar

func MustacheGrammar() *Grammar

MustacheGrammar returns a production-grade Mustache template grammar.

Implements the required Mustache spec features:

  • Interpolation: {{ name }}
  • Unescaped interpolation: {{{ name }}} and {{& name }}
  • Sections: {{# name }} ... {{/ name }}
  • Inverted sections: {{^ name }} ... {{/ name }}
  • Comments: {{! comment text }}
  • Partials: {{> partial_name }}
  • Dotted names: {{ person.name }}
  • Implicit iterator: {{ . }}
  • Raw text between tags

The grammar treats {{ and }} as delimiters. Text outside tags is raw content. The DFA handles {{{ vs {{ disambiguation via maximal munch.

func NewGrammar

func NewGrammar(name string) *Grammar

NewGrammar creates a new grammar with the given name.

func ParseGrammarFile

func ParseGrammarFile(source string) (*Grammar, error)

ParseGrammarFile parses a declarative .grammar file into a Grammar IR.

Syntax:

grammar <name>

extras = [ /\s/ ]
word = <rule_name>
supertypes = [ <rule_name>, ... ]
conflicts = [ [<rule>, <rule>], ... ]

rule <name> = <expr>

Expressions:

"string"         string literal
/pattern/        regex pattern
<name>           symbol reference
seq(a, b, ...)   sequence
choice(a, b, ..) alternation
repeat(a)        zero or more
repeat1(a)       one or more
optional(a)      optional
token(a)         token boundary
field("name", a) field annotation
prec(n, a)       precedence
prec.left(n, a)  left-associative precedence
prec.right(n, a) right-associative precedence

func SwiftABIManglingGrammar added in v0.15.2

func SwiftABIManglingGrammar() *Grammar

SwiftABIManglingGrammar returns a conservative grammar for Swift ABI mangled names. It intentionally models ABI symbol text, not Swift source syntax.

func SwiftGrammar added in v0.16.0

func SwiftGrammar() *Grammar

SwiftGrammar returns the swift grammar. Code generated by EmitGrammarGo. DO NOT EDIT.

func TSGrammar added in v0.16.0

func TSGrammar() *Grammar

TSGrammar returns the TypeScript grammar.

func TSXGrammar added in v0.16.0

func TSXGrammar() *Grammar

TSXGrammar returns the tsx grammar. Code generated by EmitGrammarGo. DO NOT EDIT.

func TsxGrammar added in v0.16.0

func TsxGrammar() *Grammar

TsxGrammar is kept for consistency with grammars.TsxLanguage.

func TypeScriptGrammar added in v0.16.0

func TypeScriptGrammar() *Grammar

TypeScriptGrammar returns the typescript grammar. Code generated by EmitGrammarGo. DO NOT EDIT.

func TypescriptGrammar added in v0.16.0

func TypescriptGrammar() *Grammar

TypescriptGrammar is kept for consistency with grammars.TypescriptLanguage.

func (*Grammar) Define

func (g *Grammar) Define(name string, rule *Rule)

Define adds a rule to the grammar. The first rule defined is the start rule.

func (*Grammar) SetConflicts

func (g *Grammar) SetConflicts(conflicts ...[]string)

SetConflicts declares grammar conflicts for GLR.

func (*Grammar) SetExternals

func (g *Grammar) SetExternals(rules ...*Rule)

SetExternals declares external scanner tokens.

func (*Grammar) SetExtras

func (g *Grammar) SetExtras(rules ...*Rule)

SetExtras sets the extra rules (e.g. whitespace, comments).

func (*Grammar) SetInline

func (g *Grammar) SetInline(names ...string)

SetInline marks rules to be inlined.

func (*Grammar) SetSupertypes

func (g *Grammar) SetSupertypes(names ...string)

SetSupertypes declares supertype rules.

func (*Grammar) SetWord

func (g *Grammar) SetWord(name string)

SetWord sets the word token for keyword extraction.

func (*Grammar) Test

func (g *Grammar) Test(name, input, expected string)

Test adds an embedded test case. Input is parsed and the resulting tree is compared against the expected S-expression. If expected is empty, the test only checks that no ERROR nodes appear.

func (*Grammar) TestError

func (g *Grammar) TestError(name, input string)

TestError adds an embedded test case that expects parse errors.

type GrammarDiff

type GrammarDiff struct {
	AddedRules        []string
	RemovedRules      []string
	ModifiedRules     []string // rules present in both but with different definitions
	ExtrasChanged     bool
	ConflictsChanged  bool
	ExternalsChanged  bool
	WordChanged       bool
	SupertypesChanged bool
}

GrammarDiff describes the differences between two grammar versions.

func DiffGrammars

func DiffGrammars(old, new *Grammar) *GrammarDiff

DiffGrammars compares two grammar versions and returns a diff.

func (*GrammarDiff) HasChanges

func (d *GrammarDiff) HasChanges() bool

HasChanges returns true if any differences were found.

func (*GrammarDiff) String

func (d *GrammarDiff) String() string

String returns a human-readable summary of the diff.

type LRTables

type LRTables struct {
	// ActionTable[state][symbol] = list of actions (multiple = conflict/GLR)
	ActionTable          map[int]map[int][]lrAction
	GotoTable            map[int]map[int]int // [state][nonterminal] → target state
	StateCount           int
	ExtraChainStateStart int // first synthetic nonterminal-extra state, or -1 if none
}

LRTables holds the generated parse tables.

type NormalizedGrammar

type NormalizedGrammar struct {
	Symbols       []SymbolInfo
	Productions   []Production
	Terminals     []TerminalPattern
	ExtraSymbols  []int    // symbol indices of extras
	FieldNames    []string // index 0 is always ""
	Conflicts     [][]int  // symbol index groups
	Supertypes    []int    // symbol indices
	StartSymbol   int
	AugmentProdID int // production index for S' → S

	// Keyword support (populated when Grammar.Word is set).
	KeywordSymbols []int             // symbol IDs that are keywords
	WordSymbolID   int               // word token symbol ID (e.g., identifier)
	KeywordEntries []TerminalPattern // keyword patterns for keyword DFA
	// ReservedWordSets stores token symbol IDs for each imported reserved word
	// set. The first set is the global set from grammar.json. Current
	// generation derives per-state subsets from that global set.
	ReservedWordSets [][]int

	// External scanner support (populated when Grammar.Externals is set).
	ExternalSymbols []int // external token index → symbol ID

	ExactPrefixStates int

	// PrecedenceOrder stores the symbol-level precedence ordering from the
	// grammar's precedences table. Maps a rule name to its numeric position
	// (higher = higher priority) and whether it's a SYMBOL or STRING entry.
	// Used during conflict resolution to compare a reduce production's LHS
	// against the named precedence of a competing shift action.
	PrecedenceOrder *precOrderTable

	PreserveKeywordIdentifierConflicts         bool
	SuppressEquivalentExternalReduceLookaheads bool
	ExternalReduceFollowLookaheads             map[string]bool
	// contains filtered or unexported fields
}

NormalizedGrammar is the output of the normalize step.

func Normalize

func Normalize(g *Grammar) (*NormalizedGrammar, error)

Normalize transforms a Grammar into a NormalizedGrammar.

func (*NormalizedGrammar) TokenCount

func (ng *NormalizedGrammar) TokenCount() int

TokenCount returns the number of terminal symbols (including symbol 0 = end).

type PrecEntry added in v0.10.0

type PrecEntry struct {
	IsSymbol bool   // true for SYMBOL entries, false for STRING entries
	Name     string // prec name or rule name
}

PrecEntry is an entry in a precedences level. It is either a named precedence (STRING type, Name is the prec name) or a rule reference (SYMBOL type, Name is the rule name).

type Production

type Production struct {
	LHS  int   // symbol index
	RHS  []int // symbol indices
	Prec int
	// HasExplicitPrec distinguishes an explicit compile-time precedence wrapper
	// (including prec(0, ...)) from the default implicit zero precedence.
	HasExplicitPrec bool
	Assoc           Assoc
	DynPrec         int
	ProductionID    int
	Fields          []FieldAssign // per-RHS-position field assignments
	Aliases         []AliasInfo   // per-RHS-position alias info
	IsExtra         bool          // true if this production belongs to a nonterminal extra
}

Production is a single LHS → RHS production with metadata.

type ReservedWordSet added in v0.10.2

type ReservedWordSet struct {
	Name  string
	Rules []*Rule
}

ReservedWordSet is an ordered named set of reserved word token rules. The first set is the global set from grammar.json's top-level `reserved` object. Additional sets are preserved for future context-specific support.

type Rule

type Rule struct {
	Kind     RuleKind
	Value    string  // literal/pattern/symbol/field name
	Children []*Rule // sub-rules
	Prec     int     // precedence value
	Named    bool    // for alias: whether the alias is a named node
}

Rule is a node in the grammar rule tree.

func Alias

func Alias(rule *Rule, name string, named bool) *Rule

Alias aliases a rule to a different name.

func Blank

func Blank() *Rule

Blank creates an epsilon (empty) rule.

func Braces

func Braces(rule *Rule) *Rule

Braces wraps a rule in curly braces.

func Brackets

func Brackets(rule *Rule) *Rule

Brackets wraps a rule in square brackets.

func Choice

func Choice(rules ...*Rule) *Rule

Choice creates an alternation of rules.

func CommaSep

func CommaSep(rule *Rule) *Rule

CommaSep creates an optional comma-separated list.

func CommaSep1

func CommaSep1(rule *Rule) *Rule

CommaSep1 creates a non-empty comma-separated list.

func Field

func Field(name string, rule *Rule) *Rule

Field annotates a rule with a field name.

func ImmToken

func ImmToken(rule *Rule) *Rule

ImmToken creates an immediate token (no preceding whitespace).

func Optional

func Optional(rule *Rule) *Rule

Optional creates an optional rule.

func Parens

func Parens(rule *Rule) *Rule

Parens wraps a rule in parentheses.

func Pat

func Pat(pattern string) *Rule

Pat creates a regex pattern rule.

func Prec

func Prec(n int, rule *Rule) *Rule

Prec sets precedence on a rule.

func PrecDynamic

func PrecDynamic(n int, rule *Rule) *Rule

PrecDynamic sets dynamic precedence on a rule.

func PrecLeft

func PrecLeft(n int, rule *Rule) *Rule

PrecLeft sets left-associative precedence on a rule.

func PrecRight

func PrecRight(n int, rule *Rule) *Rule

PrecRight sets right-associative precedence on a rule.

func Repeat

func Repeat(rule *Rule) *Rule

Repeat creates a zero-or-more repetition.

func Repeat1

func Repeat1(rule *Rule) *Rule

Repeat1 creates a one-or-more repetition.

func SepBy

func SepBy(sep, rule *Rule) *Rule

SepBy creates an optional list separated by the given separator.

func SepBy1

func SepBy1(sep, rule *Rule) *Rule

SepBy1 creates a non-empty list separated by the given separator.

func Seq

func Seq(rules ...*Rule) *Rule

Seq creates a sequence of rules.

func Str

func Str(s string) *Rule

Str creates a string literal rule.

func Surround

func Surround(open, rule, close *Rule) *Rule

Surround wraps a rule with open and close delimiters.

func Sym

func Sym(name string) *Rule

Sym creates a symbol reference rule.

func Token

func Token(rule *Rule) *Rule

Token creates a token boundary (content is a single lexer token).

type RuleKind

type RuleKind int

RuleKind identifies the type of a grammar rule node.

const (
	RuleString      RuleKind = iota // literal string: "{"
	RulePattern                     // regex pattern: /[0-9]+/
	RuleSymbol                      // symbol reference: $.object
	RuleSeq                         // sequence: seq(a, b, c)
	RuleChoice                      // alternation: choice(a, b)
	RuleRepeat                      // zero-or-more: repeat(a)
	RuleRepeat1                     // one-or-more: repeat1(a)
	RuleOptional                    // optional: optional(a)
	RuleToken                       // token boundary: token(a)
	RuleImmToken                    // immediate token: token.immediate(a)
	RuleField                       // field annotation: field("name", a)
	RulePrec                        // precedence: prec(n, a)
	RulePrecLeft                    // left-associative: prec.left(n, a)
	RulePrecRight                   // right-associative: prec.right(n, a)
	RulePrecDynamic                 // dynamic precedence: prec.dynamic(n, a)
	RuleBlank                       // epsilon / empty
	RuleAlias                       // alias: alias(a, "name")
)

type SymbolInfo

type SymbolInfo struct {
	Name      string
	Visible   bool
	Named     bool
	Supertype bool
	Kind      SymbolKind
	IsExtra   bool
	Immediate bool // token.immediate — no preceding whitespace skip
}

SymbolInfo describes a grammar symbol.

type SymbolKind

type SymbolKind int

SymbolKind classifies a grammar symbol.

const (
	SymbolTerminal    SymbolKind = iota // anonymous terminal like "{"
	SymbolNamedToken                    // named terminal like number, string_content
	SymbolExternal                      // external scanner token
	SymbolNonterminal                   // nonterminal rule
)

type TerminalPattern

type TerminalPattern struct {
	SymbolID  int
	Rule      *Rule // the flattened rule tree for NFA construction
	Priority  int   // lower = higher priority (wins on tie)
	Immediate bool  // token.immediate
}

TerminalPattern describes a terminal symbol's match pattern for DFA generation.

type TestCase

type TestCase struct {
	Name        string // test name
	Input       string // input to parse
	Expected    string // expected S-expression (empty = just check no errors)
	ExpectError bool   // if true, expect ERROR nodes in the tree
}

TestCase is an embedded grammar test case.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL