grammargen

package
v0.9.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 17, 2026 License: MIT Imports: 16 Imported by: 0

Documentation

Overview

Package grammargen implements a pure-Go grammar generator for gotreesitter. It compiles grammar definitions expressed in a Go DSL into binary blobs that the gotreesitter runtime can load and use for parsing.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AddConflict

func AddConflict(g *Grammar, symbols ...string)

AddConflict adds a GLR conflict group to the grammar.

func AppendChoice

func AppendChoice(g *Grammar, ruleName string, newAlts ...*Rule)

AppendChoice appends new alternatives to an existing Choice rule. If the named rule is already a Choice, the new alternatives are appended to its children. Otherwise the existing rule and the new alternatives are wrapped in a new Choice.

func EmitC

func EmitC(name string, lang *gotreesitter.Language) (string, error)

EmitC emits a parser.c string from a compiled Language struct.

func EmitGrammarGo

func EmitGrammarGo(g *Grammar, pkgName, funcName string) ([]byte, error)

EmitGrammarGo takes a Grammar IR and emits Go source code that reconstructs it using grammargen DSL calls. The output is a standalone Go file in the given package with a function of the given name that returns *Grammar.

func ExportGrammarJSON

func ExportGrammarJSON(g *Grammar) ([]byte, error)

ExportGrammarJSON serializes a Grammar struct to the tree-sitter grammar.json format. The output is compatible with ImportGrammarJSON — a round-trip ImportGrammarJSON(ExportGrammarJSON(g)) should produce an equivalent grammar.

The JSON structure matches tree-sitter's canonical resolved grammar.json:

{
  "name": "...",
  "word": "...",
  "rules": { ... },
  "extras": [...],
  "conflicts": [...],
  "externals": [...],
  "inline": [...],
  "supertypes": [...]
}

func Generate

func Generate(g *Grammar) ([]byte, error)

Generate compiles a Grammar definition into a binary blob that gotreesitter can load via DecodeLanguageBlob / loadEmbeddedLanguage. LR(1) state splitting is always attempted; a rollback guard reverts to the plain LALR table if splitting does not reduce GLR conflicts.

func GenerateC

func GenerateC(g *Grammar) (string, error)

GenerateC compiles a Grammar to a standard tree-sitter parser.c string. The output is compatible with tree-sitter's C runtime ABI 14.

func GenerateHighlightQueries

func GenerateHighlightQueries(base, extended *Grammar) string

GenerateHighlightQueries produces tree-sitter highlight queries for rules added by a grammar extension. It diffs base and extended to find new rules, then applies naming conventions to generate appropriate highlights.

Conventions:

  • New Str() tokens matching identifier pattern -> @keyword
  • *_declaration with "name" field -> name: (identifier) @type.definition
  • *_variant with "name" field -> name: (identifier) @constructor
  • *_block with "description" field -> description: @string
  • *_expression -> no default highlight (expressions are structural)
  • *_statement -> no default highlight
  • Field named "params"/"parameters" -> children (identifier) @variable.parameter
  • let_declaration name -> @variable.definition
  • New string tokens that are operators (non-alphanumeric) -> @operator
  • New string tokens that are keywords (alphanumeric) -> @keyword

func GenerateHighlightQuery

func GenerateHighlightQuery(g *Grammar) string

GenerateHighlightQuery infers a tree-sitter highlight query from grammar structure. It maps well-known rule names and patterns to standard capture names:

  • comment → @comment
  • string, string_content → @string
  • number, integer, float → @number
  • true, false → @boolean
  • null, nil, none → @constant.builtin
  • identifier → @variable
  • type_identifier → @type
  • function keywords → @keyword.function
  • control flow keywords → @keyword.control
  • other keyword-like string terminals → @keyword
  • operators → @operator

func GenerateLanguage

func GenerateLanguage(g *Grammar) (*gotreesitter.Language, error)

GenerateLanguage compiles a Grammar into a Language struct without encoding. LR(1) state splitting is always attempted; a rollback guard reverts to the plain LALR table if splitting does not reduce GLR conflicts.

func GenerateLanguageWithContext

func GenerateLanguageWithContext(ctx context.Context, g *Grammar) (*gotreesitter.Language, error)

GenerateLanguageWithContext is like GenerateLanguage but accepts a context for cancellation. When the context is cancelled, LR table construction and DFA building abort promptly, allowing the caller to reclaim memory that would otherwise be held by an orphaned goroutine.

func LoadLanguageBlob

func LoadLanguageBlob(data []byte) (*gotreesitter.Language, error)

LoadLanguageBlob deserializes a compressed language blob back into a Language. This is the inverse of the blob encoding used by GenerateLanguage.

func RunTests

func RunTests(g *Grammar) error

RunTests generates the grammar and runs all embedded test cases. Returns nil if all tests pass, or an error describing failures.

func Validate

func Validate(g *Grammar) []string

Validate checks the grammar for common issues and returns warnings.

Types

type AliasInfo

type AliasInfo struct {
	ChildIndex int
	Name       string
	Named      bool
}

AliasInfo stores alias information for a child position.

type Assoc

type Assoc int

Assoc is the associativity of a production.

const (
	AssocNone Assoc = iota
	AssocLeft
	AssocRight
)

type ConflictDiag

type ConflictDiag struct {
	Kind          ConflictKind
	State         int
	LookaheadSym  int
	Actions       []lrAction // the conflicting actions
	Resolution    string     // how it was resolved (or "GLR" if kept)
	IsMergedState bool       // was this state produced by LALR merging?
	MergeCount    int        // how many merge origins this state has
}

ConflictDiag describes a conflict encountered during LR table construction.

func (*ConflictDiag) String

func (d *ConflictDiag) String(ng *NormalizedGrammar) string

type ConflictKind

type ConflictKind int

ConflictKind describes the type of LR conflict.

const (
	ShiftReduce ConflictKind = iota
	ReduceReduce
)

type FieldAssign

type FieldAssign struct {
	ChildIndex int
	FieldName  string
}

FieldAssign maps a child position in a production to a field name.

type GenerateReport

type GenerateReport struct {
	Language        *gotreesitter.Language
	Blob            []byte
	Conflicts       []ConflictDiag
	SplitCandidates []splitCandidate
	SplitResult     *splitReport
	Warnings        []string
	SymbolCount     int
	StateCount      int
	TokenCount      int
}

GenerateReport holds the result of grammar generation with diagnostics.

func GenerateWithReport

func GenerateWithReport(g *Grammar) (*GenerateReport, error)

GenerateWithReport compiles a grammar and returns a full diagnostic report.

type Grammar

type Grammar struct {
	Name              string
	Rules             map[string]*Rule
	RuleOrder         []string // order rules were defined (first = start rule)
	Extras            []*Rule
	Conflicts         [][]string
	Externals         []*Rule
	Inline            []string
	Word              string
	Supertypes        []string
	Tests             []TestCase      // embedded test cases
	EnableLRSplitting bool            // opt-in: attempt LR(1) state splitting for merge pathology
	BinaryRepeatMode  bool            // use tree-sitter's binary repeat helper shape (aux→seq(aux,aux)|inner)
	NonKeywordStrings map[string]bool // strings that should NOT be promoted via keyword DFA (extension keywords that coexist as identifiers)
}

Grammar is the top-level grammar definition.

func AliasSuperGrammar

func AliasSuperGrammar() *Grammar

AliasSuperGrammar returns a grammar that exercises aliases and supertypes.

Supertypes:

_expression is a supertype with children: number, string, identifier, binary_expression

Aliases:

In assignment, the left-hand side identifier is aliased to "variable"
In binary_expression, the operator string is aliased to "op"

func CalcGrammar

func CalcGrammar() *Grammar

CalcGrammar returns a calculator grammar that exercises precedence and associativity. It defines:

  • Binary operators: +, -, *, / with standard math precedence
  • Unary prefix minus: -x (highest precedence)
  • Parenthesized expressions: (x)
  • Integer literals: number

func ExtScannerGrammar

func ExtScannerGrammar() *Grammar

ExtScannerGrammar returns a grammar with external scanner tokens. It models a simple block-structured language where INDENT and DEDENT tokens are produced by an external scanner (like Python).

program: repeat(statement)
statement: simple_statement | block
simple_statement: identifier ";"
block: identifier ":" NEWLINE INDENT repeat(statement) DEDENT

External tokens: INDENT, DEDENT, NEWLINE

func ExtendGrammar

func ExtendGrammar(name string, base *Grammar, customize func(g *Grammar)) *Grammar

ExtendGrammar creates a new grammar that inherits from a base grammar. The customize function receives the new grammar with all base rules copied in, and can override rules, add new ones, or modify extras/conflicts/etc.

Example:

cpp := ExtendGrammar("cpp", cGrammar(), func(g *Grammar) {
    g.Define("class_declaration", Seq(Str("class"), Sym("identifier"), Sym("class_body")))
    // Override an existing rule:
    g.Define("declaration", Choice(Sym("class_declaration"), Sym("function_declaration")))
})

func GLRGrammar

func GLRGrammar() *Grammar

GLRGrammar returns a grammar with intentional ambiguity that requires GLR parsing. It models a simplified C-like language where `a * b` can be parsed as either multiplication or a pointer declaration:

expression_statement: a * b ;  (multiplication)
pointer_declaration:  a * b ;  (type * name)

The conflict between _expression and type_name is declared, causing the parser to fork stacks when it encounters the ambiguity.

func GoGrammar

func GoGrammar() *Grammar

GoGrammar returns the go grammar. Code generated by EmitGrammarGo. DO NOT EDIT.

func INIGrammar

func INIGrammar() *Grammar

INIGrammar returns a production-grade INI file grammar.

Parses the superset of major INI dialects (Windows API, Python configparser, Git config, PHP parse_ini_file):

  • Sections: [name] and [section "subsection"] (Git-style)
  • Key-value pairs: key = value, key : value, key=value
  • Comments: ; and # (full-line only)
  • Quoted string values: "..." with \" and \\ escapes
  • Global pairs: key=value before any [section]
  • Empty values: key= (value is optional)

INI is line-oriented: newlines are significant (not extras). Only horizontal whitespace (spaces, tabs) is treated as extras.

func ImportGrammarJS

func ImportGrammarJS(source []byte) (*Grammar, error)

ImportGrammarJS parses a tree-sitter grammar.js file and returns a Grammar IR. This uses gotreesitter's own JavaScript grammar to parse the file, demonstrating the full-circle capability: gotreesitter parsing its own input format.

func ImportGrammarJSON

func ImportGrammarJSON(data []byte) (*Grammar, error)

ImportGrammarJSON parses a tree-sitter grammar.json file (the canonical resolved form generated by `tree-sitter generate`) and returns a Grammar IR. This is more reliable than ImportGrammarJS because grammar.json has no require() calls, helper functions, or other JavaScript-specific constructs.

func JSONGrammar

func JSONGrammar() *Grammar

JSONGrammar returns the JSON grammar defined using the Go DSL. This mirrors tree-sitter-json's grammar.js definition.

func KeywordGrammar

func KeywordGrammar() *Grammar

KeywordGrammar returns a simplified language grammar that exercises keyword extraction and the word token mechanism. Keywords "var" and "return" match the identifier pattern but are promoted to their own symbols by the keyword DFA.

func LoxGrammar

func LoxGrammar() *Grammar

LoxGrammar returns a production-grade Lox grammar (Crafting Interpreters spec).

Implements the full Lox language:

  • Variables: var x = expr;
  • Functions: fun name(params) { body }
  • Classes: class Name < Super { methods }
  • Control flow: if/else, while, for
  • Operators: or, and, ==, !=, <, >, <=, >=, +, -, *, /, !, unary -
  • Calls and property access: f(args), obj.prop, obj.prop = val
  • Literals: numbers, strings, true, false, nil, this, super
  • Print: print expr;
  • Return: return expr;
  • Comments: // line comments
  • Block scoping: { statements }

func MustacheGrammar

func MustacheGrammar() *Grammar

MustacheGrammar returns a production-grade Mustache template grammar.

Implements the required Mustache spec features:

  • Interpolation: {{ name }}
  • Unescaped interpolation: {{{ name }}} and {{& name }}
  • Sections: {{# name }} ... {{/ name }}
  • Inverted sections: {{^ name }} ... {{/ name }}
  • Comments: {{! comment text }}
  • Partials: {{> partial_name }}
  • Dotted names: {{ person.name }}
  • Implicit iterator: {{ . }}
  • Raw text between tags

The grammar treats {{ and }} as delimiters. Text outside tags is raw content. The DFA handles {{{ vs {{ disambiguation via maximal munch.

func NewGrammar

func NewGrammar(name string) *Grammar

NewGrammar creates a new grammar with the given name.

func ParseGrammarFile

func ParseGrammarFile(source string) (*Grammar, error)

ParseGrammarFile parses a declarative .grammar file into a Grammar IR.

Syntax:

grammar <name>

extras = [ /\s/ ]
word = <rule_name>
supertypes = [ <rule_name>, ... ]
conflicts = [ [<rule>, <rule>], ... ]

rule <name> = <expr>

Expressions:

"string"         string literal
/pattern/        regex pattern
<name>           symbol reference
seq(a, b, ...)   sequence
choice(a, b, ..) alternation
repeat(a)        zero or more
repeat1(a)       one or more
optional(a)      optional
token(a)         token boundary
field("name", a) field annotation
prec(n, a)       precedence
prec.left(n, a)  left-associative precedence
prec.right(n, a) right-associative precedence

func (*Grammar) Define

func (g *Grammar) Define(name string, rule *Rule)

Define adds a rule to the grammar. The first rule defined is the start rule.

func (*Grammar) SetConflicts

func (g *Grammar) SetConflicts(conflicts ...[]string)

SetConflicts declares grammar conflicts for GLR.

func (*Grammar) SetExternals

func (g *Grammar) SetExternals(rules ...*Rule)

SetExternals declares external scanner tokens.

func (*Grammar) SetExtras

func (g *Grammar) SetExtras(rules ...*Rule)

SetExtras sets the extra rules (e.g. whitespace, comments).

func (*Grammar) SetInline

func (g *Grammar) SetInline(names ...string)

SetInline marks rules to be inlined.

func (*Grammar) SetSupertypes

func (g *Grammar) SetSupertypes(names ...string)

SetSupertypes declares supertype rules.

func (*Grammar) SetWord

func (g *Grammar) SetWord(name string)

SetWord sets the word token for keyword extraction.

func (*Grammar) Test

func (g *Grammar) Test(name, input, expected string)

Test adds an embedded test case. Input is parsed and the resulting tree is compared against the expected S-expression. If expected is empty, the test only checks that no ERROR nodes appear.

func (*Grammar) TestError

func (g *Grammar) TestError(name, input string)

TestError adds an embedded test case that expects parse errors.

type GrammarDiff

type GrammarDiff struct {
	AddedRules        []string
	RemovedRules      []string
	ModifiedRules     []string // rules present in both but with different definitions
	ExtrasChanged     bool
	ConflictsChanged  bool
	ExternalsChanged  bool
	WordChanged       bool
	SupertypesChanged bool
}

GrammarDiff describes the differences between two grammar versions.

func DiffGrammars

func DiffGrammars(old, new *Grammar) *GrammarDiff

DiffGrammars compares two grammar versions and returns a diff.

func (*GrammarDiff) HasChanges

func (d *GrammarDiff) HasChanges() bool

HasChanges returns true if any differences were found.

func (*GrammarDiff) String

func (d *GrammarDiff) String() string

String returns a human-readable summary of the diff.

type LRTables

type LRTables struct {
	// ActionTable[state][symbol] = list of actions (multiple = conflict/GLR)
	ActionTable          map[int]map[int][]lrAction
	GotoTable            map[int]map[int]int // [state][nonterminal] → target state
	StateCount           int
	ExtraChainStateStart int // first synthetic nonterminal-extra state, or -1 if none
}

LRTables holds the generated parse tables.

type NormalizedGrammar

type NormalizedGrammar struct {
	Symbols       []SymbolInfo
	Productions   []Production
	Terminals     []TerminalPattern
	ExtraSymbols  []int    // symbol indices of extras
	FieldNames    []string // index 0 is always ""
	Conflicts     [][]int  // symbol index groups
	Supertypes    []int    // symbol indices
	StartSymbol   int
	AugmentProdID int // production index for S' → S

	// Keyword support (populated when Grammar.Word is set).
	KeywordSymbols []int             // symbol IDs that are keywords
	WordSymbolID   int               // word token symbol ID (e.g., identifier)
	KeywordEntries []TerminalPattern // keyword patterns for keyword DFA

	// External scanner support (populated when Grammar.Externals is set).
	ExternalSymbols []int // external token index → symbol ID
	// contains filtered or unexported fields
}

NormalizedGrammar is the output of the normalize step.

func Normalize

func Normalize(g *Grammar) (*NormalizedGrammar, error)

Normalize transforms a Grammar into a NormalizedGrammar.

func (*NormalizedGrammar) TokenCount

func (ng *NormalizedGrammar) TokenCount() int

TokenCount returns the number of terminal symbols (including symbol 0 = end).

type Production

type Production struct {
	LHS          int   // symbol index
	RHS          []int // symbol indices
	Prec         int
	Assoc        Assoc
	DynPrec      int
	ProductionID int
	Fields       []FieldAssign // per-RHS-position field assignments
	Aliases      []AliasInfo   // per-RHS-position alias info
	IsExtra      bool          // true if this production belongs to a nonterminal extra
}

Production is a single LHS → RHS production with metadata.

type Rule

type Rule struct {
	Kind     RuleKind
	Value    string  // literal/pattern/symbol/field name
	Children []*Rule // sub-rules
	Prec     int     // precedence value
	Named    bool    // for alias: whether the alias is a named node
}

Rule is a node in the grammar rule tree.

func Alias

func Alias(rule *Rule, name string, named bool) *Rule

Alias aliases a rule to a different name.

func Blank

func Blank() *Rule

Blank creates an epsilon (empty) rule.

func Braces

func Braces(rule *Rule) *Rule

Braces wraps a rule in curly braces.

func Brackets

func Brackets(rule *Rule) *Rule

Brackets wraps a rule in square brackets.

func Choice

func Choice(rules ...*Rule) *Rule

Choice creates an alternation of rules.

func CommaSep

func CommaSep(rule *Rule) *Rule

CommaSep creates an optional comma-separated list.

func CommaSep1

func CommaSep1(rule *Rule) *Rule

CommaSep1 creates a non-empty comma-separated list.

func Field

func Field(name string, rule *Rule) *Rule

Field annotates a rule with a field name.

func ImmToken

func ImmToken(rule *Rule) *Rule

ImmToken creates an immediate token (no preceding whitespace).

func Optional

func Optional(rule *Rule) *Rule

Optional creates an optional rule.

func Parens

func Parens(rule *Rule) *Rule

Parens wraps a rule in parentheses.

func Pat

func Pat(pattern string) *Rule

Pat creates a regex pattern rule.

func Prec

func Prec(n int, rule *Rule) *Rule

Prec sets precedence on a rule.

func PrecDynamic

func PrecDynamic(n int, rule *Rule) *Rule

PrecDynamic sets dynamic precedence on a rule.

func PrecLeft

func PrecLeft(n int, rule *Rule) *Rule

PrecLeft sets left-associative precedence on a rule.

func PrecRight

func PrecRight(n int, rule *Rule) *Rule

PrecRight sets right-associative precedence on a rule.

func Repeat

func Repeat(rule *Rule) *Rule

Repeat creates a zero-or-more repetition.

func Repeat1

func Repeat1(rule *Rule) *Rule

Repeat1 creates a one-or-more repetition.

func SepBy

func SepBy(sep, rule *Rule) *Rule

SepBy creates an optional list separated by the given separator.

func SepBy1

func SepBy1(sep, rule *Rule) *Rule

SepBy1 creates a non-empty list separated by the given separator.

func Seq

func Seq(rules ...*Rule) *Rule

Seq creates a sequence of rules.

func Str

func Str(s string) *Rule

Str creates a string literal rule.

func Surround

func Surround(open, rule, close *Rule) *Rule

Surround wraps a rule with open and close delimiters.

func Sym

func Sym(name string) *Rule

Sym creates a symbol reference rule.

func Token

func Token(rule *Rule) *Rule

Token creates a token boundary (content is a single lexer token).

type RuleKind

type RuleKind int

RuleKind identifies the type of a grammar rule node.

const (
	RuleString      RuleKind = iota // literal string: "{"
	RulePattern                     // regex pattern: /[0-9]+/
	RuleSymbol                      // symbol reference: $.object
	RuleSeq                         // sequence: seq(a, b, c)
	RuleChoice                      // alternation: choice(a, b)
	RuleRepeat                      // zero-or-more: repeat(a)
	RuleRepeat1                     // one-or-more: repeat1(a)
	RuleOptional                    // optional: optional(a)
	RuleToken                       // token boundary: token(a)
	RuleImmToken                    // immediate token: token.immediate(a)
	RuleField                       // field annotation: field("name", a)
	RulePrec                        // precedence: prec(n, a)
	RulePrecLeft                    // left-associative: prec.left(n, a)
	RulePrecRight                   // right-associative: prec.right(n, a)
	RulePrecDynamic                 // dynamic precedence: prec.dynamic(n, a)
	RuleBlank                       // epsilon / empty
	RuleAlias                       // alias: alias(a, "name")
)

type SymbolInfo

type SymbolInfo struct {
	Name      string
	Visible   bool
	Named     bool
	Supertype bool
	Kind      SymbolKind
	IsExtra   bool
	Immediate bool // token.immediate — no preceding whitespace skip
}

SymbolInfo describes a grammar symbol.

type SymbolKind

type SymbolKind int

SymbolKind classifies a grammar symbol.

const (
	SymbolTerminal    SymbolKind = iota // anonymous terminal like "{"
	SymbolNamedToken                    // named terminal like number, string_content
	SymbolExternal                      // external scanner token
	SymbolNonterminal                   // nonterminal rule
)

type TerminalPattern

type TerminalPattern struct {
	SymbolID  int
	Rule      *Rule // the flattened rule tree for NFA construction
	Priority  int   // lower = higher priority (wins on tie)
	Immediate bool  // token.immediate
}

TerminalPattern describes a terminal symbol's match pattern for DFA generation.

type TestCase

type TestCase struct {
	Name        string // test name
	Input       string // input to parse
	Expected    string // expected S-expression (empty = just check no errors)
	ExpectError bool   // if true, expect ERROR nodes in the tree
}

TestCase is an embedded grammar test case.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL