Documentation
¶
Overview ¶
Package grammargen implements a pure-Go grammar generator for gotreesitter. It compiles grammar definitions expressed in a Go DSL into binary blobs that the gotreesitter runtime can load and use for parsing.
Index ¶
- func AddConflict(g *Grammar, symbols ...string)
- func AppendChoice(g *Grammar, ruleName string, newAlts ...*Rule)
- func EmitC(name string, lang *gotreesitter.Language) (string, error)
- func EmitGrammarGo(g *Grammar, pkgName, funcName string) ([]byte, error)
- func ExportGrammarJSON(g *Grammar) ([]byte, error)
- func Generate(g *Grammar) ([]byte, error)
- func GenerateC(g *Grammar) (string, error)
- func GenerateHighlightQueries(base, extended *Grammar) string
- func GenerateHighlightQuery(g *Grammar) string
- func GenerateLanguage(g *Grammar) (*gotreesitter.Language, error)
- func GenerateLanguageWithContext(ctx context.Context, g *Grammar) (*gotreesitter.Language, error)
- func LoadLanguageBlob(data []byte) (*gotreesitter.Language, error)
- func RunTests(g *Grammar) error
- func Validate(g *Grammar) []string
- type AliasInfo
- type Assoc
- type ConflictDiag
- type ConflictKind
- type FieldAssign
- type GenerateReport
- type Grammar
- func AliasSuperGrammar() *Grammar
- func CalcGrammar() *Grammar
- func ExtScannerGrammar() *Grammar
- func ExtendGrammar(name string, base *Grammar, customize func(g *Grammar)) *Grammar
- func GLRGrammar() *Grammar
- func GoGrammar() *Grammar
- func INIGrammar() *Grammar
- func ImportGrammarJS(source []byte) (*Grammar, error)
- func ImportGrammarJSON(data []byte) (*Grammar, error)
- func JSONGrammar() *Grammar
- func KeywordGrammar() *Grammar
- func LoxGrammar() *Grammar
- func MustacheGrammar() *Grammar
- func NewGrammar(name string) *Grammar
- func ParseGrammarFile(source string) (*Grammar, error)
- func (g *Grammar) Define(name string, rule *Rule)
- func (g *Grammar) SetConflicts(conflicts ...[]string)
- func (g *Grammar) SetExternals(rules ...*Rule)
- func (g *Grammar) SetExtras(rules ...*Rule)
- func (g *Grammar) SetInline(names ...string)
- func (g *Grammar) SetSupertypes(names ...string)
- func (g *Grammar) SetWord(name string)
- func (g *Grammar) Test(name, input, expected string)
- func (g *Grammar) TestError(name, input string)
- type GrammarDiff
- type LRTables
- type NormalizedGrammar
- type Production
- type Rule
- func Alias(rule *Rule, name string, named bool) *Rule
- func Blank() *Rule
- func Braces(rule *Rule) *Rule
- func Brackets(rule *Rule) *Rule
- func Choice(rules ...*Rule) *Rule
- func CommaSep(rule *Rule) *Rule
- func CommaSep1(rule *Rule) *Rule
- func Field(name string, rule *Rule) *Rule
- func ImmToken(rule *Rule) *Rule
- func Optional(rule *Rule) *Rule
- func Parens(rule *Rule) *Rule
- func Pat(pattern string) *Rule
- func Prec(n int, rule *Rule) *Rule
- func PrecDynamic(n int, rule *Rule) *Rule
- func PrecLeft(n int, rule *Rule) *Rule
- func PrecRight(n int, rule *Rule) *Rule
- func Repeat(rule *Rule) *Rule
- func Repeat1(rule *Rule) *Rule
- func SepBy(sep, rule *Rule) *Rule
- func SepBy1(sep, rule *Rule) *Rule
- func Seq(rules ...*Rule) *Rule
- func Str(s string) *Rule
- func Surround(open, rule, close *Rule) *Rule
- func Sym(name string) *Rule
- func Token(rule *Rule) *Rule
- type RuleKind
- type SymbolInfo
- type SymbolKind
- type TerminalPattern
- type TestCase
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AddConflict ¶
AddConflict adds a GLR conflict group to the grammar.
func AppendChoice ¶
AppendChoice appends new alternatives to an existing Choice rule. If the named rule is already a Choice, the new alternatives are appended to its children. Otherwise the existing rule and the new alternatives are wrapped in a new Choice.
func EmitC ¶
func EmitC(name string, lang *gotreesitter.Language) (string, error)
EmitC emits a parser.c string from a compiled Language struct.
func EmitGrammarGo ¶
EmitGrammarGo takes a Grammar IR and emits Go source code that reconstructs it using grammargen DSL calls. The output is a standalone Go file in the given package with a function of the given name that returns *Grammar.
func ExportGrammarJSON ¶
ExportGrammarJSON serializes a Grammar struct to the tree-sitter grammar.json format. The output is compatible with ImportGrammarJSON — a round-trip ImportGrammarJSON(ExportGrammarJSON(g)) should produce an equivalent grammar.
The JSON structure matches tree-sitter's canonical resolved grammar.json:
{
"name": "...",
"word": "...",
"rules": { ... },
"extras": [...],
"conflicts": [...],
"externals": [...],
"inline": [...],
"supertypes": [...]
}
func Generate ¶
Generate compiles a Grammar definition into a binary blob that gotreesitter can load via DecodeLanguageBlob / loadEmbeddedLanguage. LR(1) state splitting is always attempted; a rollback guard reverts to the plain LALR table if splitting does not reduce GLR conflicts.
func GenerateC ¶
GenerateC compiles a Grammar to a standard tree-sitter parser.c string. The output is compatible with tree-sitter's C runtime ABI 14.
func GenerateHighlightQueries ¶
GenerateHighlightQueries produces tree-sitter highlight queries for rules added by a grammar extension. It diffs base and extended to find new rules, then applies naming conventions to generate appropriate highlights.
Conventions:
- New Str() tokens matching identifier pattern -> @keyword
- *_declaration with "name" field -> name: (identifier) @type.definition
- *_variant with "name" field -> name: (identifier) @constructor
- *_block with "description" field -> description: @string
- *_expression -> no default highlight (expressions are structural)
- *_statement -> no default highlight
- Field named "params"/"parameters" -> children (identifier) @variable.parameter
- let_declaration name -> @variable.definition
- New string tokens that are operators (non-alphanumeric) -> @operator
- New string tokens that are keywords (alphanumeric) -> @keyword
func GenerateHighlightQuery ¶
GenerateHighlightQuery infers a tree-sitter highlight query from grammar structure. It maps well-known rule names and patterns to standard capture names:
- comment → @comment
- string, string_content → @string
- number, integer, float → @number
- true, false → @boolean
- null, nil, none → @constant.builtin
- identifier → @variable
- type_identifier → @type
- function keywords → @keyword.function
- control flow keywords → @keyword.control
- other keyword-like string terminals → @keyword
- operators → @operator
func GenerateLanguage ¶
func GenerateLanguage(g *Grammar) (*gotreesitter.Language, error)
GenerateLanguage compiles a Grammar into a Language struct without encoding. LR(1) state splitting is always attempted; a rollback guard reverts to the plain LALR table if splitting does not reduce GLR conflicts.
func GenerateLanguageWithContext ¶
GenerateLanguageWithContext is like GenerateLanguage but accepts a context for cancellation. When the context is cancelled, LR table construction and DFA building abort promptly, allowing the caller to reclaim memory that would otherwise be held by an orphaned goroutine.
func LoadLanguageBlob ¶
func LoadLanguageBlob(data []byte) (*gotreesitter.Language, error)
LoadLanguageBlob deserializes a compressed language blob back into a Language. This is the inverse of the blob encoding used by GenerateLanguage.
Types ¶
type ConflictDiag ¶
type ConflictDiag struct {
Kind ConflictKind
State int
LookaheadSym int
Actions []lrAction // the conflicting actions
Resolution string // how it was resolved (or "GLR" if kept)
IsMergedState bool // was this state produced by LALR merging?
MergeCount int // how many merge origins this state has
}
ConflictDiag describes a conflict encountered during LR table construction.
func (*ConflictDiag) String ¶
func (d *ConflictDiag) String(ng *NormalizedGrammar) string
type ConflictKind ¶
type ConflictKind int
ConflictKind describes the type of LR conflict.
const ( ShiftReduce ConflictKind = iota ReduceReduce )
type FieldAssign ¶
FieldAssign maps a child position in a production to a field name.
type GenerateReport ¶
type GenerateReport struct {
Language *gotreesitter.Language
Blob []byte
Conflicts []ConflictDiag
SplitCandidates []splitCandidate
SplitResult *splitReport
Warnings []string
SymbolCount int
StateCount int
TokenCount int
}
GenerateReport holds the result of grammar generation with diagnostics.
func GenerateWithReport ¶
func GenerateWithReport(g *Grammar) (*GenerateReport, error)
GenerateWithReport compiles a grammar and returns a full diagnostic report.
type Grammar ¶
type Grammar struct {
Name string
Rules map[string]*Rule
RuleOrder []string // order rules were defined (first = start rule)
Extras []*Rule
Conflicts [][]string
Externals []*Rule
Inline []string
Word string
Supertypes []string
Tests []TestCase // embedded test cases
EnableLRSplitting bool // opt-in: attempt LR(1) state splitting for merge pathology
BinaryRepeatMode bool // use tree-sitter's binary repeat helper shape (aux→seq(aux,aux)|inner)
NonKeywordStrings map[string]bool // strings that should NOT be promoted via keyword DFA (extension keywords that coexist as identifiers)
}
Grammar is the top-level grammar definition.
func AliasSuperGrammar ¶
func AliasSuperGrammar() *Grammar
AliasSuperGrammar returns a grammar that exercises aliases and supertypes.
Supertypes:
_expression is a supertype with children: number, string, identifier, binary_expression
Aliases:
In assignment, the left-hand side identifier is aliased to "variable" In binary_expression, the operator string is aliased to "op"
func CalcGrammar ¶
func CalcGrammar() *Grammar
CalcGrammar returns a calculator grammar that exercises precedence and associativity. It defines:
- Binary operators: +, -, *, / with standard math precedence
- Unary prefix minus: -x (highest precedence)
- Parenthesized expressions: (x)
- Integer literals: number
func ExtScannerGrammar ¶
func ExtScannerGrammar() *Grammar
ExtScannerGrammar returns a grammar with external scanner tokens. It models a simple block-structured language where INDENT and DEDENT tokens are produced by an external scanner (like Python).
program: repeat(statement) statement: simple_statement | block simple_statement: identifier ";" block: identifier ":" NEWLINE INDENT repeat(statement) DEDENT
External tokens: INDENT, DEDENT, NEWLINE
func ExtendGrammar ¶
ExtendGrammar creates a new grammar that inherits from a base grammar. The customize function receives the new grammar with all base rules copied in, and can override rules, add new ones, or modify extras/conflicts/etc.
Example:
cpp := ExtendGrammar("cpp", cGrammar(), func(g *Grammar) {
g.Define("class_declaration", Seq(Str("class"), Sym("identifier"), Sym("class_body")))
// Override an existing rule:
g.Define("declaration", Choice(Sym("class_declaration"), Sym("function_declaration")))
})
func GLRGrammar ¶
func GLRGrammar() *Grammar
GLRGrammar returns a grammar with intentional ambiguity that requires GLR parsing. It models a simplified C-like language where `a * b` can be parsed as either multiplication or a pointer declaration:
expression_statement: a * b ; (multiplication) pointer_declaration: a * b ; (type * name)
The conflict between _expression and type_name is declared, causing the parser to fork stacks when it encounters the ambiguity.
func GoGrammar ¶
func GoGrammar() *Grammar
GoGrammar returns the go grammar. Code generated by EmitGrammarGo. DO NOT EDIT.
func INIGrammar ¶
func INIGrammar() *Grammar
INIGrammar returns a production-grade INI file grammar.
Parses the superset of major INI dialects (Windows API, Python configparser, Git config, PHP parse_ini_file):
- Sections: [name] and [section "subsection"] (Git-style)
- Key-value pairs: key = value, key : value, key=value
- Comments: ; and # (full-line only)
- Quoted string values: "..." with \" and \\ escapes
- Global pairs: key=value before any [section]
- Empty values: key= (value is optional)
INI is line-oriented: newlines are significant (not extras). Only horizontal whitespace (spaces, tabs) is treated as extras.
func ImportGrammarJS ¶
ImportGrammarJS parses a tree-sitter grammar.js file and returns a Grammar IR. This uses gotreesitter's own JavaScript grammar to parse the file, demonstrating the full-circle capability: gotreesitter parsing its own input format.
func ImportGrammarJSON ¶
ImportGrammarJSON parses a tree-sitter grammar.json file (the canonical resolved form generated by `tree-sitter generate`) and returns a Grammar IR. This is more reliable than ImportGrammarJS because grammar.json has no require() calls, helper functions, or other JavaScript-specific constructs.
func JSONGrammar ¶
func JSONGrammar() *Grammar
JSONGrammar returns the JSON grammar defined using the Go DSL. This mirrors tree-sitter-json's grammar.js definition.
func KeywordGrammar ¶
func KeywordGrammar() *Grammar
KeywordGrammar returns a simplified language grammar that exercises keyword extraction and the word token mechanism. Keywords "var" and "return" match the identifier pattern but are promoted to their own symbols by the keyword DFA.
func LoxGrammar ¶
func LoxGrammar() *Grammar
LoxGrammar returns a production-grade Lox grammar (Crafting Interpreters spec).
Implements the full Lox language:
- Variables: var x = expr;
- Functions: fun name(params) { body }
- Classes: class Name < Super { methods }
- Control flow: if/else, while, for
- Operators: or, and, ==, !=, <, >, <=, >=, +, -, *, /, !, unary -
- Calls and property access: f(args), obj.prop, obj.prop = val
- Literals: numbers, strings, true, false, nil, this, super
- Print: print expr;
- Return: return expr;
- Comments: // line comments
- Block scoping: { statements }
func MustacheGrammar ¶
func MustacheGrammar() *Grammar
MustacheGrammar returns a production-grade Mustache template grammar.
Implements the required Mustache spec features:
- Interpolation: {{ name }}
- Unescaped interpolation: {{{ name }}} and {{& name }}
- Sections: {{# name }} ... {{/ name }}
- Inverted sections: {{^ name }} ... {{/ name }}
- Comments: {{! comment text }}
- Partials: {{> partial_name }}
- Dotted names: {{ person.name }}
- Implicit iterator: {{ . }}
- Raw text between tags
The grammar treats {{ and }} as delimiters. Text outside tags is raw content. The DFA handles {{{ vs {{ disambiguation via maximal munch.
func NewGrammar ¶
NewGrammar creates a new grammar with the given name.
func ParseGrammarFile ¶
ParseGrammarFile parses a declarative .grammar file into a Grammar IR.
Syntax:
grammar <name> extras = [ /\s/ ] word = <rule_name> supertypes = [ <rule_name>, ... ] conflicts = [ [<rule>, <rule>], ... ] rule <name> = <expr>
Expressions:
"string" string literal
/pattern/ regex pattern
<name> symbol reference
seq(a, b, ...) sequence
choice(a, b, ..) alternation
repeat(a) zero or more
repeat1(a) one or more
optional(a) optional
token(a) token boundary
field("name", a) field annotation
prec(n, a) precedence
prec.left(n, a) left-associative precedence
prec.right(n, a) right-associative precedence
func (*Grammar) Define ¶
Define adds a rule to the grammar. The first rule defined is the start rule.
func (*Grammar) SetConflicts ¶
SetConflicts declares grammar conflicts for GLR.
func (*Grammar) SetExternals ¶
SetExternals declares external scanner tokens.
func (*Grammar) SetSupertypes ¶
SetSupertypes declares supertype rules.
type GrammarDiff ¶
type GrammarDiff struct {
AddedRules []string
RemovedRules []string
ModifiedRules []string // rules present in both but with different definitions
ExtrasChanged bool
ConflictsChanged bool
ExternalsChanged bool
WordChanged bool
SupertypesChanged bool
}
GrammarDiff describes the differences between two grammar versions.
func DiffGrammars ¶
func DiffGrammars(old, new *Grammar) *GrammarDiff
DiffGrammars compares two grammar versions and returns a diff.
func (*GrammarDiff) HasChanges ¶
func (d *GrammarDiff) HasChanges() bool
HasChanges returns true if any differences were found.
func (*GrammarDiff) String ¶
func (d *GrammarDiff) String() string
String returns a human-readable summary of the diff.
type LRTables ¶
type LRTables struct {
// ActionTable[state][symbol] = list of actions (multiple = conflict/GLR)
ActionTable map[int]map[int][]lrAction
GotoTable map[int]map[int]int // [state][nonterminal] → target state
StateCount int
ExtraChainStateStart int // first synthetic nonterminal-extra state, or -1 if none
}
LRTables holds the generated parse tables.
type NormalizedGrammar ¶
type NormalizedGrammar struct {
Symbols []SymbolInfo
Productions []Production
Terminals []TerminalPattern
ExtraSymbols []int // symbol indices of extras
FieldNames []string // index 0 is always ""
Conflicts [][]int // symbol index groups
Supertypes []int // symbol indices
StartSymbol int
AugmentProdID int // production index for S' → S
// Keyword support (populated when Grammar.Word is set).
KeywordSymbols []int // symbol IDs that are keywords
WordSymbolID int // word token symbol ID (e.g., identifier)
KeywordEntries []TerminalPattern // keyword patterns for keyword DFA
// External scanner support (populated when Grammar.Externals is set).
ExternalSymbols []int // external token index → symbol ID
// contains filtered or unexported fields
}
NormalizedGrammar is the output of the normalize step.
func Normalize ¶
func Normalize(g *Grammar) (*NormalizedGrammar, error)
Normalize transforms a Grammar into a NormalizedGrammar.
func (*NormalizedGrammar) TokenCount ¶
func (ng *NormalizedGrammar) TokenCount() int
TokenCount returns the number of terminal symbols (including symbol 0 = end).
type Production ¶
type Production struct {
LHS int // symbol index
RHS []int // symbol indices
Prec int
Assoc Assoc
DynPrec int
ProductionID int
Fields []FieldAssign // per-RHS-position field assignments
Aliases []AliasInfo // per-RHS-position alias info
IsExtra bool // true if this production belongs to a nonterminal extra
}
Production is a single LHS → RHS production with metadata.
type Rule ¶
type Rule struct {
Kind RuleKind
Value string // literal/pattern/symbol/field name
Children []*Rule // sub-rules
Prec int // precedence value
Named bool // for alias: whether the alias is a named node
}
Rule is a node in the grammar rule tree.
func PrecDynamic ¶
PrecDynamic sets dynamic precedence on a rule.
type RuleKind ¶
type RuleKind int
RuleKind identifies the type of a grammar rule node.
const ( RuleString RuleKind = iota // literal string: "{" RulePattern // regex pattern: /[0-9]+/ RuleSymbol // symbol reference: $.object RuleSeq // sequence: seq(a, b, c) RuleChoice // alternation: choice(a, b) RuleRepeat // zero-or-more: repeat(a) RuleRepeat1 // one-or-more: repeat1(a) RuleOptional // optional: optional(a) RuleToken // token boundary: token(a) RuleImmToken // immediate token: token.immediate(a) RuleField // field annotation: field("name", a) RulePrec // precedence: prec(n, a) RulePrecLeft // left-associative: prec.left(n, a) RulePrecRight // right-associative: prec.right(n, a) RulePrecDynamic // dynamic precedence: prec.dynamic(n, a) RuleBlank // epsilon / empty RuleAlias // alias: alias(a, "name") )
type SymbolInfo ¶
type SymbolInfo struct {
Name string
Visible bool
Named bool
Supertype bool
Kind SymbolKind
IsExtra bool
Immediate bool // token.immediate — no preceding whitespace skip
}
SymbolInfo describes a grammar symbol.
type SymbolKind ¶
type SymbolKind int
SymbolKind classifies a grammar symbol.
const ( SymbolTerminal SymbolKind = iota // anonymous terminal like "{" SymbolNamedToken // named terminal like number, string_content SymbolExternal // external scanner token SymbolNonterminal // nonterminal rule )
Source Files
¶
- alias_grammar.go
- assemble.go
- bitset.go
- calc_grammar.go
- codegen_c.go
- dfa.go
- diagnostics.go
- diff.go
- emit_grammar_go.go
- encode.go
- export_grammarjson.go
- ext_grammar.go
- glr_grammar.go
- go_grammar.go
- grammar.go
- highlight.go
- highlight_gen.go
- import_grammarjs.go
- import_grammarjson.go
- ini_grammar.go
- json_grammar.go
- keyword_grammar.go
- lox_grammar.go
- lr.go
- lr_lalr.go
- lr_provenance.go
- lr_split.go
- lr_split_oracle.go
- mustache_grammar.go
- nfa.go
- normalize.go
- parse_grammar_file.go
- regex.go