semanticfw

package module

v2.0.1 Latest Latest Go to latest Published: Jan 14, 2026 License: MIT Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/BlackVectorOps/semantic_firewall

Links

Open Source Insights

README ¶

Semantic Firewall

Behavioral Code Analysis Engine for Go

Fingerprint behavior, not bytes · Prove loop equivalence · Catch backdoors · Hunt malware

[!CAUTION] Disclaimer: This tool is provided for defensive security research and authorized testing only. The malware scanning features are designed to help security teams detect and analyze malicious code patterns. Do not use this tool to create, distribute, or deploy malware. Users are responsible for ensuring compliance with all applicable laws and organizational policies. The author assumes no liability for misuse.

What is Semantic Firewall?

Semantic Firewall is a static analysis engine that generates deterministic fingerprints of Go code's behavior, not its textual representation. Unlike traditional diff tools that get confused by whitespace, renaming, or style changes, Semantic Firewall understands the actual control flow and data dependencies of your code.

Core Capabilities:

Feature	Description
Scalar Evolution (SCEV)	Mathematically proves loop equivalence regardless of syntax
Semantic Zipper	Diffs architectural changes by walking use-def chains in parallel
BoltDB Signature Store	ACID-compliant persistent storage with O(1) topology lookups
Fuzzy Hash Indexing (LSH-lite)	Locality-sensitive bucketing for near-match detection
Shannon Entropy Analysis	Detects obfuscation, packing, and encrypted payloads
Topology Matching	Catches renamed/obfuscated malware via structural fingerprints

Installation

go install github.com/BlackVectorOps/semantic_firewall/v2/cmd/sfw@latest

Quick Start

# Fingerprint a file (produces deterministic SHA-256 of behavior)
sfw check ./main.go

# Semantic diff between two versions (ignores cosmetic changes)
sfw diff old_version.go new_version.go

# Index a malware sample (auto-resolves DB location)
sfw index malware.go --name "Beacon_v1" --severity CRITICAL

# Scan code for known malware patterns (O(1) topology matching)
sfw scan ./suspicious/ --threshold 0.8

# Scan with dependency analysis (examines imported packages)
sfw scan ./cmd/myapp --deps --deps-depth transitive

# View database statistics
sfw stats

Check Output:

{
  "file": "./main.go",
  "functions": [
    { "function": "main", "fingerprint": "005efb52a8c9d1e3...", "line": 12 }
  ]
}

Diff Output (Risk-Aware):

{
  "summary": {
    "semantic_match_pct": 92.5,
    "preserved": 12,
    "modified": 1,
    "renamed_functions": 2,
    "high_risk_changes": 1
  },
  "functions": [
    {
      "function": "HandleLogin",
      "status": "modified",
      "added_ops": ["Call <log.Printf>", "Call <net.Dial>"],
      "risk_score": 15,
      "topology_delta": "Calls+2, AddedGoroutine"
    }
  ],
  "topology_matches": [
    {
      "old_function": "processData",
      "new_function": "handleInput",
      "similarity": 0.94,
      "matched_by_name": false
    }
  ]
}

Scan Output (Malware Hunter):

{
  "target": "./suspicious/",
  "backend": "boltdb",
  "total_functions_scanned": 47,
  "alerts": [
    {
      "signature_name": "Beacon_v1",
      "severity": "CRITICAL",
      "matched_function": "executePayload",
      "confidence": 0.92,
      "match_details": {
        "topology_match": true,
        "entropy_match": true,
        "topology_similarity": 1.0,
        "calls_matched": ["net.Dial", "os.Exec"]
      }
    }
  ],
  "summary": { "critical": 1, "high": 0, "total_alerts": 1 }
}

Why Use This?

"Don't unit tests solve this?" No. Unit tests verify correctness (does input A produce output B?). sfw verifies intent and integrity.

A developer refactors a function but secretly adds a network call → unit tests pass, sfw fails
A developer changes a switch to a Strategy Pattern → git diff shows 100 lines changed, sfw diff shows zero logic changes
An attacker renames a known malware function → name-based detection fails, sfw scan catches it via topology
A supply chain attack adds obfuscated code → entropy analysis flags it as PACKED

Traditional Tooling	Semantic Firewall
Git Diff: Shows lines changed (whitespace, renaming = noise)	sfw check: Verifies control flow graph identity
Unit Tests: Verify input/output (blind to side effects)	sfw diff: Isolates actual logic drift from cosmetic changes
YARA/Grep: Pattern matches strings (trivial to evade)	sfw scan: O(1) topology matching survives renaming/obfuscation
Traditional AV: Signature hashes (defeated by recompilation)	sfw: Behavioral fingerprints survive recompilation

Use cases:

Supply chain security: Detect backdoors like the xz attack that pass code review
Safe refactoring: Prove your refactor didn't change behavior
CI/CD gates: Block PRs that alter critical function logic
Malware hunting: Index known malware patterns, scan codebases at scale
Obfuscation detection: Entropy analysis flags packed/encrypted code
Dependency auditing: Scan imported packages for malicious patterns

Commands Reference

Command	Purpose	Time Complexity	Space
`sfw check`	Generate semantic fingerprints	O(N)	O(N)
`sfw diff`	Semantic delta via Zipper algorithm	O(I)	O(I)
`sfw index`	Index malware samples into BoltDB	O(N)	O(1) per sig
`sfw scan`	Hunt malware via topology matching	O(1) exact / O(M) fuzzy	O(M)
`sfw migrate`	Migrate JSON signatures to BoltDB	O(S)	O(S)
`sfw stats`	Display database statistics	O(1)	O(1)

Where N = source size, I = instructions, S = signatures, M = signatures in entropy range.

Command Details

sfw check [--strict] [--scan --db <path>] <file.go|directory>

Generate semantic fingerprints. Use --strict for validation mode. Use --scan to enable unified security scanning during fingerprinting.

sfw diff <old.go> <new.go>

Compute semantic delta using the Zipper algorithm with topology-based function matching. Outputs risk scores and structural deltas.

sfw index <file.go> --name <name> --severity <CRITICAL|HIGH|MEDIUM|LOW> [--category <cat>] [--db <path>]

Index a reference malware sample. Generates topology hash, fuzzy hash, and entropy score.

sfw scan <file.go|directory> [--db <path>] [--threshold <0.0-1.0>] [--exact] [--deps] [--deps-depth <direct|transitive>]

Scan target code for malware signatures. Use --exact for O(1) topology-only matching. Use --deps to scan imported dependencies.

sfw migrate --from <json> --to <db>

Migrate legacy JSON database to BoltDB format for O(1) lookups.

Signature Database Configuration

The CLI automatically resolves the signature database location (signatures.db) in the following order:

Explicit Flag: --db /custom/path/signatures.db
Environment Variable: SFW_DB_PATH
Local Directory: ./signatures.db
User Home: ~/.sfw/signatures.db
System Strings: /usr/local/share/sfw/signatures.db or /var/lib/sfw/signatures.db

This allows you to manage updates independently of the binary.

sfw stats --db <path>

Display database statistics including signature count and index sizes.

Proof of Equivalence: The SCEV Engine

Skeptical that this survives more than whitespace changes?

The Semantic Firewall uses Scalar Evolution (SCEV) analysis to mathematically prove loop identity. SCEV represents induction variables as closed-form algebraic expressions called Add Recurrences.

These three functions generate the IDENTICAL SHA-256 fingerprint:

1. Idiomatic Go (Range)

func sum(items []int) int {
    total := 0
    for _, x := range items {
        total += x
    }
    return total
}

2. C-Style (Index)

func sum(items []int) int {
    total := 0
    for i := 0; i < len(items); i++ {
        total += items[i]
    }
    return total
}

3. Raw Control Flow (Goto)

func sum(items []int) int {
    total := 0
    i := 0
loop:
    if i >= len(items) {
        goto done
    }
    total += items[i]
    i++
    goto loop
done:
    return total
}

All three compile to the same canonical control flow:

flowchart TD
    subgraph sg1 ["Canonical Control Flow"]
        Entry["entry:<br/>total = 0<br/>i = {0, +, 1}"]
        Loop{"i < len(items)?"}
        Body["total += items[i]"]
        Exit["return total"]
        
        Entry --> Loop
        Loop -->|yes| Body
        Body --> Loop
        Loop -->|no| Exit
    end

The Add Recurrence Notation

The SCEV notation {Start, +, Step} represents an induction variable where at iteration $k$:

$$Val(k) = Start + (Step \times k)$$

So {0, +, 1} means "starts at 0, increments by 1 each iteration"; this algebraic representation is identical regardless of source syntax.

SCEV is closed under affine transformations:

Operation	Result
`{S, +, T} + C`	`{S+C, +, T}`
`C × {S, +, T}`	`{C×S, +, C×T}`
`{S₁, +, T₁} + {S₂, +, T₂}`	`{S₁+S₂, +, T₁+T₂}`

Verify it yourself:

go run examples/proof.go

# Output:
# [Semantic Firewall Proof]
# 1. Range Loop Hash:  0a1b2c...
# 2. Index Loop Hash:  0a1b2c...
# 3. Goto Loop Hash:   0a1b2c...
# [SUCCESS] All three implementations are logically identical.

Persistent Signature Database (BoltDB)

The scanner uses BoltDB, an embedded key-value store with ACID transactions, for signature storage. This enables:

O(1) exact topology lookups via indexed hash keys
O(M) fuzzy matching via range scans on entropy indexes
Atomic writes: no partial updates on crash
Concurrent reads: safe for parallel scanning
Zero configuration: single file, no server required

Database Schema

┌─────────────────────────────────────────────────────────────────┐
│                        BoltDB Buckets                           │
├─────────────────────────────────────────────────────────────────┤
│  signatures     │  ID → JSON blob (full signature)              │
│  idx_topology   │  TopologyHash → ID (O(1) exact match)         │
│  idx_fuzzy      │  FuzzyHash:ID → ID (LSH bucket index)         │
│  idx_entropy    │  "05.1234:ID" → ID (range scan index)         │
│  meta           │  version, stats, maintenance info             │
└─────────────────────────────────────────────────────────────────┘

Entropy Key Encoding

Entropy scores are stored as fixed-width keys for proper lexicographic ordering:

Key: "05.1234:SFW-MAL-001"
      ├──────┤ ├─────────┤
      entropy  unique ID

This enables efficient range scans: find all signatures with entropy 5.0 ± 0.5.

Database Operations

# View statistics
sfw stats --db signatures.db
{
  "signature_count": 142,
  "topology_index_count": 142,
  "entropy_index_size": 28672,
  "file_size_human": "2.1 MB"
}

# Migrate from legacy JSON (one-time operation)
sfw migrate --from old_signatures.json --to signatures.db

# Export for backup/compatibility
# (Programmatic API: scanner.ExportToJSON("backup.json"))

Programmatic Database Access

// Open database with options
opts := semanticfw.BoltScannerOptions{
    MatchThreshold:   0.75,    // Minimum confidence for alerts
    EntropyTolerance: 0.5,     // Fuzzy entropy window
    Timeout:          5*time.Second,
    ReadOnly:         false,   // Set true for scan-only mode
}
scanner, err := semanticfw.NewBoltScanner("signatures.db", opts)
defer scanner.Close()

// Add signatures (single or bulk)
scanner.AddSignature(sig)
scanner.AddSignatures(sigs)  // Atomic batch insert

// Lookup operations
sig, _ := scanner.GetSignature("SFW-MAL-001")
sig, _ := scanner.GetSignatureByTopology(topoHash)
ids, _ := scanner.ListSignatureIDs()
count, _ := scanner.CountSignatures()

// Maintenance
scanner.DeleteSignature("SFW-MAL-001")
scanner.MarkFalsePositive("SFW-MAL-001", "benign library")
scanner.RebuildIndexes()  // Recover from corruption
scanner.Compact("signatures-compacted.db")

Malware Scanning: Two-Phase Detection

The Semantic Firewall includes a behavioral malware scanner that matches code by its structural topology, not just strings or hashes. The scanner uses a two-phase detection algorithm:

Phase 1: O(1) Exact Topology Match

1. Extract FunctionTopology from target SSA function
2. Compute topology hash: SHA-256(blockCount || callProfile || controlFlowFlags)
3. BoltDB lookup: idx_topology[hash] → signature IDs
4. Return exact matches with 100% topology confidence

Phase 2: O(1) Fuzzy Bucket Match (LSH-lite)

1. Compute fuzzy hash: GenerateFuzzyHash(topology) → "B3L1BR2"
2. Look up all signatures in the same fuzzy bucket
3. Verify call signature overlap and entropy distance
4. Return matches above confidence threshold

Why this survives evasion:

Renaming evasion fails: backdoor() → helper() still matches (names aren't part of topology)
Obfuscation-resistant: Variable renaming and code shuffling don't change block/call structure
O(1) at scale: BoltDB indexes enable instant lookups even with thousands of signatures

Fuzzy Hash Buckets (LSH-lite)

The fuzzy hash creates locality-sensitive buckets based on quantized structural metrics:

FuzzyHash = "B{log2(blocks)}L{loops}BR{log2(branches)}"

Examples:
  B3L1BR2 = 8-15 blocks, 1 loop, 4-7 branches
  B4L2BR3 = 16-31 blocks, 2 loops, 8-15 branches

Log2 buckets reduce sensitivity to small changes while preserving structural similarity.

Dependency Scanning

Scan not just your code, but all imported dependencies:

# Scan local code + direct imports
sfw scan ./cmd/myapp --deps --db signatures.db

# Deep scan: include transitive dependencies
sfw scan . --deps --deps-depth transitive --db signatures.db

# Fast exact-match mode for large codebases
sfw scan . --deps --exact --db signatures.db

Output with dependencies:

{
  "target": "./cmd/myapp",
  "total_functions_scanned": 1247,
  "dependencies_scanned": 892,
  "scanned_dependencies": [
    "github.com/example/suspicious-lib",
    "github.com/another/dependency"
  ],
  "alerts": [...],
  "summary": { "critical": 0, "high": 1, "total_alerts": 1 }
}

Flag	Description
`--deps`	Enable dependency scanning
`--deps-depth direct`	Scan only direct imports (default)
`--deps-depth transitive`	Scan all transitive dependencies
`--exact`	O(1) exact topology match only (fastest)

Note: Dependency scanning requires modules to be downloaded (go mod download). Stdlib packages are automatically excluded.

Workflow: Lab → Hunt

flowchart LR
    subgraph Lab["Lab Phase"]
        M1["Known Malware"] --> I["sfw index"]
        I --> DB[(signatures.db)]
    end
    
    subgraph Hunt["Hunter Phase"]
        T["Target Code"] --> S["sfw scan"]
        DB --> S
        S --> A["Alerts"]
    end
    
    style Lab fill:#1e1b4b,stroke:#8b5cf6,stroke-width:2px
    style Hunt fill:#064e3b,stroke:#10b981,stroke-width:2px
    style DB fill:#7c2d12,stroke:#f97316,stroke-width:2px

Step 1: Index Known Malware (Lab Phase)

# Index a beacon/backdoor sample
sfw index samples/dirty/dirty_beacon.go \
    --name "DirtyBeacon" \
    --severity CRITICAL \
    --category malware \
    --db signatures.db

# Output:
{
  "message": "Indexed 1 functions from samples/dirty/dirty_beacon.go",
  "indexed": [{
    "name": "DirtyBeacon_Run",
    "topology_hash": "topo:9a8b7c6d5e4f3a2b...",
    "fuzzy_hash": "B3L1BR2",
    "entropy_score": 5.82,
    "identifying_features": {
      "required_calls": ["net.Dial", "os.Exec", "time.Sleep"],
      "control_flow": { "has_infinite_loop": true, "has_reconnect_logic": true }
    }
  }],
  "backend": "boltdb",
  "total_signatures": 1
}

Step 2: Scan Suspicious Code (Hunter Phase)

# Scan an entire directory
sfw scan ./untrusted_vendor/ --db signatures.db --threshold 0.75

# Fast mode: exact topology match only (O(1) per function)
sfw scan ./large_codebase/ --db signatures.db --exact

Shannon Entropy Analysis

The scanner calculates Shannon entropy for each function to detect obfuscation and packed code:

$$H = -\sum_{i} p(x_i) \log_2 p(x_i)$$

Where $p(x_i)$ is the probability of byte value $x_i$ appearing in the function's string literals.

Entropy Spectrum:

     LOW          NORMAL              HIGH         PACKED
  ◀─────────────────────────────────────────────────────────▶
  │    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░▓▓▓▓▓▓▓▓▓▓▓████████│
  0                  4.0              6.5        7.5        8.0
  │                   │                │          │          │
  │  Simple funcs     │  Normal code   │ Obfusc.  │ Encrypted│
  │  (getters/setters)│  (business     │ (base64  │ (packed  │
  │                   │   logic)       │  strings)│  payloads)│

Entropy Range	Classification	Meaning	Example
< 4.0	LOW	Simple/sparse code	`func Get() int { return x }`
4.0 - 6.5	NORMAL	Typical compiled code	Business logic, handlers
6.5 - 7.5	HIGH	Potentially obfuscated	Base64 blobs, encoded strings
> 7.5	PACKED	Likely packed/encrypted	Encrypted payloads, shellcode

Functions with HIGH or PACKED entropy combined with suspicious call patterns receive elevated confidence scores.

Call Signature Resolution

The topology extractor resolves call targets to stable identifiers:

Call Type	Resolution	Example
Static function	`pkg.Func`	`net.Dial`
Interface invoke	`invoke:Type.Method`	`invoke:io.Reader.Read`
Builtin	`builtin:name`	`builtin:len`
Closure	`closure:signature`	`closure:func(int) error`
Go statement	`go:target`	`go:handler.serve`
Defer statement	`defer:target`	`defer:conn.Close`
Reflection	`reflect:Call`	Dynamic dispatch
Dynamic	`dynamic:type`	Unknown target

This stable resolution ensures signatures match even when:

Code is moved between packages
Interface implementations change
Method receivers are renamed

Topology Matching

The diff command now uses structural topology matching to detect renamed or obfuscated functions.

How Topology Matching Works:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        TOPOLOGY FINGERPRINT EXTRACTION                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   func processData(input []byte) error {        ──►  TOPOLOGY VECTOR        │
│       conn, _ := net.Dial("tcp", addr)                                      │
│       for _, b := range input {                      ┌──────────────────┐   │
│           conn.Write([]byte{b})                      │ Params: 1        │   │
│       }                                              │ Returns: 1       │   │
│       return conn.Close()                            │ Blocks: 4        │   │
│   }                                                  │ Loops: 1         │   │
│                                                      │ Calls:           │   │
│                                                      │   net.Dial: 1    │   │
│   func handleInput(data []byte) error {              │   Write: 1       │   │
│       c, _ := net.Dial("tcp", server)                │   Close: 1       │   │
│       for i := 0; i < len(data); i++ {               │ Entropy: 5.2     │   │
│           c.Write([]byte{data[i]})                   └──────────────────┘   │
│       }                                                      │              │
│       return c.Close()                                       ▼              │
│   }                                              ┌──────────────────────┐   │
│                                                  │  SIMILARITY: 94%     │   │
│   Different names, SAME topology ───────────────│  ✓ MATCH DETECTED    │   │
│                                                  └──────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Topology vs Name-Based Matching:

flowchart LR
    subgraph old["Old Version"]
        O1["processData()"]
        O2["sendPacket()"]
        O3["initConn()"]
    end
    
    subgraph new["New Version (Obfuscated)"]
        N1["handleInput()"]
        N2["xmit()"]
        N3["setup()"]
    end
    
    O1 -.->|"Name: ✗"| N1
    O1 ==>|"Topology: 94%"| N1
    O2 ==>|"Topology: 91%"| N2
    O3 ==>|"Topology: 88%"| N3
    
    style old fill:#1e1b4b,stroke:#8b5cf6,stroke-width:2px
    style new fill:#064e3b,stroke:#10b981,stroke-width:2px

sfw diff old_version.go refactored_version.go

{
  "summary": {
    "preserved": 8,
    "modified": 2,
    "renamed_functions": 3,
    "topology_matched_pct": 85.7
  },
  "topology_matches": [
    {
      "old_function": "processData",
      "new_function": "handleInput",
      "similarity": 0.94,
      "matched_by_name": false
    }
  ]
}

Functions are matched by their structural fingerprint (block count, call profile, control flow features) rather than just name, enabling detection of:

Renamed functions
Copy-pasted code with modified names
Obfuscated variants of known patterns

Library Usage

Fingerprinting

import semanticfw "github.com/BlackVectorOps/semantic_firewall/v2"

src := `package main
func Add(a, b int) int { return a + b }
`

results, err := semanticfw.FingerprintSource("example.go", src, semanticfw.DefaultLiteralPolicy)
if err != nil {
    log.Fatal(err)
}

for _, r := range results {
    fmt.Printf("%s: %s\n", r.FunctionName, r.Fingerprint)
}

Malware Scanning with BoltDB

import semanticfw "github.com/BlackVectorOps/semantic_firewall/v2"

// Open the signature database
scanner, err := semanticfw.NewBoltScanner("signatures.db", semanticfw.DefaultBoltScannerOptions())
if err != nil {
    log.Fatal(err)
}
defer scanner.Close()

// Extract topology from a function
topo := semanticfw.ExtractTopology(ssaFunction)

// O(1) exact topology match
if alert := scanner.ScanTopologyExact(topo, "suspiciousFunc"); alert != nil {
    fmt.Printf("ALERT: %s matched %s (confidence: %.2f)\n", 
        alert.MatchedFunction, alert.SignatureName, alert.Confidence)
}

// Full scan: exact + fuzzy entropy matching
alerts := scanner.ScanTopology(topo, "suspiciousFunc")
for _, alert := range alerts {
    fmt.Printf("[%s] %s: %s\n", alert.Severity, alert.SignatureName, alert.MatchedFunction)
}

Topology Extraction

import semanticfw "github.com/BlackVectorOps/semantic_firewall/v2"

// Extract structural features from an SSA function
topo := semanticfw.ExtractTopology(ssaFunction)

fmt.Printf("Blocks: %d, Loops: %d, Entropy: %.2f\n", 
    topo.BlockCount, topo.LoopCount, topo.EntropyScore)
fmt.Printf("Calls: %v\n", topo.CallSignatures)
fmt.Printf("Entropy Class: %s\n", topo.EntropyProfile.Classification)
fmt.Printf("Fuzzy Hash: %s\n", topo.FuzzyHash)

Unified Pipeline: Check + Scan

Enable security scanning during fingerprinting for a unified integrity + security workflow:

// CLI: sfw check --scan --db signatures.db ./main.go

// Programmatically:
results, _ := semanticfw.FingerprintSourceAdvanced(
    path, src, semanticfw.DefaultLiteralPolicy, strictMode)

for _, r := range results {
    // Integrity: Get fingerprint
    fmt.Printf("Function %s: %s\n", r.FunctionName, r.Fingerprint)
    
    // Security: Scan for malware
    fn := r.GetSSAFunction()
    topo := semanticfw.ExtractTopology(fn)
    alerts := scanner.ScanTopology(topo, r.FunctionName)
    for _, alert := range alerts {
        fmt.Printf("ALERT: %s\n", alert.SignatureName)
    }
}

Signature Structure

type Signature struct {
    ID                  string              // "SFW-MAL-001"
    Name                string              // "Beacon_v1_Run"
    Description         string              // Human-readable description
    Severity            string              // "CRITICAL", "HIGH", "MEDIUM", "LOW"
    Category            string              // "malware", "backdoor", "dropper"
    TopologyHash        string              // SHA-256 of topology vector
    FuzzyHash           string              // LSH bucket key "B3L1BR2"
    EntropyScore        float64             // 0.0-8.0
    EntropyTolerance    float64             // Fuzzy match window (default: 0.5)
    NodeCount           int                 // Basic block count
    LoopDepth           int                 // Maximum nesting depth
    IdentifyingFeatures IdentifyingFeatures // Behavioral markers
    Metadata            SignatureMetadata   // Provenance info
}

type IdentifyingFeatures struct {
    RequiredCalls  []string          // Must be present (VETO if missing)
    OptionalCalls  []string          // Bonus if present
    StringPatterns []string          // Suspicious strings
    ControlFlow    *ControlFlowHints // Structural patterns
}

type ControlFlowHints struct {
    HasInfiniteLoop   bool  // Beacon/C2 indicator
    HasReconnectLogic bool  // Persistence indicator
}

Technical Deep Dive

Click to expand: Architecture & Algorithms

Pipeline Overview

flowchart LR
    A["Source"] --> B["SSA"]
    B --> C["Loop Analysis"]
    C --> D["SCEV"]
    D --> E["Canonicalization"]
    E --> F["SHA-256"]
    
    B -.-> B1["go/ssa"]
    C -.-> C1["Tarjan's SCC"]
    D -.-> D1["Symbolic Evaluation"]
    E -.-> E1["Virtual IR Normalization"]
    
    style A fill:#4c1d95,stroke:#8b5cf6,stroke-width:2px,color:#e9d5ff
    style B fill:#1e3a8a,stroke:#3b82f6,stroke-width:2px,color:#dbeafe
    style C fill:#064e3b,stroke:#10b981,stroke-width:2px,color:#d1fae5
    style D fill:#7c2d12,stroke:#f97316,stroke-width:2px,color:#ffedd5
    style E fill:#701a75,stroke:#d946ef,stroke-width:2px,color:#fae8ff
    style F fill:#0f766e,stroke:#14b8a6,stroke-width:2px,color:#ccfbf1
    
    style B1 fill:transparent,stroke:#3b82f6,color:#93c5fd
    style C1 fill:transparent,stroke:#10b981,color:#6ee7b7
    style D1 fill:transparent,stroke:#f97316,color:#fdba74
    style E1 fill:transparent,stroke:#d946ef,color:#f0abfc

SSA Construction: golang.org/x/tools/go/ssa converts source to Static Single Assignment form with explicit control flow graphs
Loop Detection: Natural loop identification via backedge detection (edge B->H where H dominates B)
SCEV Analysis: Algebraic characterization of loop variables as closed-form recurrences
Canonicalization: Deterministic IR transformation: register renaming, branch normalization, loop virtualization
Fingerprint: SHA-256 of canonical IR string

Scalar Evolution (SCEV) Engine

The SCEV framework (scev.go, 746 LOC) solves the "loop equivalence problem" -- proving that syntactically different loops compute the same sequence of values.

Core Abstraction: Add Recurrences

An induction variable is represented as ${Start, +, Step}_L$, meaning at iteration $k$ the value is:

$$Val(k) = Start + (Step \times k)$$

This representation is closed under affine transformations:

Operation	Result
${S, +, T} + C$	${S+C, +, T}$
$C \times {S, +, T}$	${C \times S, +, C \times T}$
${S_1, +, T_1} + {S_2, +, T_2}$	${S_1+S_2, +, T_1+T_2}$

IV Detection Algorithm (Tarjan's SCC)

1. Build dependency graph restricted to loop body
2. Find SCCs via Tarjan's algorithm (O(V+E))
3. For each SCC containing a header Phi:
   a. Extract cycle: Phi -> BinOp -> Phi
   b. Classify: Basic ({S,+,C}), Geometric ({S,*,C}), Polynomial
   c. Verify step is loop invariant
4. Propagate SCEV to derived expressions via recursive folding

Trip Count Derivation

For a loop for i := Start; i < Limit; i += Step:

$$TripCount = \left\lceil \frac{Limit - Start}{Step} \right\rceil$$

Computed via ceiling division: (Limit - Start + Step - 1) / Step

The engine handles:

Upcounting (i < N) and downcounting (i > N) loops
Inclusive bounds (i <= N -- add 1 to numerator)
Negative steps (normalized to absolute value)
Multi-predecessor loop headers (validates consistent start values)

Canonicalization Engine

The canonicalizer (canonicalizer.go, 1162 LOC) transforms SSA into a deterministic string representation via five phases:

Phase 1: Loop & SCEV Analysis

c.loopInfo = DetectLoops(fn)
AnalyzeSCEV(c.loopInfo)

Phase 2: Semantic Normalization

Invariant Hoisting: Pure calls like len(s) are virtually moved to preheader
IV Virtualization: Phi nodes for IVs are replaced with SCEV notation {0, +, 1}
Derived IV Propagation: Expressions like i*4 become {0, +, 4} in output

Phase 3: Register Renaming

Parameters: p0, p1, p2, ...
Free Variables: fv0, fv1, ...
Instructions: v0, v1, v2, ... (DFS order)

Phase 4: Deterministic Block Ordering

Blocks are traversed in dominance-respecting DFS order, ensuring identical output regardless of SSA construction order. Successor ordering is normalized:

>= branches are rewritten to < with swapped successors
> branches are rewritten to <= with swapped successors

Phase 5: Virtual Control Flow

Branch normalization is applied virtually (no SSA mutation) via lookup tables:

virtualBlocks map[*ssa.BasicBlock]*virtualBlock  // swapped successors
virtualBinOps map[*ssa.BinOp]token.Token         // normalized operators

The Semantic Zipper

The Zipper (zipper.go, 568 LOC) computes a semantic diff between two functions -- what actually changed in behavior, ignoring cosmetic differences.

Algorithm: Parallel Graph Traversal

PHASE 0: Semantic Analysis
  - Run SCEV on both functions independently
  - Build canonicalizers for operand comparison

PHASE 1: Anchor Alignment
  - Map parameters positionally: oldFn.Params[i] <-> newFn.Params[i]
  - Map free variables if counts match
  - Seed entry block via sequential matching (critical for main())

PHASE 2: Forward Propagation (BFS on Use-Def chains)
  while queue not empty:
    (vOld, vNew) = dequeue()
    for each user uOld of vOld:
      candidates = users of vNew with matching structural fingerprint
      for uNew in candidates:
        if areEquivalent(uOld, uNew):
          map(uOld, uNew)
          enqueue((uOld, uNew))
          break

PHASE 2.5: Terminator Scavenging
  - Explicitly match Return/Panic instructions via operand equivalence
  - Handles cases where terminators aren't reached via normal propagation

PHASE 3: Divergence Isolation
  - Added = newFn instructions not in reverse map
  - Removed = oldFn instructions not in forward map

Equivalence Checking

Two instructions are equivalent iff:

Same Go type (reflect.TypeOf)
Same SSA value type (types.Identical)
Same operation specific properties (BinOp.Op, Field index, Alloc.Heap, etc.)
All operands equivalent (recursive, with commutativity handling for ADD/MUL/AND/OR/XOR)

Structural Fingerprinting (DoS Prevention)

To prevent $O(N \times M)$ comparisons on high fanout values, users are bucketed by structural fingerprint:

fp := fmt.Sprintf("%T:%s", instr, op)  // e.g., "*ssa.BinOp:+"
candidates := newByOp[fp]              // Only compare compatible types

Bucket size is capped at 100 to bound worst case complexity.

Security Hardening

Threat	Mitigation
Algorithmic DoS (exponential SCEV)	Memoization cache per loop: `loop.SCEVCache`
Quadratic Zipper (5000 identical ADDs)	Fingerprint bucketing + `MaxCandidates=100`
RCE via CGO	`CGO_ENABLED=0` during `packages.Load`
SSRF via module fetch	`GOPROXY=off` prevents network calls
Stack overflow (cyclic graphs)	Visited sets in all recursive traversals
NaN comparison instability	Branch normalization restricted to `IsInteger \| IsString` types
IR injection (fake instructions in strings)	Struct tags and literals sanitized before hashing
TypeParam edge cases	Generic types excluded from branch swap (may hide floats)

Complexity Analysis

Operation	Time	Space
SSA Construction	$O(N)$	$O(N)$
Loop Detection	$O(V+E)$	$O(V)$
SCEV Analysis	$O(L \times I)$ amortized	$O(I)$ per loop
Canonicalization	$O(I \times \log B)$	$O(I + B)$
Zipper	$O(I^2)$ worst, $O(I)$ typical	$O(I)$
Topology Extract	$O(I)$	$O(C)$
Scan (BoltDB exact)	$O(1)$	$O(1)$
Scan (fuzzy entropy)	$O(M)$	$O(M)$

Where $N$ = source size, $V$ = blocks, $E$ = edges, $L$ = loops, $I$ = instructions, $B$ = blocks, $C$ = unique calls, $M$ = signatures in entropy range.

Malware Scanner Architecture

The scanner (scanner_bolt.go, 780 LOC) provides two-phase detection with ACID-compliant persistence:

Phase 1: O(1) Exact Topology Match

1. Extract FunctionTopology from target SSA function
2. Compute topology hash: SHA-256(blockCount || callProfile || controlFlowFlags)
3. BoltDB lookup: idx_topology[hash] → signature ID
4. Return exact matches with 100% topology confidence

Phase 2: O(1) Fuzzy Bucket Match (LSH-lite)

1. Compute fuzzy hash: GenerateFuzzyHash(topo) → "B3L1BR2"
2. BoltDB prefix scan: idx_fuzzy[fuzzyHash:*] → candidate IDs
3. For each candidate:
   a. Load signature from signatures bucket
   b. Verify call signature overlap
   c. Check entropy distance within tolerance
   d. Compute composite confidence score
4. Return matches above threshold

BoltDB Storage Schema

Bucket: signatures     → ID → JSON blob (full signature)
Bucket: idx_topology   → TopologyHash → ID (exact match index)
Bucket: idx_fuzzy      → FuzzyHash:ID → ID (LSH bucket index)
Bucket: idx_entropy    → "08.4321:ID" → ID (range scan index)
Bucket: meta           → version, stats, maintenance info

Entropy Key Encoding

Entropy is stored as a fixed-width key for proper lexicographic ordering:

key := fmt.Sprintf("%08.4f:%s", entropy, id)  // "05.8200:SFW-MAL-001"

This enables efficient range scans and ensures uniqueness even for identical entropy values.

False Positive Feedback Loop

The scanner supports learning from mistakes:

// Mark a signature as generating false positives
scanner.MarkFalsePositive("SFW-MAL-001", "benign crypto library")

// This appends a timestamped note to the signature's metadata:
// "FP:2026-01-12T15:04:05Z:benign crypto library"

Confidence Score Calculation

The final confidence score is computed as a weighted average:

confidence = avg(
    topologySimilarity,      // 1.0 if exact hash match
    entropyScore,            // 1.0 - (distance / tolerance)
    callMatchScore,          // len(matched) / len(required)
    stringPatternScore,      // bonus for matched patterns
)

// VETO: If ANY required call is missing, confidence = 0.0

Topology Matching Algorithm

The topology matcher (topology.go, 673 LOC) enables function matching independent of names:

Feature Vector

type FunctionTopology struct {
    FuzzyHash      string              // LSH bucket key
    ParamCount     int                 // Signature: param count
    ReturnCount    int                 // Signature: return count
    BlockCount     int                 // CFG complexity
    InstrCount     int                 // Code size
    LoopCount      int                 // Iteration patterns
    BranchCount    int                 // Decision points
    PhiCount       int                 // SSA merge points
    CallSignatures map[string]int      // "net.Dial" → 2
    BinOpCounts    map[string]int      // "+" → 5
    HasDefer       bool                // Error handling
    HasRecover     bool                // Panic recovery
    HasPanic       bool                // Failure paths
    HasGo          bool                // Concurrency
    HasSelect      bool                // Channel ops
    HasRange       bool                // Iteration style
    EntropyScore   float64             // Obfuscation indicator
    EntropyProfile EntropyProfile      // Detailed entropy analysis
}

Similarity Score

Functions are compared via weighted Jaccard similarity:

$$Similarity = \frac{\sum_i w_i \cdot match_i}{\sum_i w_i}$$

Where weights prioritize:

Call profile (w=3): Most discriminative feature
Control flow (w=2): defer/recover/panic/select/go
Metrics (w=1): Block/instruction counts within 20% tolerance

Risk-Aware Diff Scoring

When comparing function versions, structural changes receive risk scores:

Change	Risk Points
New call	+5 each
New loop	+10 each
New goroutine	+15
New defer	+3
New panic	+5
Entropy increase >1.0	+10

High cumulative risk scores flag changes that warrant extra review.

License

MIT License. See LICENSE for details.

Built for the security-conscious developer

Documentation ¶

Index ¶

Constants
Variables
func AnalyzeSCEV(info *LoopInfo)
func BuildSSAFromPackages(initialPkgs []*packages.Package) (*ssa.Program, *ssa.Package, error)
func CalculateEntropy(data []byte) float64
func CalculateEntropyNormalized(data []byte) float64
func CheckIRPattern(t *testing.T, ir string, pattern string)
func ComputeTopologySimilarityExported(topo *FunctionTopology, sig Signature) float64
func EntropyDistance(e1, e2 float64) float64
func EntropyMatch(e1, e2, tolerance float64) bool
func FormatEntropyKeyExported(entropy float64, id string) string
func GenerateFuzzyHash(t *FunctionTopology) string
func GenerateTopologyHashExported(topo *FunctionTopology) string
func GetFunctionNames(results []FingerprintResult) []string
func MatchCallsExported(topo *FunctionTopology, required []string) (score float64, matched, missing []string)
func MatchFunctionsByTopology(oldResults, newResults []FingerprintResult, threshold float64) (matched []TopologyMatch, addedFuncs []FingerprintResult, ...)
func ReleaseCanonicalizer(c *Canonicalizer)
func SetupTestEnv(t *testing.T, dirPrefix string) (string, func())
func ShortFuncName(fullName string) string
func TopologyFingerprint(t *FunctionTopology) string
func TopologySimilarity(a, b *FunctionTopology) float64
type BoltScanner
- func NewBoltScanner(dbPath string, opts BoltScannerOptions) (*BoltScanner, error)
- func (s *BoltScanner) AddSignature(sig Signature) error
- func (s *BoltScanner) AddSignatures(sigs []Signature) error
- func (s *BoltScanner) Close() error
- func (s *BoltScanner) Compact(destPath string) error
- func (s *BoltScanner) CountSignatures() (int, error)
- func (s *BoltScanner) DeleteSignature(id string) error
- func (s *BoltScanner) ExportToJSON(jsonPath string) error
- func (s *BoltScanner) GetSignature(id string) (*Signature, error)
- func (s *BoltScanner) GetSignatureByTopology(topoHash string) (*Signature, error)
- func (s *BoltScanner) ListSignatureIDs() ([]string, error)
- func (s *BoltScanner) MarkFalsePositive(id string, notes string) error
- func (s *BoltScanner) MigrateFromJSON(jsonPath string) (int, error)
- func (s *BoltScanner) RebuildIndexes() error
- func (s *BoltScanner) ScanTopology(topo *FunctionTopology, funcName string) []ScanResult
- func (s *BoltScanner) ScanTopologyExact(topo *FunctionTopology, funcName string) *ScanResult
- func (s *BoltScanner) SetEntropyTolerance(tolerance float64)
- func (s *BoltScanner) SetThreshold(threshold float64)
- func (s *BoltScanner) Stats() (*BoltScannerStats, error)
type BoltScannerOptions
- func DefaultBoltScannerOptions() BoltScannerOptions
type BoltScannerStats
type Canonicalizer
- func AcquireCanonicalizer(policy LiteralPolicy) *Canonicalizer
- func NewCanonicalizer(policy LiteralPolicy) *Canonicalizer
- func (c *Canonicalizer) ApplyVirtualControlFlowFromState(swappedBlocks map[*ssa.BasicBlock]bool, ...)
- func (c *Canonicalizer) CanonicalizeFunction(fn *ssa.Function) string
type ControlFlowHints
type EntropyClass
- func ClassifyEntropy(entropy float64) EntropyClass
- func (c EntropyClass) String() string
type EntropyProfile
- func CalculateEntropyProfile(bodyBytes []byte, stringLiterals []string) EntropyProfile
type FingerprintResult
- func CompileAndGetFunction(t *testing.T, src, funcName string) *FingerprintResult
- func FindResult(results []FingerprintResult, name string) *FingerprintResult
- func FingerprintPackages(initialPkgs []*packages.Package, policy LiteralPolicy, strictMode bool) ([]FingerprintResult, error)
- func FingerprintSource(filename string, src string, policy LiteralPolicy) ([]FingerprintResult, error)
- func FingerprintSourceAdvanced(filename string, src string, policy LiteralPolicy, strictMode bool) ([]FingerprintResult, error)
- func GenerateFingerprint(fn *ssa.Function, policy LiteralPolicy, strictMode bool) FingerprintResult
- func (r FingerprintResult) GetSSAFunction() *ssa.Function
type FunctionTopology
- func ExtractTopology(fn *ssa.Function) *FunctionTopology
type IVType
type IdentifyingFeatures
type InductionVariable
type LiteralPolicy
- func (p *LiteralPolicy) ShouldAbstract(c *ssa.Const, usageContext ssa.Instruction) bool
type Loop
- func (l *Loop) String() string
type LoopInfo
- func DetectLoops(fn *ssa.Function) *LoopInfo
type MatchDetails
type Renamer
type SCEV
type SCEVAddRec
- func (s *SCEVAddRec) EvaluateAt(k *big.Int) *big.Int
- func (s *SCEVAddRec) IsLoopInvariant(loop *Loop) bool
- func (s *SCEVAddRec) Name() string
- func (s *SCEVAddRec) Parent() *ssa.Function
- func (s *SCEVAddRec) Pos() token.Pos
- func (s *SCEVAddRec) Referrers() *[]ssa.Instruction
- func (s *SCEVAddRec) String() string
- func (s *SCEVAddRec) StringWithRenamer(r Renamer) string
- func (s *SCEVAddRec) Type() types.Type
type SCEVConstant
- func SCEVFromConst(c *ssa.Const) *SCEVConstant
- func (s *SCEVConstant) EvaluateAt(k *big.Int) *big.Int
- func (s *SCEVConstant) IsLoopInvariant(loop *Loop) bool
- func (s *SCEVConstant) Name() string
- func (s *SCEVConstant) Parent() *ssa.Function
- func (s *SCEVConstant) Pos() token.Pos
- func (s *SCEVConstant) Referrers() *[]ssa.Instruction
- func (s *SCEVConstant) String() string
- func (s *SCEVConstant) StringWithRenamer(r Renamer) string
- func (s *SCEVConstant) Type() types.Type
type SCEVGenericExpr
- func (s *SCEVGenericExpr) EvaluateAt(k *big.Int) *big.Int
- func (s *SCEVGenericExpr) IsLoopInvariant(loop *Loop) bool
- func (s *SCEVGenericExpr) Name() string
- func (s *SCEVGenericExpr) Parent() *ssa.Function
- func (s *SCEVGenericExpr) Pos() token.Pos
- func (s *SCEVGenericExpr) Referrers() *[]ssa.Instruction
- func (s *SCEVGenericExpr) String() string
- func (s *SCEVGenericExpr) StringWithRenamer(r Renamer) string
- func (s *SCEVGenericExpr) Type() types.Type
type SCEVUnknown
- func (s *SCEVUnknown) EvaluateAt(k *big.Int) *big.Int
- func (s *SCEVUnknown) IsLoopInvariant(loop *Loop) bool
- func (s *SCEVUnknown) Name() string
- func (s *SCEVUnknown) Parent() *ssa.Function
- func (s *SCEVUnknown) Pos() token.Pos
- func (s *SCEVUnknown) Referrers() *[]ssa.Instruction
- func (s *SCEVUnknown) String() string
- func (s *SCEVUnknown) StringWithRenamer(r Renamer) string
- func (s *SCEVUnknown) Type() types.Type
type ScanResult
type Scanner
- func NewScanner() *Scanner
- func (s *Scanner) AddSignature(sig Signature)
- func (s *Scanner) GetDatabase() *SignatureDatabase
- func (s *Scanner) LoadDatabase(path string) error
- func (s *Scanner) SaveDatabase(path string) error
- func (s *Scanner) ScanTopology(topo *FunctionTopology, funcName string) []ScanResult
- func (s *Scanner) SetThreshold(threshold float64)
type Signature
- func IndexFunction(topo *FunctionTopology, name, description, severity, category string) Signature
type SignatureDatabase
type SignatureMetadata
type TopologyMatch
type Zipper
- func NewZipper(oldFn, newFn *ssa.Function, policy LiteralPolicy) (*Zipper, error)
- func (z *Zipper) ComputeDiff() (*ZipperArtifacts, error)
type ZipperArtifacts

Constants ¶

View Source

const MaxCandidates = 100

Limits comparison candidates per fingerprint bucket. Prevents algorithmic DoS where malicious inputs with thousands of identical operations could cause O(N*M) comparisons. With this limit, worst case becomes O(N * MaxCandidates) which is linear.

View Source

const MaxRenamerDepth = 100

Limits recursion depth in SCEV renaming to prevent stack overflow. A depth of 100 is sufficient for legitimate nested expressions while preventing malicious deeply nested chains (e.g., v1 = v2, v2 = v3, ... v10000 = C) or exponential "Billion Laughs" expansion attacks (e.g., A -> {B, +, B}).

Variables ¶

View Source

var DefaultLiteralPolicy = LiteralPolicy{
	AbstractControlFlowComparisons: true,
	KeepSmallIntegerIndices:        true,
	KeepReturnStatusValues:         true,
	KeepStringLiterals:             false,
	SmallIntMin:                    -16,
	SmallIntMax:                    16,
	AbstractOtherTypes:             true,
}

Standard policy for fingerprinting. Preserves small integers used for indexing and status codes while masking magic numbers and large constants.

View Source

var KeepAllLiteralsPolicy = LiteralPolicy{
	AbstractControlFlowComparisons: false,
	KeepSmallIntegerIndices:        true,
	KeepReturnStatusValues:         true,
	KeepStringLiterals:             true,
	SmallIntMin:                    math.MinInt64,
	SmallIntMax:                    math.MaxInt64,
	AbstractOtherTypes:             false,
}

Designed for testing or exact matching by disabling most abstractions and expanding the "small" integer range to the full int64 spectrum.

Functions ¶

func AnalyzeSCEV ¶

func AnalyzeSCEV(info *LoopInfo)

Main entry point for SCEV analysis on a LoopInfo.

func BuildSSAFromPackages ¶

func BuildSSAFromPackages(initialPkgs []*packages.Package) (*ssa.Program, *ssa.Package, error)

Constructs Static Single Assignment form from loaded Go packages. Returns the complete program and the target package for analysis.

func CalculateEntropy ¶

func CalculateEntropy(data []byte) float64

Returns the Shannon entropy of a byte slice. Result ranges from 0.0 (completely uniform/predictable) to 8.0 (maximum randomness). High entropy (>7.0) often indicates packed/encrypted code. Normal code typically has entropy between 4.5 and 6.5.

func CalculateEntropyNormalized ¶

func CalculateEntropyNormalized(data []byte) float64

Returns entropy normalized to 0.0-1.0 range. Useful for direct comparison and threshold checks.

func CheckIRPattern ¶

func CheckIRPattern(t *testing.T, ir string, pattern string)

CheckIRPattern checks IR against a pattern using regex, abstracting register names. Exported for use in external test packages.

func ComputeTopologySimilarityExported ¶

func ComputeTopologySimilarityExported(topo *FunctionTopology, sig Signature) float64

ComputeTopologySimilarityExported exports the computeTopologySimilarity function for testing.

func EntropyDistance ¶

func EntropyDistance(e1, e2 float64) float64

Calculates the absolute difference between two entropy values. Used for fuzzy matching: two functions with similar entropy are more likely related.

func EntropyMatch ¶

func EntropyMatch(e1, e2, tolerance float64) bool

Returns true if two entropy values are within the given tolerance. Default tolerance of 0.5 is recommended for malware family matching.

func FormatEntropyKeyExported ¶

func FormatEntropyKeyExported(entropy float64, id string) string

FormatEntropyKeyExported exports the formatEntropyKey function for testing.

func GenerateFuzzyHash ¶

func GenerateFuzzyHash(t *FunctionTopology) string

REMEDIATION: O(1) Topology Trap Fix GenerateFuzzyHash creates a locality-sensitive hash for bucket indexing. Buckets: Blocks (Log2), Loops (Exact/Capped), Branches (Log2).

func GenerateTopologyHashExported ¶

func GenerateTopologyHashExported(topo *FunctionTopology) string

GenerateTopologyHashExported exports the generateTopologyHash function for testing.

func GetFunctionNames ¶

func GetFunctionNames(results []FingerprintResult) []string

GetFunctionNames extracts function names from results for easier verification. Exported for use in external test packages.

func MatchCallsExported ¶

func MatchCallsExported(topo *FunctionTopology, required []string) (score float64, matched, missing []string)

MatchCallsExported exports the matchCalls function for testing.

func MatchFunctionsByTopology ¶

func MatchFunctionsByTopology(oldResults, newResults []FingerprintResult, threshold float64) (
	matched []TopologyMatch,
	addedFuncs []FingerprintResult,
	removedFuncs []FingerprintResult,
)

Performs topology based function matching between two sets of fingerprint results. This is the "unobfuscator" that finds renamed functions.

Strategy: 1. First, try to match by exact name (preserves intentional naming) 2. For unmatched functions, compute topology similarity matrix 3. Use greedy matching to pair functions by structural similarity 4. Report matches above a confidence threshold

func ReleaseCanonicalizer ¶

func ReleaseCanonicalizer(c *Canonicalizer)

func SetupTestEnv ¶

func SetupTestEnv(t *testing.T, dirPrefix string) (string, func())

SetupTestEnv creates an isolated test environment for packages loader. Exported for use in external test packages.

func ShortFuncName ¶

func ShortFuncName(fullName string) string

ShortFuncName returns the short function name without package prefix. Exported for use in external test packages.

func TopologyFingerprint ¶

func TopologyFingerprint(t *FunctionTopology) string

Generates a short structural fingerprint for display purposes. This is a human readable summary of the function's shape.

func TopologySimilarity ¶

func TopologySimilarity(a, b *FunctionTopology) float64

Computes a similarity score between two function topologies. Returns a value between 0.0 (completely different) and 1.0 (identical structure).

Types ¶

type BoltScanner ¶

type BoltScanner struct {
	// contains filtered or unexported fields
}

Performs semantic malware detection using BoltDB for persistent storage. Supports O(1) exact topology matching and O(M) fuzzy entropy range scans.

func NewBoltScanner ¶

func NewBoltScanner(dbPath string, opts BoltScannerOptions) (*BoltScanner, error)

Opens or creates a BoltDB backed signature database. The database file will be created if it doesn't exist.

func (*BoltScanner) AddSignature ¶

func (s *BoltScanner) AddSignature(sig Signature) error

Atomically saves a signature and updates all indexes. Safe for concurrent use.

func (*BoltScanner) AddSignatures ¶

func (s *BoltScanner) AddSignatures(sigs []Signature) error

Atomically adds multiple signatures in a single transaction. Much faster than calling AddSignature in a loop for bulk imports.

func (*BoltScanner) Close ¶

func (s *BoltScanner) Close() error

Flushes all pending writes and closes the database. Always call this when done to prevent data loss.

func (*BoltScanner) Compact ¶

func (s *BoltScanner) Compact(destPath string) error

Forces a compaction of the database file to reclaim space. BoltDB doesn't automatically shrink, so call this after large deletions.

func (*BoltScanner) CountSignatures ¶

func (s *BoltScanner) CountSignatures() (int, error)

Returns the total number of signatures in the database.

func (*BoltScanner) DeleteSignature ¶

func (s *BoltScanner) DeleteSignature(id string) error

Removes a signature and its index entries.

func (*BoltScanner) ExportToJSON ¶

func (s *BoltScanner) ExportToJSON(jsonPath string) error

Exports all signatures to a JSON file (backup/compatibility).

func (*BoltScanner) GetSignature ¶

func (s *BoltScanner) GetSignature(id string) (*Signature, error)

Retrieves a single signature by ID.

func (*BoltScanner) GetSignatureByTopology ¶

func (s *BoltScanner) GetSignatureByTopology(topoHash string) (*Signature, error)

Retrieves a signature by its topology hash.

func (*BoltScanner) ListSignatureIDs ¶

func (s *BoltScanner) ListSignatureIDs() ([]string, error)

Returns all signature IDs in the database.

func (*BoltScanner) MarkFalsePositive ¶

func (s *BoltScanner) MarkFalsePositive(id string, notes string) error

Updates a signature to record that it caused a false positive. Enables learning feedback loops without rewriting the entire database.

func (*BoltScanner) MigrateFromJSON ¶

func (s *BoltScanner) MigrateFromJSON(jsonPath string) (int, error)

Imports signatures from a legacy JSON database file. One time migration utility.

func (*BoltScanner) RebuildIndexes ¶

func (s *BoltScanner) RebuildIndexes() error

Rebuilds all secondary indexes from the master signatures bucket. Use this to recover from index corruption or after manual edits.

func (*BoltScanner) ScanTopology ¶

func (s *BoltScanner) ScanTopology(topo *FunctionTopology, funcName string) []ScanResult

Checks a function topology against the signature database using two phases:

Phase A (O(1)): Exact topology hash lookup
Phase B (O(1)): Fuzzy bucket index lookup (LSH-lite)

func (*BoltScanner) ScanTopologyExact ¶

func (s *BoltScanner) ScanTopologyExact(topo *FunctionTopology, funcName string) *ScanResult

Performs only exact topology hash matching (fastest). Use this when you only want exact matches without fuzzy entropy scanning.

func (*BoltScanner) SetEntropyTolerance ¶

func (s *BoltScanner) SetEntropyTolerance(tolerance float64)

Updates the entropy fuzzy match window.

func (*BoltScanner) SetThreshold ¶

func (s *BoltScanner) SetThreshold(threshold float64)

Updates the minimum confidence threshold for alerts.

func (*BoltScanner) Stats ¶

func (s *BoltScanner) Stats() (*BoltScannerStats, error)

type BoltScannerOptions ¶

type BoltScannerOptions struct {
	MatchThreshold   float64       // Minimum confidence for alerts (default: 0.75)
	EntropyTolerance float64       // Entropy fuzzy match window (default: 0.5)
	Timeout          time.Duration // DB open timeout (default: 5s)
	ReadOnly         bool          // Open DB in read-only mode for scanning only
}

Configures the BoltScanner initialization.

func DefaultBoltScannerOptions ¶

func DefaultBoltScannerOptions() BoltScannerOptions

Returns sensible defaults for production use.

type BoltScannerStats ¶

type BoltScannerStats struct {
	SignatureCount   int
	TopoIndexCount   int
	EntropyIndexSize int64
	FileSize         int64
}

Returns database statistics for monitoring.

type Canonicalizer ¶

type Canonicalizer struct {
	Policy     LiteralPolicy
	StrictMode bool
	// contains filtered or unexported fields
}

Transforms an SSA function into a deterministic string representation.

func AcquireCanonicalizer ¶

func AcquireCanonicalizer(policy LiteralPolicy) *Canonicalizer

func NewCanonicalizer ¶

func NewCanonicalizer(policy LiteralPolicy) *Canonicalizer

func (*Canonicalizer) ApplyVirtualControlFlowFromState ¶

func (c *Canonicalizer) ApplyVirtualControlFlowFromState(swappedBlocks map[*ssa.BasicBlock]bool, virtualBinOps map[*ssa.BinOp]token.Token)

func (*Canonicalizer) CanonicalizeFunction ¶

func (c *Canonicalizer) CanonicalizeFunction(fn *ssa.Function) string

type ControlFlowHints ¶

type ControlFlowHints struct {
	HasInfiniteLoop   bool `json:"has_infinite_loop,omitempty"`
	HasReconnectLogic bool `json:"has_reconnect_logic,omitempty"`
}

Captures control flow patterns.

type EntropyClass ¶

type EntropyClass int

Categorizes entropy levels for quick analysis.

const (
	EntropyLow    EntropyClass = iota // < 4.0: Simple/sparse code
	EntropyNormal                     // 4.0-6.5: Typical compiled code
	EntropyHigh                       // 6.5-7.5: Potentially obfuscated
	EntropyPacked                     // > 7.5: Likely packed/encrypted
)

func ClassifyEntropy ¶

func ClassifyEntropy(entropy float64) EntropyClass

Determines the entropy class from a raw entropy value.

func (EntropyClass) String ¶

func (c EntropyClass) String() string

type EntropyProfile ¶

type EntropyProfile struct {
	// Overall entropy of the function body
	Overall float64

	// Entropy of string literals within the function
	StringLiteralEntropy float64

	// Entropy classification
	Classification EntropyClass
}

Captures entropy characteristics for malware analysis.

func CalculateEntropyProfile ¶

func CalculateEntropyProfile(bodyBytes []byte, stringLiterals []string) EntropyProfile

Builds a complete entropy profile for analysis.

type FingerprintResult ¶

type FingerprintResult struct {
	FunctionName string
	Fingerprint  string
	CanonicalIR  string
	Pos          token.Pos
	Line         int
	Filename     string
	// contains filtered or unexported fields
}

Holds everything we learned from fingerprinting a single function: the hash, the canonical IR that produced it, and the source location for traceability.

func CompileAndGetFunction ¶

func CompileAndGetFunction(t *testing.T, src, funcName string) *FingerprintResult

CompileAndGetFunction is a helper to compile source and get a named SSA function. Exported for use in external test packages.

func FindResult ¶

func FindResult(results []FingerprintResult, name string) *FingerprintResult

FindResult searches for a FingerprintResult by function name. It supports both exact matches and suffix matches (e.g., "functionName" matches "pkg.functionName"). Exported for use in external test packages.

func FingerprintPackages ¶

func FingerprintPackages(initialPkgs []*packages.Package, policy LiteralPolicy, strictMode bool) ([]FingerprintResult, error)

Walks the loaded packages, builds SSA, and generates fingerprint results for every function we find. Handles methods, closures, and init functions.

func FingerprintSource ¶

func FingerprintSource(filename string, src string, policy LiteralPolicy) ([]FingerprintResult, error)

Analyzes a single Go source file provided as a string. Primary entry point for verifying code snippets or patch hunks.

func FingerprintSourceAdvanced ¶

func FingerprintSourceAdvanced(filename string, src string, policy LiteralPolicy, strictMode bool) ([]FingerprintResult, error)

Extended interface for source analysis that exposes strict mode control.

func GenerateFingerprint ¶

func GenerateFingerprint(fn *ssa.Function, policy LiteralPolicy, strictMode bool) FingerprintResult

Produces the SHA256 hash and canonical string representation for an SSA function. Pulls a Canonicalizer from the pool to keep allocations low and throughput high.

func (FingerprintResult) GetSSAFunction ¶

func (r FingerprintResult) GetSSAFunction() *ssa.Function

Exposes the underlying SSA function for consumers that need deeper analysis, like semantic diffing with the Zipper algorithm. Returns nil if unavailable.

type FunctionTopology ¶

type FunctionTopology struct {
	// Fuzzy Hash for Bucket Indexing (LSH-lite)
	// Used for O(1) candidate retrieval in large databases.
	FuzzyHash string

	// Basic metrics
	ParamCount  int
	ReturnCount int
	BlockCount  int
	InstrCount  int
	LoopCount   int
	BranchCount int // if statements
	PhiCount    int

	// Call profile: map of "package.func" or "method" -> count
	CallSignatures map[string]int

	// Type signature (normalized)
	ParamTypes  []string
	ReturnTypes []string

	// Control flow features
	HasDefer   bool
	HasRecover bool
	HasPanic   bool
	HasGo      bool
	HasSelect  bool
	HasRange   bool

	// Operator profile
	BinOpCounts map[string]int
	UnOpCounts  map[string]int

	// String literal hashes (for behavioral matching)
	StringLiterals []string

	// Entropy analysis for obfuscation detection
	EntropyScore   float64        // Shannon entropy of function body (0.0-8.0)
	EntropyProfile EntropyProfile // Full entropy analysis
	// contains filtered or unexported fields
}

Captures the structural "shape" of a function independent of names. This enables matching functions that have been renamed or obfuscated.

func ExtractTopology ¶

func ExtractTopology(fn *ssa.Function) *FunctionTopology

Analyzes an SSA function and extracts its structural features.

type IVType ¶

type IVType int

const (
	IVTypeUnknown    IVType = iota
	IVTypeBasic             // {S, +, C}
	IVTypeDerived           // Affine: A * IV + B
	IVTypeGeometric         // {S, *, C}
	IVTypePolynomial        // Step is another IV
)

type IdentifyingFeatures ¶

type IdentifyingFeatures struct {
	RequiredCalls  []string          `json:"required_calls,omitempty"`
	OptionalCalls  []string          `json:"optional_calls,omitempty"`
	StringPatterns []string          `json:"string_patterns,omitempty"`
	ControlFlow    *ControlFlowHints `json:"control_flow,omitempty"`
}

Captures behavioral markers for detection.

type InductionVariable ¶

type InductionVariable struct {
	Phi   *ssa.Phi
	Type  IVType
	Start SCEV // Value at iteration 0
	Step  SCEV // Update stride
}

Describes a detected IV. Reference: Section 3.2 Classification Taxonomy.

type LiteralPolicy ¶

type LiteralPolicy struct {
	AbstractControlFlowComparisons bool
	KeepSmallIntegerIndices        bool
	KeepReturnStatusValues         bool
	KeepStringLiterals             bool
	SmallIntMin                    int64
	SmallIntMax                    int64
	AbstractOtherTypes             bool
}

Defines the configurable strategy for determining which literal values should be abstracted into placeholders during canonicalization. Allows fine grained control over integer abstraction in different contexts.

func (*LiteralPolicy) ShouldAbstract ¶

func (p *LiteralPolicy) ShouldAbstract(c *ssa.Const, usageContext ssa.Instruction) bool

Decides whether a given constant should be replaced by a generic placeholder. Analyzes the constant's type, value, and immediate usage context in the SSA graph.

type Loop ¶

type Loop struct {
	Header *ssa.BasicBlock
	Latch  *ssa.BasicBlock // Primary source of the backedge

	// Blocks contains all basic blocks within the loop body.
	Blocks map[*ssa.BasicBlock]bool
	// Exits contains blocks inside the loop that have successors outside.
	Exits []*ssa.BasicBlock

	// Hierarchy
	Parent   *Loop
	Children []*Loop

	// Semantic Analysis (populated in scev.go)
	Inductions map[*ssa.Phi]*InductionVariable
	TripCount  SCEV // Symbolic expression

	// Memoization cache for SCEV analysis to prevent exponential complexity.
	SCEVCache map[ssa.Value]SCEV
}

Represents a natural loop in the SSA graph. Reference: Section 2.3 Natural Loops.

func (*Loop) String ¶

func (l *Loop) String() string

type LoopInfo ¶

type LoopInfo struct {
	Function *ssa.Function
	Loops    []*Loop // Top-level loops (roots of the hierarchy)
	// Map from Header block to Loop object for O(1) lookup
	LoopMap map[*ssa.BasicBlock]*Loop
}

Summarizes loop analysis for a single function.

func DetectLoops ¶

func DetectLoops(fn *ssa.Function) *LoopInfo

Reconstructs the loop hierarchy using dominance relations. Reference: Section 2.3.1 Algorithm: Detecting Natural Loops.

type MatchDetails ¶

type MatchDetails struct {
	TopologyMatch      bool     `json:"topology_match"`
	EntropyMatch       bool     `json:"entropy_match"`
	CallsMatched       []string `json:"calls_matched"`
	CallsMissing       []string `json:"calls_missing"`
	StringsMatched     []string `json:"strings_matched"`
	TopologySimilarity float64  `json:"topology_similarity"`
	EntropyDistance    float64  `json:"entropy_distance"`
}

Provides granular information about the match.

type Renamer ¶

type Renamer func(ssa.Value) string

Maps an SSA value to its canonical name. Ensures deterministic output regardless of SSA register naming.

type SCEV ¶

type SCEV interface {
	ssa.Value
	EvaluateAt(k *big.Int) *big.Int
	IsLoopInvariant(loop *Loop) bool
	String() string
	// Returns a canonical string using the provided renamer function to map
	// SSA values to their canonical names (e.g., v0, v1). Critical for determinism:
	// without it, raw SSA names (t0, t1) would leak into fingerprints, breaking
	// semantic equivalence.
	StringWithRenamer(r Renamer) string
}

Represents a scalar expression in the SCEV lattice.

type SCEVAddRec ¶

type SCEVAddRec struct {
	Start SCEV
	Step  SCEV
	Loop  *Loop
}

Represents an Add Recurrence: {Start, +, Step}_L Reference: Section 4.1 The Add Recurrence Abstraction.

func (*SCEVAddRec) EvaluateAt ¶

func (s *SCEVAddRec) EvaluateAt(k *big.Int) *big.Int

func (*SCEVAddRec) IsLoopInvariant ¶

func (s *SCEVAddRec) IsLoopInvariant(loop *Loop) bool

func (*SCEVAddRec) Name ¶

func (s *SCEVAddRec) Name() string

ssa.Value Stubs

func (*SCEVAddRec) Parent ¶

func (s *SCEVAddRec) Parent() *ssa.Function

func (*SCEVAddRec) Pos ¶

func (s *SCEVAddRec) Pos() token.Pos

func (*SCEVAddRec) Referrers ¶

func (s *SCEVAddRec) Referrers() *[]ssa.Instruction

func (*SCEVAddRec) String ¶

func (s *SCEVAddRec) String() string

func (*SCEVAddRec) StringWithRenamer ¶

func (s *SCEVAddRec) StringWithRenamer(r Renamer) string

func (*SCEVAddRec) Type ¶

func (s *SCEVAddRec) Type() types.Type

type SCEVConstant ¶

type SCEVConstant struct {
	Value *big.Int
}

Represents a literal integer constant.

func SCEVFromConst ¶

func SCEVFromConst(c *ssa.Const) *SCEVConstant

func (*SCEVConstant) EvaluateAt ¶

func (s *SCEVConstant) EvaluateAt(k *big.Int) *big.Int

func (*SCEVConstant) IsLoopInvariant ¶

func (s *SCEVConstant) IsLoopInvariant(loop *Loop) bool

func (*SCEVConstant) Name ¶

func (s *SCEVConstant) Name() string

ssa.Value Stubs

func (*SCEVConstant) Parent ¶

func (s *SCEVConstant) Parent() *ssa.Function

func (*SCEVConstant) Pos ¶

func (s *SCEVConstant) Pos() token.Pos

func (*SCEVConstant) Referrers ¶

func (s *SCEVConstant) Referrers() *[]ssa.Instruction

func (*SCEVConstant) String ¶

func (s *SCEVConstant) String() string

func (*SCEVConstant) StringWithRenamer ¶

func (s *SCEVConstant) StringWithRenamer(r Renamer) string

func (*SCEVConstant) Type ¶

func (s *SCEVConstant) Type() types.Type

type SCEVGenericExpr ¶

type SCEVGenericExpr struct {
	Op token.Token
	X  SCEV
	Y  SCEV
}

Represents binary operations like Add/Mul for formulas.

func (*SCEVGenericExpr) EvaluateAt ¶

func (s *SCEVGenericExpr) EvaluateAt(k *big.Int) *big.Int

func (*SCEVGenericExpr) IsLoopInvariant ¶

func (s *SCEVGenericExpr) IsLoopInvariant(loop *Loop) bool

func (*SCEVGenericExpr) Name ¶

func (s *SCEVGenericExpr) Name() string

ssa.Value Stubs

func (*SCEVGenericExpr) Parent ¶

func (s *SCEVGenericExpr) Parent() *ssa.Function

func (*SCEVGenericExpr) Pos ¶

func (s *SCEVGenericExpr) Pos() token.Pos

func (*SCEVGenericExpr) Referrers ¶

func (s *SCEVGenericExpr) Referrers() *[]ssa.Instruction

func (*SCEVGenericExpr) String ¶

func (s *SCEVGenericExpr) String() string

func (*SCEVGenericExpr) StringWithRenamer ¶

func (s *SCEVGenericExpr) StringWithRenamer(r Renamer) string

func (*SCEVGenericExpr) Type ¶

func (s *SCEVGenericExpr) Type() types.Type

type SCEVUnknown ¶

type SCEVUnknown struct {
	Value       ssa.Value
	IsInvariant bool // Explicitly tracks invariance relative to the analysis loop scope
}

Represents a symbolic value (e.g., parameter or unanalyzable instruction).

func (*SCEVUnknown) EvaluateAt ¶

func (s *SCEVUnknown) EvaluateAt(k *big.Int) *big.Int

func (*SCEVUnknown) IsLoopInvariant ¶

func (s *SCEVUnknown) IsLoopInvariant(loop *Loop) bool

func (*SCEVUnknown) Name ¶

func (s *SCEVUnknown) Name() string

ssa.Value Stubs

func (*SCEVUnknown) Parent ¶

func (s *SCEVUnknown) Parent() *ssa.Function

func (*SCEVUnknown) Pos ¶

func (s *SCEVUnknown) Pos() token.Pos

func (*SCEVUnknown) Referrers ¶

func (s *SCEVUnknown) Referrers() *[]ssa.Instruction

func (*SCEVUnknown) String ¶

func (s *SCEVUnknown) String() string

func (*SCEVUnknown) StringWithRenamer ¶

func (s *SCEVUnknown) StringWithRenamer(r Renamer) string

func (*SCEVUnknown) Type ¶

func (s *SCEVUnknown) Type() types.Type

type ScanResult ¶

type ScanResult struct {
	SignatureID     string       `json:"signature_id"`
	SignatureName   string       `json:"signature_name"`
	Severity        string       `json:"severity"`
	MatchedFunction string       `json:"matched_function"`
	Confidence      float64      `json:"confidence"` // 0.0 to 1.0
	MatchDetails    MatchDetails `json:"match_details"`
}

Represents a match between analyzed code and a signature.

type Scanner ¶

type Scanner struct {
	// contains filtered or unexported fields
}

Performs semantic malware detection.

func NewScanner ¶

func NewScanner() *Scanner

Creates a new scanner instance.

func (*Scanner) AddSignature ¶

func (s *Scanner) AddSignature(sig Signature)

Adds a new signature to the database.

func (*Scanner) GetDatabase ¶

func (s *Scanner) GetDatabase() *SignatureDatabase

Returns the current signature database.

func (*Scanner) LoadDatabase ¶

func (s *Scanner) LoadDatabase(path string) error

Loads signatures from a JSON file.

func (*Scanner) SaveDatabase ¶

func (s *Scanner) SaveDatabase(path string) error

Writes the signature database to a JSON file.

func (*Scanner) ScanTopology ¶

func (s *Scanner) ScanTopology(topo *FunctionTopology, funcName string) []ScanResult

Checks a function topology against all signatures. This is the "Hunter Phase" where we scan unknown code for matches.

func (*Scanner) SetThreshold ¶

func (s *Scanner) SetThreshold(threshold float64)

Sets the minimum confidence threshold for alerts.

type Signature ¶

type Signature struct {
	ID                  string              `json:"id"`
	Name                string              `json:"name"`
	Description         string              `json:"description"`
	Severity            string              `json:"severity"`
	Category            string              `json:"category"`
	TopologyHash        string              `json:"topology_hash"`
	FuzzyHash           string              `json:"fuzzy_hash,omitempty"` // REMEDIATION: LSH bucket
	EntropyScore        float64             `json:"entropy_score"`
	EntropyTolerance    float64             `json:"entropy_tolerance"`
	NodeCount           int                 `json:"node_count"`
	LoopDepth           int                 `json:"loop_depth"`
	IdentifyingFeatures IdentifyingFeatures `json:"identifying_features"`
	Metadata            SignatureMetadata   `json:"metadata"`
}

Represents a single malware signature entry.

func IndexFunction ¶

func IndexFunction(topo *FunctionTopology, name, description, severity, category string) Signature

Generates a signature entry from a FunctionTopology. This is the "Lab Phase" where we analyze known malware to build the database.

type SignatureDatabase ¶

type SignatureDatabase struct {
	Version     string      `json:"version"`
	Description string      `json:"description"`
	Signatures  []Signature `json:"signatures"`
}

Represents the malware signature database.

type SignatureMetadata ¶

type SignatureMetadata struct {
	Author     string   `json:"author"`
	Created    string   `json:"created"`
	References []string `json:"references,omitempty"`
}

Contains provenance information.

type TopologyMatch ¶

type TopologyMatch struct {
	OldResult   FingerprintResult
	NewResult   FingerprintResult
	OldTopology *FunctionTopology
	NewTopology *FunctionTopology
	Similarity  float64
	ByName      bool // true if matched by name, false if by topology
}

Represents a potential function pairing with a confidence score.

type Zipper ¶

type Zipper struct {
	// contains filtered or unexported fields
}

Implements the semantic delta analysis algorithm. Walks the use def chains of two functions in parallel, aligning equivalent nodes and isolating divergence.

func NewZipper ¶

func NewZipper(oldFn, newFn *ssa.Function, policy LiteralPolicy) (*Zipper, error)

Creates a new analysis session between two function versions.

func (*Zipper) ComputeDiff ¶

func (z *Zipper) ComputeDiff() (*ZipperArtifacts, error)

Runs through all four phases of the Zipper algorithm: semantic analysis, anchor alignment, forward propagation, and divergence isolation.

type ZipperArtifacts ¶

type ZipperArtifacts struct {
	OldFunction  string
	NewFunction  string
	MatchedNodes int
	Added        []string
	Removed      []string
	Preserved    bool
}

Output from the semantic delta analysis. Shows what instructions were added, removed, or matched between two function versions.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
sfw command Package main provides the sfw CLI tool for semantic fingerprinting of Go source files.	Package main provides the sfw CLI tool for semantic fingerprinting of Go source files.
examples
v1 command
v2 command
samples
clean command
dirty command
shuffled command

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL