mcp-datahub

module
v0.7.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 16, 2026 License: Apache-2.0

README

txn2/mcp-datahub

GitHub license Go Reference Go Report Card codecov OpenSSF Scorecard SLSA 3

An MCP server and composable Go library that connects AI assistants to DataHub metadata catalogs. Search datasets, explore schemas, trace lineage, and access glossary terms and domains.

Documentation | Installation | Library Docs

Two Ways to Use

1. Standalone MCP Server

Install and connect to Claude Desktop, Cursor, or any MCP client:

Claude Desktop (Easiest) - Download the .mcpb bundle from releases and double-click to install:

  • macOS Apple Silicon: mcp-datahub_X.X.X_darwin_arm64.mcpb
  • macOS Intel: mcp-datahub_X.X.X_darwin_amd64.mcpb
  • Windows: mcp-datahub_X.X.X_windows_amd64.mcpb

Other Installation Methods:

# Homebrew (macOS)
brew install txn2/tap/mcp-datahub

# Go install
go install github.com/txn2/mcp-datahub/cmd/mcp-datahub@latest

Manual Claude Desktop Configuration (if not using MCPB):

{
  "mcpServers": {
    "datahub": {
      "command": "/opt/homebrew/bin/mcp-datahub",
      "env": {
        "DATAHUB_URL": "https://datahub.example.com",
        "DATAHUB_TOKEN": "your_token"
      }
    }
  }
}
Multi-Server Configuration

Connect to multiple DataHub instances simultaneously:

# Primary server
export DATAHUB_URL=https://prod.datahub.example.com/api/graphql
export DATAHUB_TOKEN=prod-token
export DATAHUB_CONNECTION_NAME=prod

# Additional servers (JSON)
export DATAHUB_ADDITIONAL_SERVERS='{"staging":{"url":"https://staging.datahub.example.com/api/graphql","token":"staging-token"}}'

Use datahub_list_connections to discover available connections, then pass the connection parameter to any tool.

2. Composable Go Library

Import into your own MCP server for custom authentication, tenant isolation, and audit logging:

import (
    "github.com/txn2/mcp-datahub/pkg/client"
    "github.com/txn2/mcp-datahub/pkg/tools"
)

// Create client and register tools with your MCP server
datahubClient, _ := client.NewFromEnv()
defer datahubClient.Close()

toolkit := tools.NewToolkit(datahubClient, tools.Config{})
toolkit.RegisterAll(yourMCPServer)
Customizing Tool Descriptions

Override tool descriptions to match your deployment:

toolkit := tools.NewToolkit(datahubClient, tools.Config{},
    tools.WithDescriptions(map[tools.ToolName]string{
        tools.ToolSearch: "Search our internal data catalog for datasets and dashboards",
    }),
)
Customizing Tool Annotations

Override MCP tool annotations (behavior hints for AI clients):

toolkit := tools.NewToolkit(datahubClient, tools.Config{},
    tools.WithAnnotations(map[tools.ToolName]*mcp.ToolAnnotations{
        tools.ToolSearch: {ReadOnlyHint: true, OpenWorldHint: boolPtr(true)},
    }),
)

All 19 tools ship with default annotations: read tools are marked ReadOnlyHint: true, write tools are marked DestructiveHint: false and IdempotentHint: true.

Extensions (Logging, Metrics, Error Hints)

Enable optional middleware via the extensions package:

import "github.com/txn2/mcp-datahub/pkg/extensions"

// Load from environment variables (MCP_DATAHUB_EXT_*)
cfg := extensions.FromEnv()
opts := extensions.BuildToolkitOptions(cfg)
toolkit := tools.NewToolkit(datahubClient, toolsCfg, opts...)

// Or load from a YAML/JSON config file
serverCfg, _ := extensions.LoadConfig("config.yaml")

See the library documentation for middleware, selective tool registration, and enterprise patterns.

Combining with mcp-trino

Build a unified data platform MCP server by combining DataHub metadata with Trino query execution:

import (
    datahubClient "github.com/txn2/mcp-datahub/pkg/client"
    datahubTools "github.com/txn2/mcp-datahub/pkg/tools"
    trinoClient "github.com/txn2/mcp-trino/pkg/client"
    trinoTools "github.com/txn2/mcp-trino/pkg/tools"
)

// Add DataHub tools (search, lineage, schema, glossary)
dh, _ := datahubClient.NewFromEnv()
datahubTools.NewToolkit(dh, datahubTools.Config{}).RegisterAll(server)

// Add Trino tools (query execution, catalog browsing)
tr, _ := trinoClient.NewFromEnv()
trinoTools.NewToolkit(tr, trinoTools.Config{}).RegisterAll(server)

// AI assistants can now:
// - Search DataHub for tables -> Get schema -> Query via Trino
// - Explore lineage -> Understand data flow -> Run validation queries

See txn2/mcp-trino for the companion library.

Bidirectional Integration with QueryProvider

The library supports bidirectional context injection. While mcp-trino can pull semantic context from DataHub, mcp-datahub can receive query execution context back from a query engine:

import (
    datahubTools "github.com/txn2/mcp-datahub/pkg/tools"
    "github.com/txn2/mcp-datahub/pkg/integration"
)

// QueryProvider enables query engines to inject context into DataHub tools
type myQueryProvider struct {
    trinoClient *trino.Client
}

func (p *myQueryProvider) Name() string { return "trino" }

func (p *myQueryProvider) ResolveTable(ctx context.Context, urn string) (*integration.TableIdentifier, error) {
    // Map DataHub URN to Trino table (catalog.schema.table)
    return &integration.TableIdentifier{
        Catalog: "hive", Schema: "production", Table: "users",
    }, nil
}

func (p *myQueryProvider) GetTableAvailability(ctx context.Context, urn string) (*integration.TableAvailability, error) {
    // Check if table is queryable
    return &integration.TableAvailability{Available: true}, nil
}

func (p *myQueryProvider) GetQueryExamples(ctx context.Context, urn string) ([]integration.QueryExample, error) {
    // Return sample queries for this entity
    return []integration.QueryExample{
        {Name: "sample", SQL: "SELECT * FROM hive.production.users LIMIT 10"},
    }, nil
}

// Wire it up
toolkit := datahubTools.NewToolkit(datahubClient, config,
    datahubTools.WithQueryProvider(&myQueryProvider{trinoClient: trino}),
)

When a QueryProvider is configured, tool responses are enriched:

  • Search results: Include query_context with table availability
  • Entity details: Include query_table, query_examples, query_availability
  • Schema: Include query_table for immediate SQL usage
  • Lineage: Include execution_context mapping URNs to tables

Integration Middleware

Enterprise features like access control and audit logging are enabled through middleware adapters:

import (
    datahubTools "github.com/txn2/mcp-datahub/pkg/tools"
    "github.com/txn2/mcp-datahub/pkg/integration"
)

// Access control - filter entities by user permissions
type myAccessFilter struct{}
func (f *myAccessFilter) CanAccess(ctx context.Context, urn string) (bool, error) { /* ... */ }
func (f *myAccessFilter) FilterURNs(ctx context.Context, urns []string) ([]string, error) { /* ... */ }

// Audit logging - track all tool invocations
type myAuditLogger struct{}
func (l *myAuditLogger) LogToolCall(ctx context.Context, tool string, params map[string]any, userID string) error { /* ... */ }

// Wire up with multiple integration options
toolkit := datahubTools.NewToolkit(datahubClient, config,
    datahubTools.WithAccessFilter(&myAccessFilter{}),
    datahubTools.WithAuditLogger(&myAuditLogger{}, func(ctx context.Context) string {
        return ctx.Value("user_id").(string)
    }),
    datahubTools.WithURNResolver(&myURNResolver{}),      // Map external IDs to URNs
    datahubTools.WithMetadataEnricher(&myEnricher{}),    // Add custom metadata
)

See the library documentation for complete integration patterns.

Available Tools

Read Tools (always available)

Tool Description
datahub_search Search for datasets, dashboards, pipelines by query and entity type
datahub_get_entity Get entity metadata by URN (description, owners, tags, domain)
datahub_get_schema Get dataset schema with field types and descriptions
datahub_get_lineage Get upstream/downstream data lineage
datahub_get_column_lineage Get fine-grained column-level lineage mappings
datahub_get_queries Get SQL queries associated with a dataset
datahub_get_glossary_term Get glossary term definition and properties
datahub_list_tags List available tags in the catalog
datahub_list_domains List data domains
datahub_list_data_products List data products
datahub_get_data_product Get data product details (owners, domain, properties)
datahub_list_connections List configured DataHub server connections (multi-server mode)

Write Tools (require DATAHUB_WRITE_ENABLED=true)

Tool Description
datahub_update_description Update the description of an entity
datahub_add_tag Add a tag to an entity
datahub_remove_tag Remove a tag from an entity
datahub_add_glossary_term Add a glossary term to an entity
datahub_remove_glossary_term Remove a glossary term from an entity
datahub_add_link Add a link to an entity
datahub_remove_link Remove a link from an entity

Write tools use DataHub's REST API (POST /aspects?action=ingestProposal) with read-modify-write semantics for array aspects (tags, terms, links). They are disabled by default for safety.

See the tools reference for detailed documentation.

Configuration

Variable Description Default
DATAHUB_URL DataHub GraphQL API URL (required)
DATAHUB_TOKEN API token (required)
DATAHUB_TIMEOUT Request timeout (seconds) 30
DATAHUB_DEFAULT_LIMIT Default search limit 10
DATAHUB_MAX_LIMIT Maximum limit 100
DATAHUB_CONNECTION_NAME Display name for primary connection datahub
DATAHUB_ADDITIONAL_SERVERS JSON map of additional servers (optional)
DATAHUB_WRITE_ENABLED Enable write operations (true or 1) false
DATAHUB_DEBUG Enable debug logging (1 or true) false

Extensions

Variable Description Default
MCP_DATAHUB_EXT_LOGGING Enable structured logging of tool calls false
MCP_DATAHUB_EXT_METRICS Enable metrics collection false
MCP_DATAHUB_EXT_METADATA Enable metadata enrichment on results false
MCP_DATAHUB_EXT_ERRORS Enable error hint enrichment true

Config File

As an alternative to environment variables, configure via YAML or JSON:

datahub:
  url: https://datahub.example.com
  token: "${DATAHUB_TOKEN}"
  timeout: "30s"
  write_enabled: true

toolkit:
  default_limit: 20
  descriptions:
    datahub_search: "Custom search description for your deployment"

extensions:
  logging: true
  errors: true

Load with extensions.LoadConfig("config.yaml"). Environment variables override file values for sensitive fields. Token values support $VAR / ${VAR} expansion.

See configuration reference for all options.

Development

make build     # Build binary
make test      # Run tests with race detection
make lint      # Run golangci-lint
make security  # Run gosec and govulncheck
make coverage  # Generate coverage report
make verify    # Run tidy, lint, and test
make help      # Show all targets

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache License 2.0


Open source by Craig Johnston, sponsored by Deasil Works, Inc.

Directories

Path Synopsis
cmd
mcp-datahub command
Package main provides the mcp-datahub CLI entry point.
Package main provides the mcp-datahub CLI entry point.
internal
server
Package server provides the default MCP server setup for mcp-datahub.
Package server provides the default MCP server setup for mcp-datahub.
pkg
client
Package client provides a GraphQL client for DataHub.
Package client provides a GraphQL client for DataHub.
extensions
Package extensions provides optional middleware and configuration for mcp-datahub.
Package extensions provides optional middleware and configuration for mcp-datahub.
integration
Package integration provides interfaces for extending mcp-datahub behavior.
Package integration provides interfaces for extending mcp-datahub behavior.
multiserver
Package multiserver provides support for managing connections to multiple DataHub servers.
Package multiserver provides support for managing connections to multiple DataHub servers.
tools
Package tools provides MCP tool definitions for DataHub operations.
Package tools provides MCP tool definitions for DataHub operations.
types
Package types defines DataHub domain types for the MCP server.
Package types defines DataHub domain types for the MCP server.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL