bigquery

package
v0.13.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2026 License: Apache-2.0 Imports: 28 Imported by: 0

README

BigQuery

Extract table metadata, schema, statistics, and lineage from Google BigQuery.

Usage

source:
  name: bigquery
  config:
    project_id: google-project-id
    table_pattern: gofood.fact_
    max_preview_rows: 3
    exclude:
      datasets:
        - dataset_a
        - dataset_b
      tables:
        - dataset_c.table_a
      labels:
        env: staging
    max_page_size: 100
    include_column_profile: true
    build_view_lineage: true
    # Only one of service_account_base64 / service_account_json is needed.
    # If both are present, service_account_base64 takes precedence.
    service_account_base64: ____base64_encoded_service_account____
    service_account_json: |-
      {
        "type": "service_account",
        "private_key_id": "xxxxxxx",
        "private_key": "xxxxxxx",
        "client_email": "xxxxxxx",
        "client_id": "xxxxxxx",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "xxxxxxx",
        "client_x509_cert_url": "xxxxxxx"
      }
    collect_table_usage: false
    usage_period_in_day: 7
    usage_project_ids:
      - google-project-id
      - other-google-project-id
    concurrency: 10

Configuration

Key Value Example Description
project_id string my-project BigQuery Project ID required
service_account_base64 string ____BASE64____ Base64-encoded service account JSON. Takes precedence over service_account_json optional
service_account_json string {"private_key": ...} Service account credentials as a JSON string optional
table_pattern string gofood.fact_ Regex pattern to whitelist tables to extract optional
exclude.datasets []string [dataset_a] Dataset IDs to exclude optional
exclude.tables []string [dataset_c.table_a] Table names in datasetID.tableID format to exclude optional
exclude.labels map[string]string {env: staging} Tables with any matching label key-value pair are excluded optional
max_page_size int 100 Page size hint for BigQuery API list calls optional
dataset_page_size int 10 Page size for listing datasets. Falls back to max_page_size optional
table_page_size int 50 Page size for listing tables. Falls back to max_page_size optional
include_column_profile bool true Profile each column (min, max, avg, med, unique, count, top) optional
max_preview_rows int 30 Number of preview rows to fetch. 0 skips preview, -1 omits the key entirely. Default 30 optional
mix_values bool false Shuffle column values across preview rows for privacy. Default false optional
build_view_lineage bool true Parse view SQL to extract upstream lineage edges. Default false optional
collect_table_usage bool false Collect table usage statistics from BigQuery audit logs. Default false optional
usage_period_in_day int 7 Number of days of audit logs to scan. Default 7 optional
usage_project_ids []string [my-project] GCP project IDs to scan for audit logs. Defaults to project_id optional
concurrency int 10 Number of tables to process concurrently. Default 10 optional
Notes
  • Leaving service_account_json and service_account_base64 blank defaults to Google Application Default Credentials. Recommended when Meteor runs inside the same GCP environment.
  • The service account needs the bigquery.privateLogsViewer role to collect audit logs.

Entities

Entity: table

URN format: urn:bigquery:{project_id}:table:{project_id}:{dataset_id}.{table_id}

Produced by plugins.BigQueryURN(projectID, datasetID, tableID).

Property Type Description
entity.description string Table description from BigQuery metadata
entity.properties.full_qualified_name string Fully qualified table name (project.dataset.table)
entity.properties.dataset string Dataset ID
entity.properties.project string Project ID
entity.properties.type string BigQuery table type (TABLE, VIEW, MATERIALIZED_VIEW, etc.)
entity.properties.partition_data object Partition configuration (see below)
entity.properties.clustering_fields []string Fields the table is clustered on
entity.properties.sql string View SQL query (views and materialized views only)
entity.properties.columns []object Column schema (see below)
entity.properties.profile object Table-level usage profile (see below)
entity.properties.create_time string Table creation timestamp (RFC 3339)
entity.properties.update_time string Last modification timestamp (RFC 3339)
entity.properties.preview_fields []string Column names for preview rows
entity.properties.preview_rows []array Sample data rows
entity.properties.labels map[string]string BigQuery table labels
Partition Data (entity.properties.partition_data)
Key Type Description
partition_field string Partition column. Defaults to _PARTITIONTIME for time partitioning with no explicit field
require_partition_filter bool Whether queries must filter on the partition column
time_partition.partition_by string Time partition granularity: HOUR, DAY, MONTH, YEAR
time_partition.partition_expire_seconds float Seconds until partition data expires. 0 means no expiry
range_partition.start int Range partition start (inclusive)
range_partition.end int Range partition end (exclusive)
range_partition.interval int Range partition interval width
Column (entity.properties.columns[])
Key Type Description
name string Column name
data_type string BigQuery data type (STRING, INTEGER, RECORD, etc.)
description string Column description
is_nullable bool Whether the column is nullable
mode string Column mode: NULLABLE, REQUIRED, or REPEATED
policy_tags []string Data Catalog policy tags in taxonomy:tag:resource format
columns []object Nested columns (for RECORD type)
profile object Column profile with min, max, avg, med, unique, count, top (when include_column_profile is enabled)
Profile (entity.properties.profile)

Populated when collect_table_usage is enabled.

Key Type Description
total_rows int Number of rows in the table
usage_count int Number of times the table was queried in the audit log window
common_joins []object Tables commonly joined with this one. Each entry has urn, count, and conditions
filters []string WHERE clause expressions found in queries against this table

Edges

Type Source Target Description
derived_from upstream table URN this table URN Upstream dependency parsed from view SQL. Emitted when build_view_lineage is enabled and the table is a view or materialized view

Contributing

Refer to the contribution guidelines for information on contributing to this module.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CreateClient added in v0.9.1

func CreateClient(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)

CreateClient creates a bigquery client

func IsExcludedByLabels added in v0.13.0

func IsExcludedByLabels(tableLabels, excludeLabels map[string]string) bool

isExcludedByLabels returns true if the table's labels match any of the configured exclude labels. A match means the table has the same key with the same value.

func IsExcludedDataset

func IsExcludedDataset(datasetID string, excludedDatasets []string) bool

func IsExcludedTable

func IsExcludedTable(datasetID, tableID string, excludedTables []string) bool

Types

type Config

type Config struct {
	ProjectID string `json:"project_id" yaml:"project_id" mapstructure:"project_id" validate:"required"`
	// ServiceAccountBase64 takes precedence over ServiceAccountJSON field
	ServiceAccountBase64 string  `mapstructure:"service_account_base64"`
	ServiceAccountJSON   string  `mapstructure:"service_account_json"`
	MaxPageSize          int     `mapstructure:"max_page_size"`
	DatasetPageSize      int     `mapstructure:"dataset_page_size"`
	TablePageSize        int     `mapstructure:"table_page_size"`
	TablePattern         string  `mapstructure:"table_pattern"`
	Exclude              Exclude `mapstructure:"exclude"`
	IncludeColumnProfile bool    `mapstructure:"include_column_profile"`
	// MaxPreviewRows can also be set to -1 to restrict adding preview_rows key in entity properties
	MaxPreviewRows      int      `mapstructure:"max_preview_rows" default:"30"`
	MixValues           bool     `mapstructure:"mix_values" default:"false"`
	IsCollectTableUsage bool     `mapstructure:"collect_table_usage" default:"false"`
	UsagePeriodInDay    int64    `mapstructure:"usage_period_in_day" default:"7"`
	UsageProjectIDs     []string `mapstructure:"usage_project_ids"`
	BuildViewLineage    bool     `mapstructure:"build_view_lineage" default:"false"`
	Concurrency         int      `mapstructure:"concurrency" default:"10"`
}

Config holds the set of configuration for the bigquery extractor

type Exclude

type Exclude struct {
	// list of datasetIDs
	Datasets []string `json:"datasets" yaml:"datasets" mapstructure:"datasets"`
	// list of tableNames in format - datasetID.tableID
	Tables []string `json:"tables" yaml:"tables" mapstructure:"tables"`
	// list of label key-value pairs; tables matching any label are excluded
	Labels map[string]string `json:"labels" yaml:"labels" mapstructure:"labels"`
}

type Extractor

type Extractor struct {
	plugins.BaseExtractor
	// contains filtered or unexported fields
}

Extractor manages the communication with the bigquery service

func New

func New(logger log.Logger, newClient NewClientFunc, randFn randFn) *Extractor

func (*Extractor) Extract

func (e *Extractor) Extract(ctx context.Context, emit plugins.Emit) error

Extract checks if the table is valid and extracts the table schema

func (*Extractor) Init

func (e *Extractor) Init(ctx context.Context, config plugins.Config) error

Init initializes the extractor

type NewClientFunc added in v0.9.1

type NewClientFunc func(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL