bigquery

package

v0.13.0 Latest Latest Go to latest Published: Apr 19, 2026 License: Apache-2.0 Imports: 28 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/raystack/meteor

Links

Open Source Insights

README ¶

BigQuery

Extract table metadata, schema, statistics, and lineage from Google BigQuery.

Usage

source:
  name: bigquery
  config:
    project_id: google-project-id
    table_pattern: gofood.fact_
    max_preview_rows: 3
    exclude:
      datasets:
        - dataset_a
        - dataset_b
      tables:
        - dataset_c.table_a
      labels:
        env: staging
    max_page_size: 100
    include_column_profile: true
    build_view_lineage: true
    # Only one of service_account_base64 / service_account_json is needed.
    # If both are present, service_account_base64 takes precedence.
    service_account_base64: ____base64_encoded_service_account____
    service_account_json: |-
      {
        "type": "service_account",
        "private_key_id": "xxxxxxx",
        "private_key": "xxxxxxx",
        "client_email": "xxxxxxx",
        "client_id": "xxxxxxx",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "xxxxxxx",
        "client_x509_cert_url": "xxxxxxx"
      }
    collect_table_usage: false
    usage_period_in_day: 7
    usage_project_ids:
      - google-project-id
      - other-google-project-id
    concurrency: 10

Configuration

Key	Value	Example	Description
`project_id`	`string`	`my-project`	BigQuery Project ID	required
`service_account_base64`	`string`	`____BASE64____`	Base64-encoded service account JSON. Takes precedence over `service_account_json`	optional
`service_account_json`	`string`	`{"private_key": ...}`	Service account credentials as a JSON string	optional
`table_pattern`	`string`	`gofood.fact_`	Regex pattern to whitelist tables to extract	optional
`exclude.datasets`	`[]string`	`[dataset_a]`	Dataset IDs to exclude	optional
`exclude.tables`	`[]string`	`[dataset_c.table_a]`	Table names in `datasetID.tableID` format to exclude	optional
`exclude.labels`	`map[string]string`	`{env: staging}`	Tables with any matching label key-value pair are excluded	optional
`max_page_size`	`int`	`100`	Page size hint for BigQuery API list calls	optional
`dataset_page_size`	`int`	`10`	Page size for listing datasets. Falls back to `max_page_size`	optional
`table_page_size`	`int`	`50`	Page size for listing tables. Falls back to `max_page_size`	optional
`include_column_profile`	`bool`	`true`	Profile each column (min, max, avg, med, unique, count, top)	optional
`max_preview_rows`	`int`	`30`	Number of preview rows to fetch. `0` skips preview, `-1` omits the key entirely. Default `30`	optional
`mix_values`	`bool`	`false`	Shuffle column values across preview rows for privacy. Default `false`	optional
`build_view_lineage`	`bool`	`true`	Parse view SQL to extract upstream lineage edges. Default `false`	optional
`collect_table_usage`	`bool`	`false`	Collect table usage statistics from BigQuery audit logs. Default `false`	optional
`usage_period_in_day`	`int`	`7`	Number of days of audit logs to scan. Default `7`	optional
`usage_project_ids`	`[]string`	`[my-project]`	GCP project IDs to scan for audit logs. Defaults to `project_id`	optional
`concurrency`	`int`	`10`	Number of tables to process concurrently. Default `10`	optional

Notes

Leaving service_account_json and service_account_base64 blank defaults to Google Application Default Credentials. Recommended when Meteor runs inside the same GCP environment.
The service account needs the bigquery.privateLogsViewer role to collect audit logs.

Entities

Entity: `table`

URN format: urn:bigquery:{project_id}:table:{project_id}:{dataset_id}.{table_id}

Produced by plugins.BigQueryURN(projectID, datasetID, tableID).

Property	Type	Description
`entity.description`	`string`	Table description from BigQuery metadata
`entity.properties.full_qualified_name`	`string`	Fully qualified table name (`project.dataset.table`)
`entity.properties.dataset`	`string`	Dataset ID
`entity.properties.project`	`string`	Project ID
`entity.properties.type`	`string`	BigQuery table type (`TABLE`, `VIEW`, `MATERIALIZED_VIEW`, etc.)
`entity.properties.partition_data`	`object`	Partition configuration (see below)
`entity.properties.clustering_fields`	`[]string`	Fields the table is clustered on
`entity.properties.sql`	`string`	View SQL query (views and materialized views only)
`entity.properties.columns`	`[]object`	Column schema (see below)
`entity.properties.profile`	`object`	Table-level usage profile (see below)
`entity.properties.create_time`	`string`	Table creation timestamp (RFC 3339)
`entity.properties.update_time`	`string`	Last modification timestamp (RFC 3339)
`entity.properties.preview_fields`	`[]string`	Column names for preview rows
`entity.properties.preview_rows`	`[]array`	Sample data rows
`entity.properties.labels`	`map[string]string`	BigQuery table labels

Partition Data (`entity.properties.partition_data`)

Key	Type	Description
`partition_field`	`string`	Partition column. Defaults to `_PARTITIONTIME` for time partitioning with no explicit field
`require_partition_filter`	`bool`	Whether queries must filter on the partition column
`time_partition.partition_by`	`string`	Time partition granularity: `HOUR`, `DAY`, `MONTH`, `YEAR`
`time_partition.partition_expire_seconds`	`float`	Seconds until partition data expires. `0` means no expiry
`range_partition.start`	`int`	Range partition start (inclusive)
`range_partition.end`	`int`	Range partition end (exclusive)
`range_partition.interval`	`int`	Range partition interval width

Column (`entity.properties.columns[]`)

Key	Type	Description
`name`	`string`	Column name
`data_type`	`string`	BigQuery data type (`STRING`, `INTEGER`, `RECORD`, etc.)
`description`	`string`	Column description
`is_nullable`	`bool`	Whether the column is nullable
`mode`	`string`	Column mode: `NULLABLE`, `REQUIRED`, or `REPEATED`
`policy_tags`	`[]string`	Data Catalog policy tags in `taxonomy:tag:resource` format
`columns`	`[]object`	Nested columns (for `RECORD` type)
`profile`	`object`	Column profile with `min`, `max`, `avg`, `med`, `unique`, `count`, `top` (when `include_column_profile` is enabled)

Profile (`entity.properties.profile`)

Populated when collect_table_usage is enabled.

Key	Type	Description
`total_rows`	`int`	Number of rows in the table
`usage_count`	`int`	Number of times the table was queried in the audit log window
`common_joins`	`[]object`	Tables commonly joined with this one. Each entry has `urn`, `count`, and `conditions`
`filters`	`[]string`	WHERE clause expressions found in queries against this table

Edges

Type	Source	Target	Description
`derived_from`	upstream table URN	this table URN	Upstream dependency parsed from view SQL. Emitted when `build_view_lineage` is enabled and the table is a view or materialized view

Contributing

Refer to the contribution guidelines for information on contributing to this module.

Documentation ¶

Index ¶

func CreateClient(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)
func IsExcludedByLabels(tableLabels, excludeLabels map[string]string) bool
func IsExcludedDataset(datasetID string, excludedDatasets []string) bool
func IsExcludedTable(datasetID, tableID string, excludedTables []string) bool
type Config
type Exclude
type Extractor
- func New(logger log.Logger, newClient NewClientFunc, randFn randFn) *Extractor
- func (e *Extractor) Extract(ctx context.Context, emit plugins.Emit) error
- func (e *Extractor) Init(ctx context.Context, config plugins.Config) error
type NewClientFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CreateClient ¶ added in v0.9.1

func CreateClient(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)

CreateClient creates a bigquery client

func IsExcludedByLabels ¶ added in v0.13.0

func IsExcludedByLabels(tableLabels, excludeLabels map[string]string) bool

isExcludedByLabels returns true if the table's labels match any of the configured exclude labels. A match means the table has the same key with the same value.

func IsExcludedDataset ¶

func IsExcludedDataset(datasetID string, excludedDatasets []string) bool

func IsExcludedTable ¶

func IsExcludedTable(datasetID, tableID string, excludedTables []string) bool

Types ¶

type Config ¶

type Config struct {
	ProjectID string `json:"project_id" yaml:"project_id" mapstructure:"project_id" validate:"required"`
	// ServiceAccountBase64 takes precedence over ServiceAccountJSON field
	ServiceAccountBase64 string  `mapstructure:"service_account_base64"`
	ServiceAccountJSON   string  `mapstructure:"service_account_json"`
	MaxPageSize          int     `mapstructure:"max_page_size"`
	DatasetPageSize      int     `mapstructure:"dataset_page_size"`
	TablePageSize        int     `mapstructure:"table_page_size"`
	TablePattern         string  `mapstructure:"table_pattern"`
	Exclude              Exclude `mapstructure:"exclude"`
	IncludeColumnProfile bool    `mapstructure:"include_column_profile"`
	// MaxPreviewRows can also be set to -1 to restrict adding preview_rows key in entity properties
	MaxPreviewRows      int      `mapstructure:"max_preview_rows" default:"30"`
	MixValues           bool     `mapstructure:"mix_values" default:"false"`
	IsCollectTableUsage bool     `mapstructure:"collect_table_usage" default:"false"`
	UsagePeriodInDay    int64    `mapstructure:"usage_period_in_day" default:"7"`
	UsageProjectIDs     []string `mapstructure:"usage_project_ids"`
	BuildViewLineage    bool     `mapstructure:"build_view_lineage" default:"false"`
	Concurrency         int      `mapstructure:"concurrency" default:"10"`
}

Config holds the set of configuration for the bigquery extractor

type Exclude ¶

type Exclude struct {
	// list of datasetIDs
	Datasets []string `json:"datasets" yaml:"datasets" mapstructure:"datasets"`
	// list of tableNames in format - datasetID.tableID
	Tables []string `json:"tables" yaml:"tables" mapstructure:"tables"`
	// list of label key-value pairs; tables matching any label are excluded
	Labels map[string]string `json:"labels" yaml:"labels" mapstructure:"labels"`
}

type Extractor ¶

type Extractor struct {
	plugins.BaseExtractor
	// contains filtered or unexported fields
}

Extractor manages the communication with the bigquery service

func New ¶

func New(logger log.Logger, newClient NewClientFunc, randFn randFn) *Extractor

func (*Extractor) Extract ¶

func (e *Extractor) Extract(ctx context.Context, emit plugins.Emit) error

Extract checks if the table is valid and extracts the table schema

func (*Extractor) Init ¶

func (e *Extractor) Init(ctx context.Context, config plugins.Config) error

Init initializes the extractor

type NewClientFunc ¶ added in v0.9.1

type NewClientFunc func(ctx context.Context, logger log.Logger, config *Config) (*bigquery.Client, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
auditlog
sqlparser
upstream

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL