athena

package

v0.14.2 Latest Latest Go to latest Published: May 15, 2026 License: Apache-2.0 Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/getsynq/dwhsupport

Links

Open Source Insights

Documentation ¶

Overview ¶

Package athena implements the scrapper.Scrapper interface for Amazon Athena.

V1 scope is metadata-only: catalog, tables, databases, view definitions. Table metrics, query logs, freshness, table constraints, and table comments are deferred — those methods either return ErrUnsupported or empty slices.

Athena exposes a Presto-derived SQL surface, so most queries look like Trino's but talk to Athena's information_schema (no system.metadata.* tables, no per-catalog prefix needed — Athena's information_schema is already catalog-scoped to the data catalog the connection is bound to).

Index ¶

func ScopeFromConf(conf *AthenaScrapperConf) *scope.ScopeFilter
type AthenaScrapper
- func NewAthenaScrapper(ctx context.Context, conf *AthenaScrapperConf) (*AthenaScrapper, error)
type AthenaScrapperConf

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ScopeFromConf ¶

func ScopeFromConf(conf *AthenaScrapperConf) *scope.ScopeFilter

ScopeFromConf returns the configured ScopeFilter as-is. Athena is the first warehouse to use synq.common.v1.ScopeFilter directly on both internal and agent protos (no legacy `repeated string databases` field to translate from), so this is just a passthrough — kept for symmetry with the other scrappers' ScopeFromConf functions.

Mapping note: Athena's hierarchy is Glue catalog → Glue database → table. In ScopeFilter terms (mirroring BigQuery's project/dataset shape):

ScopeRule.database = Glue catalog (almost always 'AwsDataCatalog')
ScopeRule.schema   = Glue database
ScopeRule.table    = Glue table / view

Types ¶

type AthenaScrapper ¶

type AthenaScrapper struct {
	// contains filtered or unexported fields
}

func NewAthenaScrapper ¶

func NewAthenaScrapper(ctx context.Context, conf *AthenaScrapperConf) (*AthenaScrapper, error)

func (*AthenaScrapper) Capabilities ¶

func (e *AthenaScrapper) Capabilities() scrapper.Capabilities

func (*AthenaScrapper) Close ¶

func (e *AthenaScrapper) Close() error

func (*AthenaScrapper) DialectType ¶

func (e *AthenaScrapper) DialectType() string

func (*AthenaScrapper) FetchQueryLogs ¶

func (e *AthenaScrapper) FetchQueryLogs(
	ctx context.Context,
	from, to time.Time,
	obfuscator querylogs.QueryObfuscator,
) (querylogs.QueryLogIterator, error)

FetchQueryLogs streams the workgroup's query history from the Athena management APIs (athena:ListQueryExecutions + BatchGetQueryExecution).

These calls do NOT bill per data scanned — they're metadata operations. Page size is 50 (the AWS hard cap for both ListQueryExecutions and BatchGetQueryExecution); the iterator stops paging once it sees an execution older than `from`. AWS returns executions newest-first.

Athena retains query history for 45 days by default per workgroup; older `from` values silently truncate to that window.

func (*AthenaScrapper) IsPermissionError ¶

func (e *AthenaScrapper) IsPermissionError(err error) bool

func (*AthenaScrapper) QueryCatalog ¶

func (e *AthenaScrapper) QueryCatalog(ctx context.Context) ([]*scrapper.CatalogColumnRow, error)

func (*AthenaScrapper) QueryCustomMetrics ¶

func (e *AthenaScrapper) QueryCustomMetrics(ctx context.Context, sql string, args ...any) ([]*scrapper.CustomMetricsRow, error)

func (*AthenaScrapper) QueryDatabases ¶

func (e *AthenaScrapper) QueryDatabases(ctx context.Context) ([]*scrapper.DatabaseRow, error)

QueryDatabases lists Glue databases visible to the connection.

Note on terminology: ScopeRule.database is the Glue Data Catalog name (almost always 'AwsDataCatalog'); ScopeRule.schema is the Glue database. We populate DatabaseRow.Database with the *Glue database* name here so downstream code that treats DatabaseRow as the top-level container sees what the user actually filters on. The real catalog lives on the row's Instance (and the catalog string is captured via the executor's Catalog()).

func (*AthenaScrapper) QuerySegments ¶

func (e *AthenaScrapper) QuerySegments(ctx context.Context, sql string, args ...any) ([]*scrapper.SegmentRow, error)

func (*AthenaScrapper) QueryShape ¶

func (e *AthenaScrapper) QueryShape(ctx context.Context, sql string) ([]*scrapper.QueryShapeColumn, error)

func (*AthenaScrapper) QuerySqlDefinitions ¶

func (e *AthenaScrapper) QuerySqlDefinitions(ctx context.Context) ([]*scrapper.SqlDefinitionRow, error)

QuerySqlDefinitions returns view (and optionally table) DDLs.

Default: pulls view bodies from information_schema.views.view_definition — one bulk query, free (DDL).

With UseShowCreateView/UseShowCreateTable enabled, additionally calls SHOW CREATE VIEW / SHOW CREATE TABLE per object in an 8-way errgroup pool. SHOW CREATE on Athena is the only way to retrieve table DDL (CTAS bodies, Iceberg TBLPROPERTIES, Hive external LOCATION/SerDe) and the only way to get the full CREATE OR REPLACE VIEW DDL with original column declarations. Each call is one Athena query execution (~$0.00005 at the 10MB scan minimum).

func (*AthenaScrapper) QueryTableConstraints ¶

func (e *AthenaScrapper) QueryTableConstraints(ctx context.Context) ([]*scrapper.TableConstraintRow, error)

QueryTableConstraints exposes partition columns as `PARTITION BY` constraint rows — the same shape BigQuery uses for time/range partitioning.

Athena / Hive / Iceberg metastore has no traditional PK / FK / UNIQUE / CHECK concept; partitioning is the only constraint-shaped metadata Athena exposes that's worth surfacing.

Sources, per row:

Glue `Table.PartitionKeys` — populated for Hive-style partitioned tables (`PARTITIONED BY (col TYPE)`), free along with the metrics fetch.
Iceberg `$partitions` metadata table — Glue does NOT populate `PartitionKeys` for Iceberg tables (the partition spec lives in the `metadata.json` on S3, not in Glue), so we read the spec column names from `SELECT * FROM "<db>"."<table>$partitions" LIMIT 1` when `UseIcebergMetricsScan` is enabled. One Athena query per Iceberg table (~$0.00005 each at the 10MB scan minimum).

Iceberg transforms (`day(...)`, `bucket(N, ...)`, `truncate(L, ...)`) are flattened to their source column names — the per-column constraint row can't represent the transform. To recover the full transform string, enable `UseShowCreateTable` and parse the `PARTITIONED BY` clause from the DDL returned by `QuerySqlDefinitions`.

func (*AthenaScrapper) QueryTableMetrics ¶

func (e *AthenaScrapper) QueryTableMetrics(ctx context.Context, lastFetchTime time.Time) ([]*scrapper.TableMetricsRow, error)

QueryTableMetrics returns row counts, total size in bytes, and last-updated timestamps for tables visible to the configured Glue catalog.

Sources, in order of priority per row:

Glue table parameters (Hive table-level statistics): `numRows`, `totalSize`, `transient_lastDdlTime`. Populated by `ANALYZE TABLE COMPUTE STATISTICS` or by Glue crawlers with stats enabled. Free — one paginated `glue:GetTables` call per Glue database, no Athena query cost.
Iceberg `$files` metadata table aggregate (only when `UseIcebergMetricsScan` is enabled and the Glue parameters didn't already supply row count + size). One Athena query per Iceberg table (~$0.00005 each at the 10MB scan minimum).

Tables with no usable stats from either source are skipped — we'd rather omit a row than emit zeros that downstream consumers treat as a real "0 rows".

`lastFetchTime` is honoured against Glue's `UpdateTime`: tables that haven't changed since the last fetch are skipped entirely, including their Iceberg scan.

func (*AthenaScrapper) QueryTables ¶

func (e *AthenaScrapper) QueryTables(ctx context.Context, opts ...scrapper.QueryTablesOption) ([]*scrapper.TableRow, error)

func (*AthenaScrapper) RunRawQuery ¶

func (e *AthenaScrapper) RunRawQuery(ctx context.Context, sql string) (scrapper.RawQueryRowIterator, error)

func (*AthenaScrapper) SqlDialect ¶

func (e *AthenaScrapper) SqlDialect() sqldialect.Dialect

func (*AthenaScrapper) ValidateConfiguration ¶

func (e *AthenaScrapper) ValidateConfiguration(ctx context.Context) ([]string, error)

type AthenaScrapperConf ¶

type AthenaScrapperConf struct {
	*dwhexecathena.AthenaConf

	// UseShowCreateView, when true, calls `SHOW CREATE VIEW` per view to
	// retrieve the full `CREATE OR REPLACE VIEW … AS …` DDL instead of just
	// the rewritten Presto body from `information_schema.views.view_definition`.
	// Each call is one Athena query (~$0.00005 at the 10MB scan minimum).
	UseShowCreateView bool

	// UseShowCreateTable, when true, calls `SHOW CREATE TABLE` per non-view
	// object to retrieve the table DDL. This is the only way to get table DDL
	// on Athena — `information_schema` exposes no equivalent. Required to see
	// CTAS bodies (lineage), Iceberg `TBLPROPERTIES`, partitioning, and Hive
	// external `LOCATION`/SerDe info. Each call is one Athena query.
	UseShowCreateTable bool

	// UseIcebergMetricsScan, when true, fans out per-Iceberg-table Athena
	// queries against the table's metadata tables to recover information that
	// Glue does not expose:
	//
	//  - `$files` aggregate → row count + total size.
	//  - `$snapshots` MAX(committed_at) → exact data-freshness timestamp (the
	//    moment of the most recent INSERT / MERGE / UPDATE / DELETE). Glue's
	//    `UpdateTime` usually moves with the metadata pointer too, but the
	//    snapshot timestamp is authoritative.
	//  - `$partitions` `typeof(partition)` → partition-spec column names (Glue's
	//    `PartitionKeys` is empty for native Iceberg tables since the spec
	//    lives in the metadata.json).
	//
	// Per Iceberg table: one combined query for row count + size + freshness,
	// plus one query for the partition spec on partitioned tables. Each query
	// costs ~$0.00005 at Athena's 10MB scan minimum. Hive-style tables and
	// views are unaffected — those metrics / partitions / freshness come from
	// Glue parameters and `PartitionKeys` for free. Default is false.
	UseIcebergMetricsScan bool

	// Scope is the include/exclude filter for Glue catalog/database/table.
	// Nil means accept-all. The cloud and agent protos both carry this as
	// a synq.common.v1.ScopeFilter; callers translate via ScopeFromProto.
	Scope *scope.ScopeFilter
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL