Documentation
¶
Overview ¶
Package athena implements the scrapper.Scrapper interface for Amazon Athena.
V1 scope is metadata-only: catalog, tables, databases, view definitions. Table metrics, query logs, freshness, table constraints, and table comments are deferred — those methods either return ErrUnsupported or empty slices.
Athena exposes a Presto-derived SQL surface, so most queries look like Trino's but talk to Athena's information_schema (no system.metadata.* tables, no per-catalog prefix needed — Athena's information_schema is already catalog-scoped to the data catalog the connection is bound to).
Index ¶
- func ScopeFromConf(conf *AthenaScrapperConf) *scope.ScopeFilter
- type AthenaScrapper
- func (e *AthenaScrapper) Capabilities() scrapper.Capabilities
- func (e *AthenaScrapper) Close() error
- func (e *AthenaScrapper) DialectType() string
- func (e *AthenaScrapper) FetchQueryLogs(ctx context.Context, from, to time.Time, obfuscator querylogs.QueryObfuscator) (querylogs.QueryLogIterator, error)
- func (e *AthenaScrapper) IsPermissionError(err error) bool
- func (e *AthenaScrapper) QueryCatalog(ctx context.Context) ([]*scrapper.CatalogColumnRow, error)
- func (e *AthenaScrapper) QueryCustomMetrics(ctx context.Context, sql string, args ...any) ([]*scrapper.CustomMetricsRow, error)
- func (e *AthenaScrapper) QueryDatabases(ctx context.Context) ([]*scrapper.DatabaseRow, error)
- func (e *AthenaScrapper) QuerySegments(ctx context.Context, sql string, args ...any) ([]*scrapper.SegmentRow, error)
- func (e *AthenaScrapper) QueryShape(ctx context.Context, sql string) ([]*scrapper.QueryShapeColumn, error)
- func (e *AthenaScrapper) QuerySqlDefinitions(ctx context.Context) ([]*scrapper.SqlDefinitionRow, error)
- func (e *AthenaScrapper) QueryTableConstraints(ctx context.Context) ([]*scrapper.TableConstraintRow, error)
- func (e *AthenaScrapper) QueryTableMetrics(ctx context.Context, lastFetchTime time.Time) ([]*scrapper.TableMetricsRow, error)
- func (e *AthenaScrapper) QueryTables(ctx context.Context, opts ...scrapper.QueryTablesOption) ([]*scrapper.TableRow, error)
- func (e *AthenaScrapper) RunRawQuery(ctx context.Context, sql string) (scrapper.RawQueryRowIterator, error)
- func (e *AthenaScrapper) SqlDialect() sqldialect.Dialect
- func (e *AthenaScrapper) ValidateConfiguration(ctx context.Context) ([]string, error)
- type AthenaScrapperConf
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ScopeFromConf ¶
func ScopeFromConf(conf *AthenaScrapperConf) *scope.ScopeFilter
ScopeFromConf returns the configured ScopeFilter as-is. Athena is the first warehouse to use synq.common.v1.ScopeFilter directly on both internal and agent protos (no legacy `repeated string databases` field to translate from), so this is just a passthrough — kept for symmetry with the other scrappers' ScopeFromConf functions.
Mapping note: Athena's hierarchy is Glue catalog → Glue database → table. In ScopeFilter terms (mirroring BigQuery's project/dataset shape):
ScopeRule.database = Glue catalog (almost always 'AwsDataCatalog') ScopeRule.schema = Glue database ScopeRule.table = Glue table / view
Types ¶
type AthenaScrapper ¶
type AthenaScrapper struct {
// contains filtered or unexported fields
}
func NewAthenaScrapper ¶
func NewAthenaScrapper(ctx context.Context, conf *AthenaScrapperConf) (*AthenaScrapper, error)
func (*AthenaScrapper) Capabilities ¶
func (e *AthenaScrapper) Capabilities() scrapper.Capabilities
func (*AthenaScrapper) Close ¶
func (e *AthenaScrapper) Close() error
func (*AthenaScrapper) DialectType ¶
func (e *AthenaScrapper) DialectType() string
func (*AthenaScrapper) FetchQueryLogs ¶
func (e *AthenaScrapper) FetchQueryLogs( ctx context.Context, from, to time.Time, obfuscator querylogs.QueryObfuscator, ) (querylogs.QueryLogIterator, error)
FetchQueryLogs streams the workgroup's query history from the Athena management APIs (athena:ListQueryExecutions + BatchGetQueryExecution).
These calls do NOT bill per data scanned — they're metadata operations. Page size is 50 (the AWS hard cap for both ListQueryExecutions and BatchGetQueryExecution); the iterator stops paging once it sees an execution older than `from`. AWS returns executions newest-first.
Athena retains query history for 45 days by default per workgroup; older `from` values silently truncate to that window.
func (*AthenaScrapper) IsPermissionError ¶
func (e *AthenaScrapper) IsPermissionError(err error) bool
func (*AthenaScrapper) QueryCatalog ¶
func (e *AthenaScrapper) QueryCatalog(ctx context.Context) ([]*scrapper.CatalogColumnRow, error)
func (*AthenaScrapper) QueryCustomMetrics ¶
func (e *AthenaScrapper) QueryCustomMetrics(ctx context.Context, sql string, args ...any) ([]*scrapper.CustomMetricsRow, error)
func (*AthenaScrapper) QueryDatabases ¶
func (e *AthenaScrapper) QueryDatabases(ctx context.Context) ([]*scrapper.DatabaseRow, error)
QueryDatabases lists Glue databases visible to the connection.
Note on terminology: ScopeRule.database is the Glue Data Catalog name (almost always 'AwsDataCatalog'); ScopeRule.schema is the Glue database. We populate DatabaseRow.Database with the *Glue database* name here so downstream code that treats DatabaseRow as the top-level container sees what the user actually filters on. The real catalog lives on the row's Instance (and the catalog string is captured via the executor's Catalog()).
func (*AthenaScrapper) QuerySegments ¶
func (e *AthenaScrapper) QuerySegments(ctx context.Context, sql string, args ...any) ([]*scrapper.SegmentRow, error)
func (*AthenaScrapper) QueryShape ¶
func (e *AthenaScrapper) QueryShape(ctx context.Context, sql string) ([]*scrapper.QueryShapeColumn, error)
func (*AthenaScrapper) QuerySqlDefinitions ¶
func (e *AthenaScrapper) QuerySqlDefinitions(ctx context.Context) ([]*scrapper.SqlDefinitionRow, error)
QuerySqlDefinitions returns view (and optionally table) DDLs.
Default: pulls view bodies from information_schema.views.view_definition — one bulk query, free (DDL).
With UseShowCreateView/UseShowCreateTable enabled, additionally calls SHOW CREATE VIEW / SHOW CREATE TABLE per object in an 8-way errgroup pool. SHOW CREATE on Athena is the only way to retrieve table DDL (CTAS bodies, Iceberg TBLPROPERTIES, Hive external LOCATION/SerDe) and the only way to get the full CREATE OR REPLACE VIEW DDL with original column declarations. Each call is one Athena query execution (~$0.00005 at the 10MB scan minimum).
func (*AthenaScrapper) QueryTableConstraints ¶
func (e *AthenaScrapper) QueryTableConstraints(ctx context.Context) ([]*scrapper.TableConstraintRow, error)
QueryTableConstraints exposes partition columns as `PARTITION BY` constraint rows — the same shape BigQuery uses for time/range partitioning.
Athena / Hive / Iceberg metastore has no traditional PK / FK / UNIQUE / CHECK concept; partitioning is the only constraint-shaped metadata Athena exposes that's worth surfacing.
Sources, per row:
- Glue `Table.PartitionKeys` — populated for Hive-style partitioned tables (`PARTITIONED BY (col TYPE)`), free along with the metrics fetch.
- Iceberg `$partitions` metadata table — Glue does NOT populate `PartitionKeys` for Iceberg tables (the partition spec lives in the `metadata.json` on S3, not in Glue), so we read the spec column names from `SELECT * FROM "<db>"."<table>$partitions" LIMIT 1` when `UseIcebergMetricsScan` is enabled. One Athena query per Iceberg table (~$0.00005 each at the 10MB scan minimum).
Iceberg transforms (`day(...)`, `bucket(N, ...)`, `truncate(L, ...)`) are flattened to their source column names — the per-column constraint row can't represent the transform. To recover the full transform string, enable `UseShowCreateTable` and parse the `PARTITIONED BY` clause from the DDL returned by `QuerySqlDefinitions`.
func (*AthenaScrapper) QueryTableMetrics ¶
func (e *AthenaScrapper) QueryTableMetrics(ctx context.Context, lastFetchTime time.Time) ([]*scrapper.TableMetricsRow, error)
QueryTableMetrics returns row counts, total size in bytes, and last-updated timestamps for tables visible to the configured Glue catalog.
Sources, in order of priority per row:
- Glue table parameters (Hive table-level statistics): `numRows`, `totalSize`, `transient_lastDdlTime`. Populated by `ANALYZE TABLE COMPUTE STATISTICS` or by Glue crawlers with stats enabled. Free — one paginated `glue:GetTables` call per Glue database, no Athena query cost.
- Iceberg `$files` metadata table aggregate (only when `UseIcebergMetricsScan` is enabled and the Glue parameters didn't already supply row count + size). One Athena query per Iceberg table (~$0.00005 each at the 10MB scan minimum).
Tables with no usable stats from either source are skipped — we'd rather omit a row than emit zeros that downstream consumers treat as a real "0 rows".
`lastFetchTime` is honoured against Glue's `UpdateTime`: tables that haven't changed since the last fetch are skipped entirely, including their Iceberg scan.
func (*AthenaScrapper) QueryTables ¶
func (e *AthenaScrapper) QueryTables(ctx context.Context, opts ...scrapper.QueryTablesOption) ([]*scrapper.TableRow, error)
func (*AthenaScrapper) RunRawQuery ¶
func (e *AthenaScrapper) RunRawQuery(ctx context.Context, sql string) (scrapper.RawQueryRowIterator, error)
func (*AthenaScrapper) SqlDialect ¶
func (e *AthenaScrapper) SqlDialect() sqldialect.Dialect
func (*AthenaScrapper) ValidateConfiguration ¶
func (e *AthenaScrapper) ValidateConfiguration(ctx context.Context) ([]string, error)
type AthenaScrapperConf ¶
type AthenaScrapperConf struct {
*dwhexecathena.AthenaConf
// UseShowCreateView, when true, calls `SHOW CREATE VIEW` per view to
// retrieve the full `CREATE OR REPLACE VIEW … AS …` DDL instead of just
// the rewritten Presto body from `information_schema.views.view_definition`.
// Each call is one Athena query (~$0.00005 at the 10MB scan minimum).
UseShowCreateView bool
// UseShowCreateTable, when true, calls `SHOW CREATE TABLE` per non-view
// object to retrieve the table DDL. This is the only way to get table DDL
// on Athena — `information_schema` exposes no equivalent. Required to see
// CTAS bodies (lineage), Iceberg `TBLPROPERTIES`, partitioning, and Hive
// external `LOCATION`/SerDe info. Each call is one Athena query.
UseShowCreateTable bool
// UseIcebergMetricsScan, when true, fans out per-Iceberg-table Athena
// queries against the table's metadata tables to recover information that
// Glue does not expose:
//
// - `$files` aggregate → row count + total size.
// - `$snapshots` MAX(committed_at) → exact data-freshness timestamp (the
// moment of the most recent INSERT / MERGE / UPDATE / DELETE). Glue's
// `UpdateTime` usually moves with the metadata pointer too, but the
// snapshot timestamp is authoritative.
// - `$partitions` `typeof(partition)` → partition-spec column names (Glue's
// `PartitionKeys` is empty for native Iceberg tables since the spec
// lives in the metadata.json).
//
// Per Iceberg table: one combined query for row count + size + freshness,
// plus one query for the partition spec on partitioned tables. Each query
// costs ~$0.00005 at Athena's 10MB scan minimum. Hive-style tables and
// views are unaffected — those metrics / partitions / freshness come from
// Glue parameters and `PartitionKeys` for free. Default is false.
UseIcebergMetricsScan bool
// Scope is the include/exclude filter for Glue catalog/database/table.
// Nil means accept-all. The cloud and agent protos both carry this as
// a synq.common.v1.ScopeFilter; callers translate via ScopeFromProto.
Scope *scope.ScopeFilter
}