stt

package
v1.4.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 15, 2026 License: Apache-2.0 Imports: 12 Imported by: 1

Documentation

Overview

Package stt provides speech-to-text services for converting audio to text.

The package defines a common Service interface that abstracts STT providers, enabling voice AI applications to transcribe speech from users.

Architecture

The package provides:

  • Service interface for STT providers
  • TranscriptionConfig for audio format configuration
  • Multiple provider implementations (OpenAI Whisper, etc.)

Usage

Basic usage with OpenAI Whisper:

service := stt.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
text, err := service.Transcribe(ctx, audioData, stt.TranscriptionConfig{
    Format:     "pcm",
    SampleRate: 16000,
    Channels:   1,
    Language:   "en",
})
if err != nil {
    log.Fatal(err)
}
fmt.Println("User said:", text)

Available Providers

The package includes implementations for:

  • OpenAI Whisper (whisper-1 model)
  • More providers can be added following the Service interface

Index

Constants

View Source
const (
	// Default audio settings.
	DefaultSampleRate = 16000
	DefaultChannels   = 1
	DefaultBitDepth   = 16

	// Common audio formats.
	FormatPCM = "pcm"
	FormatWAV = "wav"
	FormatMP3 = "mp3"
)
View Source
const (

	// ModelWhisper1 is the OpenAI Whisper model for transcription.
	ModelWhisper1 = "whisper-1"
)

Variables

View Source
var (
	// ErrEmptyAudio is returned when audio data is empty.
	ErrEmptyAudio = errors.New("audio data is empty")

	// ErrRateLimited is returned when the provider rate limits requests.
	ErrRateLimited = errors.New("rate limited by provider")

	// ErrInvalidFormat is returned when the audio format is not supported.
	ErrInvalidFormat = errors.New("unsupported audio format")

	// ErrAudioTooShort is returned when audio is too short to transcribe.
	ErrAudioTooShort = errors.New("audio too short to transcribe")
)

Common errors for STT services.

Functions

func APIKeyFromCredential added in v1.4.5

func APIKeyFromCredential(c credentials.Credential) string

APIKeyFromCredential returns the raw API key from an APIKey credential, or "" for any other credential shape (or nil).

func RegisterFactory added in v1.4.5

func RegisterFactory(providerType string, factory Factory)

RegisterFactory registers a factory for the given provider type. Typically called from per-provider package init().

func ResolveCredential added in v1.4.5

func ResolveCredential(ctx context.Context, providerType string,
	cfgDir string, cred *credentials.CredentialConfig,
) (credentials.Credential, error)

ResolveCredential resolves an STT provider's credential block into a concrete Credential, applying the same fallback chain as chat providers.

func TranscribeWithRetry added in v1.4.2

func TranscribeWithRetry(
	ctx context.Context,
	svc Service,
	audio []byte,
	config TranscriptionConfig,
	retry RetryConfig,
) (string, error)

TranscribeWithRetry calls svc.Transcribe with bounded retry on transient errors. Only errors where TranscriptionError.Retryable is true are retried; all others are returned immediately. Uses full jitter backoff to avoid synchronized retries across concurrent callers.

func WrapPCMAsWAV

func WrapPCMAsWAV(pcmData []byte, sampleRate, channels, bitsPerSample int) []byte

WrapPCMAsWAV wraps raw PCM audio data in a WAV header. This is necessary for APIs like OpenAI Whisper that expect file uploads.

Parameters:

  • pcmData: Raw PCM audio bytes (little-endian, signed)
  • sampleRate: Sample rate in Hz (e.g., 16000)
  • channels: Number of channels (1=mono, 2=stereo)
  • bitsPerSample: Bits per sample (typically 16)

Returns a byte slice containing WAV-formatted audio.

Types

type Factory added in v1.4.5

type Factory func(spec ProviderSpec) (Service, error)

Factory builds a Service from a spec.

type OpenAIOption

type OpenAIOption func(*OpenAIService)

OpenAIOption configures the OpenAI STT service.

func WithOpenAIBaseURL

func WithOpenAIBaseURL(url string) OpenAIOption

WithOpenAIBaseURL sets a custom base URL (for testing or proxies).

func WithOpenAIClient

func WithOpenAIClient(client *http.Client) OpenAIOption

WithOpenAIClient sets a custom HTTP client.

func WithOpenAIModel

func WithOpenAIModel(model string) OpenAIOption

WithOpenAIModel sets the STT model to use.

type OpenAIService

type OpenAIService struct {
	// contains filtered or unexported fields
}

OpenAIService implements STT using OpenAI's Whisper API.

func NewOpenAI

func NewOpenAI(apiKey string, opts ...OpenAIOption) *OpenAIService

NewOpenAI creates an OpenAI STT service using Whisper.

func (*OpenAIService) Name

func (s *OpenAIService) Name() string

Name returns the provider identifier.

func (*OpenAIService) SupportedFormats

func (s *OpenAIService) SupportedFormats() []string

SupportedFormats returns audio formats supported by OpenAI Whisper.

func (*OpenAIService) Transcribe

func (s *OpenAIService) Transcribe(
	ctx context.Context, audio []byte, config TranscriptionConfig,
) (string, error)

Transcribe converts audio to text using OpenAI's Whisper API.

type ProviderSpec added in v1.4.5

type ProviderSpec struct {
	// ID is a stable identifier; informational only at this layer.
	ID string
	// Type selects the implementation: openai (only one today).
	Type string
	// Model overrides the provider's default transcription model.
	Model string
	// BaseURL overrides the provider's default API endpoint.
	BaseURL string
	// Credential carries the resolved API key.
	Credential credentials.Credential
	// AdditionalConfig carries provider-specific extras. Unknown keys
	// are ignored.
	AdditionalConfig map[string]any
}

ProviderSpec is the runtime form of an STT-provider declaration, used by CreateFromSpec to construct a Service implementation. The SDK's runtime-config layer translates pkg/config.STTProviderConfig into this struct after resolving credentials.

type RetryConfig added in v1.4.2

type RetryConfig struct {
	// MaxAttempts is the total number of attempts including the initial
	// call. 3 means "initial + up to 2 retries". Values < 1 are
	// treated as 1 (no retry).
	MaxAttempts int
	// InitialDelay is the base backoff before the first retry.
	InitialDelay time.Duration
	// MaxDelay caps the per-attempt backoff.
	MaxDelay time.Duration
}

RetryConfig configures bounded retry for STT transcription calls. Defaults are on (unlike streaming retry) because STT calls are one-shot and idempotent — retry has no content-duplication risk, and the alternative is silently dropped speech.

func DefaultRetryConfig added in v1.4.2

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns sensible defaults for STT retry.

type Service

type Service interface {
	// Name returns the provider identifier (for logging/debugging).
	Name() string

	// Transcribe converts audio to text.
	// Returns the transcribed text or an error if transcription fails.
	Transcribe(ctx context.Context, audio []byte, config TranscriptionConfig) (string, error)

	// SupportedFormats returns supported audio input formats.
	// Common values: "pcm", "wav", "mp3", "m4a", "webm"
	SupportedFormats() []string
}

Service transcribes audio to text. This interface abstracts different STT providers (OpenAI Whisper, Google, etc.) enabling voice AI applications to use any provider interchangeably.

func CreateFromSpec added in v1.4.5

func CreateFromSpec(spec ProviderSpec) (Service, error)

CreateFromSpec returns a Service implementation for the given spec.

type TranscriptionConfig

type TranscriptionConfig struct {
	// Format is the audio format ("pcm", "wav", "mp3").
	// Default: "pcm"
	Format string

	// SampleRate is the audio sample rate in Hz.
	// Default: 16000
	SampleRate int

	// Channels is the number of audio channels (1=mono, 2=stereo).
	// Default: 1
	Channels int

	// BitDepth is the bits per sample for PCM audio.
	// Default: 16
	BitDepth int

	// Language is a hint for the transcription language (e.g., "en", "es").
	// Optional - improves accuracy if provided.
	Language string

	// Model is the STT model to use (provider-specific).
	// For OpenAI: "whisper-1"
	Model string

	// Prompt is a text prompt to guide transcription (provider-specific).
	// Can improve accuracy for domain-specific vocabulary.
	Prompt string
}

TranscriptionConfig configures speech-to-text transcription.

func DefaultTranscriptionConfig

func DefaultTranscriptionConfig() TranscriptionConfig

DefaultTranscriptionConfig returns sensible defaults for transcription.

type TranscriptionError

type TranscriptionError struct {
	// Provider is the STT provider name.
	Provider string

	// Code is the provider-specific error code.
	Code string

	// Message is a human-readable error message.
	Message string

	// Cause is the underlying error, if any.
	Cause error

	// Retryable indicates whether the request can be retried.
	Retryable bool
}

TranscriptionError represents an error during transcription.

func NewTranscriptionError

func NewTranscriptionError(provider, code, message string, cause error, retryable bool) *TranscriptionError

NewTranscriptionError creates a new TranscriptionError.

func (*TranscriptionError) Error

func (e *TranscriptionError) Error() string

Error implements the error interface.

func (*TranscriptionError) Is

func (e *TranscriptionError) Is(target error) bool

Is implements error matching for errors.Is.

func (*TranscriptionError) Unwrap

func (e *TranscriptionError) Unwrap() error

Unwrap returns the underlying error.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL