stt

package

v0.9.0 Latest Latest Go to latest Published: May 2, 2026 License: MIT Imports: 9 Imported by: 7

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/plexusone/omnivoice-core

Links

Open Source Insights

Documentation ¶

Overview ¶

Package stt provides a unified interface for Speech-to-Text providers.

Index ¶

Constants
Variables
type Client
- func NewClient(providers ...Provider) *Client
- func (c *Client) Hook() observability.STTHook
- func (c *Client) Provider(name string) (Provider, bool)
- func (c *Client) SetFallbacks(names ...string)
- func (c *Client) SetHook(hook observability.STTHook)
- func (c *Client) SetPrimary(name string)
- func (c *Client) Transcribe(ctx context.Context, audio []byte, config TranscriptionConfig) (*TranscriptionResult, error)
- func (c *Client) TranscribeStream(ctx context.Context, config TranscriptionConfig) (io.WriteCloser, <-chan StreamEvent, error)
type Provider
type Segment
type StreamEvent
type StreamEventType
type StreamingProvider
type Transcript
- func LoadTranscript(filePath string) (*Transcript, error)
- func NewTranscript(result *TranscriptionResult, provider, model, audioFile string, ...) *Transcript
- func (t *Transcript) SaveJSON(filePath string) error
- func (t *Transcript) ToJSON() ([]byte, error)
- func (t *Transcript) TotalDuration() time.Duration
type TranscriptMetadata
type TranscriptOptions
type TranscriptSegment
- func (s *TranscriptSegment) SegmentDuration() time.Duration
type TranscriptWord
- func (w *TranscriptWord) WordDuration() time.Duration
type TranscriptionConfig
type TranscriptionResult
type Word

Constants ¶

View Source

const TranscriptFormatVersion = "1.0"

TranscriptFormatVersion is the current version of the OmniVoice transcript format.

View Source

const TranscriptSchemaURL = "https://omnivoice.dev/schema/transcript-v1.json"

TranscriptSchemaURL is the JSON Schema URL for the transcript format.

Variables ¶

View Source

var (
	// ErrNoAvailableProvider is returned when no provider is available.
	ErrNoAvailableProvider = errors.New("stt: no available provider")

	// ErrStreamingNotSupported is returned when streaming is not supported.
	ErrStreamingNotSupported = errors.New("stt: streaming not supported by any provider")

	// ErrInvalidAudio is returned when the audio data is invalid.
	ErrInvalidAudio = errors.New("stt: invalid audio data")

	// ErrInvalidConfig is returned when the transcription config is invalid.
	ErrInvalidConfig = errors.New("stt: invalid configuration")

	// ErrAudioTooLong is returned when audio exceeds provider limits.
	ErrAudioTooLong = errors.New("stt: audio too long")

	// ErrAudioTooShort is returned when audio is too short to transcribe.
	ErrAudioTooShort = errors.New("stt: audio too short")

	// ErrRateLimited is returned when the provider rate limits the request.
	ErrRateLimited = errors.New("stt: rate limited")

	// ErrQuotaExceeded is returned when the provider quota is exceeded.
	ErrQuotaExceeded = errors.New("stt: quota exceeded")

	// ErrUnsupportedLanguage is returned when the language is not supported.
	ErrUnsupportedLanguage = errors.New("stt: unsupported language")

	// ErrUnsupportedFormat is returned when the audio format is not supported.
	ErrUnsupportedFormat = errors.New("stt: unsupported audio format")

	// ErrStreamClosed is returned when attempting to use a closed stream.
	ErrStreamClosed = errors.New("stt: stream closed")
)

Functions ¶

This section is empty.

Types ¶

type Client ¶

type Client struct {
	// contains filtered or unexported fields
}

Client provides a unified interface across multiple STT providers.

func NewClient ¶

func NewClient(providers ...Provider) *Client

NewClient creates a new STT client with the specified providers.

func (*Client) Hook ¶ added in v0.6.0

func (c *Client) Hook() observability.STTHook

Hook returns the current observability hook.

func (*Client) Provider ¶

func (c *Client) Provider(name string) (Provider, bool)

Provider returns a specific provider by name.

func (*Client) SetFallbacks ¶

func (c *Client) SetFallbacks(names ...string)

SetFallbacks sets the fallback provider order.

func (*Client) SetHook ¶ added in v0.6.0

func (c *Client) SetHook(hook observability.STTHook)

SetHook sets the observability hook for all STT operations.

func (*Client) SetPrimary ¶

func (c *Client) SetPrimary(name string)

SetPrimary sets the primary provider by name.

func (*Client) Transcribe ¶

func (c *Client) Transcribe(ctx context.Context, audio []byte, config TranscriptionConfig) (*TranscriptionResult, error)

Transcribe uses the primary provider with smart fallback. Fallback only occurs for permanent (non-retryable) errors. Transient errors like rate limits are expected to be handled by the provider's retry logic.

func (*Client) TranscribeStream ¶

func (c *Client) TranscribeStream(ctx context.Context, config TranscriptionConfig) (io.WriteCloser, <-chan StreamEvent, error)

TranscribeStream attempts streaming transcription with the primary provider. Falls back to other providers if streaming is not available or on permanent errors.

type Provider ¶

type Provider interface {
	// Name returns the provider name.
	Name() string

	// Transcribe converts audio to text (batch mode).
	Transcribe(ctx context.Context, audio []byte, config TranscriptionConfig) (*TranscriptionResult, error)

	// TranscribeFile transcribes audio from a file path.
	TranscribeFile(ctx context.Context, filePath string, config TranscriptionConfig) (*TranscriptionResult, error)

	// TranscribeURL transcribes audio from a URL.
	TranscribeURL(ctx context.Context, url string, config TranscriptionConfig) (*TranscriptionResult, error)
}

Provider defines the interface for STT providers.

type Segment ¶

type Segment struct {
	// Text is the transcribed text for this segment.
	Text string

	// StartTime is when the segment starts.
	StartTime time.Duration

	// EndTime is when the segment ends.
	EndTime time.Duration

	// Confidence is the average confidence for this segment.
	Confidence float64

	// Speaker is the speaker identifier (if diarization enabled).
	Speaker string

	// Words contains word-level details (if enabled).
	Words []Word

	// Language is the detected language for this segment.
	Language string
}

Segment represents a segment of transcription (sentence, phrase).

type StreamEvent ¶

type StreamEvent struct {
	// Type is the event type.
	Type StreamEventType

	// Transcript is the current transcript (partial or final).
	Transcript string

	// IsFinal indicates if this is a final (non-interim) result.
	IsFinal bool

	// Segment contains segment details for final results.
	Segment *Segment

	// SpeechStarted indicates voice activity started.
	SpeechStarted bool

	// SpeechEnded indicates voice activity ended.
	SpeechEnded bool

	// Error contains any error that occurred.
	Error error
}

StreamEvent represents an event from streaming transcription.

type StreamEventType ¶

type StreamEventType string

StreamEventType identifies the type of stream event.

const (
	// EventTranscript is a transcription result (partial or final).
	EventTranscript StreamEventType = "transcript"

	// EventSpeechStart indicates the user started speaking.
	EventSpeechStart StreamEventType = "speech_start"

	// EventSpeechEnd indicates the user stopped speaking.
	EventSpeechEnd StreamEventType = "speech_end"

	// EventError indicates an error occurred.
	EventError StreamEventType = "error"
)

type StreamingProvider ¶

type StreamingProvider interface {
	Provider

	// TranscribeStream starts a streaming transcription session.
	// Returns a writer for sending audio and a channel for receiving events.
	TranscribeStream(ctx context.Context, config TranscriptionConfig) (io.WriteCloser, <-chan StreamEvent, error)
}

StreamingProvider extends Provider with real-time streaming support.

type Transcript ¶ added in v0.9.0

type Transcript struct {
	// Schema is the JSON Schema URL for validation.
	Schema string `json:"$schema"`

	// Version is the format version (e.g., "1.0").
	Version string `json:"version"`

	// Text is the complete transcription text.
	Text string `json:"text"`

	// Language is the detected or specified language (BCP-47 code).
	Language string `json:"language,omitempty"`

	// LanguageConfidence is the confidence score for language detection (0.0-1.0).
	LanguageConfidence float64 `json:"language_confidence,omitempty"`

	// Duration is the audio duration (marshals as milliseconds in JSON).
	Duration duration.DurationMilliseconds `json:"duration_ms"`

	// Segments contains the transcript broken into segments.
	Segments []TranscriptSegment `json:"segments,omitempty"`

	// Metadata contains information about how the transcript was generated.
	Metadata TranscriptMetadata `json:"metadata"`
}

Transcript represents the OmniVoice JSON Transcript format. This is the canonical output format for transcription results.

func LoadTranscript ¶ added in v0.9.0

func LoadTranscript(filePath string) (*Transcript, error)

LoadTranscript reads a transcript from a JSON file.

func NewTranscript ¶ added in v0.9.0

func NewTranscript(result *TranscriptionResult, provider, model, audioFile string, config *TranscriptionConfig) *Transcript

NewTranscript creates a Transcript from a TranscriptionResult.

func (*Transcript) SaveJSON ¶ added in v0.9.0

func (t *Transcript) SaveJSON(filePath string) error

SaveJSON writes the transcript to a JSON file.

func (*Transcript) ToJSON ¶ added in v0.9.0

func (t *Transcript) ToJSON() ([]byte, error)

ToJSON serializes the transcript to JSON with indentation.

func (*Transcript) TotalDuration ¶ added in v0.9.0

func (t *Transcript) TotalDuration() time.Duration

TotalDuration returns the total duration as a time.Duration.

type TranscriptMetadata ¶ added in v0.9.0

type TranscriptMetadata struct {
	// Provider is the STT provider used (e.g., "deepgram", "openai").
	Provider string `json:"provider"`

	// Model is the provider-specific model used (if specified).
	Model string `json:"model,omitempty"`

	// CreatedAt is the ISO 8601 timestamp when the transcript was created.
	CreatedAt string `json:"created_at"`

	// AudioFile is the original audio file path or URL (if available).
	AudioFile string `json:"audio_file,omitempty"`

	// Options contains the transcription options that were used.
	Options *TranscriptOptions `json:"options,omitempty"`
}

TranscriptMetadata contains provenance information about the transcript.

type TranscriptOptions ¶ added in v0.9.0

type TranscriptOptions struct {
	// Language is the requested language (if specified).
	Language string `json:"language,omitempty"`

	// EnablePunctuation indicates if punctuation was enabled.
	EnablePunctuation bool `json:"enable_punctuation,omitempty"`

	// EnableWordTimestamps indicates if word timestamps were enabled.
	EnableWordTimestamps bool `json:"enable_word_timestamps,omitempty"`

	// EnableSpeakerDiarization indicates if speaker diarization was enabled.
	EnableSpeakerDiarization bool `json:"enable_speaker_diarization,omitempty"`
}

TranscriptOptions records the options used for transcription.

type TranscriptSegment ¶ added in v0.9.0

type TranscriptSegment struct {
	// Text is the transcribed text for this segment.
	Text string `json:"text"`

	// Start is the start time (marshals as milliseconds in JSON).
	Start duration.DurationMilliseconds `json:"start_ms"`

	// End is the end time (marshals as milliseconds in JSON).
	End duration.DurationMilliseconds `json:"end_ms"`

	// Speaker is the speaker identifier (if diarization is enabled).
	Speaker string `json:"speaker,omitempty"`

	// Confidence is the average confidence score for this segment (0.0-1.0).
	Confidence float64 `json:"confidence,omitempty"`

	// Language is the detected language for this segment (if different from overall).
	Language string `json:"language,omitempty"`

	// Words contains word-level details (if word timestamps are enabled).
	Words []TranscriptWord `json:"words,omitempty"`
}

TranscriptSegment represents a segment of the transcript (sentence, phrase, or utterance).

func (*TranscriptSegment) SegmentDuration ¶ added in v0.9.0

func (s *TranscriptSegment) SegmentDuration() time.Duration

SegmentDuration returns the duration of a segment as time.Duration.

type TranscriptWord ¶ added in v0.9.0

type TranscriptWord struct {
	// Text is the transcribed word.
	Text string `json:"text"`

	// Start is the start time (marshals as milliseconds in JSON).
	Start duration.DurationMilliseconds `json:"start_ms"`

	// End is the end time (marshals as milliseconds in JSON).
	End duration.DurationMilliseconds `json:"end_ms"`

	// Speaker is the speaker identifier (if diarization is enabled).
	Speaker string `json:"speaker,omitempty"`

	// Confidence is the recognition confidence (0.0-1.0).
	Confidence float64 `json:"confidence,omitempty"`
}

TranscriptWord represents a single word with timing information.

func (*TranscriptWord) WordDuration ¶ added in v0.9.0

func (w *TranscriptWord) WordDuration() time.Duration

WordDuration returns the duration of a word as time.Duration.

type TranscriptionConfig ¶

type TranscriptionConfig struct {
	// Language is the BCP-47 language code (e.g., "en-US").
	// Leave empty for automatic detection.
	Language string

	// Model is the provider-specific model identifier (optional).
	Model string

	// SampleRate is the audio sample rate in Hz.
	SampleRate int

	// Channels is the number of audio channels (1 = mono, 2 = stereo).
	Channels int

	// Encoding is the audio encoding ("pcm", "mp3", "wav", "opus", "flac").
	Encoding string

	// EnablePunctuation adds punctuation to transcription.
	EnablePunctuation bool

	// EnableWordTimestamps includes word-level timestamps.
	EnableWordTimestamps bool

	// EnableSpeakerDiarization identifies different speakers.
	EnableSpeakerDiarization bool

	// MaxSpeakers is the maximum number of speakers to detect (for diarization).
	MaxSpeakers int

	// Keywords are words/phrases to boost recognition accuracy.
	Keywords []string

	// VocabularyID is a provider-specific custom vocabulary ID.
	VocabularyID string

	// Extensions holds provider-specific settings.
	// Keys should be namespaced by provider (e.g., "deepgram.tier", "elevenlabs.num_speakers").
	// Use provider-specific helper functions for type-safe access.
	Extensions map[string]any

	// Hook provides observability for STT operations.
	// If nil, no hooks are called.
	Hook observability.STTHook
}

TranscriptionConfig configures a STT transcription request.

type TranscriptionResult ¶

type TranscriptionResult struct {
	// Text is the full transcription text.
	Text string

	// Segments contains segment-level details.
	Segments []Segment

	// Language is the detected language.
	Language string

	// LanguageConfidence is the confidence in language detection.
	LanguageConfidence float64

	// Duration is the audio duration.
	Duration time.Duration
}

TranscriptionResult contains the result of a STT transcription.

type Word ¶

type Word struct {
	// Text is the transcribed word.
	Text string

	// StartTime is when the word starts.
	StartTime time.Duration

	// EndTime is when the word ends.
	EndTime time.Duration

	// Confidence is the recognition confidence (0.0 to 1.0).
	Confidence float64

	// Speaker is the speaker identifier (if diarization enabled).
	Speaker string
}

Word represents a single transcribed word with timing.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
providertest Package providertest provides conformance tests for STT provider implementations.	Package providertest provides conformance tests for STT provider implementations.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL