Documentation
¶
Overview ¶
Package stt provides speech-to-text services for converting audio to text.
The package defines a common Service interface that abstracts STT providers, enabling voice AI applications to transcribe speech from users.
Architecture ¶
The package provides:
- Service interface for STT providers
- TranscriptionConfig for audio format configuration
- Multiple provider implementations (OpenAI Whisper, etc.)
Usage ¶
Basic usage with OpenAI Whisper:
service := stt.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
text, err := service.Transcribe(ctx, audioData, stt.TranscriptionConfig{
Format: "pcm",
SampleRate: 16000,
Channels: 1,
Language: "en",
})
if err != nil {
log.Fatal(err)
}
fmt.Println("User said:", text)
Available Providers ¶
The package includes implementations for:
- OpenAI Whisper (whisper-1 model)
- More providers can be added following the Service interface
Index ¶
Constants ¶
const ( // Default audio settings. DefaultSampleRate = 16000 DefaultChannels = 1 DefaultBitDepth = 16 // Common audio formats. FormatPCM = "pcm" FormatWAV = "wav" FormatMP3 = "mp3" )
const (
// ModelWhisper1 is the OpenAI Whisper model for transcription.
ModelWhisper1 = "whisper-1"
)
Variables ¶
var ( // ErrEmptyAudio is returned when audio data is empty. ErrEmptyAudio = errors.New("audio data is empty") // ErrRateLimited is returned when the provider rate limits requests. ErrRateLimited = errors.New("rate limited by provider") // ErrInvalidFormat is returned when the audio format is not supported. ErrInvalidFormat = errors.New("unsupported audio format") // ErrAudioTooShort is returned when audio is too short to transcribe. ErrAudioTooShort = errors.New("audio too short to transcribe") )
Common errors for STT services.
Functions ¶
func WrapPCMAsWAV ¶
WrapPCMAsWAV wraps raw PCM audio data in a WAV header. This is necessary for APIs like OpenAI Whisper that expect file uploads.
Parameters:
- pcmData: Raw PCM audio bytes (little-endian, signed)
- sampleRate: Sample rate in Hz (e.g., 16000)
- channels: Number of channels (1=mono, 2=stereo)
- bitsPerSample: Bits per sample (typically 16)
Returns a byte slice containing WAV-formatted audio.
Types ¶
type OpenAIOption ¶
type OpenAIOption func(*OpenAIService)
OpenAIOption configures the OpenAI STT service.
func WithOpenAIBaseURL ¶
func WithOpenAIBaseURL(url string) OpenAIOption
WithOpenAIBaseURL sets a custom base URL (for testing or proxies).
func WithOpenAIClient ¶
func WithOpenAIClient(client *http.Client) OpenAIOption
WithOpenAIClient sets a custom HTTP client.
func WithOpenAIModel ¶
func WithOpenAIModel(model string) OpenAIOption
WithOpenAIModel sets the STT model to use.
type OpenAIService ¶
type OpenAIService struct {
// contains filtered or unexported fields
}
OpenAIService implements STT using OpenAI's Whisper API.
func NewOpenAI ¶
func NewOpenAI(apiKey string, opts ...OpenAIOption) *OpenAIService
NewOpenAI creates an OpenAI STT service using Whisper.
func (*OpenAIService) Name ¶
func (s *OpenAIService) Name() string
Name returns the provider identifier.
func (*OpenAIService) SupportedFormats ¶
func (s *OpenAIService) SupportedFormats() []string
SupportedFormats returns audio formats supported by OpenAI Whisper.
func (*OpenAIService) Transcribe ¶
func (s *OpenAIService) Transcribe( ctx context.Context, audio []byte, config TranscriptionConfig, ) (string, error)
Transcribe converts audio to text using OpenAI's Whisper API.
type Service ¶
type Service interface {
// Name returns the provider identifier (for logging/debugging).
Name() string
// Transcribe converts audio to text.
// Returns the transcribed text or an error if transcription fails.
Transcribe(ctx context.Context, audio []byte, config TranscriptionConfig) (string, error)
// SupportedFormats returns supported audio input formats.
// Common values: "pcm", "wav", "mp3", "m4a", "webm"
SupportedFormats() []string
}
Service transcribes audio to text. This interface abstracts different STT providers (OpenAI Whisper, Google, etc.) enabling voice AI applications to use any provider interchangeably.
type TranscriptionConfig ¶
type TranscriptionConfig struct {
// Format is the audio format ("pcm", "wav", "mp3").
// Default: "pcm"
Format string
// SampleRate is the audio sample rate in Hz.
// Default: 16000
SampleRate int
// Channels is the number of audio channels (1=mono, 2=stereo).
// Default: 1
Channels int
// BitDepth is the bits per sample for PCM audio.
// Default: 16
BitDepth int
// Language is a hint for the transcription language (e.g., "en", "es").
// Optional - improves accuracy if provided.
Language string
// Model is the STT model to use (provider-specific).
// For OpenAI: "whisper-1"
Model string
// Prompt is a text prompt to guide transcription (provider-specific).
// Can improve accuracy for domain-specific vocabulary.
Prompt string
}
TranscriptionConfig configures speech-to-text transcription.
func DefaultTranscriptionConfig ¶
func DefaultTranscriptionConfig() TranscriptionConfig
DefaultTranscriptionConfig returns sensible defaults for transcription.
type TranscriptionError ¶
type TranscriptionError struct {
// Provider is the STT provider name.
Provider string
// Code is the provider-specific error code.
Code string
// Message is a human-readable error message.
Message string
// Cause is the underlying error, if any.
Cause error
// Retryable indicates whether the request can be retried.
Retryable bool
}
TranscriptionError represents an error during transcription.
func NewTranscriptionError ¶
func NewTranscriptionError(provider, code, message string, cause error, retryable bool) *TranscriptionError
NewTranscriptionError creates a new TranscriptionError.
func (*TranscriptionError) Error ¶
func (e *TranscriptionError) Error() string
Error implements the error interface.
func (*TranscriptionError) Is ¶
func (e *TranscriptionError) Is(target error) bool
Is implements error matching for errors.Is.
func (*TranscriptionError) Unwrap ¶
func (e *TranscriptionError) Unwrap() error
Unwrap returns the underlying error.