multimodal

command

v0.62.0 Latest Latest Go to latest Published: Sep 12, 2025 License: MIT Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/XiaoConstantine/dspy-go

Links

Open Source Insights

README ¶

Multimodal Gemini Example

This example demonstrates how to use Google's Gemini model for multimodal tasks including image analysis, vision-based question answering, and multimodal chat capabilities.

Features

Image Analysis: Analyze images and answer questions about them
Vision Question Answering: Detailed visual analysis with structured outputs
Multimodal Chat: Conversational interactions with images
Streaming Support: Real-time streaming of multimodal responses
Multiple Images: Compare and analyze multiple images simultaneously

Prerequisites

Google Gemini API Key: You need a valid API key from Google AI Studio
Go 1.21+: Make sure you have Go installed
Sample Images: Place sample images in this directory (optional - placeholder will be created)

Setup

Set your API key:

export GEMINI_API_KEY="your-api-key-here"

Install dependencies:
```
go mod download
```
Add sample images (optional):
- Place any JPEG images named sample.jpg in this directory
- The example will create a placeholder if no image is found

Usage

Run the example:

go run main.go

Examples Included

1. Basic Image Analysis

// Analyze an image and answer questions about it
inputs := map[string]any{
    "image":    core.NewImageBlock(imageData, "image/jpeg"),
    "question": "What objects can you see in this image?",
}

2. Vision Question Answering

// Structured analysis with observations and answers
inputs := map[string]any{
    "image": core.NewImageBlock(imageData, "image/jpeg"),
    "task":  "Count the number of people in this image and describe what they are doing.",
}

3. Multimodal Chat

// Conversational interactions with images
inputs := map[string]any{
    "image":   core.NewImageBlock(imageData, "image/jpeg"),
    "message": "Hello! Can you tell me what you see in this image?",
}

4. Streaming Multimodal Generation

// Real-time streaming responses
content := []core.ContentBlock{
    core.NewTextBlock("Please describe this image in detail..."),
    core.NewImageBlock(imageData, "image/jpeg"),
}
streamResp, err := llm.StreamGenerateWithContent(ctx, content)

5. Multiple Images Analysis

// Compare and analyze multiple images
content := []core.ContentBlock{
    core.NewTextBlock("Compare these two images:"),
    core.NewImageBlock(imageData1, "image/jpeg"),
    core.NewImageBlock(imageData2, "image/jpeg"),
}

Supported Image Formats

JPEG (.jpg, .jpeg)
PNG (.png)
WebP (.webp)
GIF (.gif)

Key Components

ContentBlock Types

core.FieldTypeText: Text content
core.FieldTypeImage: Image content
core.FieldTypeAudio: Audio content (future support)

Signatures

The example defines several signatures for different multimodal tasks:

ImageAnalysisSignature: Basic image analysis
VisionQASignature: Structured vision Q&A
MultiModalChatSignature: Conversational interactions

LLM Capabilities

Gemini supports these multimodal capabilities:

core.CapabilityMultimodal: General multimodal support
core.CapabilityVision: Image/vision processing
core.CapabilityAudio: Audio processing

Error Handling

The example includes comprehensive error handling for:

Missing API keys
Unsupported image formats
LLM API errors
File loading issues

Customization

Custom Signatures

Create your own multimodal signatures:

customSignature := core.NewSignature(
    []core.InputField{
        {Field: core.NewImageField("image", core.WithDescription("Input image"))},
        {Field: core.NewTextField("prompt", core.WithDescription("Custom prompt"))},
    },
    []core.OutputField{
        {Field: core.NewTextField("result", core.WithDescription("Analysis result"))},
    },
).WithInstruction("Your custom instruction here")

Image Processing

Process images from different sources:

// From file
imageData, err := os.ReadFile("path/to/image.jpg")
imageBlock := core.NewImageBlock(imageData, "image/jpeg")

// From base64
decodedData, err := base64.StdEncoding.DecodeString(base64String)
imageBlock := core.NewImageBlock(decodedData, "image/png")

// From URL (you'd need to fetch it first)
resp, err := http.Get("https://example.com/image.jpg")
imageData, err := io.ReadAll(resp.Body)
imageBlock := core.NewImageBlock(imageData, "image/jpeg")

Performance Tips

Image Size: Optimize image sizes for better performance
Batch Processing: Process multiple images in batches when possible
Streaming: Use streaming for long-running analysis tasks
Caching: Cache analysis results for repeated queries

Advanced Usage

Chain of Thought with Images

cotSignature := core.NewSignature(
    []core.InputField{
        {Field: core.NewImageField("image", core.WithDescription("Image to analyze"))},
        {Field: core.NewTextField("question", core.WithDescription("Question about the image"))},
    },
    []core.OutputField{
        {Field: core.NewTextField("reasoning", core.WithDescription("Step-by-step reasoning"))},
        {Field: core.NewTextField("answer", core.WithDescription("Final answer"))},
    },
).WithInstruction("Analyze the image step by step, showing your reasoning process")

Multi-turn Conversations

// Maintain conversation context with images
conversation := []core.ContentBlock{
    core.NewTextBlock("Previous conversation context..."),
    core.NewImageBlock(imageData, "image/jpeg"),
    core.NewTextBlock("User: What do you see in this image?"),
    core.NewTextBlock("Assistant: I can see..."),
    core.NewTextBlock("User: Can you tell me more about the colors?"),
}

Troubleshooting

Common Issues

API Key Issues:
- Make sure your GEMINI_API_KEY is set correctly
- Verify your API key has the necessary permissions
Image Loading Issues:
- Check file paths and permissions
- Ensure image formats are supported
- Verify image files aren't corrupted
Memory Issues:
- Large images may cause memory issues
- Consider resizing images before processing
Rate Limiting:
- Implement retry logic for API calls
- Add delays between requests if needed

Debug Mode

Enable debug logging to see detailed request/response information:

// Initialize logging with DEBUG level
import "github.com/XiaoConstantine/dspy-go/pkg/logging"

output := logging.NewConsoleOutput(true, logging.WithColor(true))
logger := logging.NewLogger(logging.Config{
    Severity: logging.DEBUG,
    Outputs:  []logging.Output{output},
})
logging.SetLogger(logger)

This will show detailed information about:

Multimodal content generation
LLM completion responses
Response parsing and field extraction
Token usage statistics

Next Steps

Try different image types and formats
Experiment with different prompt styles
Combine multimodal analysis with other DSPy modules
Build applications that use multimodal capabilities

For more examples and documentation, check out the main DSPy-Go repository.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL