README
ยถ
๐ป GhostFS - A File System Simulator

A powerful file system emulator for testing migration tools, file sync applications, and cloud storage integrations without the overhead of real file systems or expensive APIs.
๐ฏ What is GhostFS?
GhostFS is a SQL-backed file system emulator that mimics a cloud storage API like Dropbox for example. Instead of dealing with real files and folders, GhostFS creates a virtual file system stored in a DuckDB database that you can traverse, query, and manipulate through a REST API.
๐ฒ How GhostFS Works
GhostFS generates file systems probabilistically using deterministic random number generation (RNG) seeds. This means:
- Deterministic Generation: Same seed = same file system structure every time
- Probabilistic Distribution: Files and folders are created based on configurable probability distributions
- Write Queue Persistence: Generated structures are efficiently batched and persisted to DuckDB
- Configurable Performance: Tune write queue settings for your specific needs
The Generation Process:
Master Seed โ Folder Seeds โ Child Generation โ Write Queue โ DuckDB
โ โ โ โ โ
Deterministic Per-Folder Probabilistic Batched Persistent
RNG RNG Seeds File/Folder Writes Storage
Write Queue System:
- Batch Size: Number of operations to accumulate before flushing to disk
- Flush Interval: Maximum time to wait before flushing (safety net)
- Performance Trade-offs: Higher frequency = safer but slower, Lower frequency = faster but riskier
Perfect for:
- Testing file migration tools (like ByteWave) without moving real data
- Simulating massive file systems with millions of files and folders
- Prototyping cloud storage integrations with controllable environments
- Load testing file system operations at scale
๐ Why GhostFS?
The Problem
- Testing file migration tools requires terabytes of real data
- Cloud APIs have rate limits and costs during development
- Creating realistic folder structures manually is time-consuming
- Real file systems are slow for large-scale testing
The Solution
- Instant file system generation with configurable depth and complexity
- No storage overhead - millions of "files" in a lightweight database
- Full API control - simulate network issues, auth failures, rate limits
- Realistic testing without the infrastructure costs
โจ Features
Current (v0.1)
- ๐๏ธ DuckDB Backend - Fast, embedded SQL database
- ๐ฑ Intelligent Seeding - Generate realistic folder structures
- ๐ Multi-FS Mode - Primary + secondary tables for migration testing
- ๐ฒ Probabilistic Subsets - Secondary tables with configurable
dst_prob - ๐ก REST API - Standard HTTP endpoints for file operations
- ๐ Batch Operations - Create/delete multiple items at once
- ๐ฏ Table Management - List and manage multiple file systems
- ๐ Access Tracking - Automatic tracking of accessed folders via
checkedflag - ๐ Write Queues - Non-blocking batch updates for optimal performance
- โ๏ธ Dynamic Configuration - Runtime tuning of batch sizes and flush intervals
Coming Soon (v0.2+)
- ๐ Network Simulation - Configurable latency, jitter, timeouts
- ๐ Auth Simulation - Token expiration, permission failures
- โก Rate Limiting - Simulate API throttling
- ๐ Metrics & Analytics - Track usage patterns
- ๐ง Plugin System - Extend with custom behaviors
- ๐ค Auto-Scaling Write Queues - Real-time adjustment of batch sizes and flush intervals based on load patterns and risk profiles
๐๏ธ Architecture & File System Modes
Single-FS vs Multi-FS Mode
GhostFS operates in two distinct modes:
๐ต Single-FS Mode (Default)
- Uses only the primary table (
nodes) - Perfect for basic file system testing
- All items exist in one unified file system
๐ก Multi-FS Mode (Advanced)
- Uses primary table + secondary tables
- Simulates source โ destination migration scenarios
- Secondary tables contain probabilistic subsets of the primary table
- Each item has a
dst_probchance of appearing in secondary tables
How Secondary Tables Work
When generating a file system in Multi-FS mode:
- Primary Table is populated with the complete file system using deterministic RNG seeds
- Secondary Tables are populated by iterating through primary items
- Each item has a probabilistic chance (based on
dst_prob) to be included - Write queues efficiently batch and persist all generated structures to DuckDB
- Results in realistic migration scenarios with missing files/folders
Example with dst_prob: 0.7:
Primary Table (Source): Secondary Table (Destination):
โโโ folder1/ โโโ folder1/ โ
(70% chance - included)
โ โโโ file1.txt โ โโโ file1.txt โ
(70% chance - included)
โ โโโ file2.txt โ โโโ file3.txt โ
(70% chance - included)
โ โโโ file3.txt โโโ folder3/ โ
(70% chance - included)
โโโ folder2/ โโโ file6.txt โ
(70% chance - included)
โ โโโ file4.txt
โโโ folder3/ โ folder2/ missing (30% chance - excluded)
โ โโโ file5.txt โ file2.txt missing (30% chance - excluded)
โ โโโ file6.txt โ file4.txt missing (30% chance - excluded)
โโโ folder4/ โ file5.txt missing (30% chance - excluded)
โโโ file7.txt โ folder4/ missing (30% chance - excluded)
System Architecture
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ REST API โโโโโโ GhostFS โโโโโโ DuckDB โ
โ (Chi Router) โ โ Server โ โ โโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ โ Primary โ โ
โ โ โ โ Table โ โ
โ โโโโโโโโโโผโโโโโโโโโ โ โ (nodes) โ โ
โ โ Table Manager โ โ โโโโโโโโโโโโโโโ โ
โ โ (Multi-table) โ โ โโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโ โ โ Secondary โ โ
โ โ โ โ Table 1 โ โ
โ โโโโโโโโโโผโโโโโโโโโ โ โ (subset) โ โ
โ โ Write Queue โโโค โโโโโโโโโโโโโโโ โ
โ โ (Batching) โ โ โโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโ โ โ Secondary โ โ
โ โ โ Table N โ โ
โโโโโโผโโโโโโ โ โ (subset) โ โ
โ Client โ โ โโโโโโโโโโโโโโโ โ
โ App โ โโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโ
๐ ๏ธ Installation & Setup
Prerequisites
- Go 1.24.2 or higher
- Git
Quick Start
# Clone the repository
git clone https://github.com/Voltaic314/GhostFS.git
cd GhostFS
# Install dependencies
go mod download
# Seed the database with sample data
go run main.go
# Start the API server
cd code/api
go run main.go server.go
# Server starts on http://localhost:8086 (configurable via config.json)
Configuration
Create or modify config.json:
Single-FS Mode (Basic)
{
"database": {
"path": "GhostFS.db",
"tables": {
"primary": {
"table_name": "nodes",
"min_child_folders": 2,
"max_child_folders": 8,
"min_child_files": 5,
"max_child_files": 15,
"min_depth": 3,
"max_depth": 6
}
}
},
"network": {
"address": "localhost",
"port": 8086
}
}
Multi-FS Mode (Migration Testing)
{
"database": {
"path": "GhostFS.db",
"tables": {
"primary": {
"table_name": "nodes_source",
"min_child_folders": 3,
"max_child_folders": 10,
"min_child_files": 8,
"max_child_files": 20,
"min_depth": 4,
"max_depth": 8
},
"secondary": {
"destination_partial": {
"table_name": "nodes_dest_partial",
"dst_prob": 0.7
},
"destination_sparse": {
"table_name": "nodes_dest_sparse",
"dst_prob": 0.3
}
}
}
},
"network": {
"address": "localhost",
"port": 8086
}
}
Configuration Explained:
dst_prob: 0.7= 70% of items from primary will appear in this secondary tabledst_prob: 0.3= 30% of items from primary will appear in this secondary table- Multiple secondary tables simulate different migration scenarios
โ๏ธ Write Queue Configuration & Performance Tuning
GhostFS uses a sophisticated write queue system to efficiently persist generated file system structures to DuckDB. Understanding and tuning these settings is crucial for optimal performance.
๐ How Write Queues Work
The Write Queue Process:
- Operations Accumulate: File/folder creation operations are queued in memory
- Batch Threshold: When operations reach
batch_size, they're flushed to disk - Timer Safety Net: If timer expires with pending operations, they're flushed
- Force Flush: Explicit flushes for shutdown or critical operations
๐ Performance Trade-offs
| Setting | Higher Frequency | Lower Frequency |
|---|---|---|
| Batch Size | Smaller (100-1K) | Larger (5K-50K) |
| Flush Interval | Shorter (50-200ms) | Longer (1-10s) |
| Safety | โ Very Safe | โ ๏ธ Riskier |
| Performance | ๐ Slower | ๐ Faster |
| Memory Usage | ๐ Lower | ๐ด Higher |
| Crash Recovery | โ Minimal Loss | โ ๏ธ Potential Loss |
๐ฏ Recommended Configurations
Balanced (Recommended)
{
"write_queue": {
"batch_size": 1000,
"flush_interval_ms": 200
}
}
- Best of both worlds: Good performance with reasonable safety
- Suitable for: Most development and testing scenarios
- Memory usage: ~1K operations in memory at any time
High Throughput
{
"write_queue": {
"batch_size": 10000,
"flush_interval_ms": 5000
}
}
- Maximum performance: Minimal I/O overhead
- Suitable for: Large-scale generation, performance testing
- Risk: Higher memory usage, more data loss on crash
High Safety
{
"write_queue": {
"batch_size": 100,
"flush_interval_ms": 50
}
}
- Maximum safety: Frequent disk writes
- Suitable for: Production environments, critical data
- Trade-off: Slower performance due to frequent I/O
Timer-Only Mode
{
"write_queue": {
"batch_size": 0,
"flush_interval_ms": 100
}
}
- No batching: Flushes on timer only
- Suitable for: Real-time applications, low-latency requirements
- Behavior: Operations flushed every 100ms regardless of count
๐ง Dynamic Configuration
GhostFS supports runtime configuration changes for dynamic scaling:
// Adjust for high-load scenario
client.SetWriteQueueConfig("nodes", 5000, 500*time.Millisecond)
// Switch to safety mode
client.SetWriteQueueConfig("nodes", 100, 50*time.Millisecond)
// Disable batching (timer-only)
client.SetWriteQueueConfig("nodes", 0, 100*time.Millisecond)
// Update all tables at once
client.SetAllWriteQueueConfigs(1000, 200*time.Millisecond)
โ๏ธ Critical: Batch Size & Timer Synchronization
The Golden Rule: Your flush interval should match how long it takes to fill your batch threshold under normal load.
๐ฏ Proper Synchronization
Ideal Relationship:
Flush Interval โ Time to Fill Batch Size Under Normal Load
Why This Matters:
- Timer too short: You'll flush before reaching batch threshold โ wasted I/O
- Timer too long: You'll wait too long for outliers โ increased volatility
- Batch size too large: Timer becomes irrelevant โ high memory usage
- Batch size too small: Timer flushes too frequently โ performance loss
๐ Real-World Scenarios
Scenario 1: Slow Generation + Large Batch
{
"write_queue": {
"batch_size": 10000,
"flush_interval_ms": 5000
}
}
- Problem: If generation is slow, you might only create 2K items in 5 seconds
- Result: Timer flushes before reaching threshold โ inefficient
- Solution: Reduce batch size to match generation speed
Scenario 2: Fast Generation + Small Timer
{
"write_queue": {
"batch_size": 1000,
"flush_interval_ms": 50
}
}
- Problem: If you generate 1K items in 10ms, timer is too frequent
- Result: Constant flushing โ performance bottleneck
- Solution: Increase timer or batch size
Scenario 3: Balanced (Recommended)
{
"write_queue": {
"batch_size": 1000,
"flush_interval_ms": 200
}
}
- Assumption: You generate ~1K items in ~200ms under normal load
- Result: Optimal batching with safety net for outliers
๐ Load-Dependent Considerations
High-Load Systems:
- Generation is fast โ larger batch sizes work well
- Timer can be longer (safety net for rare slowdowns)
- Example:
batch_size: 5000, flush_interval_ms: 1000
Low-Load Systems:
- Generation is slow โ smaller batch sizes needed
- Timer should be shorter to catch slow periods
- Example:
batch_size: 100, flush_interval_ms: 100
Variable-Load Systems:
- Use conservative settings that work for worst-case
- Consider dynamic configuration for load changes
- Example:
batch_size: 1000, flush_interval_ms: 500
๐๏ธ Tuning Strategy
- Start Conservative: Begin with balanced settings
- Measure Generation Rate: Time how long it takes to create 1K items
- Set Timer: Timer should be 1.5-2x your measured generation time
- Adjust Batch Size: Match your typical generation rate
- Monitor & Iterate: Watch for flush patterns and adjust
Example Tuning Process:
# Step 1: Measure generation rate
time go run main.go # Note: "Generated 1000 items in 150ms"
# Step 2: Set timer to 2x generation time
"flush_interval_ms": 300 # 150ms * 2 = 300ms
# Step 3: Set batch size to match generation rate
"batch_size": 1000 # Matches your measured rate
# Step 4: Monitor and adjust
# If you see frequent timer flushes โ increase batch size
# If you see infrequent flushes โ decrease timer
๐ Monitoring & Optimization
Key Metrics to Monitor:
- Queue Depth: Number of pending operations
- Flush Frequency: How often batches are written to disk
- Memory Usage: RAM consumption from queued operations
- Generation Speed: Files/folders created per second
- Flush Pattern: Are you hitting batch threshold or timer more often?
Optimization Tips:
- Start with balanced settings (1K batch, 200ms interval)
- Measure your generation rate under typical load
- Set timer to 1.5-2x generation time for safety margin
- Adjust batch size to match your generation capacity
- Monitor flush patterns - aim for 80% batch threshold, 20% timer flushes
- Use dynamic configuration for variable loads
โ ๏ธ Important Considerations
Memory Usage:
- Each queued operation consumes memory
- Large batch sizes = higher memory usage
- Monitor system memory during large generations
Crash Recovery:
- Queued operations are lost on crash
- Smaller batch sizes = less data loss
- Consider your application's crash tolerance
I/O Performance:
- More frequent flushes = more disk I/O
- SSD vs HDD performance differences
- Consider your storage subsystem capabilities
๐ API Reference
Base URL: http://localhost:8086
Tables Management
List All File Systems
POST /tables/list
Response:
{
"success": true,
"data": {
"tables": [
{
"table_id": "uuid-here",
"table_name": "nodes",
"type": "primary"
}
]
}
}
File System Operations
List Items in Folder
POST /items/list
Content-Type: application/json
{
"table_id": "uuid-here",
"folder_id": "root-folder-id",
"folders_only": false
}
Get Root Folder
GET /items/get_root
Content-Type: application/json
{
"table_id": "uuid-here"
}
Create Multiple Items
POST /items/new
Content-Type: application/json
{
"table_id": "uuid-here",
"parent_id": "parent-folder-id",
"items": [
{"name": "New Folder", "type": "folder"},
{"name": "document.txt", "type": "file", "size": 1024}
]
}
Delete Multiple Items
POST /items/delete
Content-Type: application/json
{
"table_id": "uuid-here",
"item_ids": ["item-id-1", "item-id-2"]
}
Get Download URLs
POST /items/download
Content-Type: application/json
{
"table_id": "uuid-here",
"file_ids": ["file-id-1", "file-id-2"]
}
๐ฎ Usage Examples
Migration Testing Scenarios
Scenario 1: Incomplete Migration Detection
# 1. Generate source file system (primary table)
go run main.go
# 2. List source file system
curl -X POST http://localhost:8086/items/list \
-d '{"table_id": "source-table-id", "folder_id": "root"}'
# 3. List destination file system (secondary table with dst_prob: 0.7)
curl -X POST http://localhost:8086/items/list \
-d '{"table_id": "dest-table-id", "folder_id": "root"}'
# 4. Compare results - ~30% of files should be missing from destination
# Your migration tool should detect these missing files
Scenario 2: Incremental Sync Validation
// Test your sync tool's ability to detect missing files
sourceItems := ghostfs.ListItems("source-table-id", "root")
destItems := ghostfs.ListItems("dest-partial-table-id", "root")
// Your sync logic should identify missing items
missingItems := findMissingItems(sourceItems, destItems)
// With dst_prob: 0.7, expect ~30% missing items
// Run your incremental sync
syncTool.SyncMissing(missingItems)
// Validate sync completed successfully
Testing a File Migration Tool
// Connect to GhostFS
client := ghostfs.NewClient("http://localhost:8086")
// List available file systems
tables, _ := client.ListTables()
tableID := tables[0].TableID
// Get root folder contents
items, _ := client.ListItems(tableID, "root")
// Simulate migrating files
for _, item := range items {
if item.Type == "file" {
// Your migration logic here
downloadURL, _ := client.GetDownloadURL(tableID, item.ID)
// Process file...
}
}
Testing Rclone Integration
# Use GhostFS as a WebDAV endpoint (coming soon)
rclone sync ghostfs:/ local:backup/ --dry-run
# Or use the REST API directly
curl -X POST http://localhost:8086/items/list \
-H "Content-Type: application/json" \
-d '{"table_id": "your-table-id", "folder_id": "root"}'
๐ง Development
Project Structure
GhostFS/
โโโ code/
โ โโโ api/ # REST API server
โ โ โโโ routes/
โ โ โ โโโ tables/ # Table management endpoints
โ โ โ โโโ items/ # File/folder CRUD endpoints
โ โ โโโ main.go # API server entry point
โ โ โโโ server.go # Server configuration
โ โโโ db/ # Database layer
โ โ โโโ tables/ # Table management
โ โ โโโ seed/ # Database seeding
โ โ โโโ write_queue.go # Batched writes
โ โโโ types/ # Shared types
โ โโโ api/ # API response types
โ โโโ db/ # Database schema types
โโโ config.json # Configuration
โโโ main.go # Seeder entry point
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ฏ Use Cases
- ByteWave - File migration testing and validation
- Cloud Storage SDKs - Integration testing
- Backup Tools - Restore process validation
- File Sync Apps - Conflict resolution testing
- Performance Testing - Large-scale file operation benchmarks
๐บ๏ธ Roadmap
- v0.2 - Network simulation (latency, failures)
- v0.3 - Authentication simulation
- v0.4 - Rate limiting and quotas
- v0.5 - WebDAV/S3 protocol support
- v1.0 - Plugin system and custom behaviors
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ค Support & Community
- ๐ Documentation: Wiki
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
Built with โค๏ธ for the file migration and sync testing community
โญ Star this repo if you find it useful!