GhostFS

module

v0.1.6 Latest Latest Go to latest Published: Sep 20, 2025 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Voltaic314/GhostFS

Links

Open Source Insights

README ¶

👻 GhostFS - A File System Simulator

GhostFS Banner

A powerful file system emulator for testing migration tools, file sync applications, and cloud storage integrations without the overhead of real file systems or expensive APIs.

🎯 What is GhostFS?

GhostFS is a SQL-backed file system emulator that mimics a cloud storage API like Dropbox for example. Instead of dealing with real files and folders, GhostFS creates a virtual file system stored in a DuckDB database that you can traverse, query, and manipulate through a REST API.

🎲 How GhostFS Works

GhostFS generates file systems probabilistically using deterministic random number generation (RNG) seeds. This means:

Deterministic Generation: Same seed = same file system structure every time
Probabilistic Distribution: Files and folders are created based on configurable probability distributions
Write Queue Persistence: Generated structures are efficiently batched and persisted to DuckDB
Configurable Performance: Tune write queue settings for your specific needs

The Generation Process:

Master Seed → Folder Seeds → Child Generation → Write Queue → DuckDB
     ↓              ↓              ↓              ↓           ↓
  Deterministic  Per-Folder    Probabilistic   Batched    Persistent
     RNG         RNG Seeds     File/Folder     Writes     Storage

Write Queue System:

Batch Size: Number of operations to accumulate before flushing to disk
Flush Interval: Maximum time to wait before flushing (safety net)
Performance Trade-offs: Higher frequency = safer but slower, Lower frequency = faster but riskier

Perfect for:

Testing file migration tools (like ByteWave) without moving real data
Simulating massive file systems with millions of files and folders
Prototyping cloud storage integrations with controllable environments
Load testing file system operations at scale

🚀 Why GhostFS?

The Problem

Testing file migration tools requires terabytes of real data
Cloud APIs have rate limits and costs during development
Creating realistic folder structures manually is time-consuming
Real file systems are slow for large-scale testing

The Solution

Instant file system generation with configurable depth and complexity
No storage overhead - millions of "files" in a lightweight database
Full API control - simulate network issues, auth failures, rate limits
Realistic testing without the infrastructure costs

✨ Features

Current (v0.1)

🗄️ DuckDB Backend - Fast, embedded SQL database
🌱 Intelligent Seeding - Generate realistic folder structures
🔄 Multi-FS Mode - Primary + secondary tables for migration testing
🎲 Probabilistic Subsets - Secondary tables with configurable dst_prob
📡 REST API - Standard HTTP endpoints for file operations
📊 Batch Operations - Create/delete multiple items at once
🎯 Table Management - List and manage multiple file systems
📈 Access Tracking - Automatic tracking of accessed folders via checked flag
🔀 Write Queues - Non-blocking batch updates for optimal performance
⚙️ Dynamic Configuration - Runtime tuning of batch sizes and flush intervals

Coming Soon (v0.2+)

🌐 Network Simulation - Configurable latency, jitter, timeouts
🔐 Auth Simulation - Token expiration, permission failures
⚡ Rate Limiting - Simulate API throttling
📈 Metrics & Analytics - Track usage patterns
🔧 Plugin System - Extend with custom behaviors
🤖 Auto-Scaling Write Queues - Real-time adjustment of batch sizes and flush intervals based on load patterns and risk profiles

🏗️ Architecture & File System Modes

Single-FS vs Multi-FS Mode

GhostFS operates in two distinct modes:

🔵 Single-FS Mode (Default)

Uses only the primary table (nodes)
Perfect for basic file system testing
All items exist in one unified file system

🟡 Multi-FS Mode (Advanced)

Uses primary table + secondary tables
Simulates source → destination migration scenarios
Secondary tables contain probabilistic subsets of the primary table
Each item has a dst_prob chance of appearing in secondary tables

How Secondary Tables Work

When generating a file system in Multi-FS mode:

Primary Table is populated with the complete file system using deterministic RNG seeds
Secondary Tables are populated by iterating through primary items
Each item has a probabilistic chance (based on dst_prob) to be included
Write queues efficiently batch and persist all generated structures to DuckDB
Results in realistic migration scenarios with missing files/folders

Example with dst_prob: 0.7:

Primary Table (Source):        Secondary Table (Destination):
├── folder1/                   ├── folder1/              ✅ (70% chance - included)
│   ├── file1.txt              │   ├── file1.txt         ✅ (70% chance - included)
│   ├── file2.txt              │   └── file3.txt         ✅ (70% chance - included)
│   └── file3.txt              └── folder3/              ✅ (70% chance - included)
├── folder2/                       └── file6.txt         ✅ (70% chance - included)
│   └── file4.txt              
├── folder3/                   ❌ folder2/ missing (30% chance - excluded)
│   ├── file5.txt              ❌ file2.txt missing (30% chance - excluded)
│   └── file6.txt              ❌ file4.txt missing (30% chance - excluded)
└── folder4/                   ❌ file5.txt missing (30% chance - excluded)
    └── file7.txt              ❌ folder4/ missing (30% chance - excluded)

System Architecture

┌─────────────────┐    ┌──────────────┐    ┌─────────────────────┐
│   REST API      │────│   GhostFS    │────│      DuckDB         │
│  (Chi Router)   │    │   Server     │    │    ┌─────────────┐  │
└─────────────────┘    └──────────────┘    │    │ Primary     │  │
         │                       │         │    │ Table       │  │
         │              ┌────────▼────────┐ │    │ (nodes)     │  │
         │              │ Table Manager   │ │    └─────────────┘  │
         │              │ (Multi-table)   │ │    ┌─────────────┐  │
         │              └─────────────────┘ │    │ Secondary   │  │
         │                       │         │    │ Table 1     │  │
         │              ┌────────▼────────┐ │    │ (subset)    │  │
         │              │  Write Queue    │◄┤    └─────────────┘  │
         │              │  (Batching)     │ │    ┌─────────────┐  │
         │              └─────────────────┘ │    │ Secondary   │  │
         │                                 │    │ Table N     │  │
    ┌────▼─────┐                           │    │ (subset)    │  │
    │  Client  │                           │    └─────────────┘  │
    │   App    │                           └─────────────────────┘
    └──────────┘

🛠️ Installation & Setup

Prerequisites

Go 1.24.2 or higher
Git

Quick Start

# Clone the repository
git clone https://github.com/Voltaic314/GhostFS.git
cd GhostFS

# Install dependencies
go mod download

# Seed the database with sample data
go run main.go

# Start the API server
cd code/api
go run main.go server.go

# Server starts on http://localhost:8086 (configurable via config.json)

Configuration

Create or modify config.json:

Single-FS Mode (Basic)

{
  "database": {
    "path": "GhostFS.db",
    "tables": {
      "primary": {
        "table_name": "nodes",
        "min_child_folders": 2,
        "max_child_folders": 8,
        "min_child_files": 5,
        "max_child_files": 15,
        "min_depth": 3,
        "max_depth": 6
      }
    }
  },
  "network": {
    "address": "localhost",
    "port": 8086
  }
}

Multi-FS Mode (Migration Testing)

{
  "database": {
    "path": "GhostFS.db",
    "tables": {
      "primary": {
        "table_name": "nodes_source",
        "min_child_folders": 3,
        "max_child_folders": 10,
        "min_child_files": 8,
        "max_child_files": 20,
        "min_depth": 4,
        "max_depth": 8
      },
      "secondary": {
        "destination_partial": {
          "table_name": "nodes_dest_partial",
          "dst_prob": 0.7
        },
        "destination_sparse": {
          "table_name": "nodes_dest_sparse", 
          "dst_prob": 0.3
        }
      }
    }
  },
  "network": {
    "address": "localhost",
    "port": 8086
  }
}

Configuration Explained:

dst_prob: 0.7 = 70% of items from primary will appear in this secondary table
dst_prob: 0.3 = 30% of items from primary will appear in this secondary table
Multiple secondary tables simulate different migration scenarios

⚙️ Write Queue Configuration & Performance Tuning

GhostFS uses a sophisticated write queue system to efficiently persist generated file system structures to DuckDB. Understanding and tuning these settings is crucial for optimal performance.

🔄 How Write Queues Work

The Write Queue Process:

Operations Accumulate: File/folder creation operations are queued in memory
Batch Threshold: When operations reach batch_size, they're flushed to disk
Timer Safety Net: If timer expires with pending operations, they're flushed
Force Flush: Explicit flushes for shutdown or critical operations

📊 Performance Trade-offs

Setting	Higher Frequency	Lower Frequency
Batch Size	Smaller (100-1K)	Larger (5K-50K)
Flush Interval	Shorter (50-200ms)	Longer (1-10s)
Safety	✅ Very Safe	⚠️ Riskier
Performance	🐌 Slower	🚀 Faster
Memory Usage	💚 Lower	🔴 Higher
Crash Recovery	✅ Minimal Loss	⚠️ Potential Loss

🎯 Recommended Configurations

Balanced (Recommended)

{
  "write_queue": {
    "batch_size": 1000,
    "flush_interval_ms": 200
  }
}

Best of both worlds: Good performance with reasonable safety
Suitable for: Most development and testing scenarios
Memory usage: ~1K operations in memory at any time

High Throughput

{
  "write_queue": {
    "batch_size": 10000,
    "flush_interval_ms": 5000
  }
}

Maximum performance: Minimal I/O overhead
Suitable for: Large-scale generation, performance testing
Risk: Higher memory usage, more data loss on crash

High Safety

{
  "write_queue": {
    "batch_size": 100,
    "flush_interval_ms": 50
  }
}

Maximum safety: Frequent disk writes
Suitable for: Production environments, critical data
Trade-off: Slower performance due to frequent I/O

Timer-Only Mode

{
  "write_queue": {
    "batch_size": 0,
    "flush_interval_ms": 100
  }
}

No batching: Flushes on timer only
Suitable for: Real-time applications, low-latency requirements
Behavior: Operations flushed every 100ms regardless of count

🔧 Dynamic Configuration

GhostFS supports runtime configuration changes for dynamic scaling:

// Adjust for high-load scenario
client.SetWriteQueueConfig("nodes", 5000, 500*time.Millisecond)

// Switch to safety mode
client.SetWriteQueueConfig("nodes", 100, 50*time.Millisecond)

// Disable batching (timer-only)
client.SetWriteQueueConfig("nodes", 0, 100*time.Millisecond)

// Update all tables at once
client.SetAllWriteQueueConfigs(1000, 200*time.Millisecond)

⚖️ Critical: Batch Size & Timer Synchronization

The Golden Rule: Your flush interval should match how long it takes to fill your batch threshold under normal load.

🎯 Proper Synchronization

Ideal Relationship:

Flush Interval ≈ Time to Fill Batch Size Under Normal Load

Why This Matters:

Timer too short: You'll flush before reaching batch threshold → wasted I/O
Timer too long: You'll wait too long for outliers → increased volatility
Batch size too large: Timer becomes irrelevant → high memory usage
Batch size too small: Timer flushes too frequently → performance loss

🔄 Real-World Scenarios

Scenario 1: Slow Generation + Large Batch

{
  "write_queue": {
    "batch_size": 10000,
    "flush_interval_ms": 5000
  }
}

Problem: If generation is slow, you might only create 2K items in 5 seconds
Result: Timer flushes before reaching threshold → inefficient
Solution: Reduce batch size to match generation speed

Scenario 2: Fast Generation + Small Timer

{
  "write_queue": {
    "batch_size": 1000,
    "flush_interval_ms": 50
  }
}

Problem: If you generate 1K items in 10ms, timer is too frequent
Result: Constant flushing → performance bottleneck
Solution: Increase timer or batch size

Scenario 3: Balanced (Recommended)

{
  "write_queue": {
    "batch_size": 1000,
    "flush_interval_ms": 200
  }
}

Assumption: You generate ~1K items in ~200ms under normal load
Result: Optimal batching with safety net for outliers

📊 Load-Dependent Considerations

High-Load Systems:

Generation is fast → larger batch sizes work well
Timer can be longer (safety net for rare slowdowns)
Example: batch_size: 5000, flush_interval_ms: 1000

Low-Load Systems:

Generation is slow → smaller batch sizes needed
Timer should be shorter to catch slow periods
Example: batch_size: 100, flush_interval_ms: 100

Variable-Load Systems:

Use conservative settings that work for worst-case
Consider dynamic configuration for load changes
Example: batch_size: 1000, flush_interval_ms: 500

🎛️ Tuning Strategy

Start Conservative: Begin with balanced settings
Measure Generation Rate: Time how long it takes to create 1K items
Set Timer: Timer should be 1.5-2x your measured generation time
Adjust Batch Size: Match your typical generation rate
Monitor & Iterate: Watch for flush patterns and adjust

Example Tuning Process:

# Step 1: Measure generation rate
time go run main.go  # Note: "Generated 1000 items in 150ms"

# Step 2: Set timer to 2x generation time
"flush_interval_ms": 300  # 150ms * 2 = 300ms

# Step 3: Set batch size to match generation rate
"batch_size": 1000  # Matches your measured rate

# Step 4: Monitor and adjust
# If you see frequent timer flushes → increase batch size
# If you see infrequent flushes → decrease timer

📈 Monitoring & Optimization

Key Metrics to Monitor:

Queue Depth: Number of pending operations
Flush Frequency: How often batches are written to disk
Memory Usage: RAM consumption from queued operations
Generation Speed: Files/folders created per second
Flush Pattern: Are you hitting batch threshold or timer more often?

Optimization Tips:

Start with balanced settings (1K batch, 200ms interval)
Measure your generation rate under typical load
Set timer to 1.5-2x generation time for safety margin
Adjust batch size to match your generation capacity
Monitor flush patterns - aim for 80% batch threshold, 20% timer flushes
Use dynamic configuration for variable loads

⚠️ Important Considerations

Memory Usage:

Each queued operation consumes memory
Large batch sizes = higher memory usage
Monitor system memory during large generations

Crash Recovery:

Queued operations are lost on crash
Smaller batch sizes = less data loss
Consider your application's crash tolerance

I/O Performance:

More frequent flushes = more disk I/O
SSD vs HDD performance differences
Consider your storage subsystem capabilities

📚 API Reference

Base URL: `http://localhost:8086`

Tables Management

List All File Systems

POST /tables/list

Response:

{
  "success": true,
  "data": {
    "tables": [
      {
        "table_id": "uuid-here",
        "table_name": "nodes",
        "type": "primary"
      }
    ]
  }
}

File System Operations

List Items in Folder

POST /items/list
Content-Type: application/json

{
  "table_id": "uuid-here",
  "folder_id": "root-folder-id",
  "folders_only": false
}

Get Root Folder

GET /items/get_root
Content-Type: application/json

{
  "table_id": "uuid-here"
}

Create Multiple Items

POST /items/new
Content-Type: application/json

{
  "table_id": "uuid-here",
  "parent_id": "parent-folder-id",
  "items": [
    {"name": "New Folder", "type": "folder"},
    {"name": "document.txt", "type": "file", "size": 1024}
  ]
}

Delete Multiple Items

POST /items/delete
Content-Type: application/json

{
  "table_id": "uuid-here",
  "item_ids": ["item-id-1", "item-id-2"]
}

Get Download URLs

POST /items/download
Content-Type: application/json

{
  "table_id": "uuid-here",
  "file_ids": ["file-id-1", "file-id-2"]
}

🎮 Usage Examples

Migration Testing Scenarios

Scenario 1: Incomplete Migration Detection

# 1. Generate source file system (primary table)
go run main.go

# 2. List source file system
curl -X POST http://localhost:8086/items/list \
  -d '{"table_id": "source-table-id", "folder_id": "root"}'

# 3. List destination file system (secondary table with dst_prob: 0.7)
curl -X POST http://localhost:8086/items/list \
  -d '{"table_id": "dest-table-id", "folder_id": "root"}'

# 4. Compare results - ~30% of files should be missing from destination
# Your migration tool should detect these missing files

Scenario 2: Incremental Sync Validation

// Test your sync tool's ability to detect missing files
sourceItems := ghostfs.ListItems("source-table-id", "root")
destItems := ghostfs.ListItems("dest-partial-table-id", "root") 

// Your sync logic should identify missing items
missingItems := findMissingItems(sourceItems, destItems)
// With dst_prob: 0.7, expect ~30% missing items

// Run your incremental sync
syncTool.SyncMissing(missingItems)

// Validate sync completed successfully

Testing a File Migration Tool

// Connect to GhostFS
client := ghostfs.NewClient("http://localhost:8086")

// List available file systems
tables, _ := client.ListTables()
tableID := tables[0].TableID

// Get root folder contents
items, _ := client.ListItems(tableID, "root")

// Simulate migrating files
for _, item := range items {
    if item.Type == "file" {
        // Your migration logic here
        downloadURL, _ := client.GetDownloadURL(tableID, item.ID)
        // Process file...
    }
}

Testing Rclone Integration

# Use GhostFS as a WebDAV endpoint (coming soon)
rclone sync ghostfs:/ local:backup/ --dry-run

# Or use the REST API directly
curl -X POST http://localhost:8086/items/list \
  -H "Content-Type: application/json" \
  -d '{"table_id": "your-table-id", "folder_id": "root"}'

🔧 Development

Project Structure

GhostFS/
├── code/
│   ├── api/                    # REST API server
│   │   ├── routes/
│   │   │   ├── tables/         # Table management endpoints
│   │   │   └── items/          # File/folder CRUD endpoints
│   │   ├── main.go             # API server entry point
│   │   └── server.go           # Server configuration
│   ├── db/                     # Database layer
│   │   ├── tables/             # Table management
│   │   ├── seed/               # Database seeding
│   │   └── write_queue.go      # Batched writes
│   └── types/                  # Shared types
│       ├── api/                # API response types
│       └── db/                 # Database schema types
├── config.json                 # Configuration
└── main.go                     # Seeder entry point

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🎯 Use Cases

ByteWave - File migration testing and validation
Cloud Storage SDKs - Integration testing
Backup Tools - Restore process validation
File Sync Apps - Conflict resolution testing
Performance Testing - Large-scale file operation benchmarks

🗺️ Roadmap

v0.2 - Network simulation (latency, failures)
v0.3 - Authentication simulation
v0.4 - Rate limiting and quotas
v0.5 - WebDAV/S3 protocol support
v1.0 - Plugin system and custom behaviors

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Support & Community

📖 Documentation: Wiki
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Built with ❤️ for the file migration and sync testing community

⭐ Star this repo if you find it useful!

Directories ¶

Path	Synopsis
code
api
api/routes
api/routes/items
api/routes/server
api/routes/tables
core/items
core/tables
db
db/seed
db/tables
sdk
types/api
types/db Package db provides database-related type definitions and interfaces for testing.	Package db provides database-related type definitions and interfaces for testing.
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL