import-flat-files

command
v0.4.7-1-testing-2025-... Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 3, 2025 License: GPL-2.0 Imports: 14 Imported by: 0

README

Flat File Import Tool

This tool imports billions of usenet articles from a flat file structure into SQLite databases.

File Structure Expected

The tool expects articles to be stored in the following structure:

/mnt/xfshead/[0-f]/[0-f]/[0-f]/[hash].head
/mnt/xfsbody/[0-f]/[0-f]/[0-f]/[hash].body

Where:

  • [0-f] represents hexadecimal directories (0-15)
  • [hash] is the 61-character hash of the message-ID
  • Each directory contains approximately 431k files

Database Output

Creates SQLite databases named [0-f][0-f].db (e.g., 00.db, 01.db, ..., ff.db) containing:

CREATE TABLE articles_0000 (
    messageid_hash CHAR(58) PRIMARY KEY,
    head TEXT NOT NULL,
    body TEXT NOT NULL
);
-- ... articles_0001 through articles_ffff (65536 tables per database)

Sharding Strategy:

  • Database selection: First 2 directory levels (dir1 + dir2) → 256 databases (00.db - ff.db)
  • Table selection: Third directory level + first 3 chars of hash (dir3 + hash[0:3]) → 65536 tables per database (articles_0000 - articles_ffff)
  • Stored hash: Remaining 58 characters of hash (hash[3:]) stored in database

Example:

  • Path: /mnt/xfshead/a/b/c/defg123456789...xyz.head
  • Database: ab.db
  • Table: articles_cdef
  • Stored hash: g123456789...xyz (58 chars)

/mnt/xfs/a/b/c/d e f 123456789...xyz.(head|body) │ │ │ │ │ │ │ │ │ │ │ └─ 6th characters (table routing) -1 = 58 remaining in hash │ │ │ │ └─ 5th characters (table routing) -1 = 59 remaining in hash │ │ │ └─ 4th character (table routing) -1 = 60 remaining in hash │ │ └─── 3rd directory level (table routing) -1 = 61 remaining in hash [result is the actual existing hashed filename[0:61].(head|body)] │ └───── 2nd directory level (database routing) -1 = 62 remaining in hash └─────── 1st directory level (database routing) -1 = 63 remaining in hash

Usage

Basic Usage
go run cmd/import-flat-files/main.go
With Custom Paths
go run cmd/import-flat-files/main.go \
    -head /mnt/xfshead \
    -body /mnt/xfsbody \
    -db ./imported_articles \
    -workers 16
Options
  • -head PATH: Path to head files directory (default: /mnt/xfshead)
  • -body PATH: Path to body files directory (default: /mnt/xfsbody)
  • -db PATH: Path for SQLite database files (default: ./imported_articles)
  • -workers N: Number of worker goroutines (default: 32)
  • -update: Update mode: only import missing articles (default: false)
  • -dry-run: Don't write to database, just scan files (default: false)
  • -verbose: Verbose logging (default: false)
Examples

Dry run to estimate scope:

go run cmd/import-flat-files/main.go -dry-run -verbose

High performance import:

go run cmd/import-flat-files/main.go \
    -workers 32 \
    -head /mnt/xfshead \
    -body /mnt/xfsbody \
    -db /fast/ssd/articles

Resume interrupted import:

go run cmd/import-flat-files/main.go -update

Performance Characteristics

  • Batch Processing: Uses transactions of 10,000 articles for optimal performance
  • Parallel Workers: Configurable worker count for parallel processing
  • Memory Efficient: Processes files sequentially within each worker
  • Database Optimization: Uses WAL mode, optimized PRAGMA settings
  • Progress Reporting: Reports progress every 30 seconds

Expected Performance

With 8 workers on modern hardware:

  • Processing Rate: ~1000-5000 articles/second
  • Memory Usage: ~100-500 MB RAM
  • Disk I/O: Optimized for sequential reads and batch writes

For billions of articles, expect the import to take several days to weeks depending on:

  • Storage I/O performance (especially for 431k files per directory)
  • Number of worker threads
  • Target database storage speed

Database Size Estimation

Each article will consume approximately:

  • Head: 1-5 KB average
  • Body: 5-50 KB average
  • Total per article: ~6-55 KB
  • For 1 billion articles: ~6-55 TB total database size

Monitoring

The tool provides real-time statistics:

  • Articles processed per second
  • Total processed/errors/skipped
  • Memory usage
  • Elapsed time

Error Handling

  • Missing Files: Logs and skips articles missing head or body files
  • Database Errors: Retries and logs failed inserts
  • Memory Management: Periodic garbage collection during large batches
  • Graceful Shutdown: Can be interrupted and resumed

Database Distribution

Articles are distributed across:

  • 256 databases (00.db to ff.db) based on first 2 hex characters of hash
  • 256 tables per database (articles_000 to articles_fff) based on characters 3-5 of hash

This two-level sharding provides excellent load distribution and query performance while keeping resource usage reasonable with only 256 database connections.

Building

go build -o import-flat-files cmd/import-flat-files/main.go

Dependencies

  • Go 1.19+
  • github.com/mattn/go-sqlite3

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL