crack

package module

v1.0.0 Latest Latest Go to latest Published: Jan 28, 2026 License: 0BSD Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/taigrr/document-crack

Links

Open Source Insights

README ¶

document-crack

A Go library for extracting text content from various document formats.

Supported Formats

PDF - Portable Document Format
DOCX - Microsoft Word (Open XML)
DOC - Microsoft Word (Legacy)
PPTX - Microsoft PowerPoint (Open XML)
ODT - OpenDocument Text
TXT - Plain text

Installation

go get github.com/taigrr/document-crack

Usage

From a file path

package main

import (
    "fmt"
    "log"

    crack "github.com/taigrr/document-crack"
)

func main() {
    doc, err := crack.FromFile("/path/to/document.pdf")
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Type: %s\n", doc.Type)
    fmt.Printf("Title: %s\n", doc.Title)
    fmt.Printf("Content: %v\n", doc.Content)
}

From bytes

doc, err := crack.FromBytes(data)
if err != nil {
    log.Fatal(err)
}

From io.ReaderAt

doc, err := crack.FromReader(reader, size)
if err != nil {
    log.Fatal(err)
}

From URL

doc, err := crack.FromURL(ctx, "https://example.com/document.pdf")
if err != nil {
    log.Fatal(err)
}

Document Structure

type Document struct {
    Type    FileType  // PDF, DOCX, DOC, PPTX, ODT, TXT
    Title   string    // Document title (if available)
    Content []string  // Text content (per page for PDFs, single string for others)
}

License

0BSD

Documentation ¶

Overview ¶

Package crack provides document text extraction for various file formats.

Constants ¶

View Source

const MaxDownloadSize = 100 << 20

MaxDownloadSize is the maximum file size for URL downloads (100MB).

Variables ¶

View Source

var ErrUnknownFormat = errors.New("unknown file format")

ErrUnknownFormat is returned when the file format cannot be determined.

Functions ¶

This section is empty.

Types ¶

type Document ¶

type Document struct {
	Type    FileType
	Title   string
	Content []string
}

Document represents the extracted content from a file.

func FromBytes ¶

func FromBytes(data []byte) (Document, error)

FromBytes extracts content from a byte slice.

func FromFile ¶

func FromFile(path string) (Document, error)

FromFile extracts content from a file at the given path.

func FromReader ¶

func FromReader(r io.ReaderAt, size int64) (Document, error)

FromReader extracts content from an io.ReaderAt with known size.

func FromURL ¶

func FromURL(ctx context.Context, fileURL string) (Document, error)

FromURL downloads and extracts content from a URL.

type FileType ¶

type FileType string

FileType represents the detected document format.

const (
	TypePDF  FileType = "PDF"
	TypeDOC  FileType = "DOC"
	TypeDOCX FileType = "DOCX"
	TypeODT  FileType = "ODT"
	TypePPTX FileType = "PPTX"
	TypeTXT  FileType = "TXT"
)

Source Files ¶

View all Source files

crack.go

Directories ¶

Path	Synopsis
doc Package doc provides legacy DOC text extraction.	Package doc provides legacy DOC text extraction.
docx Package docx provides DOCX text extraction.	Package docx provides DOCX text extraction.
odt Package odt provides ODT text extraction.	Package odt provides ODT text extraction.
pdf Package pdf provides PDF text extraction.	Package pdf provides PDF text extraction.
pptx Package pptx provides PPTX text extraction.	Package pptx provides PPTX text extraction.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL