crack

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 28, 2026 License: 0BSD Imports: 12 Imported by: 0

README

document-crack

A Go library for extracting text content from various document formats.

Supported Formats

  • PDF - Portable Document Format
  • DOCX - Microsoft Word (Open XML)
  • DOC - Microsoft Word (Legacy)
  • PPTX - Microsoft PowerPoint (Open XML)
  • ODT - OpenDocument Text
  • TXT - Plain text

Installation

go get github.com/taigrr/document-crack

Usage

From a file path
package main

import (
    "fmt"
    "log"

    crack "github.com/taigrr/document-crack"
)

func main() {
    doc, err := crack.FromFile("/path/to/document.pdf")
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Type: %s\n", doc.Type)
    fmt.Printf("Title: %s\n", doc.Title)
    fmt.Printf("Content: %v\n", doc.Content)
}
From bytes
doc, err := crack.FromBytes(data)
if err != nil {
    log.Fatal(err)
}
From io.ReaderAt
doc, err := crack.FromReader(reader, size)
if err != nil {
    log.Fatal(err)
}
From URL
doc, err := crack.FromURL(ctx, "https://example.com/document.pdf")
if err != nil {
    log.Fatal(err)
}

Document Structure

type Document struct {
    Type    FileType  // PDF, DOCX, DOC, PPTX, ODT, TXT
    Title   string    // Document title (if available)
    Content []string  // Text content (per page for PDFs, single string for others)
}

License

0BSD

Documentation

Overview

Package crack provides document text extraction for various file formats.

Index

Constants

View Source
const MaxDownloadSize = 100 << 20

MaxDownloadSize is the maximum file size for URL downloads (100MB).

Variables

View Source
var ErrUnknownFormat = errors.New("unknown file format")

ErrUnknownFormat is returned when the file format cannot be determined.

Functions

This section is empty.

Types

type Document

type Document struct {
	Type    FileType
	Title   string
	Content []string
}

Document represents the extracted content from a file.

func FromBytes

func FromBytes(data []byte) (Document, error)

FromBytes extracts content from a byte slice.

func FromFile

func FromFile(path string) (Document, error)

FromFile extracts content from a file at the given path.

func FromReader

func FromReader(r io.ReaderAt, size int64) (Document, error)

FromReader extracts content from an io.ReaderAt with known size.

func FromURL

func FromURL(ctx context.Context, fileURL string) (Document, error)

FromURL downloads and extracts content from a URL.

type FileType

type FileType string

FileType represents the detected document format.

const (
	TypePDF  FileType = "PDF"
	TypeDOC  FileType = "DOC"
	TypeDOCX FileType = "DOCX"
	TypeODT  FileType = "ODT"
	TypePPTX FileType = "PPTX"
	TypeTXT  FileType = "TXT"
)

Directories

Path Synopsis
Package doc provides legacy DOC text extraction.
Package doc provides legacy DOC text extraction.
Package docx provides DOCX text extraction.
Package docx provides DOCX text extraction.
Package odt provides ODT text extraction.
Package odt provides ODT text extraction.
Package pdf provides PDF text extraction.
Package pdf provides PDF text extraction.
Package pptx provides PPTX text extraction.
Package pptx provides PPTX text extraction.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL