tomd

command module
v0.7.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 6, 2025 License: Apache-2.0 Imports: 1 Imported by: 0

README

toMD

Simple golang cli to convert various documents to markdown format. It first version support :

  • HTML page
  • PDF file
  • Docx file
  • Pptx file

The cli can convert a single file or a list of files. It can also use ollama to describe images in markdown file (only HTML for the moment).

The conversion generates metadata header with the following fields :

---
title: <title>
doc_id: <id>
description: <description>
tags: 
- file
site_url: <doc-url>
authors: 
- me
creation_date: 2025-01-13T18:15:24
last_update_date: 2025-01-13T18:15:24
visibility: Internal

These metadata's fields are generated from document information and can be overriden with a json file.

Usage

Get a web page as a markdown file

$ tomd page -f <file-url> -d <directory>

Get a set of web pages as markdown files (with metadata override)

$ tomd file -f <json-list> -d <directory>

with :

[
  {"site_url":"http://mywebsitepage.mydomain.com/page2", "description":"my page description","title":"","tags":["tag1","tag2"]},
  {"site_url":"https://mywebsitepage.mydomain.com/page4", "description":"my page description","title":""},
  {"site_url":"https://mywebsitepage.mydomain.com/page10", "description":"my page description ","title":"Page 10, xxxxx", "authors": ["author1","author2"]}
]

Extract PDF text as markdown file (basic text extraction)

$ tomd pdf -f <pdf-file> -d <directory>

Extract DOCX text as markdown file (basic text extraction)

$ tomd docx -d <docx-file> -d <directory>

Extract PPTX text as markdown file (basic text extraction)

$ tomd pptx -p <docx-file> -d <directory>

Options

Export web pages to markdown files, create metadata header and optionally use llm to describe image in markdown file

Usage:
  tomd [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  docx        Get Docx text content as a markdown file
  file        Get a list of web pages as markdown files
  help        Help about any command
  page        Get a web page as a markdown file
  pdf         Get PDF text content as a markdown file
  pptx        Get pptx text content as a markdown file
  version     Provide tomd version and build number

Flags:
  -d, --dir string   Export page(s) folder, default is current folder (default ".")
  -h, --help         help for tomd
  -i, --ia           Use IA for image description
  -v, --verbose      write debug logs in log-tomd.log file

Use "tomd [command] --help" for more information about a command.

If -i option is used, ollama and llm llava:7b are used to describe images at the end of the markdown file. So do not forget to install ollama and to pull llava:7b with ollama pull llava:7b before using this option.

You can use another model by setting TOMD_MODEL env variable with your target models.

$ export TOMD_MODEL=llama3.2-vision:latest  tomd page -f <file-url> -d <directory> -i

Build

$ go mod tidy
$ go build

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL