document

package
v0.24.0-beta Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 31, 2024 License: MIT Imports: 18 Imported by: 1

README

---
title: "Document"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Document component https://github.com/instill-ai/instill-core"
---

The Document component is an operator component that allows users to manipulate Document files.
It can carry out the following tasks:

- [Convert To Markdown](#convert-to-markdown)
- [Convert To Text](#convert-to-text)



## Release Stage

`Alpha`



## Configuration

The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/operator/document/v0/config/definition.json).





## Supported Tasks

### Convert To Markdown

Convert document to text in Markdown format.


| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CONVERT_TO_MARKDOWN` |
| Document (required) | `document` | string | Base64 encoded PDF/DOCX/DOC/PPTX/PPT/HTML to be converted to text in Markdown format |
| Display image tag | `display-image-tag` | boolean | Choose if the result displays image tags |



| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Body | `body` | string | Markdown text converted from the PDF document |






### Convert To Text

Convert document to text.


| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CONVERT_TO_TEXT` |
| Document (required) | `doc` | string | Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text |



| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Body | `body` | string | Plain text converted from the document |
| Meta | `meta` | object | Metadata extracted from the document |
| MSecs | `msecs` | number | Time taken to convert the document |
| Error | `error` | string | Error message if any during the conversion process |







Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Init

func Init(bc base.Component) *component

Types

type ConvertToTextInput

type ConvertToTextInput struct {
	// Doc: Document to convert
	Doc string `json:"doc"`
}

ConvertToTextInput defines the input for convert to text task

type ConvertToTextOutput

type ConvertToTextOutput struct {
	// Body: Plain text converted from the document
	Body string `json:"body"`
	// Meta: Metadata extracted from the document
	Meta map[string]string `json:"meta"`
	// MSecs: Time taken to convert the document
	MSecs uint32 `json:"msecs"`
	// Error: Error message if any during the conversion process
	Error string `json:"error"`
}

ConvertToTextOutput defines the output for convert to text task

type DocxDocToMarkdownTransformer

type DocxDocToMarkdownTransformer struct {
	Base64EncodedText string
	FileExtension     string
	DisplayImageTag   bool
}

func (DocxDocToMarkdownTransformer) Transform

func (t DocxDocToMarkdownTransformer) Transform() (string, error)

type HTMLToMarkdownTransformer

type HTMLToMarkdownTransformer struct {
	Base64EncodedText string
	FileExtension     string
	DisplayImageTag   bool
}

func (HTMLToMarkdownTransformer) Transform

func (t HTMLToMarkdownTransformer) Transform() (string, error)

type MarkdownTransformer

type MarkdownTransformer interface {
	Transform() (string, error)
}

type PDFToMarkdownTransformer

type PDFToMarkdownTransformer struct {
	Base64EncodedText string
	FileExtension     string
	DisplayImageTag   bool
}

func (PDFToMarkdownTransformer) Transform

func (t PDFToMarkdownTransformer) Transform() (string, error)

type PptPptxToMarkdownTransformer

type PptPptxToMarkdownTransformer struct {
	Base64EncodedText string
	FileExtension     string
	DisplayImageTag   bool
}

func (PptPptxToMarkdownTransformer) Transform

func (t PptPptxToMarkdownTransformer) Transform() (string, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL