seltabl

package module
v0.9.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 25, 2024 License: MIT Imports: 9 Imported by: 0

README

seltabl

seltabl logo

go.dev Build Status Go Report Card

A golang library with accompanying cli and language server for configurably parsing html sequences into stucts originally built for html tables, but can be used for any html sequence.

Enables data binding to structs and provides a simple, but dynamic way to define a table schema.

Installation

Install the package in a project with:

go get github.com/conneroisu/seltabl

Install the cli containing the language server operating over the lsp protocol and package command line utilities with:

go install github.com/conneroisu/seltabl/tools/seltabls@latest
Recording of Language Server

Usage

package main

import (
	"fmt"
	"github.com/conneroisu/seltabl"
	"github.com/conneroisu/seltabl/testdata"
)

type TableStruct struct {
	A string `json:"a" hSel:"tr:nth-child(1) td:nth-child(1)" dSel:"tr td:nth-child(1)" ctl:"text"`
	B string `json:"b" hSel:"tr:nth-child(1) td:nth-child(2)" dSel:"tr td:nth-child(2)" ctl:"text"`
}

var fixture = `
<table>
	<tr>
		<td>a</td>
		<td>b</td>
	</tr>
	<tr>
		<td>1</td>
		<td>2</td>
	</tr>
	<tr>
		<td>3</td>
		<td>4</td>
	</tr>
	<tr>
		<td>5</td>
		<td>6</td>
	</tr>
	<tr>
		<td>7</td>
		<td>8</td>
	</tr>
</table>
`

func main() {
	fss, err := seltabl.NewFromString[TableStruct](fixture)
	if err != nil {
		panic(fmt.Errorf("failed to parse html: %w", err))
	}
	for _, fs := range fss {
		fmt.Printf("%+v\n", fs)
	}
}

Output:

{A:1 B:2}
{A:3 B:4}
{A:5 B:6}
{A:7 B:8}

Development

A makefile at the root of the project is provided to help with development.

Testing

One can run the tests with:

make test
Linting

One can run the linter with:

make lint
Formatting

One can run the formatter with:

make fmt
Generating documentation

One can run the documentation generator with:

make doc

License

MIT

Types of ctl selectors:

  • text (default) (queries the text of the selected element)
  • spaces (queries the text of the selected element split by spaces)
  • query (queries the attributes of the selected elemente)

Documentation

Overview

Package seltabl provides a simple way to parse html tables into structs.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func New

func New[T any](doc *goquery.Document) ([]T, error)

New parses a goquery doc into a slice of structs.

The struct given as an argument must have a field with the tag seltabl, a header selector with the tag hSel, and a data selector with the tag dSel.

The selectors responsibilities:

  • header selector (hSel): used to find the header row and column for the field in the given struct.
  • data selector (dSel): used to find the data column for the field in the given struct.
  • query selector (qSel): used to query for the inner text or attribute of the cell.
  • control selector (cSel): used to control what to query for the inner text or attribute of the cell.

Example:

package main

var fixture = `
<table>
     <tr> <td>a</td> <td>b</td> </tr>
     <tr> <td>1</td> <td>2</td> </tr>
     <tr> <td>3</td> <td>4</td> </tr>
     <tr> <td>5</td> <td>6</td> </tr>
     <tr> <td>7</td> <td>8</td> </tr>
</table>
`

type FixtureStruct struct {
        A string `json:"a" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(1)" cSel:"$text"`
        B string `json:"b" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(2)" cSel:"$text"`
}

func main() {
	p, err := seltabl.New[fixtureStruct](fixture)
	if err != nil {
		panic(err)
	}
	for _, pp := range p {
		fmt.Printf("pp %+v\n", pp)
	}
}

func NewCh added in v0.7.6

func NewCh[T any](doc *goquery.Document, ch chan T) error

NewCh parses a goquery doc into a slice of structs delivered to a channel.

It parse the html for each slice of the structs.

The struct given as an argument must have a field with the tag seltabl, a header selector with the tag hSel, and a data selector with the tag key dSel.

func NewChFn added in v0.7.7

func NewChFn[
	T any,
	F func(T) bool,
](
	doc *goquery.Document,
	ch chan T,
	fn F,
) error

NewChFn parses a reader into a channel of structs.

It also applies a function to each struct before adding it to the channel.

func NewChFnErr added in v0.9.6

func NewChFnErr[
	T any,
	F func(T) bool,
](
	doc *goquery.Document,
	ch chan T,
	fn F,
) error

NewChFnErr parses a reader into a channel of structs.

It also applies a function to each struct before adding it to the channel.

It ignores errors per row selected.

func NewFromBytes added in v0.7.4

func NewFromBytes[T any](b []byte) ([]T, error)

NewFromBytes parses a byte slice into a slice of structs adhering to the given generic type.

The byte slice must be a valid html page with a single table.

The passed in generic type must be a struct with valid selectors for the table and data (hSel, dSel, cSel).

The selectors responsibilities:

  • header selector (hSel): used to find the header row and column for the field in the given struct.
  • data selector (dSel): used to find the data column for the field in the given struct.
  • query selector (qSel): used to query for the inner text or attribute of the cell.
  • control selector (cSel): used to control what to query for the inner text or attribute of the cell.

Example:

package main

import (
	"fmt"
	"github.com/conneroisu/seltabl"
)

type FixtureStruct struct {
        A string `json:"a" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(1)" cSel:"$text"`
        B string `json:"b" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(2)" cSel:"$text"`
}

func main() {
	p, err := seltabl.NewFromBytes[TableStruct]([]byte(`
	<table>
		<tr> <td>a</td> <td>b</td> </tr>
		<tr> <td>1</td> <td>2</td> </tr>
		<tr> <td>3</td> <td>4</td> </tr>
		<tr> <td>5</td> <td>6</td> </tr>
		<tr> <td>7</td> <td>8</td> </tr>
	</table>
	`))
	if err != nil {
		panic(err)
	}
	for _, pp := range p {
		fmt.Printf("pp %+v\n", pp)
	}
}

func NewFromBytesCh added in v0.7.6

func NewFromBytesCh[T any](b []byte, ch chan T) error

NewFromBytesCh parses a byte slice into a slice of structs adhering to the given generic type.

func NewFromBytesChFn added in v0.7.7

func NewFromBytesChFn[
	T any,
	F func(T) bool,
](
	b []byte,
	ch chan T,
	fn F,
) error

NewFromBytesChFn parses a byte slice into a channel of structs. It also applies a function to each struct before adding it to the channel.

func NewFromReader

func NewFromReader[T any](r io.Reader) ([]T, error)

NewFromReader parses a reader into a slice of structs.

The reader must be a valid html page with a single table.

The passed in generic type must be a struct with valid selectors for the table and data (hSel, dSel, cSel).

The selectors responsibilities:

  • header selector (hSel): used to find the header row and column for the field in the given struct.
  • data selector (dSel): used to find the data column for the field in the given struct.
  • query selector (qSel): used to query for the inner text or attribute of the cell.
  • control selector (cSel): used to control what to query for the inner text or attribute of the cell.

Example:

package main

import (
	"fmt"
	"github.com/conneroisu/seltabl"
)

type TableStruct struct {
	A string `json:"a" hSel:"tr:nth-child(1) td:nth-child(1)" dSel:"tr td:nth-child(1)" cSel:"$text"`
	B string `json:"b" hSel:"tr:nth-child(1) td:nth-child(2)" dSel:"tr td:nth-child(2)" cSel:"$text"`
}

func main() {
	p, err := seltabl.NewFromReader[TableStruct](strings.NewReader(`
	<table>
		<tr> <td>a</td> <td>b</td> </tr>
		<tr> <td>1</td> <td>2</td> </tr>
		<tr> <td>3</td> <td>4</td> </tr>
		<tr> <td>5</td> <td>6</td> </tr>
		<tr> <td>7</td> <td>8</td> </tr>
	</table>
	`))
	if err != nil {
		panic(err)
	}
	for _, pp := range p {
		fmt.Printf("pp %+v\n", pp)
	}
}

func NewFromReaderCh added in v0.7.6

func NewFromReaderCh[T any](r io.Reader, ch chan T) error

NewFromReaderCh parses a reader into a slice of structs.

func NewFromReaderChFn added in v0.7.7

func NewFromReaderChFn[
	T any,
	F func(T) bool,
](
	r io.Reader,
	ch chan T,
	fn F,
) error

NewFromReaderChFn parses a reader into a channel of structs. It also applies a function to each struct before adding it to the channel.

func NewFromString

func NewFromString[T any](htmlInput string) ([]T, error)

NewFromString parses a string into a slice of structs.

The struct must have a field with the tag seltabl, a header selector with the tag hSel, and a data selector with the tag dSel.

The selectors responsibilities:

  • header selector (hSel): used to find the header row and column for the field in the given struct.
  • data selector (dSel): used to find the data column for the field in the given struct.
  • query selector (qSel): used to query for the inner text or attribute of the cell.
  • control selector (cSel): used to control what to query for the inner text or attribute of the cell.

Example:

package main

import (
	"fmt"
	"github.com/conneroisu/seltabl"
)

type FixtureStruct struct {
        A string `json:"a" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(1)" cSel:"$text"`
        B string `json:"b" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(2)" cSel:"$text"`
}

func main() {
	p, err := seltabl.NewFromString[TableStruct](`
	<table>
		<tr> <td>a</td> <td>b</td> </tr>
		<tr> <td>1</td> <td>2</td> </tr>
		<tr> <td>3</td> <td>4</td> </tr>
		<tr> <td>5</td> <td>6</td> </tr>
		<tr> <td>7</td> <td>8</td> </tr>
	</table>
	`)
	if err != nil {
		panic(err)
	}
	for _, pp := range p {
		fmt.Printf("pp %+v\n", pp)
	}
}

func NewFromStringCh added in v0.7.6

func NewFromStringCh[T any](htmlInput string, ch chan T) error

NewFromStringCh parses a string into a slice of structs.

func NewFromStringChFn added in v0.7.7

func NewFromStringChFn[
	T any,
	F func(T) bool,
](
	htmlInput string,
	ch chan T,
	fn F,
) error

NewFromStringChFn parses a string into a channel of structs. It also applies a function to each struct before adding it to the channel.

func NewFromStringChFnErr added in v0.9.7

func NewFromStringChFnErr[T any](
	htmlInput string,
	ch chan T,
	fn func(T) bool,
) error

NewFromStringChFnErr parses a string into a channel of structs.

It also applies a function to each struct before adding it to the channel.

It ignores errors per row selected.

func NewFromURL

func NewFromURL[T any](url string) ([]T, error)

NewFromURL parses a given URL's html into a slice of structs adhering to the given generic type.

The URL must be a valid html page with a single table.

The passed in generic type must be a struct with valid selectors for the table and data (hSel, dSel, cSel).

The selectors responsibilities:

  • header selector (hSel): used to find the header row and column for the field in the given struct.
  • data selector (dSel): used to find the data column for the field in the given struct.
  • query selector (qSel): used to query for the inner text or attribute of the cell.
  • control selector (cSel): used to control what to query for the inner text or attribute of the cell.

Example:

package main

import (
	"fmt"
	"github.com/conneroisu/seltabl"
)

type FixtureStruct struct {
        A string `json:"a" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(1)" cSel:"$text"`
        B string `json:"b" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(2)" cSel:"$text"`
}

func main() {
	p, err := seltabl.NewFromURL[TableStruct]("https://github.com/conneroisu/seltabl/blob/main/testdata/ab_num_table.html")
	if err != nil {
		panic(err)
	}
	for _, pp := range p {
		fmt.Printf("pp %+v\n", pp)
	}
}

func NewFromURLCh added in v0.7.7

func NewFromURLCh[T any](url string, ch chan T) error

NewFromURLCh parses a given URL's html into a slice of structs adhering to the given generic type.

The URL must be a valid html page with a single table.

The passed in generic type must be a struct with valid selectors for the table and data (hSel, dSel, cSel).

The selectors responsibilities:

  • header selector (hSel): used to find the header row and column for the field in the given struct.
  • data selector (dSel): used to find the data column for the field in the given struct.
  • query selector (qSel): used to query for the inner text or attribute of the cell.
  • control selector (cSel): used to control what to query for the inner text or attribute of the cell.

Example:

package main

import (
	"fmt"
	"github.com/conneroisu/seltabl"
)

type FixtureStruct struct {
        A string `json:"a" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(1)" cSel:"$text"`
        B string `json:"b" hSel:"tr:nth-child(1)" dSel:"table tr:not(:first-child) td:nth-child(2)" cSel:"$text"`
}

func main() {
	p, err := seltabl.NewFromURLCh[TableStruct]("https://github.com/conneroisu/seltabl/blob/main/testdata/ab_num_table.html", ch)
	if err != nil {
		panic(err)
	}
	for _, pp := range p {
		fmt.Printf("pp %+v\n", pp)
	}
}

func SetStructField

func SetStructField[T any](
	structPtr *T,
	structField reflect.StructField,
	cellValue *goquery.Selection,
	selector SelectorI,
) error

SetStructField sets a struct field to a value. It uses generics to specify the type of the struct and the field name. It also uses the selector interface to find the value and uses the type of the selector to parse and set the value.

It is used by the NewFromString function.

Types

type Decoder

type Decoder[T any] struct {
	// contains filtered or unexported fields
}

Decoder is a struct for decoding a reader into a slice of structs.

It is used by the NewDecoder function.

It is not intended to be used directly.

Example:

type TableStruct struct {
	A string `json:"a" seltabl:"a" hSel:"tr:nth-child(1) td:nth-child(1)" dSel:"tr td:nth-child(1)" cSel:"$text"`
	B string `json:"b" seltabl:"b" hSel:"tr:nth-child(1) td:nth-child(2)" dSel:"tr td:nth-child(2)" cSel:"$text"`
}

func main() {
	r := strings.NewReader(`
	<table>
		<tr>
			<td>a</td>
			<td>b</td>
		</tr>
		<tr>
			<td> 1 </td>
			<td>2</td>
		</tr>
		<tr>
			<td>3 </td>
			<td> 4</td>
		</tr>
		<tr>
			<td> 5 </td>
			<td> 6</td>
		</tr>
		<tr>
			<td>7 </td>
			<td> 8</td>
		</tr>
	</table>
	`)
	p, err := seltabl.NewDecoder[TableStruct](r)
	if err != nil {
		panic(err)
	}
	for _, pp := range p {
		fmt.Printf("pp %+v\n", pp)
	}
}

func NewDecoder

func NewDecoder[T any](r io.ReadCloser) *Decoder[T]

NewDecoder parses a reader into a slice of structs.

It is used by the NewFromReader function.

This allows for decoding a reader into a slice of structs.

Similar to the json.Decoder for brevity.

func (*Decoder[T]) Decode

func (d *Decoder[T]) Decode() ([]T, error)

Decode parses a reader into a slice of structs.

It is used by the Decoder.Decode function.

This allows for decoding a reader into a slice of structs.

Similar to the json.Decoder for brevity.

type ErrNoDataFound added in v0.5.1

type ErrNoDataFound struct {
	Typ   reflect.Type
	Field reflect.StructField
	Cfg   *SelectorConfig
}

ErrNoDataFound is an error for when no data is found for a selector

func (ErrNoDataFound) Error added in v0.5.1

func (e ErrNoDataFound) Error() string

Error implements the error interface for ErrNoDataFound

type ErrParsing added in v0.6.3

type ErrParsing struct {
	Field reflect.Type
	Value string
	Err   error
}

ErrParsing is returned when a field's value cannot be parsed.

func (ErrParsing) Error added in v0.6.3

func (e ErrParsing) Error() string

Error returns the error message. It implements the error interface.

type ErrSelectorNotFound added in v0.5.1

type ErrSelectorNotFound struct {
	Typ   reflect.Type        // type of the struct
	Field reflect.StructField // field of the struct
	Cfg   *SelectorConfig     // selector config
}

ErrSelectorNotFound is an error for when a selector is not found

func (ErrSelectorNotFound) Error added in v0.5.1

func (e ErrSelectorNotFound) Error() string

Error implements the error interface for ErrSelectorNotFound

type SelectorConfig added in v0.2.9

type SelectorConfig struct {
	DataSelector  string // selector for the data cell
	HeadSelector  string // selector for the header cell
	QuerySelector string // selector for the data cell
	ControlTag    string // tag used to signify selecting aspects of a cell
}

SelectorConfig is a struct for configuring a selector

func NewSelectorConfig added in v0.2.9

func NewSelectorConfig(tag reflect.StructTag) *SelectorConfig

NewSelectorConfig parses a struct tag and returns a SelectorConfig

type SelectorI added in v0.5.1

type SelectorI interface {
	Select(cellValue *goquery.Selection) (string, error)
}

SelectorI is an interface for running a goquery selector on a cellValue

It is an interface that defines a Select method that takes a cellValue (goquery.Selection) and returns a string of the applied selection and an error.

Directories

Path Synopsis
examples
huggingface-leader-board command
Package main shows how to use the seltabl package to scrape a table from a given url.
Package main shows how to use the seltabl package to scrape a table from a given url.
ncaa command
Package main is the an example of how to use the seltabl package.
Package main is the an example of how to use the seltabl package.
penguins-wikipedia command
Package main is the an example of how to use the seltabl package.
Package main is the an example of how to use the seltabl package.
tools
seltabl-lsp module
seltabls module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL