common

package
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2026 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Overview

Package common provides HTML cleaning, form parsing, and other utility functions for spider rules.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CleanHtml

func CleanHtml(str string, depth int) string

CleanHtml strips HTML tags at increasing levels of aggressiveness based on depth.

func ConvertToString

func ConvertToString(src string, srcCode string, tagCode string) string

func DecodeString

func DecodeString(src, charset string) string

func Deprive

func Deprive(s string) string

Deprive removes common whitespace escape characters.

func Deprive2

func Deprive2(s string) string

Deprive2 removes both actual and literal whitespace escape sequences.

func DepriveBreak

func DepriveBreak(s string) string

DepriveBreak removes all line-break characters (both actual and literal escape sequences).

func DepriveMutiBreak

func DepriveMutiBreak(s string) string

DepriveMutiBreak collapses consecutive blank lines into a single newline.

func EncodeString

func EncodeString(src, charset string) string

func ExtractArticle

func ExtractArticle(html string) string

ExtractArticle extracts the main article body from an HTML page. Heuristic: the parent of the tag with the longest text node is treated as the article body.

func Floor

func Floor(f float64, n int) float64

Floor truncates f to n decimal places.

func GBKToUTF8

func GBKToUTF8(src string) string

func GetHref added in v1.4.0

func GetHref(baseURL string, url string, href string, mustBase bool) string

GetHref resolves a relative or absolute href against a base URL and current page URL.

func HrefSub

func HrefSub(src string, sub string) string

HrefSub appends query parameters to an existing URL.

func MakeUrl

func MakeUrl(path string, schemeAndHost ...string) (string, bool)

@SchemeAndHost https://www.baidu.com @path /search?w=x

func Ping

func Ping(address string, timeoutSecond int) result.Result[ping.PingResult]

func Pinger

func Pinger(address string, timeoutSecond int) result.VoidResult

func ProcessHtml

func ProcessHtml(html string) string

ProcessHtml removes comments from an HTML string.

func SplitCookies

func SplitCookies(cookieStr string) (cookies []*http.Cookie)

SplitCookies parses a cookie string (e.g. "mt=ci%3D-1_0; thw=cn; v=0;") into []*http.Cookie.

func Unicode16ToUTF8

func Unicode16ToUTF8(str string) string

Unicode16ToUTF8 converts \uXXXX escape sequences in a string to UTF-8 characters.

func UnicodeToUTF8

func UnicodeToUTF8(str string) string

UnicodeToUTF8 converts HTML numeric character references (e.g. "咖啡") to UTF-8.

Types

type Form

type Form struct {
	// contains filtered or unexported fields
}

Form is the default form element.

func NewForm

func NewForm(ctx *spider.Context, rule string, u string, form *goquery.Selection, schemeAndHost ...string) *Form

NewForm creates and returns a *Form type.

func (*Form) Action

func (f *Form) Action() string

Action returns the form action URL. The URL will always be absolute.

func (*Form) Click

func (f *Form) Click(button string) bool

Click submits the form by clicking the button with the given name.

func (*Form) Dom

func (f *Form) Dom() *goquery.Selection

Dom returns the inner *goquery.Selection.

func (*Form) Input

func (f *Form) Input(name, value string) *Form

Input sets the value of a form field.

func (*Form) Inputs

func (f *Form) Inputs(kv map[string]string) *Form

Input sets the value of a form field.

func (*Form) Method

func (f *Form) Method() string

Method returns the form method, eg "GET" or "POST" or "POST-M".

func (*Form) Submit

func (f *Form) Submit() bool

Submit submits the form. Clicks the first button in the form, or submits the form without using any button when the form does not contain any buttons.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL