text

package
v1.0.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 31, 2025 License: MIT Imports: 14 Imported by: 0

Documentation

Overview

ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior

ABOUTME: Normalizes whitespace in text content while preserving spaces in HTML tags ABOUTME: Faithful port of JavaScript normalizeSpaces function with 100% compatibility

ABOUTME: Removes URL anchors/fragments for clean URL comparison ABOUTME: Faithfully ports the JavaScript removeAnchor function behavior

Index

Constants

View Source
const DEFAULT_ENCODING = "utf-8"

DEFAULT_ENCODING is the fallback encoding when none is detected

Variables

View Source
var CODE_TAG_RE = regexp.MustCompile(`(?i)<code[^>]*>.*?</code>`)

CODE_TAG_RE finds code tags and their content (only closed tags)

View Source
var ENCODING_RE = regexp.MustCompile(`(?i)charset=['"]?([\w-]+)['"]?`)

ENCODING_RE matches charset declarations in HTML meta tags Matches both quoted and unquoted charset values like the JavaScript version

View Source
var HAS_ALPHA_RE = regexp.MustCompile(`(?i)[a-z]`)

HAS_ALPHA_RE matches strings containing alphabetic characters

View Source
var IS_ALPHA_RE = regexp.MustCompile(`(?i)^[a-z]+$`)

IS_ALPHA_RE matches strings containing only alphabetic characters

View Source
var IS_DIGIT_RE = regexp.MustCompile(`^[0-9]+$`)

IS_DIGIT_RE matches strings containing only digits

View Source
var MULTIPLE_SPACES_RE = regexp.MustCompile(`\s{2,}`)

MULTIPLE_SPACES_RE matches 2 or more consecutive whitespace characters This is part of the JavaScript regex /\s{2,}(?![^<>]*<\/(pre|code|textarea)>)/g

View Source
var PAGE_IN_HREF_RE = regexp.MustCompile(`(?i)(page|paging|(p(a|g|ag)?(e|enum|ewanted|ing|ination)))?(=|/)([0-9]{1,3})`)

PAGE_IN_HREF_RE is a regular expression that looks to try to find the page digit within a URL, if it exists. This matches the JavaScript regex: /(page|paging|(p(a|g|ag)?(e|enum|ewanted|ing|ination)))?(=|\/)([0-9]{1,3})/i

Matches:

page=1
pg=1
p=1
paging=12
pag=7
pagination/1
paging/88
pa/83
p/11

Does not match:

pg=102
page:2
View Source
var PRE_TAG_RE = regexp.MustCompile(`(?i)<pre[^>]*>.*?</pre>`)

PRE_TAG_RE finds pre tags and their content (only closed tags)

View Source
var TEXTAREA_TAG_RE = regexp.MustCompile(`(?i)<textarea[^>]*>.*?</textarea>`)

TEXTAREA_TAG_RE finds textarea tags and their content (only closed tags)

Functions

func ArticleBaseURL

func ArticleBaseURL(urlStr string, parsedURL *url.URL) string

ArticleBaseURL takes a URL and returns the article base of said URL. That is, no pagination data exists in it. Useful for comparing to other links that might have pagination data within them.

This is a faithful port of the JavaScript articleBaseUrl function.

Parameters:

  • urlStr: The URL string to process
  • parsedURL: Optional pre-parsed URL (can be nil)

Returns:

  • string: The base URL with pagination data removed

JavaScript equivalent:

export default function articleBaseUrl(url, parsed) {
  const parsedUrl = parsed || URL.parse(url);
  const { protocol, host, path } = parsedUrl;
  ...
}

func ExcerptContent

func ExcerptContent(content string, words ...int) string

func ExtractFromURL

func ExtractFromURL(url string, regexList []*regexp.Regexp) (string, bool)

ExtractFromURL searches for patterns in a URL and returns the first capture group from the first matching regex. Given a URL and a list of regular expressions, this function tests each regex against the URL and returns the first capture group (group 1) from the first matching pattern. This is primarily used for extracting date information from URLs in date published extraction.

Parameters:

  • url: The URL string to search within
  • regexList: A slice of compiled regular expressions to test against the URL

Returns:

  • string: The first capture group from the first matching regex, or empty string if no match
  • bool: true if a match was found, false otherwise

The function expects each regex to have at least one capture group, and will return the content of the first capture group from the first regex that matches the URL.

func FormatDateForJSON

func FormatDateForJSON(t *time.Time) string

FormatDateForJSON formats date for JSON output (compatible with JS version)

func GetEncoding

func GetEncoding(str string) string

GetEncoding extracts and validates character encoding from a string. This function is a faithful port of the JavaScript getEncoding function. It checks a string for encoding using ENCODING_RE pattern and validates the charset exists before returning it, otherwise returns DEFAULT_ENCODING.

func IsGoodSegment

func IsGoodSegment(segment string, index int, firstSegmentHasLetters bool) bool

IsGoodSegment determines if a URL segment should be kept or removed. This is a faithful port of the JavaScript isGoodSegment function.

JavaScript logic: - If segment is purely a number and it's in first/second position with < 3 chars: keep it - If segment is "index" in first position: remove it - If segment is < 3 chars in first/second position and first segment has no letters: remove it - Otherwise: keep it

func IsValidDate

func IsValidDate(t *time.Time) bool

IsValidDate checks if a parsed date is reasonable

func NormalizeSpaces

func NormalizeSpaces(text string) string

NormalizeSpaces normalizes consecutive whitespace characters to single spaces while preserving spacing within pre, code, and textarea HTML tags.

This function provides 100% compatibility with the JavaScript normalizeSpaces function: - Replaces 2+ consecutive whitespace characters with a single space - Preserves whitespace inside <pre>, <code>, and <textarea> tags - Trims leading and trailing whitespace from the result - Handles all types of whitespace: spaces, tabs, newlines, carriage returns

Example:

NormalizeSpaces("text   with    spaces") // returns "text with spaces"
NormalizeSpaces("<pre>  keep  spaces  </pre>") // returns "<pre>  keep  spaces  </pre>"

func PageNumFromURL

func PageNumFromURL(url string) *int

PageNumFromURL extracts a page number from a URL string. This is a faithful port of the JavaScript pageNumFromUrl function.

The function looks for page number patterns in URLs like:

  • page=1, pg=1, p=1
  • paging=12, pag=7
  • pagination/1, paging/88, pa/83, p/11

Returns:

  • *int: the page number if found and < 100
  • nil: if no page number found or page number >= 100

JavaScript equivalent:

export default function pageNumFromUrl(url) {
  const matches = url.match(PAGE_IN_HREF_RE);
  if (!matches) return null;
  const pageNum = parseInt(matches[6], 10);
  return pageNum < 100 ? pageNum : null;
}

func ParseDate

func ParseDate(dateStr string) (*time.Time, error)

ParseDate attempts to parse a date string using various methods

func ParseDateFromMeta

func ParseDateFromMeta(content string) (*time.Time, error)

ParseDateFromMeta parses dates from meta tag content

func RemoveAnchor

func RemoveAnchor(url string) string

RemoveAnchor removes the anchor/fragment portion from a URL and trailing slashes This function provides 100% compatibility with the JavaScript removeAnchor function: - Splits URL on '#' and takes the first part (removes fragment) - Removes trailing slashes from the result

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL