text

package

v1.0.6 Latest Latest Go to latest Published: Aug 31, 2025 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/BumpyClock/hermes

Links

Open Source Insights

Documentation ¶

Overview ¶

ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior

ABOUTME: Normalizes whitespace in text content while preserving spaces in HTML tags ABOUTME: Faithful port of JavaScript normalizeSpaces function with 100% compatibility

ABOUTME: Removes URL anchors/fragments for clean URL comparison ABOUTME: Faithfully ports the JavaScript removeAnchor function behavior

Index ¶

Constants
Variables
func ArticleBaseURL(urlStr string, parsedURL *url.URL) string
func ExcerptContent(content string, words ...int) string
func ExtractFromURL(url string, regexList []*regexp.Regexp) (string, bool)
func FormatDateForJSON(t *time.Time) string
func GetEncoding(str string) string
func IsGoodSegment(segment string, index int, firstSegmentHasLetters bool) bool
func IsValidDate(t *time.Time) bool
func NormalizeSpaces(text string) string
func PageNumFromURL(url string) *int
func ParseDate(dateStr string) (*time.Time, error)
func ParseDateFromMeta(content string) (*time.Time, error)
func RemoveAnchor(url string) string

Constants ¶

View Source

const DEFAULT_ENCODING = "utf-8"

DEFAULT_ENCODING is the fallback encoding when none is detected

Variables ¶

View Source

var CODE_TAG_RE = regexp.MustCompile(`(?i)<code[^>]*>.*?</code>`)

CODE_TAG_RE finds code tags and their content (only closed tags)

View Source

var ENCODING_RE = regexp.MustCompile(`(?i)charset=['"]?([\w-]+)['"]?`)

ENCODING_RE matches charset declarations in HTML meta tags Matches both quoted and unquoted charset values like the JavaScript version

View Source

var HAS_ALPHA_RE = regexp.MustCompile(`(?i)[a-z]`)

HAS_ALPHA_RE matches strings containing alphabetic characters

View Source

var IS_ALPHA_RE = regexp.MustCompile(`(?i)^[a-z]+$`)

IS_ALPHA_RE matches strings containing only alphabetic characters

View Source

var IS_DIGIT_RE = regexp.MustCompile(`^[0-9]+$`)

IS_DIGIT_RE matches strings containing only digits

View Source

var MULTIPLE_SPACES_RE = regexp.MustCompile(`\s{2,}`)

MULTIPLE_SPACES_RE matches 2 or more consecutive whitespace characters This is part of the JavaScript regex /\s{2,}(?![^<>]*<\/(pre|code|textarea)>)/g

View Source

var PAGE_IN_HREF_RE = regexp.MustCompile(`(?i)(page|paging|(p(a|g|ag)?(e|enum|ewanted|ing|ination)))?(=|/)([0-9]{1,3})`)

PAGE_IN_HREF_RE is a regular expression that looks to try to find the page digit within a URL, if it exists. This matches the JavaScript regex: /(page|paging|(p(a|g|ag)?(e|enum|ewanted|ing|ination)))?(=|\/)([0-9]{1,3})/i

Matches:

page=1
pg=1
p=1
paging=12
pag=7
pagination/1
paging/88
pa/83
p/11

Does not match:

pg=102
page:2

View Source

var PRE_TAG_RE = regexp.MustCompile(`(?i)<pre[^>]*>.*?</pre>`)

PRE_TAG_RE finds pre tags and their content (only closed tags)

View Source

var TEXTAREA_TAG_RE = regexp.MustCompile(`(?i)<textarea[^>]*>.*?</textarea>`)

TEXTAREA_TAG_RE finds textarea tags and their content (only closed tags)

Functions ¶

func ArticleBaseURL ¶

func ArticleBaseURL(urlStr string, parsedURL *url.URL) string

ArticleBaseURL takes a URL and returns the article base of said URL. That is, no pagination data exists in it. Useful for comparing to other links that might have pagination data within them.

This is a faithful port of the JavaScript articleBaseUrl function.

Parameters:

urlStr: The URL string to process
parsedURL: Optional pre-parsed URL (can be nil)

Returns:

string: The base URL with pagination data removed

JavaScript equivalent:

export default function articleBaseUrl(url, parsed) {
  const parsedUrl = parsed || URL.parse(url);
  const { protocol, host, path } = parsedUrl;
  ...
}

func ExcerptContent ¶

func ExcerptContent(content string, words ...int) string

func ExtractFromURL ¶

func ExtractFromURL(url string, regexList []*regexp.Regexp) (string, bool)

ExtractFromURL searches for patterns in a URL and returns the first capture group from the first matching regex. Given a URL and a list of regular expressions, this function tests each regex against the URL and returns the first capture group (group 1) from the first matching pattern. This is primarily used for extracting date information from URLs in date published extraction.

Parameters:

url: The URL string to search within
regexList: A slice of compiled regular expressions to test against the URL

Returns:

string: The first capture group from the first matching regex, or empty string if no match
bool: true if a match was found, false otherwise

The function expects each regex to have at least one capture group, and will return the content of the first capture group from the first regex that matches the URL.

func FormatDateForJSON ¶

func FormatDateForJSON(t *time.Time) string

FormatDateForJSON formats date for JSON output (compatible with JS version)

func GetEncoding ¶

func GetEncoding(str string) string

GetEncoding extracts and validates character encoding from a string. This function is a faithful port of the JavaScript getEncoding function. It checks a string for encoding using ENCODING_RE pattern and validates the charset exists before returning it, otherwise returns DEFAULT_ENCODING.

func IsGoodSegment ¶

func IsGoodSegment(segment string, index int, firstSegmentHasLetters bool) bool

IsGoodSegment determines if a URL segment should be kept or removed. This is a faithful port of the JavaScript isGoodSegment function.

JavaScript logic: - If segment is purely a number and it's in first/second position with < 3 chars: keep it - If segment is "index" in first position: remove it - If segment is < 3 chars in first/second position and first segment has no letters: remove it - Otherwise: keep it

func IsValidDate ¶

func IsValidDate(t *time.Time) bool

IsValidDate checks if a parsed date is reasonable

func NormalizeSpaces ¶

func NormalizeSpaces(text string) string

NormalizeSpaces normalizes consecutive whitespace characters to single spaces while preserving spacing within pre, code, and textarea HTML tags.

This function provides 100% compatibility with the JavaScript normalizeSpaces function: - Replaces 2+ consecutive whitespace characters with a single space - Preserves whitespace inside <pre>, <code>, and <textarea> tags - Trims leading and trailing whitespace from the result - Handles all types of whitespace: spaces, tabs, newlines, carriage returns

Example:

NormalizeSpaces("text   with    spaces") // returns "text with spaces"
NormalizeSpaces("<pre>  keep  spaces  </pre>") // returns "<pre>  keep  spaces  </pre>"

func PageNumFromURL ¶

func PageNumFromURL(url string) *int

PageNumFromURL extracts a page number from a URL string. This is a faithful port of the JavaScript pageNumFromUrl function.

The function looks for page number patterns in URLs like:

page=1, pg=1, p=1
paging=12, pag=7
pagination/1, paging/88, pa/83, p/11

Returns:

*int: the page number if found and < 100
nil: if no page number found or page number >= 100

JavaScript equivalent:

export default function pageNumFromUrl(url) {
  const matches = url.match(PAGE_IN_HREF_RE);
  if (!matches) return null;
  const pageNum = parseInt(matches[6], 10);
  return pageNum < 100 ? pageNum : null;
}

func ParseDate ¶

func ParseDate(dateStr string) (*time.Time, error)

ParseDate attempts to parse a date string using various methods

func ParseDateFromMeta ¶

func ParseDateFromMeta(content string) (*time.Time, error)

ParseDateFromMeta parses dates from meta tag content

func RemoveAnchor ¶

func RemoveAnchor(url string) string

RemoveAnchor removes the anchor/fragment portion from a URL and trailing slashes This function provides 100% compatibility with the JavaScript removeAnchor function: - Splits URL on '#' and takes the first part (removes fragment) - Removes trailing slashes from the result

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL