Documentation
¶
Overview ¶
ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior
ABOUTME: Normalizes whitespace in text content while preserving spaces in HTML tags ABOUTME: Faithful port of JavaScript normalizeSpaces function with 100% compatibility
ABOUTME: Removes URL anchors/fragments for clean URL comparison ABOUTME: Faithfully ports the JavaScript removeAnchor function behavior
Index ¶
- Constants
- Variables
- func ArticleBaseURL(urlStr string, parsedURL *url.URL) string
- func ExcerptContent(content string, words ...int) string
- func ExtractFromURL(url string, regexList []*regexp.Regexp) (string, bool)
- func FormatDateForJSON(t *time.Time) string
- func GetEncoding(str string) string
- func IsGoodSegment(segment string, index int, firstSegmentHasLetters bool) bool
- func IsValidDate(t *time.Time) bool
- func NormalizeSpaces(text string) string
- func PageNumFromURL(url string) *int
- func ParseDate(dateStr string) (*time.Time, error)
- func ParseDateFromMeta(content string) (*time.Time, error)
- func RemoveAnchor(url string) string
Constants ¶
const DEFAULT_ENCODING = "utf-8"
DEFAULT_ENCODING is the fallback encoding when none is detected
Variables ¶
var CODE_TAG_RE = regexp.MustCompile(`(?i)<code[^>]*>.*?</code>`)
CODE_TAG_RE finds code tags and their content (only closed tags)
var ENCODING_RE = regexp.MustCompile(`(?i)charset=['"]?([\w-]+)['"]?`)
ENCODING_RE matches charset declarations in HTML meta tags Matches both quoted and unquoted charset values like the JavaScript version
var HAS_ALPHA_RE = regexp.MustCompile(`(?i)[a-z]`)
HAS_ALPHA_RE matches strings containing alphabetic characters
var IS_ALPHA_RE = regexp.MustCompile(`(?i)^[a-z]+$`)
IS_ALPHA_RE matches strings containing only alphabetic characters
var IS_DIGIT_RE = regexp.MustCompile(`^[0-9]+$`)
IS_DIGIT_RE matches strings containing only digits
var MULTIPLE_SPACES_RE = regexp.MustCompile(`\s{2,}`)
MULTIPLE_SPACES_RE matches 2 or more consecutive whitespace characters This is part of the JavaScript regex /\s{2,}(?![^<>]*<\/(pre|code|textarea)>)/g
var PAGE_IN_HREF_RE = regexp.MustCompile(`(?i)(page|paging|(p(a|g|ag)?(e|enum|ewanted|ing|ination)))?(=|/)([0-9]{1,3})`)
PAGE_IN_HREF_RE is a regular expression that looks to try to find the page digit within a URL, if it exists. This matches the JavaScript regex: /(page|paging|(p(a|g|ag)?(e|enum|ewanted|ing|ination)))?(=|\/)([0-9]{1,3})/i
Matches:
page=1 pg=1 p=1 paging=12 pag=7 pagination/1 paging/88 pa/83 p/11
Does not match:
pg=102 page:2
var PRE_TAG_RE = regexp.MustCompile(`(?i)<pre[^>]*>.*?</pre>`)
PRE_TAG_RE finds pre tags and their content (only closed tags)
var TEXTAREA_TAG_RE = regexp.MustCompile(`(?i)<textarea[^>]*>.*?</textarea>`)
TEXTAREA_TAG_RE finds textarea tags and their content (only closed tags)
Functions ¶
func ArticleBaseURL ¶
ArticleBaseURL takes a URL and returns the article base of said URL. That is, no pagination data exists in it. Useful for comparing to other links that might have pagination data within them.
This is a faithful port of the JavaScript articleBaseUrl function.
Parameters:
- urlStr: The URL string to process
- parsedURL: Optional pre-parsed URL (can be nil)
Returns:
- string: The base URL with pagination data removed
JavaScript equivalent:
export default function articleBaseUrl(url, parsed) {
const parsedUrl = parsed || URL.parse(url);
const { protocol, host, path } = parsedUrl;
...
}
func ExcerptContent ¶
func ExtractFromURL ¶
ExtractFromURL searches for patterns in a URL and returns the first capture group from the first matching regex. Given a URL and a list of regular expressions, this function tests each regex against the URL and returns the first capture group (group 1) from the first matching pattern. This is primarily used for extracting date information from URLs in date published extraction.
Parameters:
- url: The URL string to search within
- regexList: A slice of compiled regular expressions to test against the URL
Returns:
- string: The first capture group from the first matching regex, or empty string if no match
- bool: true if a match was found, false otherwise
The function expects each regex to have at least one capture group, and will return the content of the first capture group from the first regex that matches the URL.
func FormatDateForJSON ¶
FormatDateForJSON formats date for JSON output (compatible with JS version)
func GetEncoding ¶
GetEncoding extracts and validates character encoding from a string. This function is a faithful port of the JavaScript getEncoding function. It checks a string for encoding using ENCODING_RE pattern and validates the charset exists before returning it, otherwise returns DEFAULT_ENCODING.
func IsGoodSegment ¶
IsGoodSegment determines if a URL segment should be kept or removed. This is a faithful port of the JavaScript isGoodSegment function.
JavaScript logic: - If segment is purely a number and it's in first/second position with < 3 chars: keep it - If segment is "index" in first position: remove it - If segment is < 3 chars in first/second position and first segment has no letters: remove it - Otherwise: keep it
func IsValidDate ¶
IsValidDate checks if a parsed date is reasonable
func NormalizeSpaces ¶
NormalizeSpaces normalizes consecutive whitespace characters to single spaces while preserving spacing within pre, code, and textarea HTML tags.
This function provides 100% compatibility with the JavaScript normalizeSpaces function: - Replaces 2+ consecutive whitespace characters with a single space - Preserves whitespace inside <pre>, <code>, and <textarea> tags - Trims leading and trailing whitespace from the result - Handles all types of whitespace: spaces, tabs, newlines, carriage returns
Example:
NormalizeSpaces("text with spaces") // returns "text with spaces"
NormalizeSpaces("<pre> keep spaces </pre>") // returns "<pre> keep spaces </pre>"
func PageNumFromURL ¶
PageNumFromURL extracts a page number from a URL string. This is a faithful port of the JavaScript pageNumFromUrl function.
The function looks for page number patterns in URLs like:
- page=1, pg=1, p=1
- paging=12, pag=7
- pagination/1, paging/88, pa/83, p/11
Returns:
- *int: the page number if found and < 100
- nil: if no page number found or page number >= 100
JavaScript equivalent:
export default function pageNumFromUrl(url) {
const matches = url.match(PAGE_IN_HREF_RE);
if (!matches) return null;
const pageNum = parseInt(matches[6], 10);
return pageNum < 100 ? pageNum : null;
}
func ParseDateFromMeta ¶
ParseDateFromMeta parses dates from meta tag content
func RemoveAnchor ¶
RemoveAnchor removes the anchor/fragment portion from a URL and trailing slashes This function provides 100% compatibility with the JavaScript removeAnchor function: - Splits URL on '#' and takes the first part (removes fragment) - Removes trailing slashes from the result
Types ¶
This section is empty.