Documentation
¶
Overview ¶
package gnmatcher provides the main use-case of the project, which is matching of possible name-strings to scientific names registered in a variety of biodiversity databases.
The goal of the project is to return back matched canonical forms of scientific names by tens of thousands a second, making it possible to work with hundreds of millions/billions of name-string matching events.
The package is intended to be used by long-running services, because it takes a few seconds to initialized its lookup data structures.
Example ¶
package main
import (
"fmt"
"github.com/gnames/gnmatcher"
"github.com/gnames/gnmatcher/config"
"github.com/gnames/gnmatcher/io/bloom"
"github.com/gnames/gnmatcher/io/trie"
)
func main() {
// Note that it takes several minutes to initialize lookup data structures.
// Requirement for initialization: Postgresql database with loaded
// http://opendata.globalnames.org/dumps/gnames-latest.sql.gz
//
// If data are imported already, it still takes several seconds to
// load lookup data into memory.
cfg := config.NewConfig()
em := bloom.NewExactMatcher(cfg)
fm := trie.NewFuzzyMatcher(cfg)
gnm := gnmatcher.NewGNMatcher(em, fm)
res := gnm.MatchNames([]string{"Pomatomus saltator", "Pardosa moesta"})
for _, match := range res {
fmt.Println(match.Name)
fmt.Println(match.MatchType)
for _, item := range match.MatchItems {
fmt.Println(item.MatchStr)
fmt.Println(item.EditDistance)
}
}
}
Index ¶
Examples ¶
Constants ¶
const MaxNamesNumber = 10_000
MaxMaxNamesNumber is the upper limit of the number of name-strings the MatchNames function can process. If the number is higher, the list of name-strings will be truncated.
Variables ¶
var ( // Version of the gnmatcher Version = "v0.3.6" // Build timestamp Build = "n/a" )
Functions ¶
func NewGNMatcher ¶
func NewGNMatcher(em exact.ExactMatcher, fm fuzzy.FuzzyMatcher) gnmatcher
NewGNMatcher is a constructor for GNMatcher interface
Types ¶
type GNMatcher ¶
type GNMatcher interface {
// MatchNames take a slice of scientific name-strings and return back
// matches to canonical forms of known scientific names. The following
// matches are attempted:
// - Exact string match for viruses
// - Exact match of the name-string's canonical form
// - Fuzzy match of the canonical form
// - Partial match of the canonical form where the middle parts of the name
// or last elements of the name are removed.
// - Partial fuzzy match of the canonical form.
//
// The resulting output does provide canonical forms, but not the sources
// where they are registered.
//
MatchNames(names []string) []*mlib.Match
gn.Versioner
}
GNMatcher is a public API to the project functionality.
Directories
¶
| Path | Synopsis |
|---|---|
|
entity
|
|
|
io
|
|
|
bloom
package bloom creates and serves bloom filters for canonical names, and names of viruses.
|
package bloom creates and serves bloom filters for canonical names, and names of viruses. |
|
The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names.
|
The purpose of this script is to find out how fast algorithms can go through a list of 100_000 names. |