LexicMap

module

v0.3.0 Latest Latest Go to latest Published: May 14, 2024 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/shenwei356/LexicMap

Links

Open Source Insights

README ¶

LexicMap: efficient sequence alignment against millions of prokaryotic genomes

LexicMap is a nucleotide sequence pseudo-alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.

Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale. Current tools either attempt full alignment but face challenges of high memory consumption and slow speeds, or they fall back on k-mer indexing, without information of where matches occur in the genome.

Results: In LexicMap, a modified version of the sequence sketching method LexicHash is adopted to compute alignment seeds. A multi-level index enables fast and low-memory variable-length seed matching and pseudo-alignment on a single server at the scale of millions of genomes (see algorithm overview), successfully indexing and searching both RefSeq+GenBank, and the AllTheBacteria datasets (2.3 and 1.9 million genomes respectively). Running at this scale has previously only been achieved by Phylign (previously called mof-search).

For example, querying a 52.8-kb plasmid in all 2,340,672 Genbank+Refseq prokaryotic genomes takes only 4 minutes and 8 seconds with 15 GB RAM and 48 CPUs, with 494,860 genome hits returned. In contrast, BLASTN is unable to run with the same dataset on common servers because it requires >2000 GB RAM. See performance.

LexicMap is easy to install (a binary file with no dependencies) and use (tutorials and usages).

More documents: http://bioinf.shenwei.me/LexicMap.

Quick start

Building an index (see the tutorial of building an index).

# From a directory with multiple genome files
lexicmap index -I genomes/ -O db.lmi

# From a file list with one file per line
lexicmap index -X files.txt -O db.lmi

Querying (see the tutorial of searching).

# For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
    --min-match-pident 50 --min-qcov-per-genome 70 --min-qcov-per-hsp 70 --top-n-genomes 1000

# For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
    --min-match-pident 50 --min-qcov-per-genome 0  --min-qcov-per-hsp 0  --top-n-genomes 0


# Extracting similar sequences for a query gene.
  # search matches with query cover >= 90%
  lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --all -o results.tsv \
      --min-qcov-per-hsp 90

  # extract matched sequences as FASTA format
  sed 1d results.tsv | awk '{print ">"$5":"$13"-"$14":"$15"\n"$19;}' > results.fasta

Sample output (queries are a few Nanopore Q20 reads). See output format details.

query                qlen   hits   sgenome           sseqid                 qcovGnm   hsp   qcovHSP   alenHSP   pident    qstart   qend   sstart    send      sstr   slen      seeds   species
------------------   ----   ----   ---------------   --------------------   -------   ---   -------   -------   -------   ------   ----   -------   -------   ----   -------   -----   ----------------------------------
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    1     96.231    766       97.389    31       796    1138938   1139706   +      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    2     95.226    758       96.834    39       796    1280352   1282817   +      1887974   21      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    3     95.226    758       97.493    39       796    32646     33406     +      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    4     95.226    758       97.493    39       796    134468    135228    -      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    5     95.226    758       97.361    39       796    1768935   1769695   +      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    6     95.226    758       97.230    39       796    242012    242772    -      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    7     95.226    758       96.834    39       796    154380    155137    -      1887974   23      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_003344625.1   NZ_QPKJ02000188.1      81.910    1     81.910    652       99.540    66       717    1         653       -      826       21      Solirubrobacter sp. CPCC 204708
ERR5396170.1000006   796    4      GCF_009663775.1   NZ_RDBR01000008.1      91.332    1     91.332    727       86.657    39       796    21391     22151     -      52610     7       Lactobacillus sp. 0.1XD8-4
ERR5396170.1000006   796    4      GCF_001591685.1   NZ_BCVJ01000102.1      87.940    1     87.940    700       88.429    66       796    434       1165      +      1933      4       Ligilactobacillus murinus
ERR5396170.1000017   516    1      GCF_013394085.1   NZ_CP040910.1          94.574    1     94.574    488       100.000   27       514    293509    293996    +      1887974   11      Limosilactobacillus fermentum
ERR5396170.1000012   848    1      GCF_013394085.1   NZ_CP040910.1          95.165    1     95.165    807       93.804    22       828    190329    191136    -      1887974   21      Limosilactobacillus fermentum
ERR5396170.1000052   330    1      GCF_013394085.1   NZ_CP040910.1          90.000    1     90.000    297       100.000   27       323    1161955   1162251   -      1887974   3       Limosilactobacillus fermentum
ERR5396170.1000000   698    1      GCF_001457615.1   NZ_LN831024.1          85.673    1     85.673    598       97.157    53       650    4452083   4452685   +      6316979   10      Pseudomonas aeruginosa

Note: the column `species` is added by mapping genome ID (column `sgenome`) to taxonomic information.

Matched query and subject sequences can be outputted as extra two columns via the flag -a/--all.

Learn more tutorials and usages.

Performance

Indexing

dataset	genomes	gzip_size	tool	db_size	time	RAM
GTDB complete	402,538	578 GB	LexicMap	510 GB	3 h 26 m	35 GB
			Blastn	360 GB	3 h 11 m	718 MB
Genbank+RefSeq	2,340,672	3.5 TB	LexicMap	2.91 TB	16 h 40 m	79 GB
			Blastn	2.15 TB	14 h 4 m	4.3 GB
AllTheBacteria HQ	1,858,610	3.1 TB	LexicMap	2.32 TB	10 h 48 m	43 GB
			Blastn	1.76 TB	14 h 3 m	2.9 GB
			Phylign	248 GB	/	/

Searching

Blastn failed to run as it requires >2000GB RAM for Genbank+RefSeq and AllTheBacteria datasets. Phylign only has the index for AllTheBacteria HQ dataset.

GTDB complete (402,538 genomes):

query	query_len	tool	genome_hits	time	RAM
a marker gene	1,299 bp	LexicMap	3,588	1.4 s	598 MB
		Blastn	7,121	36 m 11 s	351 GB
a 16S rRNA gene	1,542 bp	LexicMap	294,285	1 m 19 s	3.0 GB
		Blastn	301,197	39 m 13 s	378 GB
a plasmid	52,830 bp	LexicMap	58,930	40 s	3.2 GB
		Blastn	69,311	37 m 42 s	365 GB

AllTheBacteria HQ (1,858,610 genomes):

query	query_len	tool	genome_hits	time	RAM
a marker gene	1,299 bp	LexicMap	10,837	11.3 s	1.1 GB
		Phylign_local	7,937	29 m 13 s	78.5 GB
		Phylign_cluster	7,937	32 m 26 s	/
a 16S rRNA gene	1,542 bp	LexicMap	1,853,846	12 m 31 s	9.7 GB
		Phylign_local	1,032,948	1 h 58 m	73.3 GB
		Phylign_cluster	1,032,948	83 m 36 s	/
a plasmid	52,830 bp	LexicMap	427,112	4 m 13 s	12.9 GB
		Phylign_local	46,822	38 m 55 s	76.6 GB
		Phylign_cluster	46,822	36 m 30s	/

Genbank+RefSeq (2,340,672 genomes):

query	query_len	tool	genome_hits	time	RAM
a marker gene	1,299 bp	LexicMap	16,556	6.2 s	1.3 GB
a 16S rRNA gene	1,542 bp	LexicMap	1,875,260	8 m 29 s	10.8 GB
a plasmid	52,830 bp	LexicMap	494,860	4 m 08 s	15.0 GB

Notes:

All files are stored on a server with HDD disks. No files are cached in memory.
Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz).
LexicMap index building parameters: -k 31 -m 40000. Genome batch size: -b 10000 for GTDB datasets, -b 50000 for others.
Main searching parameters:
- LexicMap v0.3.0: --threads 48 --top-n-genomes 0 --min-qcov-per-genome 0 --min-qcov-per-hsp 0 --min-match-pident 50.
- Blastn v2.15.0+: -num_threads 48 -max_target_seqs 10000000.
- Phylign (AllTheBacteria fork 9fc65e6): threads: 48, cobs_kmer_thres: 0.33, nb_best_hits: 5000000, max_ram_gb: 100; For cluster: maximum number of slurm jobs is 100.

Installation

LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.

Or install with conda:

conda install -c bioconda lexicmap

Algorithm overview

High-performance LexicHash computation in Go.

Support

Please open an issue to report bugs, propose new functions or ask for help.

License

MIT License

Directories ¶

Path	Synopsis
lexicmap
cmd
cmd/genome
cmd/kv
cmd/seedposition
cmd/tree
cmd/util

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL