LexicMap

module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 14, 2024 License: MIT

README

LexicMap: efficient sequence alignment against millions of prokaryotic genomes​

Latest Version Anaconda Cloud Cross-platform

LexicMap is a nucleotide sequence pseudo-alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.

Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale. Current tools either attempt full alignment but face challenges of high memory consumption and slow speeds, or they fall back on k-mer indexing, without information of where matches occur in the genome.

Results: In LexicMap, a modified version of the sequence sketching method LexicHash is adopted to compute alignment seeds. A multi-level index enables fast and low-memory variable-length seed matching and pseudo-alignment on a single server at the scale of millions of genomes (see algorithm overview), successfully indexing and searching both RefSeq+GenBank, and the AllTheBacteria datasets (2.3 and 1.9 million genomes respectively). Running at this scale has previously only been achieved by Phylign (previously called mof-search).

For example, querying a 52.8-kb plasmid in all 2,340,672 Genbank+Refseq prokaryotic genomes takes only 4 minutes and 8 seconds with 15 GB RAM and 48 CPUs, with 494,860 genome hits returned. In contrast, BLASTN is unable to run with the same dataset on common servers because it requires >2000 GB RAM. See performance.

LexicMap is easy to install (a binary file with no dependencies) and use (tutorials and usages).

More documents: http://bioinf.shenwei.me/LexicMap.

Quick start

Building an index (see the tutorial of building an index).

# From a directory with multiple genome files
lexicmap index -I genomes/ -O db.lmi

# From a file list with one file per line
lexicmap index -X files.txt -O db.lmi

Querying (see the tutorial of searching).

# For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
    --min-match-pident 50 --min-qcov-per-genome 70 --min-qcov-per-hsp 70 --top-n-genomes 1000

# For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
    --min-match-pident 50 --min-qcov-per-genome 0  --min-qcov-per-hsp 0  --top-n-genomes 0


# Extracting similar sequences for a query gene.
  # search matches with query cover >= 90%
  lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --all -o results.tsv \
      --min-qcov-per-hsp 90

  # extract matched sequences as FASTA format
  sed 1d results.tsv | awk '{print ">"$5":"$13"-"$14":"$15"\n"$19;}' > results.fasta

Sample output (queries are a few Nanopore Q20 reads). See output format details.

query                qlen   hits   sgenome           sseqid                 qcovGnm   hsp   qcovHSP   alenHSP   pident    qstart   qend   sstart    send      sstr   slen      seeds   species
------------------   ----   ----   ---------------   --------------------   -------   ---   -------   -------   -------   ------   ----   -------   -------   ----   -------   -----   ----------------------------------
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    1     96.231    766       97.389    31       796    1138938   1139706   +      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    2     95.226    758       96.834    39       796    1280352   1282817   +      1887974   21      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    3     95.226    758       97.493    39       796    32646     33406     +      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    4     95.226    758       97.493    39       796    134468    135228    -      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    5     95.226    758       97.361    39       796    1768935   1769695   +      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    6     95.226    758       97.230    39       796    242012    242772    -      1887974   24      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_013394085.1   NZ_CP040910.1          96.231    7     95.226    758       96.834    39       796    154380    155137    -      1887974   23      Limosilactobacillus fermentum
ERR5396170.1000006   796    4      GCF_003344625.1   NZ_QPKJ02000188.1      81.910    1     81.910    652       99.540    66       717    1         653       -      826       21      Solirubrobacter sp. CPCC 204708
ERR5396170.1000006   796    4      GCF_009663775.1   NZ_RDBR01000008.1      91.332    1     91.332    727       86.657    39       796    21391     22151     -      52610     7       Lactobacillus sp. 0.1XD8-4
ERR5396170.1000006   796    4      GCF_001591685.1   NZ_BCVJ01000102.1      87.940    1     87.940    700       88.429    66       796    434       1165      +      1933      4       Ligilactobacillus murinus
ERR5396170.1000017   516    1      GCF_013394085.1   NZ_CP040910.1          94.574    1     94.574    488       100.000   27       514    293509    293996    +      1887974   11      Limosilactobacillus fermentum
ERR5396170.1000012   848    1      GCF_013394085.1   NZ_CP040910.1          95.165    1     95.165    807       93.804    22       828    190329    191136    -      1887974   21      Limosilactobacillus fermentum
ERR5396170.1000052   330    1      GCF_013394085.1   NZ_CP040910.1          90.000    1     90.000    297       100.000   27       323    1161955   1162251   -      1887974   3       Limosilactobacillus fermentum
ERR5396170.1000000   698    1      GCF_001457615.1   NZ_LN831024.1          85.673    1     85.673    598       97.157    53       650    4452083   4452685   +      6316979   10      Pseudomonas aeruginosa

Note: the column `species` is added by mapping genome ID (column `sgenome`) to taxonomic information.

Matched query and subject sequences can be outputted as extra two columns via the flag -a/--all.

Learn more tutorials and usages.

Performance

Indexing

dataset genomes gzip_size tool db_size time RAM
GTDB complete 402,538 578 GB LexicMap 510 GB 3 h 26 m 35 GB
Blastn 360 GB 3 h 11 m 718 MB
Genbank+RefSeq 2,340,672 3.5 TB LexicMap 2.91 TB 16 h 40 m 79 GB
Blastn 2.15 TB 14 h 4 m 4.3 GB
AllTheBacteria HQ 1,858,610 3.1 TB LexicMap 2.32 TB 10 h 48 m 43 GB
Blastn 1.76 TB 14 h 3 m 2.9 GB
Phylign 248 GB / /

Searching

Blastn failed to run as it requires >2000GB RAM for Genbank+RefSeq and AllTheBacteria datasets. Phylign only has the index for AllTheBacteria HQ dataset.

GTDB complete (402,538 genomes):

query query_len tool genome_hits time RAM
a marker gene 1,299 bp LexicMap 3,588 1.4 s 598 MB
Blastn 7,121 36 m 11 s 351 GB
a 16S rRNA gene 1,542 bp LexicMap 294,285 1 m 19 s 3.0 GB
Blastn 301,197 39 m 13 s 378 GB
a plasmid 52,830 bp LexicMap 58,930 40 s 3.2 GB
Blastn 69,311 37 m 42 s 365 GB

AllTheBacteria HQ (1,858,610 genomes):

query query_len tool genome_hits time RAM
a marker gene 1,299 bp LexicMap 10,837 11.3 s 1.1 GB
Phylign_local 7,937 29 m 13 s 78.5 GB
Phylign_cluster 7,937 32 m 26 s /
a 16S rRNA gene 1,542 bp LexicMap 1,853,846 12 m 31 s 9.7 GB
Phylign_local 1,032,948 1 h 58 m 73.3 GB
Phylign_cluster 1,032,948 83 m 36 s /
a plasmid 52,830 bp LexicMap 427,112 4 m 13 s 12.9 GB
Phylign_local 46,822 38 m 55 s 76.6 GB
Phylign_cluster 46,822 36 m 30s /

Genbank+RefSeq (2,340,672 genomes):

query query_len tool genome_hits time RAM
a marker gene 1,299 bp LexicMap 16,556 6.2 s 1.3 GB
a 16S rRNA gene 1,542 bp LexicMap 1,875,260 8 m 29 s 10.8 GB
a plasmid 52,830 bp LexicMap 494,860 4 m 08 s 15.0 GB

Notes:

  • All files are stored on a server with HDD disks. No files are cached in memory.
  • Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz).
  • LexicMap index building parameters: -k 31 -m 40000. Genome batch size: -b 10000 for GTDB datasets, -b 50000 for others.
  • Main searching parameters:
    • LexicMap v0.3.0: --threads 48 --top-n-genomes 0 --min-qcov-per-genome 0 --min-qcov-per-hsp 0 --min-match-pident 50.
    • Blastn v2.15.0+: -num_threads 48 -max_target_seqs 10000000.
    • Phylign (AllTheBacteria fork 9fc65e6): threads: 48, cobs_kmer_thres: 0.33, nb_best_hits: 5000000, max_ram_gb: 100; For cluster: maximum number of slurm jobs is 100.

Installation

LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.

Or install with conda:

conda install -c bioconda lexicmap

Algorithm overview

LexicMap overview
  • High-performance LexicHash computation in Go.

Support

Please open an issue to report bugs, propose new functions or ask for help.

License

MIT License

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL