Documentation
¶
Overview ¶
Command bio-mark-duplicates marks or removes duplicates from .bam
files.
This tool is meant to replicate the behavior of picard MarkDuplicates
in a more scalable, and efficient way. The intention is to make
bio-mark-duplicates scale well with machines exceeding 16 cores.
Duplicate Marking Concepts:
At the conceptual level, this tool considers two reads A and B as
duplicates (isDuplicate(A, B)) if their:
1) reference
2) unclipped 5' position
3) read direction (orientation)
are ALL identical.
Two pairs P1 and P2 are considered duplicates of each other, if
isDuplicate(P1.leftRead, P2.leftRead) and isDuplicate(P1.rightRead,
P2.rightRead). Left vs right is determined by the unclipped 5'
position of each read in the pair.
Mapped pairs vs. Mapped-Unmapped pairs: For some read pairs, both
reads will be mapped (mapped pairs). For other read pairs, only one
of the reads will be mapped (mapped-unmapped pairs). A mapped pair
can be a duplicate of another mapped pair, but a mapped pair P1 may
NOT be a duplicate of a mapped-unmapped pair P2 because one read of
P2 will have no alignment position, and thus cannot be equal to one
of the mapped reads of P1.
However, the mapped read of a mapped-unmapped pair can be considered
a duplicate of one read on a mapped pair. So in this example, P2.left
could be a duplicate of P1.left. We call P2.left a "mate-unmapped read".
P1: left(chr1, 1020, F) right(chr1, 1040, R)
P2: left(chr1, 1020, F) right(chr1, 0, ?)
P1 is not a duplicate of P2, but P2.left is a duplicate of P1.left.
After identifying the duplicates, this tool will select a primary
pair or read for each set of duplicates. The primary will be the
duplicate with the highest score based on the sum of its base
qualities. To break ties, a higher priority is given to reads that
appear earlier in the bam input.
In choosing a primary, pairs are given priority over mate-unmapped
reads. So if a mate-unmapped read is found to be a duplicate of one
read in a mapped pair, the pair would always be the primary, and the
mate-unmapped read would always be the duplicate. If two
mate-unmapped reads are duplicates of each other, but not duplicates
of a mapped pair, then the primary is chosen from the two
mate-unmapped reads.
After identifying the primary and the duplicates, this tool can be
configured to mark each read with the duplicate flag 1024, or to
remove each of the duplicate reads.
Tagging:
If the caller specifies the "tag-duplicates" parameter, the tool
will attach auxiliary tags DI, DL, DS, and DT to the output.
DI is the duplicate index of a duplicate set. All pairs in a
duplicate set, including the primary, will have the same value for
DI; the value for DI is the file index of the left-most read of the
primary duplicate pair. DI is not set for mate-unmapped reads.
DL is the number of library (LB aka PCR) duplicate pairs in the
duplicate set, including the primary. This is equal to the DS value
minus the number of "SQ" duplicates pairs in the duplicate set.
DS is the number of pairs in the duplicate set. DS is not set on
mate-unmapped reads, and it also does not count mate-unmapped
duplicates.
DT is set on duplicate pairs (not the primary) and mate-unmapped
reads. It is set to "SQ" for optical duplicates, and "LB" for all
other duplicates.
Implementation:
The implementation splits the input bam file into non-overlapping
shards, and processes each of those shards in separate workers, and
possibly in parallel with the other shards. Each worker is
responsible for taking the reads in one shard, and outputting the
same reads in the same order (except for marking or removing
duplicates).
Matching up pairs:
To determine which pairs are duplicates, a worker must read both
reads in a pair to determine each read's 5' position and
orientation. Each read record contains a pointer to its mate, but
it does not contain the orientation or the unclipped position which
is in the mate's flags and cigar respectively.
Matching up both reads in a pair is somewhat tricky. For most of
the reads that live in a particular shard, the read's mate will
appear within that shard, and a worker can easily match up those
pairs. However, some pairs will span from inside the shard to
outside the shard. Most of these read pairs will have their reads
relatively close together, so the a worker will read an additional
pair-padding distance on either end of the shard for matching up
pairs, but even that is not enough. Sometimes, a pair spans from
inside the shard to beyond the pair-padding; we'll call these
"distant-pairs". To deal with distant-pairs, this tool pre-scans
the entire bam input, to identify and save the distant-pairs to
memory so that the workers can lookup the distant pair reads without
needing to seek, decompress, and read them from the bam file. Note
that the distant-pairs allow all pairs to be resolved, so the
pair-padding is a memory optimization to avoid storing mates that
live in the pair-padding in memory for the duration of the entire
run. Pair-padding should be chosen so that most mates will fall
into the pair-padding.
Matching up duplicates:
Duplicates are matched up using their 5' positions, and since a 5'
position can differ from the alignment's start position, each shard
needs some fuzziness at its boundaries to ensure all potential
duplicates will be compared against each other. For example:
shard1 shard2
|---------------------|-------------------------|
|cccc|--------------| read1 (with clipping)
5 S E 5', Begin, End
|c|-----------------| read2 (with clipping)
5 S E 5', Begin, End
clip-pad shard2 clip-pad
|-----------|-------------------------|-----------|
In this example, read1 and read2 are duplicates according to their 5'
position, but the two reads will reside in different shards based on
their alignment Begin positions.
To handle these overlap cases, each shard has "clip-padding" on each
side of the shard. When searching for duplicates in each shard, the
worker considers each read that is either in the clip-padding or in
the shard. So in the example, read1 is in shard2, and read2 is in
shard2's clip-padding, but the shard2 worker still compares the two
reads with each other, decides they are duplicates, and marks one as
a duplicate based on their scores. Note that the worker for shard1
also compares read1 and read2 and also decides they are duplicates,
and marks one based on their scores. Since the scoring is
deterministic, shard1 and shard2 agree on which read to mark as
duplicate.
Clip-padding and pair-padding serve different purposes.
Clip-padding is for correctness and must exceed the largest clip
distance in the input file. Pair-padding is a memory optization.
The complete shard diagram looks like this:
shard-pad clip-pad shard1 clip-pad shard-pad
|---------|-----------|-------------------------|-----------|---------|
Output ordering:
As the workers complete each shard, they output the shard's marked
reads to an output queue that preserves the original order of the
shards. A writer reads shards from the queue in order, and writes the
records to the bam file output. The output queue has a maximum size,
and when full only allows a worker to insert the shard that the writer
currently needs. This ensures that the queue does not grow too long
when a worker takes a long time to process a particular shard.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Index ¶
- Constants
- func ChoosePrimary(entries []DuplicateEntry) int
- func GetLibrary(readGroupLibrary map[string]string, record *sam.Record) string
- func NewAux(name string, val interface{}) sam.Aux
- func NewRecord(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, ...) *sam.Record
- func NewRecordAux(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, ...) *sam.Record
- func NewRecordSeq(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, ...) *sam.Record
- func NewTestOutput(dir string, index int, format string) string
- func ReadRecords(t *testing.T, path string) []*sam.Record
- func RunTestCases(t *testing.T, header *sam.Header, cases []TestCase)
- func SetupAndMark(ctx context.Context, provider bamprovider.Provider, opts *Opts) error
- type BagProcessor
- type BagProcessorFactory
- type DuplicateEntry
- type IndexedPair
- type IndexedSingle
- type IntermediateDuplicateSet
- type MarkDuplicates
- type Metrics
- type MetricsCollection
- type OpticalDetector
- type Opts
- type Orientation
- type PhysicalLocation
- type TestCase
- type TestRecord
- type TileOpticalDetector
Constants ¶
const ( // IlluminaReadName5Fields is the number of columns in a 5 field read name. IlluminaReadName5Fields = 5 // IlluminaReadName5FieldsTileField is 0-based field number that // contains the tileName for 5 field read names. IlluminaReadName5FieldsTileField = 2 // IlluminaReadName7Fields is the number of columns in a 5 field read name. IlluminaReadName7Fields = 7 // IlluminaReadName7FieldsTileField is 0-based field number that // contains the tileName for 7 field read names. IlluminaReadName7FieldsTileField = 4 // IlluminaReadName8Fields is the number of columns in a 5 field read name. IlluminaReadName8Fields = 8 // IlluminaReadName8FieldsTileField is 0-based field number that // contains the tileName for 8 field read names. IlluminaReadName8FieldsTileField = 4 )
Variables ¶
This section is empty.
Functions ¶
func ChoosePrimary ¶
func ChoosePrimary(entries []DuplicateEntry) int
func GetLibrary ¶
GetLibrary returns the library for the given record's read group. If the library is not defined in readGroupLibrary, returns "Unknown Library".
func NewRecordAux ¶
func NewRecordSeq ¶
func NewTestOutput ¶
NewTestOutput returns different string filename for the different output formats.
func ReadRecords ¶
ReadRecords reads the records from path and returns them as a slice, in order.
func SetupAndMark ¶
SetupAndMark does some minimal setup for validating opts, and creating provider and then runs mark().
Types ¶
type BagProcessor ¶
type BagProcessor func([]*IntermediateDuplicateSet) []*IntermediateDuplicateSet
BagProcessor takes the set of bags from a particular shard, and returns the same set of reads, but the reads may now be bagged differently. The new set of bags may contain more bags or fewer bags than the original set of bags.
type BagProcessorFactory ¶
type BagProcessorFactory interface {
Create() BagProcessor
}
BagProcessorFactory creates BagProcessors. Mark-duplciates creates one BagProcessor per goroutine.
type DuplicateEntry ¶
type IndexedPair ¶
type IndexedPair struct {
Left IndexedSingle
Right IndexedSingle
}
func (IndexedPair) BaseQScore ¶
func (p IndexedPair) BaseQScore() int
func (IndexedPair) FileIdx ¶
func (p IndexedPair) FileIdx() uint64
func (IndexedPair) Name ¶
func (p IndexedPair) Name() string
type IndexedSingle ¶
func (IndexedSingle) BaseQScore ¶
func (s IndexedSingle) BaseQScore() int
func (IndexedSingle) FileIdx ¶
func (s IndexedSingle) FileIdx() uint64
func (IndexedSingle) Name ¶
func (s IndexedSingle) Name() string
type IntermediateDuplicateSet ¶
type IntermediateDuplicateSet struct {
Pairs []DuplicateEntry
Singles []DuplicateEntry
Corrected map[string]string // Maps read name to corrected UMI pair: "GAC+GAG"
}
type MarkDuplicates ¶
type MarkDuplicates struct {
Provider bamprovider.Provider
Opts *Opts
// contains filtered or unexported fields
}
MarkDuplicates implements duplicate marking.
func (*MarkDuplicates) Mark ¶
func (m *MarkDuplicates) Mark(shards []bam.Shard) (*MetricsCollection, error)
Mark marks the duplicates, and returns metrics, and an error if encountered.
type Metrics ¶
type Metrics struct {
// UnpairedReads is the number of mapped reads examined which did
// not have a mapped mate pair, either because the read is
// unpaired, or the read is paired to an unmapped mate.
UnpairedReads int
// ReadPairsExamined is the number of mapped read pairs
// examined. (Primary, non-supplemental).
ReadPairsExamined int
// SecondarySupplementary is the number of reads that were either
// secondary or supplementary.
SecondarySupplementary int
// UnmappedReads is the total number of unmapped reads
// examined. (Primary, non-supplemental).
UnmappedReads int
// UnpairedDups is the number of fragments that were marked as duplicates.
UnpairedDups int
// ReadPairDups is the number of read pairs that were marked as duplicates.
ReadPairDups int
// ReadPairOpticalDups is the number of read pairs duplicates that
// were caused by optical duplication. Value is always <
// READ_PAIR_DUPLICATES, which counts all duplicates regardless of
// source.
ReadPairOpticalDups int
}
Metrics contains metrics from mark duplicates.
type MetricsCollection ¶
type MetricsCollection struct {
// OpticalDistance stores the number of duplicate read pairs that
// have the given Euclidean distance.
OpticalDistance [][]int64
// LibraryMetrics contains per-library metrics.
LibraryMetrics map[string]*Metrics
// High coverage intervals and read counts.
HighCoverageIntervals []coverageInterval
// contains filtered or unexported fields
}
MetricsCollection contains metrics computed by Mark.
func (*MetricsCollection) AddDistance ¶
func (mc *MetricsCollection) AddDistance(bagSize, distance int)
AddDistance increments the histogram counter for the given bagsize and distance.
func (*MetricsCollection) AddHighCovInterval ¶
func (mc *MetricsCollection) AddHighCovInterval(interval coverageInterval)
func (*MetricsCollection) Get ¶
func (mc *MetricsCollection) Get(library string) *Metrics
Get returns Metrics for the given library. If there is no Metrics for library yet, create one and return it.
func (*MetricsCollection) Merge ¶
func (mc *MetricsCollection) Merge(other *MetricsCollection)
Merge per-library and optical distance metrics from other into mc.
type OpticalDetector ¶
type OpticalDetector interface {
// GetRecordProcessor returns a RecordProcessor that sees every
// record in the bam input before any calls to Detect happen. The
// OpticalDetector can use this to calculate statistics that
// influence optical detection.
GetRecordProcessor() bampair.RecordProcessor
// RecordProcessorsDone should return maximum X and Y co-ordinates and
// be called after the RecordProcessors have seen all the input records.
RecordProcessorsDone() (int, int)
// Detect identifies the optical duplicates in pairs and returns
// their names in a slice. readGroupLibrary maps readGroup to
// library name. pairs contains all the readpairs in the bag, and
// bestIndex is an index into pairs that points to the bag's
// primary readpair.
Detect(readGroupLibrary map[string]string, pairs []DuplicateEntry, bestIndex int) []string
}
OpticalDetector is a general interface for optical duplicate detection.
type Opts ¶
type Opts struct {
// Commandline options.
BamFile string
IndexFile string
MetricsFile string
HighCoverageIntervalFile string
TileSizeFile string
Format string
CoverageMax int
ShardSize int
MinBases int
Padding int
DiskMateShards int
ScratchDir string
Parallelism int
QueueLength int
ClearExisting bool
RemoveDups bool
TagDups bool
IntDI bool
UseUmis bool
UmiFile string
ScavengeUmis int
EmitUnmodifiedFields bool
SeparateSingletons bool
OutputPath string
StrandSpecific bool
OpticalHistogram string
OpticalHistogramMax int
Seed int64
// Data and operators derived from commandline options.
BagProcessorFactories []BagProcessorFactory
OpticalDetector OpticalDetector
KnownUmis []byte
}
Opts for mark-duplicates.
type Orientation ¶
type Orientation uint8
func GetR1R2Orientation ¶
func GetR1R2Orientation(p *IndexedPair) Orientation
GetR1R2Orientation returns an orientation byte containing orientations for both R1 and R2.
type PhysicalLocation ¶
type PhysicalLocation struct {
Lane int
Surface int
Swath int
Section int
TileNumber int
TileName int
X int
Y int
}
PhysicalLocation describes a read's physical location on the flow cell. Lane, Surface, Swatch, Section, and TileNumber together specify which flowcell tile the read was found in. TileName is the 4 or 5 digit representation of the tile, e.g. 1203 means surface 1, swath 2 and tile 3. 12304 means surface 1, swath 2, section 3, and tile 4. X and Y describe the X and Y coordinates of the well within the tile.
func ParseLocation ¶
func ParseLocation(qname string) PhysicalLocation
ParseLocation returns a physical location given an Illumina style read name. The read name must have 5, 7, or 8 fields separated by ':'. When there are 5 or 7 fields, the last three fields are tileName, X and Y. When there are 8 fields, the last four fields are tileName, X, Y, and UMI.
The tileName be formatted as a 4 or 5 digit Illumina tileName. For a description of 4 digit tile numbers, see Appendix B, section Tile Numbering in
http://support.illumina.com.cn/content/dam/illumina-support/documents/documentation/system_documentation/hiseqx/hiseq-x-system-guide-15050091-e.pdf
For a description of 5 digit tile numbers, see Appendix C, section Tile Numbering in
https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/nextseq/nextseq-550-system-guide-15069765-05.pdf
type TestCase ¶
type TestCase struct {
TRecords []TestRecord
Opts Opts
}
type TestRecord ¶
type TileOpticalDetector ¶
type TileOpticalDetector struct {
OpticalDistance int
}
TileOpticalDetector detects optical duplicates with a tile. For two reads to be optical duplicates, their tile, lane, surface, library, and read orientations must be identical
func (*TileOpticalDetector) Detect ¶
func (t *TileOpticalDetector) Detect(readGroupLibrary map[string]string, duplicates []DuplicateEntry, bestIndex int) []string
Detect implements OpticalDetector.
func (*TileOpticalDetector) GetRecordProcessor ¶
func (t *TileOpticalDetector) GetRecordProcessor() bampair.RecordProcessor
GetRecordProcessor implements OpticalDetector.
func (*TileOpticalDetector) RecordProcessorsDone ¶
func (t *TileOpticalDetector) RecordProcessorsDone() (int, int)
RecordProcessorsDone implements OpticalDetector.