[DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches

Keiran Raine kr2 at sanger.ac.uk
Wed Apr 12 10:27:50 EDT 2017


Hi Miguel,

When the original mapping was performed the input files were from many different sources and the read order (as noted) would have been different.  The classes I'm aware of would be:


·         Lane BAM generated from raw sequenced FASTQ (i.e. sequencing ordered)

·         Lane BAM generated from BWA-aln mapped data (different mapping produced to BWA-mem)

·         Lane BAM generated from mappings to a different reference

The same data going into the process for each of these would result in a different read order on entry into the PCAWG mapping flow.

BWA internally splits the data into blocks and estimates the insert distribution required to determine reads as properly-paired within that block.  If the data has previously been through mapping all of the well mapped data clusters and the unmapped and aberrant pairs are no longer distributed throughout the input which changes the insert size distribution.

BWA additionally is affected by the number of threads in use if an additional (hidden) parameter is not set to make the blocks of reads consistent.  The option may not be in the version used in PanCancer.  We specified a set thread count to prevent the problem but this variable allows threads to be modified:

* -K 10e8 :: hidden (yay!) option that eliminates randomness in chunking when using threads so that results can be deterministic.

If you can independently take the same source BAM on two different setups, split and remap with the results being a match then you prove reproducibility for the same input (it was done at the beginning of the project so it should still be true).  FYI, when comparing within our group we don't consider reads with MAPQ=0.

A final item that may affect the read matching (depending on how your matching works) is that when merging the remapped data reads mapped to the same location are inserted based on the file order they are presented to the code.  For example, take 3 reads mapped to the same start location from 3 different lanes:

f1=ra @ 1:1000
f2=rb @ 1:1000
f3=rc @ 1:1000
# merge/merging markdup or the like:
bammerge I=f1 I=f2 I=f3
# read order
ra, rb, rc
bammerge I=f3 I=f1 I=f2
# read order
rc, ra, rb,

I hope this helps/clarifies things,

Regards,

Keiran Raine
Principal Bioinformatician
Cancer Genome Project
Wellcome Trust Sanger Institute

kr2 at sanger.ac.uk
Tel:+44 (0)1223 834244 Ext: 4983
Office: H104

From: <mikisvaz at gmail.com> on behalf of Miguel Vazquez <miguel.vazquez at cnio.es>
Date: Wednesday, 12 April 2017 at 14:11
To: Keiran Raine <kr2 at sanger.ac.uk>
Cc: Christina Yung <Christina.Yung at oicr.on.ca>, Lincoln Stein <lincoln.stein at gmail.com>, Francis Ouellette <francis at oicr.on.ca>, "docktesters at lists.icgc.org" <docktesters at lists.icgc.org>, George Mihaiescu <George.Mihaiescu at oicr.on.ca>
Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches

Hi Keiran,

I don't quite follow all the details but I understand that you think that in working back from the aligned BAMs we should make sure we split the BAM files. I think our tests with splitting the BAM when working back from the aligned BAM did not seem to affect positively the match rate of aligned reads, however, if I understood you correctly, downstream algorithms like SVN and Indel callers could still be affected by not splitting the BAM. If so this is an interesting observation, though I don't think our testing will cover running these methods over the re-aligned BAM files, so we would not run into this scenario.

Finally Keiran, what is your opinion on the discussion about working back from the aligned BAMs? could you summarize for us again what is the reason you think there is for the 3% miss-matched reads when using the rolled back splitted BAMs and only 0.01% when using the original unaligned BAMS, and if there is any possibility of addressing this or not?

Thanks for your input

Miguel





On Tue, Apr 11, 2017 at 10:37 PM, Keiran Raine <kr2 at sanger.ac.uk<mailto:kr2 at sanger.ac.uk>> wrote:
Hi,

Please be aware that failure to split the BAM's by readgroup for remapping so that lanes/readgroups are tagged appropriately had implications for analysis algorithms.

I'm unsure how you would remap without splitting by readgroup when libraries can be different between readgroups (which would result in a loss of metadata).

For example, the CaVEMan (SNV) caller uses the readgroup as a co-variate to ensure that lane to lane artefacts are modelled correctly.  In both cgpPindel (indel) and BRASS (SV) the insert size for the individual readgroups needs to be correct, this is skewed if data is merged during mapping.

An example from cgpPindel in our internal test process showed that starting from the exact same read order in the individual lane/readgroup BAMs but merging the files in a different order could result in minor changes to indel calls (items that failed filtering).  We found this was due to the order that reads from the same location being presented to the core algorithm in a different order.

Hope this is useful.

Keiran Raine
Principal Bioinformatician
Cancer Genome Project
Wellcome Trust Sanger Institute

kr2 at sanger.ac.uk<mailto:kr2 at sanger.ac.uk>
Tel:+44 (0)1223 834244 Ext: 4983<tel:+44%201223%20834244>
Office: H104

From: <mikisvaz at gmail.com<mailto:mikisvaz at gmail.com>> on behalf of Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>>
Date: Tuesday, 11 April 2017 at 17:31
To: Christina Yung <Christina.Yung at oicr.on.ca<mailto:Christina.Yung at oicr.on.ca>>
Cc: Lincoln Stein <lincoln.stein at gmail.com<mailto:lincoln.stein at gmail.com>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>, Francis Ouellette <francis at oicr.on.ca<mailto:francis at oicr.on.ca>>, Keiran Raine <kr2 at sanger.ac.uk<mailto:kr2 at sanger.ac.uk>>, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>
Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches

Hi Christina,

There are two issues:

1- Splitting the BAM files and running them in the right order. Can do
2- That the order of the reads *inside* a BAM is the same. Can not fix

So if we would like the inquisitive user to reproduce the alignment process from the available aligned BAM we need to tell him that the *reads* are not in the same order and that about 3% of the reads will be aligned differently.

Compared to the problem of the read order, the problem with the BAM splitting and ordering is negligible, in fact, splitting the BAM i believe did nothing at all to our numbers, so we might as well not even bother I think, but there are people better suited than me to make this call.

In short, we can claim:

1) that the process is reproducible to a 99.99 percent using the original unaligned BAM files
2) that working back from the aligned BAM one is able to reproduce the results to a 96% accuracy, the lack on accuracy apparently due to different read ordering.

The process to reproduce in 2) could be the simple one, just unalign the BAM, or the more elaborate one that involves splitting the BAM an feeding it in the right order, which does not seem to improve anything.



-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
docktesters mailing list
docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>
https://lists.icgc.org/mailman/listinfo/docktesters




-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170412/59d9a9e7/attachment-0001.html>


More information about the docktesters mailing list