[DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches

Keiran Raine kr2 at sanger.ac.uk
Tue Apr 11 16:37:48 EDT 2017


Hi,

Please be aware that failure to split the BAM's by readgroup for remapping so that lanes/readgroups are tagged appropriately had implications for analysis algorithms.

I'm unsure how you would remap without splitting by readgroup when libraries can be different between readgroups (which would result in a loss of metadata).

For example, the CaVEMan (SNV) caller uses the readgroup as a co-variate to ensure that lane to lane artefacts are modelled correctly.  In both cgpPindel (indel) and BRASS (SV) the insert size for the individual readgroups needs to be correct, this is skewed if data is merged during mapping.

An example from cgpPindel in our internal test process showed that starting from the exact same read order in the individual lane/readgroup BAMs but merging the files in a different order could result in minor changes to indel calls (items that failed filtering).  We found this was due to the order that reads from the same location being presented to the core algorithm in a different order.

Hope this is useful.

Keiran Raine
Principal Bioinformatician
Cancer Genome Project
Wellcome Trust Sanger Institute

kr2 at sanger.ac.uk
Tel:+44 (0)1223 834244 Ext: 4983
Office: H104

From: <mikisvaz at gmail.com> on behalf of Miguel Vazquez <miguel.vazquez at cnio.es>
Date: Tuesday, 11 April 2017 at 17:31
To: Christina Yung <Christina.Yung at oicr.on.ca>
Cc: Lincoln Stein <lincoln.stein at gmail.com>, "docktesters at lists.icgc.org" <docktesters at lists.icgc.org>, Francis Ouellette <francis at oicr.on.ca>, Keiran Raine <kr2 at sanger.ac.uk>, George Mihaiescu <George.Mihaiescu at oicr.on.ca>
Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches

Hi Christina,

There are two issues:

1- Splitting the BAM files and running them in the right order. Can do
2- That the order of the reads *inside* a BAM is the same. Can not fix

So if we would like the inquisitive user to reproduce the alignment process from the available aligned BAM we need to tell him that the *reads* are not in the same order and that about 3% of the reads will be aligned differently.

Compared to the problem of the read order, the problem with the BAM splitting and ordering is negligible, in fact, splitting the BAM i believe did nothing at all to our numbers, so we might as well not even bother I think, but there are people better suited than me to make this call.

In short, we can claim:

1) that the process is reproducible to a 99.99 percent using the original unaligned BAM files
2) that working back from the aligned BAM one is able to reproduce the results to a 96% accuracy, the lack on accuracy apparently due to different read ordering.

The process to reproduce in 2) could be the simple one, just unalign the BAM, or the more elaborate one that involves splitting the BAM an feeding it in the right order, which does not seem to improve anything.






-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170411/e4d7e83a/attachment-0001.html>


More information about the docktesters mailing list