[DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches

Miguel Vazquez miguel.vazquez at cnio.es
Wed Apr 12 09:11:54 EDT 2017


Hi Keiran,

I don't quite follow all the details but I understand that you think that
in working back from the aligned BAMs we should make sure we split the BAM
files. I think our tests with splitting the BAM when working back from the
aligned BAM did not seem to affect positively the match rate of aligned
reads, however, if I understood you correctly, downstream algorithms like
SVN and Indel callers could still be affected by not splitting the BAM. If
so this is an interesting observation, though I don't think our testing
will cover running these methods over the re-aligned BAM files, so we would
not run into this scenario.

Finally Keiran, what is your opinion on the discussion about working back
from the aligned BAMs? could you summarize for us again what is the reason
you think there is for the 3% miss-matched reads when using the rolled back
splitted BAMs and only 0.01% when using the original unaligned BAMS, and if
there is any possibility of addressing this or not?

Thanks for your input

Miguel





On Tue, Apr 11, 2017 at 10:37 PM, Keiran Raine <kr2 at sanger.ac.uk> wrote:

> Hi,
>
>
>
> Please be aware that failure to split the BAM's by readgroup for remapping
> so that lanes/readgroups are tagged appropriately had implications for
> analysis algorithms.
>
>
>
> I'm unsure how you would remap without splitting by readgroup when
> libraries can be different between readgroups (which would result in a loss
> of metadata).
>
>
>
> For example, the CaVEMan (SNV) caller uses the readgroup as a co-variate
> to ensure that lane to lane artefacts are modelled correctly.  In both
> cgpPindel (indel) and BRASS (SV) the insert size for the individual
> readgroups needs to be correct, this is skewed if data is merged during
> mapping.
>
>
>
> An example from cgpPindel in our internal test process showed that
> starting from the exact same read order in the individual lane/readgroup
> BAMs but merging the files in a different order could result in minor
> changes to indel calls (items that failed filtering).  We found this was
> due to the order that reads from the same location being presented to the
> core algorithm in a different order.
>
>
>
> Hope this is useful.
>
>
>
> Keiran Raine
>
> Principal Bioinformatician
>
> Cancer Genome Project
>
> Wellcome Trust Sanger Institute
>
>
>
> kr2 at sanger.ac.uk
>
> Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244>
>
> Office: H104
>
>
>
> *From: *<mikisvaz at gmail.com> on behalf of Miguel Vazquez <
> miguel.vazquez at cnio.es>
> *Date: *Tuesday, 11 April 2017 at 17:31
> *To: *Christina Yung <Christina.Yung at oicr.on.ca>
> *Cc: *Lincoln Stein <lincoln.stein at gmail.com>, "docktesters at lists.icgc.org"
> <docktesters at lists.icgc.org>, Francis Ouellette <francis at oicr.on.ca>,
> Keiran Raine <kr2 at sanger.ac.uk>, George Mihaiescu <
> George.Mihaiescu at oicr.on.ca>
> *Subject: *Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013%
> miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64%
> soft-matches
>
>
>
> Hi Christina,
>
>
>
> There are two issues:
>
>
>
> 1- Splitting the BAM files and running them in the right order. Can do
>
> 2- That the order of the reads *inside* a BAM is the same. Can not fix
>
>
>
> So if we would like the inquisitive user to reproduce the alignment
> process from the available aligned BAM we need to tell him that the *reads*
> are not in the same order and that about 3% of the reads will be aligned
> differently.
>
>
>
> Compared to the problem of the read order, the problem with the BAM
> splitting and ordering is negligible, in fact, splitting the BAM i believe
> did nothing at all to our numbers, so we might as well not even bother I
> think, but there are people better suited than me to make this call.
>
>
>
> In short, we can claim:
>
>
>
> 1) that the process is reproducible to a 99.99 percent using the original
> unaligned BAM files
>
> 2) that working back from the aligned BAM one is able to reproduce the
> results to a 96% accuracy, the lack on accuracy apparently due to different
> read ordering.
>
>
>
> The process to reproduce in 2) could be the simple one, just unalign the
> BAM, or the more elaborate one that involves splitting the BAM an feeding
> it in the right order, which does not seem to improve anything.
>
>
>
>
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a company
> registered in England with number 2742969, whose registered office is 215
> Euston Road, London, NW1 2BE.
>
> _______________________________________________
> docktesters mailing list
> docktesters at lists.icgc.org
> https://lists.icgc.org/mailman/listinfo/docktesters
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170412/62f963fe/attachment.html>


More information about the docktesters mailing list