[DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches

Miguel Vazquez miguel.vazquez at cnio.es
Mon Apr 10 08:15:00 EDT 2017


Hi all,

The comparison with the *tumor BAM* for DO51057 has completed with *rates
of miss-maches (**0.043%) and soft-matches (**4.64%) just slightly
higher* *than
for the* *normal BAM*. These *numbers are not definitive* since as you can
read from Jonas just below, *there might still be a discrepancy* in the
order in which the BAM where processed.  We'll soon know from Jonas if a
different order will fix these rates even more.

*Lines*: 1010685786
*Matches*: 963319037
*Misses*: 442926
*Soft*: 46923823

Best regards

Miguel

On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester <
Jonas.Demeulemeester at crick.ac.uk> wrote:

> Hi all,
>
> I’m currently running the comparison of the BWA-Mem docker reproduced bams
> and the PCAWG ones for DO51057.
> I should be able to send a report some time today.
>
> Miguel, looking at your code, I believe you’re feeding the unaligned bams
> into the pipeline in the order given by the read group lines (@RG) in the
> header of the PCAWG bam.
> I’m using the order recorded in the command line/programs used (@CL/@PG)
> lines of the PCAWG bam, which is often different for whatever reason.
> I’m not entirely sure which one is the correct one, but I’m guessing the
> one in the @CL/@PG lines is the actual one as it chronologically reiterates
> the whole procedure ( [align - sort] x N followed by merge + flag dups )
> If this is the case, the true % mismatches may be lower still than 0.013%,
> if not, then I should see a higher mismatch rate and the 0.013% is due to
> something else still.
>
> Regarding the soft-matches, I agree with Junjun, we may want to ask the
> people behind the variant callers, but I guess they are probably dealing
> with these multiply-mapping reads internally.
>
> Best,
> Jonas
>
>
> _________________________________
> Jonas Demeulemeester, PhD
> Postdoctoral Researcher
> The Francis Crick Institute
> 1 Midland Road
> London
> NW1 1AT
>
> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594>
> M: +44 (0)7482 070730 <+44%207482%20070730>
> *E:* jonas.demeulemeester at crick.ac.uk
> *W:* www.crick.ac.uk
>
>
>
> On 9 Apr 2017, at 15:47, Junjun Zhang <Junjun.Zhang at oicr.on.ca> wrote:
>
> Hi Miguel,
>
> This is indeed good news, the mismatch is significantly lower.
>
> Regarding soft matches, thanks for the explanation. I wonder whether it
> has impact (or how much impact) on variant calls, do variant callers take
> into account the information that a read may map to multiple places? Does
> it make adjustment at the time of variant calling? I guess these are
> questions for variant caller authors.
>
> Thanks,
> Junjun
>
>
>
> From: <docktesters-bounces+junjun.zhang=oicr.on.ca at lists.icgc.org> on
> behalf of Miguel Vazquez <miguel.vazquez at cnio.es>
> Date: Thursday, April 6, 2017 at 11:36 AM
> To: Lincoln Stein <lincoln.stein at gmail.com>
> Cc: Francis Ouellette <francis at oicr.on.ca>, Keiran Raine <kr2 at sanger.ac.uk>,
> "docktesters at lists.icgc.org" <docktesters at lists.icgc.org>
> Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM
> only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches
>
> Hi Lincoln,
>
> Soft-match means that the alignment position in the new BAM is not the
> same is the one in the original BAM, but is included in the list of
> alternative alignments for that read.
>
> For instance, the original bam aligns a read to chr 1 pos 1000, but also
> admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the
> new bam aligns it at chr 2 pos 2000, which is not the position chosen by
> the original BAM but is in the alternative list. It could also work the
> other way, that the original position is included in the list of
> alternative positions of the new BAM
>
> I hope this was clear.
>
> Best regards
>
> Miguel
>
>
>
>
>
>
> On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein <lincoln.stein at gmail.com>
> wrote:
>
>> Hi Miguel,
>>
>> Sounds like a significant achievement! But remind me what a "soft match"
>> is?
>>
>> Lincoln
>>
>> On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez <miguel.vazquez at cnio.es>
>> wrote:
>>
>>> Dear all,
>>>
>>> This is just an advance teaser for the BWA-Mem validation after the
>>> latest changes, it is currently running over the tumor BAM, but the normal
>>> BAM has completed and the *missmatches are two orders of magnitude
>>> lower* than in our two previous attempts. Before further discussion
>>> here are the raw numbers:
>>>
>>> Lines: 1125172217
>>> Matches: 1083221794
>>> *Misses: 143716*
>>> Soft: 41806707
>>>
>>> If my calculation are correct this means 96.3% matches, *0.013%
>>> miss-matches*, and 3.7% soft-matches.
>>>
>>> The fix was two part. First realizing that the input of this process
>>> should not be a single unaligned version of the output BAMs, but several
>>> input BAMs. Breaking down the output bam into it's constituent BAMs, by a
>>> process implemented by Jonas, dit not address the problem unfortunately.
>>> After this first attempt it was pointed out to us, I think by Keiran, that
>>> the order of the reads matter, and so our attempt to work back from the
>>> output BAM was not going to work. Junjun came back to us with the second
>>> part of the fix, he located a subset of original unaligned BAMs in the DKFZ
>>> that we could use. Downloading these BAM files and submitting them to
>>> BWA-Mem in the same order as was specified in the output BAM header
>>> achieved these promising results.
>>>
>>> I will reply this message in a few days with the corresponding numbers
>>> for the other BAM, the tumor, which is currently running.
>>>
>>> Best regards
>>>
>>> Miguel
>>>
>>>
>>>
>>> On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez <miguel.vazquez at cnio.es>
>>> wrote:
>>>
>>>> Dear all,
>>>>
>>>> Great news! The BWA-Mem test on a real PCAWG donor succeed in running;
>>>> achieving an overlap with the original BAM alignment similar to the
>>>> HCC1143 test. The numbers are:
>>>>
>>>> Lines: 1708047647
>>>> Matches: 1589172843
>>>> Misses: 62726130
>>>> Soft: 56148674
>>>>
>>>> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches.
>>>> Compared to the HCC1143 test there are a few percentage points in matches
>>>> that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio
>>>> of misses is very close 3.6%.
>>>>
>>>> I'm running this test on a second donor.
>>>>
>>>> Best regards
>>>>
>>>> Miguel
>>>>
>>>> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez <miguel.vazquez at cnio.es
>>>> > wrote:
>>>>
>>>>> Dear colleagues,
>>>>>
>>>>> I'm very happy to say that the BWA-Mem pipeline finished for the
>>>>> HCC1143 data.
>>>>>
>>>>> I think what solved the problem was setting the headers to the
>>>>> unaligned BAM files. I'm currently trying it out with the DO35937 donor,
>>>>> but its too early to say if its working or not.
>>>>>
>>>>> To compare BAM files I've followed some advice that I found on the
>>>>> internet https://www.biostars.org/p/166221/. I will detail them a bit
>>>>> below because I would like some advice as to how appropriate the approach
>>>>> is, but first here are the numbers:
>>>>>
>>>>> *Lines*: 74264390
>>>>> *Matches*: 70565742
>>>>> *Misses*: 2693687
>>>>> *Soft*: 1004961
>>>>>
>>>>>
>>>>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*.
>>>>> Matches are when the chromosome and position are the same, soft-matches are
>>>>> when they are not the same but the position from one of the alignments is
>>>>> included in the list of alternative positions for the other alignment (e.g
>>>>> XA:Z:15,-102516528,76M,0), and misses are the rest.
>>>>>
>>>>> Here is the detailed process from the start. The comparison script is
>>>>> here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
>>>>> n/compare_bwa_bam.sh
>>>>>
>>>>> 1) Un-align tumor and normal BAM files, retaining the original aligned
>>>>> BAM files
>>>>> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam
>>>>> with alignments from both tumor and normal
>>>>> 3) use samtools to extract the entries, limited for the first in pair
>>>>> (?), cut the read-name, chromosome, position (??) and extra information
>>>>> (for additional alignments) and sort them. We do this for the original
>>>>> files and for the BWA-Mem merged_output file, but separating tumor and
>>>>> normal entries (marked with the codes 'tumor' and 'normal', I believe from
>>>>> the headers I set when un-aligning them)
>>>>> 4) join the lines by read-name, separately for the tumor and normal
>>>>> pairs of files, and check for matches
>>>>>
>>>>> I've two questions:
>>>>> (?) Is it OK to select only the first in pair, its what the guy in the
>>>>> example did, and it did simplify the code without repeated read-names
>>>>> (??) I guess its OK to only check chromosome and position, the cigar
>>>>> would be necessarily the same.
>>>>>
>>>>> Best regards
>>>>>
>>>>> Miguel
>>>>>
>>>>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez <
>>>>> miguel.vazquez at cnio.es> wrote:
>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> Let me summarize the status of the testing for Sanger and DKFZ. The
>>>>>> validation has been run for two donors for each workflow: DO50311 DO52140
>>>>>>
>>>>>> Sanger:
>>>>>> ----------
>>>>>>
>>>>>> Sanger call only somatic variants. The results are *identical for
>>>>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The
>>>>>> discrepancies are reproducible (on the same machine at least), i.e. the
>>>>>> same are found after running the workflow a second time.
>>>>>>
>>>>>> DKFZ:
>>>>>> ---------
>>>>>> DKFZ cals somatic and germline variants, except germline CNVs. For
>>>>>> both germline and somatic variants the results are *identical for
>>>>>> SNV.MNV and Indels* but with *large discrepancies for SV and CNV*.
>>>>>>
>>>>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of
>>>>>> investigating this issue I believe.
>>>>>>
>>>>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas
>>>>>> Demeulemeester. Denis I believe is investigating this problem further. I
>>>>>> haven't had the chance to investigate this much myself.
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Miguel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------
>>>>>> RESULTS
>>>>>> ---------------------
>>>>>>
>>>>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt
>>>>>>
>>>>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ
>>>>>> ---
>>>>>> Common: 51087
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.indel for DO50311 using DKFZ
>>>>>> ---
>>>>>> Common: 26469
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.sv for DO50311 using DKFZ
>>>>>> ---
>>>>>> Common: 231
>>>>>> Extra: 44
>>>>>>     - Example: 10:20596800:N:<TRA>,10:5606682
>>>>>> 1:N:<TRA>,11:16776092:N:<TRA>
>>>>>> Missing: 48
>>>>>>     - Example: 10:119704959:N:<INV>,10:131163
>>>>>> 22:N:<TRA>,10:47063485:N:<TRA>
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.cnv for DO50311 using DKFZ
>>>>>> ---
>>>>>> Common: 731
>>>>>> Extra: 213
>>>>>>     - Example: 10:132510034:N:<DEL>,10:205968
>>>>>> 01:N:<NEUTRAL>,10:47674883:N:<NEUTRAL>
>>>>>> Missing: 190
>>>>>>     - Example: 10:100891940:N:<NEUTRAL>,10:10
>>>>>> 4975905:N:<NEUTRAL>,10:119704960:N:<NEUTRAL>
>>>>>>
>>>>>>
>>>>>> Comparison of germline.snv.mnv for DO50311 using DKFZ
>>>>>> ---
>>>>>> Common: 3850992
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of germline.indel for DO50311 using DKFZ
>>>>>> ---
>>>>>> Common: 709060
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of germline.sv for DO50311 using DKFZ
>>>>>> ---
>>>>>> Common: 1393
>>>>>> Extra: 231
>>>>>>     - Example: 10:134319313:N:<DEL>,10:134948
>>>>>> 976:N:<DEL>,10:19996638:N:<DEL>
>>>>>> Missing: 615
>>>>>>     - Example: 10:101851839:N:<TRA>,10:101851
>>>>>> 884:N:<TRA>,10:10745225:N:<DUP>
>>>>>>
>>>>>> File not found /mnt/1TB/work/DockerTest-Migue
>>>>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz
>>>>>>
>>>>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ
>>>>>> ---
>>>>>> Common: 37160
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.indel for DO52140 using DKFZ
>>>>>> ---
>>>>>> Common: 19347
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.sv for DO52140 using DKFZ
>>>>>> ---
>>>>>> Common: 72
>>>>>> Extra: 23
>>>>>>     - Example: 10:132840774:N:<DEL>,11:382520
>>>>>> 19:N:<TRA>,11:47700673:N:<TRA>
>>>>>> Missing: 61
>>>>>>     - Example: 10:134749140:N:<DEL>,11:179191
>>>>>> :N:<TRA>,11:38252005:N:<TRA>
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.cnv for DO52140 using DKFZ
>>>>>> ---
>>>>>> Common: 275
>>>>>> Extra: 94
>>>>>>     - Example: 1:106505931:N:<LOH>,1:10906889
>>>>>> 9:N:<DEL>,1:109359995:N:<DEL>
>>>>>> Missing: 286
>>>>>>     - Example: 10:88653561:N:<LOH>,11:179192:
>>>>>> N:<LOH>,11:38252006:N:<LOH>
>>>>>>
>>>>>>
>>>>>> Comparison of germline.snv.mnv for DO52140 using DKFZ
>>>>>> ---
>>>>>> Common: 3833896
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of germline.indel for DO52140 using DKFZ
>>>>>> ---
>>>>>> Common: 706572
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of germline.sv for DO52140 using DKFZ
>>>>>> ---
>>>>>> Common: 1108
>>>>>> Extra: 1116
>>>>>>     - Example: 10:102158308:N:<DEL>,10:104645
>>>>>> 247:N:<DEL>,10:105097522:N:<DEL>
>>>>>> Missing: 2908
>>>>>>     - Example: 10:100107032:N:<TRA>,10:100107
>>>>>> 151:N:<TRA>,10:102158345:N:<DEL>
>>>>>>
>>>>>> File not found /mnt/1TB/work/DockerTest-Migue
>>>>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz
>>>>>>
>>>>>> Comparison of somatic.snv.mnv for DO50311 using Sanger
>>>>>> ---
>>>>>> Common: 156299
>>>>>> Extra: 1
>>>>>>     - Example: Y:58885197:A:G
>>>>>> Missing: 14
>>>>>>     - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.indel for DO50311 using Sanger
>>>>>> ---
>>>>>> Common: 812487
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.sv for DO50311 using Sanger
>>>>>> ---
>>>>>> Common: 260
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.cnv for DO50311 using Sanger
>>>>>> ---
>>>>>> Common: 138
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.snv.mnv for DO52140 using Sanger
>>>>>> ---
>>>>>> Common: 87234
>>>>>> Extra: 5
>>>>>>     - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A
>>>>>> Missing: 7
>>>>>>     - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.indel for DO52140 using Sanger
>>>>>> ---
>>>>>> Common: 803986
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.sv for DO52140 using Sanger
>>>>>> ---
>>>>>> Common: 6
>>>>>> Extra: 0
>>>>>> Missing: 0
>>>>>>
>>>>>>
>>>>>> Comparison of somatic.cnv for DO52140 using Sanger
>>>>>> ---
>>>>>> Common: 36
>>>>>> Extra: 0
>>>>>> Missing: 2
>>>>>>     - Example: 10:11767915:T:<CNV>,10:11779907:G:<CNV>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> *Lincoln Stein*
>>
>> Scientific Director (Interim), Ontario Institute for Cancer Research
>> Director, Informatics and Bio-computing Program, OICR
>> Senior Principal Investigator, OICR
>> Professor, Department of Molecular Genetics, University of Toronto
>>
>> <http://goog_1828306398/>
>> *Ontario Institute for Cancer Research*
>> MaRS Centre
>> 661 University Avenue
>> Suite 510
>> Toronto, Ontario
>> Canada M5G 0A3
>>
>> Tel: 416-673-8514
>> Mobile: 416-817-8240
>> Email: lincoln.stein at gmail.com
>> Toll-free: 1-866-678-6427
>> Twitter: @OICR_news
>>
>> *Executive Assistant*
>> *Melisa Torres*
>> Tel:  647-259-4253
>> Email: melisa.torres at oicr.on.ca <stacey.quinn at oicr.on.ca>
>> www.oicr.on.ca
>>
>> This message and any attachments may contain confidential and/or
>> privileged information for the sole use of the intended recipient. Any
>> review or distribution by anyone other than the person for whom it was
>> originally intended is strictly prohibited. If you have received this
>> message in error, please contact the sender and delete all copies.
>> Opinions, conclusions or other information contained in this message may
>> not be that of the organization.
>>
>> _______________________________________________
>> docktesters mailing list
>> docktesters at lists.icgc.org
>> https://lists.icgc.org/mailman/listinfo/docktesters
>>
>>
> _______________________________________________
> docktesters mailing list
> docktesters at lists.icgc.org
> https://lists.icgc.org/mailman/listinfo/docktesters
>
>
> The Francis Crick Institute Limited is a registered charity in England and
> Wales no. 1140062 and a company registered in England and Wales no.
> 06885462, with its registered office at 1 Midland Road London NW1 1AT
>
> _______________________________________________
> docktesters mailing list
> docktesters at lists.icgc.org
> https://lists.icgc.org/mailman/listinfo/docktesters
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170410/4cc83fb3/attachment-0001.html>


More information about the docktesters mailing list