[DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches

Jonas Demeulemeester Jonas.Demeulemeester at crick.ac.uk
Mon Apr 10 07:14:50 EDT 2017


Hi all,

I’m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057.
I should be able to send a report some time today.

Miguel, looking at your code, I believe you’re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam.
I’m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason.
I’m not entirely sure which one is the correct one, but I’m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups )
If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still.

Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally.

Best,
Jonas


_________________________________
Jonas Demeulemeester, PhD
Postdoctoral Researcher
The Francis Crick Institute
1 Midland Road
London
NW1 1AT

T: +44 (0)20 3796 2594
M: +44 (0)7482 070730
E: jonas.demeulemeester at crick.ac.uk<%22mailto:>
W: www.crick.ac.uk<%22http://>



On 9 Apr 2017, at 15:47, Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>> wrote:

Hi Miguel,

This is indeed good news, the mismatch is significantly lower.

Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors.

Thanks,
Junjun



From: <docktesters-bounces+junjun.zhang=oicr.on.ca at lists.icgc.org<mailto:docktesters-bounces+junjun.zhang=oicr.on.ca at lists.icgc.org>> on behalf of Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>>
Date: Thursday, April 6, 2017 at 11:36 AM
To: Lincoln Stein <lincoln.stein at gmail.com<mailto:lincoln.stein at gmail.com>>
Cc: Francis Ouellette <francis at oicr.on.ca<mailto:francis at oicr.on.ca>>, Keiran Raine <kr2 at sanger.ac.uk<mailto:kr2 at sanger.ac.uk>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches

Hi Lincoln,

Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read.

For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM

I hope this was clear.

Best regards

Miguel






On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein <lincoln.stein at gmail.com<mailto:lincoln.stein at gmail.com>> wrote:
Hi Miguel,

Sounds like a significant achievement! But remind me what a "soft match" is?

Lincoln

On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>> wrote:
Dear all,

This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers:

Lines: 1125172217
Matches: 1083221794
Misses: 143716
Soft: 41806707

If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches.

The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results.

I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running.

Best regards

Miguel



On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>> wrote:
Dear all,

Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are:

Lines: 1708047647
Matches: 1589172843
Misses: 62726130
Soft: 56148674

Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%.

I'm running this test on a second donor.

Best regards

Miguel

On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>> wrote:
Dear colleagues,

I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data.

I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not.

To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers:

Lines: 74264390
Matches: 70565742
Misses: 2693687
Soft: 1004961


Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g  XA:Z:15,-102516528,76M,0), and misses are the rest.

Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh

1) Un-align tumor and normal BAM files, retaining the original aligned BAM files
2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal
3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them)
4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches

I've two questions:
(?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names
(??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same.

Best regards

Miguel

On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>> wrote:
Dear all,

Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140

Sanger:
----------

Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time.

DKFZ:
---------
DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV.

Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe.

BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself.

Best

Miguel




---------------------
RESULTS
---------------------

ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt

Comparison of somatic.snv.mnv for DO50311 using DKFZ
---
Common: 51087
Extra: 0
Missing: 0


Comparison of somatic.indel for DO50311 using DKFZ
---
Common: 26469
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv/> for DO50311 using DKFZ
---
Common: 231
Extra: 44
    - Example: 10:20596800:N:<TRA>,10:56066821:N:<TRA>,11:16776092:N:<TRA>
Missing: 48
    - Example: 10:119704959:N:<INV>,10:13116322:N:<TRA>,10:47063485:N:<TRA>


Comparison of somatic.cnv for DO50311 using DKFZ
---
Common: 731
Extra: 213
    - Example: 10:132510034:N:<DEL>,10:20596801:N:<NEUTRAL>,10:47674883:N:<NEUTRAL>
Missing: 190
    - Example: 10:100891940:N:<NEUTRAL>,10:104975905:N:<NEUTRAL>,10:119704960:N:<NEUTRAL>


Comparison of germline.snv.mnv for DO50311 using DKFZ
---
Common: 3850992
Extra: 0
Missing: 0


Comparison of germline.indel for DO50311 using DKFZ
---
Common: 709060
Extra: 0
Missing: 0


Comparison of germline.sv<http://germline.sv/> for DO50311 using DKFZ
---
Common: 1393
Extra: 231
    - Example: 10:134319313:N:<DEL>,10:134948976:N:<DEL>,10:19996638:N:<DEL>
Missing: 615
    - Example: 10:101851839:N:<TRA>,10:101851884:N:<TRA>,10:10745225:N:<DUP>

File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz

Comparison of somatic.snv.mnv for DO52140 using DKFZ
---
Common: 37160
Extra: 0
Missing: 0


Comparison of somatic.indel for DO52140 using DKFZ
---
Common: 19347
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv/> for DO52140 using DKFZ
---
Common: 72
Extra: 23
    - Example: 10:132840774:N:<DEL>,11:38252019:N:<TRA>,11:47700673:N:<TRA>
Missing: 61
    - Example: 10:134749140:N:<DEL>,11:179191:N:<TRA>,11:38252005:N:<TRA>


Comparison of somatic.cnv for DO52140 using DKFZ
---
Common: 275
Extra: 94
    - Example: 1:106505931:N:<LOH>,1:109068899:N:<DEL>,1:109359995:N:<DEL>
Missing: 286
    - Example: 10:88653561:N:<LOH>,11:179192:N:<LOH>,11:38252006:N:<LOH>


Comparison of germline.snv.mnv for DO52140 using DKFZ
---
Common: 3833896
Extra: 0
Missing: 0


Comparison of germline.indel for DO52140 using DKFZ
---
Common: 706572
Extra: 0
Missing: 0


Comparison of germline.sv<http://germline.sv/> for DO52140 using DKFZ
---
Common: 1108
Extra: 1116
    - Example: 10:102158308:N:<DEL>,10:104645247:N:<DEL>,10:105097522:N:<DEL>
Missing: 2908
    - Example: 10:100107032:N:<TRA>,10:100107151:N:<TRA>,10:102158345:N:<DEL>

File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz

Comparison of somatic.snv.mnv for DO50311 using Sanger
---
Common: 156299
Extra: 1
    - Example: Y:58885197:A:G
Missing: 14
    - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C


Comparison of somatic.indel for DO50311 using Sanger
---
Common: 812487
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv/> for DO50311 using Sanger
---
Common: 260
Extra: 0
Missing: 0


Comparison of somatic.cnv for DO50311 using Sanger
---
Common: 138
Extra: 0
Missing: 0


Comparison of somatic.snv.mnv for DO52140 using Sanger
---
Common: 87234
Extra: 5
    - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A
Missing: 7
    - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A


Comparison of somatic.indel for DO52140 using Sanger
---
Common: 803986
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv/> for DO52140 using Sanger
---
Common: 6
Extra: 0
Missing: 0


Comparison of somatic.cnv for DO52140 using Sanger
---
Common: 36
Extra: 0
Missing: 2
    - Example: 10:11767915:T:<CNV>,10:11779907:G:<CNV>






--
Lincoln Stein

Scientific Director (Interim), Ontario Institute for Cancer Research
Director, Informatics and Bio-computing Program, OICR
Senior Principal Investigator, OICR
Professor, Department of Molecular Genetics, University of Toronto

<http://goog_1828306398/>
Ontario Institute for Cancer Research
MaRS Centre
661 University Avenue
Suite 510
Toronto, Ontario
Canada M5G 0A3

Tel: 416-673-8514
Mobile: 416-817-8240
Email: lincoln.stein at gmail.com<mailto:lincoln.stein at gmail.com>
Toll-free: 1-866-678-6427
Twitter: @OICR_news

Executive Assistant
Melisa Torres
Tel:  647-259-4253
Email: melisa.torres at oicr.on.ca<mailto:stacey.quinn at oicr.on.ca>
www.oicr.on.ca<http://www.oicr.on.ca/>

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

_______________________________________________
docktesters mailing list
docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>
https://lists.icgc.org/mailman/listinfo/docktesters


_______________________________________________
docktesters mailing list
docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>
https://lists.icgc.org/mailman/listinfo/docktesters


The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170410/161cd471/attachment-0001.html>


More information about the docktesters mailing list