[DOCKTESTERS] BWA-Mem validation of HCC1143. 95% matches, 3.6% miss-matches, and 1.3% soft-matches

Denis Yuen Denis.Yuen at oicr.on.ca
Tue Feb 14 10:25:46 EST 2017


Hi,

Thanks for the update. I haven't had as much time to work through the BWA procedure as I'd like.

This sounds like good progress.

________________________________
From: docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org <docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org> on behalf of Miguel Vazquez <miguel.vazquez at cnio.es>
Sent: February 14, 2017 9:30 AM
To: Lincoln Stein; Francis Ouellette; Brian O'Connor
Cc: docktesters at lists.icgc.org
Subject: [DOCKTESTERS] BWA-Mem validation of HCC1143. 95% matches, 3.6% miss-matches, and 1.3% soft-matches

Dear colleagues,

I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data.

I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not.

To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers:

Lines: 74264390
Matches: 70565742
Misses: 2693687
Soft: 1004961


Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g  XA:Z:15,-102516528,76M,0), and misses are the rest.

Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh

1) Un-align tumor and normal BAM files, retaining the original aligned BAM files
2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal
3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them)
4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches

I've two questions:
(?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names
(??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same.

Best regards

Miguel

On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>> wrote:
Dear all,

Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140

Sanger:
----------

Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time.

DKFZ:
---------
DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV.


Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe.

BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself.

Best

Miguel




---------------------
RESULTS
---------------------

ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt

Comparison of somatic.snv.mnv for DO50311 using DKFZ
---
Common: 51087
Extra: 0
Missing: 0


Comparison of somatic.indel for DO50311 using DKFZ
---
Common: 26469
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv> for DO50311 using DKFZ
---
Common: 231
Extra: 44
    - Example: 10:20596800:N:<TRA>,10:56066821:N:<TRA>,11:16776092:N:<TRA>
Missing: 48
    - Example: 10:119704959:N:<INV>,10:13116322:N:<TRA>,10:47063485:N:<TRA>


Comparison of somatic.cnv for DO50311 using DKFZ
---
Common: 731
Extra: 213
    - Example: 10:132510034:N:<DEL>,10:20596801:N:<NEUTRAL>,10:47674883:N:<NEUTRAL>
Missing: 190
    - Example: 10:100891940:N:<NEUTRAL>,10:104975905:N:<NEUTRAL>,10:119704960:N:<NEUTRAL>


Comparison of germline.snv.mnv for DO50311 using DKFZ
---
Common: 3850992
Extra: 0
Missing: 0


Comparison of germline.indel for DO50311 using DKFZ
---
Common: 709060
Extra: 0
Missing: 0


Comparison of germline.sv<http://germline.sv> for DO50311 using DKFZ
---
Common: 1393
Extra: 231
    - Example: 10:134319313:N:<DEL>,10:134948976:N:<DEL>,10:19996638:N:<DEL>
Missing: 615
    - Example: 10:101851839:N:<TRA>,10:101851884:N:<TRA>,10:10745225:N:<DUP>

File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz

Comparison of somatic.snv.mnv for DO52140 using DKFZ
---
Common: 37160
Extra: 0
Missing: 0


Comparison of somatic.indel for DO52140 using DKFZ
---
Common: 19347
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv> for DO52140 using DKFZ
---
Common: 72
Extra: 23
    - Example: 10:132840774:N:<DEL>,11:38252019:N:<TRA>,11:47700673:N:<TRA>
Missing: 61
    - Example: 10:134749140:N:<DEL>,11:179191:N:<TRA>,11:38252005:N:<TRA>


Comparison of somatic.cnv for DO52140 using DKFZ
---
Common: 275
Extra: 94
    - Example: 1:106505931:N:<LOH>,1:109068899:N:<DEL>,1:109359995:N:<DEL>
Missing: 286
    - Example: 10:88653561:N:<LOH>,11:179192:N:<LOH>,11:38252006:N:<LOH>


Comparison of germline.snv.mnv for DO52140 using DKFZ
---
Common: 3833896
Extra: 0
Missing: 0


Comparison of germline.indel for DO52140 using DKFZ
---
Common: 706572
Extra: 0
Missing: 0


Comparison of germline.sv<http://germline.sv> for DO52140 using DKFZ
---
Common: 1108
Extra: 1116
    - Example: 10:102158308:N:<DEL>,10:104645247:N:<DEL>,10:105097522:N:<DEL>
Missing: 2908
    - Example: 10:100107032:N:<TRA>,10:100107151:N:<TRA>,10:102158345:N:<DEL>

File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz

Comparison of somatic.snv.mnv for DO50311 using Sanger
---
Common: 156299
Extra: 1
    - Example: Y:58885197:A:G
Missing: 14
    - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C


Comparison of somatic.indel for DO50311 using Sanger
---
Common: 812487
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv> for DO50311 using Sanger
---
Common: 260
Extra: 0
Missing: 0


Comparison of somatic.cnv for DO50311 using Sanger
---
Common: 138
Extra: 0
Missing: 0


Comparison of somatic.snv.mnv for DO52140 using Sanger
---
Common: 87234
Extra: 5
    - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A
Missing: 7
    - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A


Comparison of somatic.indel for DO52140 using Sanger
---
Common: 803986
Extra: 0
Missing: 0


Comparison of somatic.sv<http://somatic.sv> for DO52140 using Sanger
---
Common: 6
Extra: 0
Missing: 0


Comparison of somatic.cnv for DO52140 using Sanger
---
Common: 36
Extra: 0
Missing: 2
    - Example: 10:11767915:T:<CNV>,10:11779907:G:<CNV>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170214/60652d6f/attachment.html>


More information about the docktesters mailing list