[DOCKTESTERS] BWA-Mem validation of DO35937. 93% matches, 3.6% miss-matches, and 3.2% soft-matches

Miguel Vazquez miguel.vazquez at cnio.es
Sun Feb 19 07:43:11 EST 2017


Dear all,

Great news! The BWA-Mem test on a real PCAWG donor succeed in running;
achieving an overlap with the original BAM alignment similar to the
HCC1143 test. The numbers are:

Lines: 1708047647
Matches: 1589172843
Misses: 62726130
Soft: 56148674

Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared
to the HCC1143 test there are a few percentage points in matches that turn
into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses
is very close 3.6%.

I'm running this test on a second donor.

Best regards

Miguel

On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez <miguel.vazquez at cnio.es>
wrote:

> Dear colleagues,
>
> I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143
> data.
>
> I think what solved the problem was setting the headers to the unaligned
> BAM files. I'm currently trying it out with the DO35937 donor, but its too
> early to say if its working or not.
>
> To compare BAM files I've followed some advice that I found on the
> internet https://www.biostars.org/p/166221/. I will detail them a bit
> below because I would like some advice as to how appropriate the approach
> is, but first here are the numbers:
>
> *Lines*: 74264390
> *Matches*: 70565742
> *Misses*: 2693687
> *Soft*: 1004961
>
>
> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*.
> Matches are when the chromosome and position are the same, soft-matches are
> when they are not the same but the position from one of the alignments is
> included in the list of alternative positions for the other alignment (e.g
> XA:Z:15,-102516528,76M,0), and misses are the rest.
>
> Here is the detailed process from the start. The comparison script is here
> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/
> bin/compare_bwa_bam.sh
>
> 1) Un-align tumor and normal BAM files, retaining the original aligned BAM
> files
> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with
> alignments from both tumor and normal
> 3) use samtools to extract the entries, limited for the first in pair (?),
> cut the read-name, chromosome, position (??) and extra information (for
> additional alignments) and sort them. We do this for the original files and
> for the BWA-Mem merged_output file, but separating tumor and normal entries
> (marked with the codes 'tumor' and 'normal', I believe from the headers I
> set when un-aligning them)
> 4) join the lines by read-name, separately for the tumor and normal pairs
> of files, and check for matches
>
> I've two questions:
> (?) Is it OK to select only the first in pair, its what the guy in the
> example did, and it did simplify the code without repeated read-names
> (??) I guess its OK to only check chromosome and position, the cigar would
> be necessarily the same.
>
> Best regards
>
> Miguel
>
> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez <miguel.vazquez at cnio.es>
> wrote:
>
>> Dear all,
>>
>> Let me summarize the status of the testing for Sanger and DKFZ. The
>> validation has been run for two donors for each workflow: DO50311 DO52140
>>
>> Sanger:
>> ----------
>>
>> Sanger call only somatic variants. The results are *identical for Indels
>> and SVs* but *almost identical for SNV.MNV and CNV*. The discrepancies
>> are reproducible (on the same machine at least), i.e. the same are found
>> after running the workflow a second time.
>>
>> DKFZ:
>> ---------
>> DKFZ cals somatic and germline variants, except germline CNVs. For both
>> germline and somatic variants the results are *identical for SNV.MNV and
>> Indels* but with *large discrepancies for SV and CNV*.
>>
>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of
>> investigating this issue I believe.
>>
>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas
>> Demeulemeester. Denis I believe is investigating this problem further. I
>> haven't had the chance to investigate this much myself.
>>
>> Best
>>
>> Miguel
>>
>>
>>
>>
>> ---------------------
>> RESULTS
>> ---------------------
>>
>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt
>>
>> Comparison of somatic.snv.mnv for DO50311 using DKFZ
>> ---
>> Common: 51087
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.indel for DO50311 using DKFZ
>> ---
>> Common: 26469
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.sv for DO50311 using DKFZ
>> ---
>> Common: 231
>> Extra: 44
>>     - Example: 10:20596800:N:<TRA>,10:5606682
>> 1:N:<TRA>,11:16776092:N:<TRA>
>> Missing: 48
>>     - Example: 10:119704959:N:<INV>,10:131163
>> 22:N:<TRA>,10:47063485:N:<TRA>
>>
>>
>> Comparison of somatic.cnv for DO50311 using DKFZ
>> ---
>> Common: 731
>> Extra: 213
>>     - Example: 10:132510034:N:<DEL>,10:205968
>> 01:N:<NEUTRAL>,10:47674883:N:<NEUTRAL>
>> Missing: 190
>>     - Example: 10:100891940:N:<NEUTRAL>,10:10
>> 4975905:N:<NEUTRAL>,10:119704960:N:<NEUTRAL>
>>
>>
>> Comparison of germline.snv.mnv for DO50311 using DKFZ
>> ---
>> Common: 3850992
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of germline.indel for DO50311 using DKFZ
>> ---
>> Common: 709060
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of germline.sv for DO50311 using DKFZ
>> ---
>> Common: 1393
>> Extra: 231
>>     - Example: 10:134319313:N:<DEL>,10:134948
>> 976:N:<DEL>,10:19996638:N:<DEL>
>> Missing: 615
>>     - Example: 10:101851839:N:<TRA>,10:101851
>> 884:N:<TRA>,10:10745225:N:<DUP>
>>
>> File not found /mnt/1TB/work/DockerTest-Migue
>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz
>>
>> Comparison of somatic.snv.mnv for DO52140 using DKFZ
>> ---
>> Common: 37160
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.indel for DO52140 using DKFZ
>> ---
>> Common: 19347
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.sv for DO52140 using DKFZ
>> ---
>> Common: 72
>> Extra: 23
>>     - Example: 10:132840774:N:<DEL>,11:382520
>> 19:N:<TRA>,11:47700673:N:<TRA>
>> Missing: 61
>>     - Example: 10:134749140:N:<DEL>,11:179191:N:<TRA>,11:38252005:N:<TRA>
>>
>>
>> Comparison of somatic.cnv for DO52140 using DKFZ
>> ---
>> Common: 275
>> Extra: 94
>>     - Example: 1:106505931:N:<LOH>,1:10906889
>> 9:N:<DEL>,1:109359995:N:<DEL>
>> Missing: 286
>>     - Example: 10:88653561:N:<LOH>,11:179192:N:<LOH>,11:38252006:N:<LOH>
>>
>>
>> Comparison of germline.snv.mnv for DO52140 using DKFZ
>> ---
>> Common: 3833896
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of germline.indel for DO52140 using DKFZ
>> ---
>> Common: 706572
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of germline.sv for DO52140 using DKFZ
>> ---
>> Common: 1108
>> Extra: 1116
>>     - Example: 10:102158308:N:<DEL>,10:104645
>> 247:N:<DEL>,10:105097522:N:<DEL>
>> Missing: 2908
>>     - Example: 10:100107032:N:<TRA>,10:100107
>> 151:N:<TRA>,10:102158345:N:<DEL>
>>
>> File not found /mnt/1TB/work/DockerTest-Migue
>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz
>>
>> Comparison of somatic.snv.mnv for DO50311 using Sanger
>> ---
>> Common: 156299
>> Extra: 1
>>     - Example: Y:58885197:A:G
>> Missing: 14
>>     - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C
>>
>>
>> Comparison of somatic.indel for DO50311 using Sanger
>> ---
>> Common: 812487
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.sv for DO50311 using Sanger
>> ---
>> Common: 260
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.cnv for DO50311 using Sanger
>> ---
>> Common: 138
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.snv.mnv for DO52140 using Sanger
>> ---
>> Common: 87234
>> Extra: 5
>>     - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A
>> Missing: 7
>>     - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A
>>
>>
>> Comparison of somatic.indel for DO52140 using Sanger
>> ---
>> Common: 803986
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.sv for DO52140 using Sanger
>> ---
>> Common: 6
>> Extra: 0
>> Missing: 0
>>
>>
>> Comparison of somatic.cnv for DO52140 using Sanger
>> ---
>> Common: 36
>> Extra: 0
>> Missing: 2
>>     - Example: 10:11767915:T:<CNV>,10:11779907:G:<CNV>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170219/9f75dc01/attachment-0001.html>


More information about the docktesters mailing list