From Denis.Yuen at oicr.on.ca Mon Apr 3 10:01:34 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Mon, 3 Apr 2017 14:01:34 +0000 Subject: [DOCKTESTERS] SV Merge image in Dockstore Message-ID: <830b6e5ecd8548fb94fc8bd0fde8cf47@oicr.on.ca> Hi, As mentioned in the Monday call, I took a quick look at https://dockstore.org/containers/registry.hub.docker.com/essi/pcawg_sv_merge It looks like the tool is functional and I was able to run it with the test data (a small test json is registered on dockstore and is available here https://bitbucket.org/weischenfeldt/pcawg_sv_merge/src/26bcbf6b86935417d8be5a379b3657167076c6b8/Dockstore.json?at=docker&fileviewer=file-view-default ) A couple heads-up though: 1) It appears that there's a bug with version 1.1.5 of Dockstore where it looks like Dockstore does not copy the results to the final location and instead leaves them in place when working with array output. This means that for now, users will need to look at the stdout file which does list all the output files (the file will be at a path that looks like ( ./datastore/launcher-6dd9b514-fe37-4290-8f36-5fb28e5a7f3a/outputs/cwltool.stdout.txt ) 2) I wasn't able to download the full Synapse dataset, but this may be an issue with my account since the underlying dataset seems to be hosted at dccsftp.nci.nih.gov rather than Synapse itself. Link is available here https://www.synapse.org/#!Synapse:syn8547037 In short, I believe that the tool should be ready for testing although there are a couple rough edges to work on (although they shouldn't block testing). Denis Yuen Senior Software Developer Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Apr 3 11:42:23 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 3 Apr 2017 15:42:23 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Hi, The DKFZ workflow finished in one day and there were no differences observed by the compare_result script. I attached the updated report for all four workflow runs. Thank you, George From: Miguel Vazquez > Date: Wednesday, March 29, 2017 at 5:09 AM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! George, If you where using my scripts you should have a file: tests/DKFZ/DO218695/Dockstore.json . Do you mind sharing it with us? I find a bit hard to debug these sorts of errors because its not clear to me if the problem is in the underlying tools or the dockstore layer. I think reason has it that the problem must be in the interface, that dockstore cannot setup the environment for the tool, since the tools work and the inputs are probably suitable for them, so my guess is that some inputs are not there or they are not placed in the right location, or something like that. Anyway, lets start by looking at the Dockstore.json Best Miguel On Mon, Mar 27, 2017 at 4:36 PM, George Mihaiescu > wrote: Hi, Last week thanks to Denis who provided the DKFZ dependencies, I was able to start that workflow. It ran for about 10 hours at 100% CPU, but then it failed with the following errors: root at dockstore4-dkfz:~/PCAWG-Docker-Test# + cntSuccessful=4 ++ expr 4 - 4 + cntErrornous=0 + [[ 0 -gt 0 ]] + [[ 0 == 0 ]] + echo 'No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175138477_roddy_snvCalling/jobStateLogfile.txt' No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175138477_roddy_snvCalling/jobStateLogfile.txt + for logfile in '${jobstateFiles[@]}' ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt ++ grep -v null: ++ grep :STARTED: ++ wc -l + cntStarted=2 ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt ++ grep -v null: ++ grep :0: ++ wc -l + cntSuccessful=2 ++ expr 2 - 2 + cntErrornous=0 + [[ 0 -gt 0 ]] + [[ 0 == 0 ]] + echo 'No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt' + [[ true == true ]] No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt There was at least one error in a job status logfile. Will exit now! + echo 'There was at least one error in a job status logfile. Will exit now!' + exit 5 mv: cannot stat `/mnt/datastore/resultdata/*': No such file or directory Result directory listing is: + gosu root chmod -R a+wrx /var/spool/cwl Error while running job: Error collecting output for parameter 'germline_indel_vcf_gz': Did not find output file with glob pattern: '['*.germline.indel.vcf.gz']' [job temp5679700718223668526.cwl] completed permanentFail Final process status is permanentFail Workflow error, try again with --debug for more information: Process status is ['permanentFail'] org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.access$200(DefaultExecutor.java:48) at org.apache.commons.exec.DefaultExecutor$1.run(DefaultExecutor.java:200) at java.lang.Thread.run(Thread.java:745) java.lang.RuntimeException: problems running command: cwltool --enable-dev --non-strict --outdir /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/outputs/ --tmpdir-prefix /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/working/ /tmp/1490197113216-0/temp5679700718223668526.cwl /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/workflow_params.json Any idea what went wrong? Thank you, George From: Jonas Demeulemeester > Date: Monday, March 20, 2017 at 3:13 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Do you have the DKFZ workflow dependencies tarball in place (and named correctly)? That's the file it's clearly not finding: 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz You can find the link to this reference tarball on the DKFZ pipeline github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workflows) Hope this helps, Jonas On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gmihaiescu-DockerizedPCAWGworkflows-030417-1131-24.pdf Type: application/pdf Size: 620647 bytes Desc: gmihaiescu-DockerizedPCAWGworkflows-030417-1131-24.pdf URL: From mikisvaz at gmail.com Mon Apr 3 11:54:48 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Mon, 3 Apr 2017 17:54:48 +0200 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Excellent George. Thank you! Miguel On Apr 3, 2017 5:42 PM, "George Mihaiescu" wrote: > Hi, > > The DKFZ workflow finished in one day and there were no differences > observed by the compare_result script. > > I attached the updated report for all four workflow runs. > > Thank you, > George > > From: Miguel Vazquez > Date: Wednesday, March 29, 2017 at 5:09 AM > To: George Mihaiescu > Cc: Jonas Demeulemeester , Junjun Zhang > , "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > George, > > If you where using my scripts you should have a file: > tests/DKFZ/DO218695/Dockstore.json . Do you mind sharing it with us? I > find a bit hard to debug these sorts of errors because its not clear to me > if the problem is in the underlying tools or the dockstore layer. I think > reason has it that the problem must be in the interface, that dockstore > cannot setup the environment for the tool, since the tools work and the > inputs are probably suitable for them, so my guess is that some inputs are > not there or they are not placed in the right location, or something like > that. > > Anyway, lets start by looking at the Dockstore.json > > Best > > Miguel > > On Mon, Mar 27, 2017 at 4:36 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > >> Hi, >> >> Last week thanks to Denis who provided the DKFZ dependencies, I was able >> to start that workflow. >> >> It ran for about 10 hours at 100% CPU, but then it failed with the >> following errors: >> >> root at dockstore4-dkfz:~/PCAWG-Docker-Test# >> + cntSuccessful=4 >> ++ expr 4 - 4 >> + cntErrornous=0 >> + [[ 0 -gt 0 ]] >> + [[ 0 == 0 ]] >> + echo 'No errors found for /mnt/datastore/testdata/run_id >> /roddyExecutionStore/exec_170322_175138477_roddy_snvCalling/ >> jobStateLogfile.txt' >> No errors found for /mnt/datastore/testdata/run_id >> /roddyExecutionStore/exec_170322_175138477_roddy_snvCalling/ >> jobStateLogfile.txt >> + for logfile in '${jobstateFiles[@]}' >> ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_1703 >> 22_175637640_roddy_indelCalling/jobStateLogfile.txt >> ++ grep -v null: >> ++ grep :STARTED: >> ++ wc -l >> + cntStarted=2 >> ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_1703 >> 22_175637640_roddy_indelCalling/jobStateLogfile.txt >> ++ grep -v null: >> ++ grep :0: >> ++ wc -l >> + cntSuccessful=2 >> ++ expr 2 - 2 >> + cntErrornous=0 >> + [[ 0 -gt 0 ]] >> + [[ 0 == 0 ]] >> + echo 'No errors found for /mnt/datastore/testdata/run_id >> /roddyExecutionStore/exec_170322_175637640_roddy_indelCallin >> g/jobStateLogfile.txt' >> + [[ true == true ]] >> No errors found for /mnt/datastore/testdata/run_id >> /roddyExecutionStore/exec_170322_175637640_roddy_indelCallin >> g/jobStateLogfile.txt >> There was at least one error in a job status logfile. Will exit now! >> + echo 'There was at least one error in a job status logfile. Will exit >> now!' >> + exit 5 >> mv: cannot stat `/mnt/datastore/resultdata/*': No such file or directory >> Result directory listing is: >> + gosu root chmod -R a+wrx /var/spool/cwl >> Error while running job: Error collecting output for parameter >> 'germline_indel_vcf_gz': Did not find output file with glob pattern: >> '['*.germline.indel.vcf.gz']' >> [job temp5679700718223668526.cwl] completed permanentFail >> Final process status is permanentFail >> Workflow error, try again with --debug for more information: >> Process status is ['permanentFail'] >> org.apache.commons.exec.ExecuteException: Process exited with an error: >> 1 (Exit value: 1) >> at org.apache.commons.exec.DefaultExecutor.executeInternal(Defa >> ultExecutor.java:404) >> at org.apache.commons.exec.DefaultExecutor.access$200(DefaultEx >> ecutor.java:48) >> at org.apache.commons.exec.DefaultExecutor$1.run(DefaultExecuto >> r.java:200) >> at java.lang.Thread.run(Thread.java:745) >> java.lang.RuntimeException: problems running command: cwltool >> --enable-dev --non-strict --outdir /root/PCAWG-Docker-Test/tests/ >> DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/outputs/ >> --tmpdir-prefix /root/PCAWG-Docker-Test/tests/ >> DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/working/ >> /tmp/1490197113216-0/temp5679700718223668526.cwl >> /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/laun >> cher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/workflow_params.json >> >> >> Any idea what went wrong? >> >> Thank you, >> George >> >> From: Jonas Demeulemeester >> Date: Monday, March 20, 2017 at 3:13 PM >> To: George Mihaiescu >> Cc: Miguel Vazquez , Junjun Zhang < >> Junjun.Zhang at oicr.on.ca>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi George, >> >> Do you have the DKFZ workflow dependencies tarball in place (and named >> correctly)? >> That's the file it's clearly not finding: >> >> 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could >> not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz >> to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/laun >> cher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe- >> ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz >> java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/ >> DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9- >> a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/ >> dkfz-workflow-dependencies_150318_0951.tar.gz >> >> >> You can find the link to this reference tarball on the DKFZ pipeline >> github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workfl >> ows) >> >> Hope this helps, >> Jonas >> >> >> On 20 Mar 2017, at 17:19, George Mihaiescu >> wrote: >> >> Hi, >> >> How do I run the DKFZ workflow? >> I first ran the DELLY which ended with the following output: >> Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/ >> Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0- >> a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz >> to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO2186 >> 95.delly.somatic.sv.vcf.gz >> [##################################################] 100% >> Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/ >> Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0- >> a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz >> to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO2186 >> 95.delly.sv.cov.plots.tar.gz >> >> After that, I tried to run the DKFZ but it errors as below: >> >> root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 >> Running: >> cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch >> --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0 >> quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json >> WARNING: You're currently running as root; probably by accident. >> Press control-C to abort or Enter to continue as root. >> Set DOCKSTORE_ROOT to disable this warning. >> >> Creating directories for run of Dockstore launcher at: >> ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 >> Provisioning your input files to your local machine >> Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/ >> Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into >> directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/laun >> cher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887- >> bce0-41dd-a4d2-52f000d79e65 >> Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam >> into directory: /root/PCAWG-Docker-Test/tests/ >> DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9- >> a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 >> Downloading: #reference-gz from /root/PCAWG-Docker-Test/resour >> ces//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: >> /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/laun >> cher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe- >> ba17-4383-afd8-e785394e365f >> 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could >> not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz >> to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/laun >> cher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe- >> ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz >> java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/ >> DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9- >> a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/ >> dkfz-workflow-dependencies_150318_0951.tar.gz -> >> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies >> _150318_0951.tar.gz >> at sun.nio.fs.UnixException.translateToIOException(UnixExceptio >> n.java:86) >> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException. >> java:102) >> at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemP >> rovider.java:476) >> at java.nio.file.Files.createLink(Files.java:1086) >> at io.dockstore.common.FileProvisioning.provisionInputFile(File >> Provisioning.java:273) >> at io.github.collaboratory.LauncherCWL.copyIndividualFile(Launc >> herCWL.java:726) >> at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCW >> L.java:688) >> at io.github.collaboratory.LauncherCWL.pullFilesHelper(Launcher >> CWL.java:659) >> at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL. >> java:586) >> at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) >> at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWL >> Launch(AbstractEntryClient.java:1028) >> at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl >> (AbstractEntryClient.java:968) >> at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl >> (AbstractEntryClient.java:951) >> at io.dockstore.client.cli.nested.AbstractEntryClient.launch( >> AbstractEntryClient.java:935) >> at io.dockstore.client.cli.nested.AbstractEntryClient.processEn >> tryCommands(AbstractEntryClient.java:247) >> at io.dockstore.client.cli.Client.run(Client.java:704) >> at io.dockstore.client.cli.Client.main(Client.java:796) >> java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resour >> ces//dkfz-workflow-dependencies_150318_0951.tar.gz to >> /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/laun >> cher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe- >> ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz >> >> >> P.S. I have three other Sanger tests running that were started at >> different intervals (and on VMs with different CPU/memory/disk), but none >> of them has completed yet. >> >> Thank you, >> George >> >> >> From: Miguel Vazquez >> Date: Monday, March 13, 2017 at 8:52 AM >> To: George Mihaiescu >> Cc: Junjun Zhang , Jonas Demeulemeester < >> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi George, >> >> Answers inline >> >> On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < >> George.Mihaiescu at oicr.on.ca> wrote: >> >>> Hi Miguel, >>> >>> I've started the test by running "bin/run_test.sh Sanger DO50398", so I >>> guess with just one workflow running it should complete faster than two >>> weeks. >>> >> >> I think it still should take a long time. My scripts will run one >> workflow after another. >> >> >>> >>> Because I'm running in Collaboratory I've changed the >>> "get_icgc_donor.sh" script to use a docker container that has the icgc >>> client inside and pull data from Collaboratory. There is no "bam.bas" file >>> downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an >>> issue. >>> >>> >> I wondered the same thing first time I did this, but this file is >> produced by the pipeline. There was some problem with this that was dealt >> with by the developers and updated in the docker. So I think you won't have >> a problem >> >> >>> By looking at the "bin/compare_result_type.sh" it looks like it's using >>> the gnos client to pull down the existing VCF files for comparison reasons, >>> but I think we store those files in Collaboratory as well, so I'll work >>> with Junjun to adapt the script for this. >>> >>> >> Let me know if you need any help >> >> >>> I think I initially tried to run the DKFZ workflow, but it complained >>> about having to run Delly first, so I abandoned this for now. >>> >> >> Yes, if you look at the run_batch.sh you will see that when using DKFZ it >> will always run Delly first. Delly prepares some files the the DKFZ file >> needs, namely related to copy number I believe. >> >> >>> >>> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. >>> >> >> Remember that you will need to add the relevant has-keys for the >> different files in the etc/donor_files.csv. Its a bit tedious right now. >> You need to go to the ICGC DCC and find these codes manually for the files >> you need. Ask me if you need help. Once you have all you can run all the >> workflows for that donor and evaluate results. >> >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/et >> c/donor_files.csv >> >> >> Regards >> >> Miguel >> >> >>> >>> George >>> >>> From: Miguel Vazquez >>> Date: Monday, March 13, 2017 at 6:53 AM >>> To: George Mihaiescu >>> Cc: Junjun Zhang , Jonas Demeulemeester < >>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Hi George, >>> >>> The Sanger workflow is very lengthy, it takes about two weeks in my >>> tests. >>> >>> About correctness, my scripts also cover that part, if you are not using >>> them they might still help you to clarify how we do it. The idea is to take >>> each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both >>> germline and somatic and compare it with the result uploaded to GNOS (not >>> all pipelines produce all files). This is the relevant part in the >>> run_batch.sh script: >>> >>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>> n/run_batch.sh#L42-L46 >>> >>> The bin/compare_result_type.sh script will take care of downloading the >>> correct file from GNOS and running the comparison. The comparison itself is >>> simple since all files are VCFs, it consists in taking out the variants in >>> terms of chromosome, position, reference and alternative allele and >>> measuring the overlaps. >>> >>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>> n/compare_result_type.sh >>> >>> About which donors to test, DO52140 is one Jonas and I have both tested >>> and could be interesting to get a third opinion. Also, any other donor >>> could be interesting to see if something new comes up. I'm not sure which >>> options is best. >>> >>> Miguel >>> >>> >>> >>> >>> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < >>> George.Mihaiescu at oicr.on.ca> wrote: >>> >>>> Hi, >>>> >>>> I've started Sanger on DO50398 and it's been running for more than 24 >>>> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" >>>> >>>> I just started a second run on a different VM on same donor, just to >>>> compare run times. >>>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send >>>> some monitoring graphs when it finishes the workflow, but I have no idea >>>> how to check its correctness. >>>> >>>> Give me a list of donors and what workflows you want me to run and I'll >>>> try to schedule them tomorrow. >>>> >>>> George >>>> >>>> >>>> From: Junjun Zhang >>>> Date: Sunday, March 12, 2017 at 10:45 PM >>>> To: Jonas Demeulemeester , George >>>> Mihaiescu >>>> Cc: Miguel Vazquez , Denis Yuen < >>>> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < >>>> docktesters at lists.icgc.org> >>>> Subject: Re: [DOCKTESTERS] Thanks! >>>> >>>> Thanks Miguel and Jonas for your help here! >>>> >>>> Do you have any update on the latest testing? Please feel free updating >>>> the wiki with any update: https://wiki.oicr.on.c >>>> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference >>>> >>>> Regards, >>>> Junjun >>>> >>>> >>>> >>>> From: Jonas Demeulemeester >>>> Date: Saturday, March 11, 2017 at 7:15 PM >>>> To: George Mihaiescu >>>> Cc: Miguel Vazquez , Junjun Zhang < >>>> junjun.zhang at oicr.on.ca>, Denis Yuen , " >>>> docktesters at lists.icgc.org" >>>> Subject: Re: [DOCKTESTERS] Thanks! >>>> >>>> Hi George, >>>> >>>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of >>>> scripts. >>>> Give them a go and if you run into issues, just let us know! >>>> >>>> Cheers, >>>> Jonas >>>> >>>> >>>> On 11 Mar 2017, at 17:00, George Mihaiescu >>>> wrote: >>>> >>>> Sure, I'll give it a try and report later. >>>> >>>> Thank you, >>>> >>>> *George Mihaiescu* >>>> Senior Cloud Architect >>>> >>>> *Ontario Institute for Cancer Research* >>>> MaRS Centre >>>> 661 University Avenue >>>> Suite 510 >>>> Toronto, Ontario >>>> Canada M5G 0A3 >>>> >>>> Email: George.Mihaiescu at oicr.on.ca >>>> Toll-free: 1-866-678-6427 >>>> Twitter: @OICR_news >>>> >>>> www.oicr.on.ca >>>> >>>> This message and any attachments may contain confidential and/or >>>> privileged information for the sole use of the intended recipient. Any >>>> review or distribution by anyone other than the person for whom it was >>>> originally intended is strictly prohibited. If you have received this >>>> message in error, please contact the sender and delete all copies. >>>> Opinions, conclusions or other information contained in this message may >>>> not be that of the organization. >>>> >>>> >>>> >>>> From: Miguel Vazquez >>>> Date: Saturday, March 11, 2017 at 10:57 AM >>>> To: Junjun Zhang >>>> Cc: Denis Yuen , Jonas Demeulemeester < >>>> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < >>>> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < >>>> docktesters at lists.icgc.org> >>>> Subject: Re: [DOCKTESTERS] Thanks! >>>> >>>> Hi Junjun, >>>> >>>> I think Jonas has been using my scripts to run some of the tests, maybe >>>> George could try them as well, it should be very easy for him to try the >>>> Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. >>>> >>>> https://github.com/mikisvaz/PCAWG-Docker-Test >>>> >>>> He would just need to update the tokens for DACO access and the scripts >>>> will take care of downloading the BAM files, running the workflows and >>>> evaluating the result. >>>> >>>> The documentation there is reasonably updated, but if this sounds good >>>> then perhaps he could contact me and I could walk him through the details. >>>> >>>> Best regards >>>> >>>> Miguel >>>> >>>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang >>>> wrote: >>>> >>>>> Dear Docktesters, >>>>> >>>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans >>>>> to run some bioinformatics workflows to test Collab environment. >>>>> >>>>> Just thought this is a good opportunity to use as extra help for >>>>> testing out the PCAWG dockerized workflows. >>>>> >>>>> Miguel, Denis and others, what workflows / datasets do you think would >>>>> be good for George to run? >>>>> >>>>> Thanks, >>>>> Junjun >>>>> >>>>> >>>>> >>>>> From: on >>>>> behalf of Denis Yuen >>>>> Date: Wednesday, March 1, 2017 at 10:26 AM >>>>> To: "docktesters at lists.icgc.org" >>>>> Subject: [DOCKTESTERS] Thanks! >>>>> >>>>> Hi, >>>>> >>>>> Just wanted to say thanks to Miguel and Jonas for keeping the workflow >>>>> testing data page up-to-date. >>>>> >>>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >>>>> >>>>> >>>>> As we work on new versions or debugging, it is invaluable to know what >>>>> versions of the workflows have worked outside OICR, thanks! >>>>> >>>>> >>>>> >>>>> *Denis Yuen* >>>>> Senior Software Developer >>>>> >>>>> >>>>> *Ontario**Institute**for**Cancer**Research* >>>>> MaRSCentre >>>>> 661 University Avenue >>>>> Suite510 >>>>> Toronto, Ontario,Canada M5G0A3 >>>>> >>>>> Toll-free: 1-866-678-6427 >>>>> Twitter: @OICR_news >>>>> *www.oicr.on.ca * >>>>> >>>>> This message and any attachments may contain confidential and/or >>>>> privileged information for the sole use of the intended recipient. Any >>>>> review or distribution by anyone other than the person for whom it was >>>>> originally intended is strictly prohibited. If you have received this >>>>> message in error, please contact the sender and delete all copies. >>>>> Opinions, conclusions or other information contained in this message may >>>>> not be that of the organization. >>>>> >>>>> >>>>> _______________________________________________ >>>>> docktesters mailing list >>>>> docktesters at lists.icgc.org >>>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>>> >>>>> >>>> The Francis Crick Institute Limited is a registered charity in England >>>> and Wales no. 1140062 and a company registered in England and Wales no. >>>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>>> >>>> >>>> _______________________________________________ >>>> docktesters mailing list >>>> docktesters at lists.icgc.org >>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>> >>>> >>> >> The Francis Crick Institute Limited is a registered charity in England >> and Wales no. 1140062 and a company registered in England and Wales no. >> 06885462, with its registered office at 1 Midland Road London NW1 1AT >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Thu Apr 6 10:28:21 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Thu, 6 Apr 2017 16:28:21 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Message-ID: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the *missmatches are two orders of magnitude lower* than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 *Misses: 143716* Soft: 41806707 If my calculation are correct this means 96.3% matches, *0.013% miss-matches*, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez wrote: > Dear all, > > Great news! The BWA-Mem test on a real PCAWG donor succeed in running; > achieving an overlap with the original BAM alignment similar to the > HCC1143 test. The numbers are: > > Lines: 1708047647 > Matches: 1589172843 > Misses: 62726130 > Soft: 56148674 > > Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared > to the HCC1143 test there are a few percentage points in matches that turn > into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses > is very close 3.6%. > > I'm running this test on a second donor. > > Best regards > > Miguel > > On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: > >> Dear colleagues, >> >> I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 >> data. >> >> I think what solved the problem was setting the headers to the unaligned >> BAM files. I'm currently trying it out with the DO35937 donor, but its too >> early to say if its working or not. >> >> To compare BAM files I've followed some advice that I found on the >> internet https://www.biostars.org/p/166221/. I will detail them a bit >> below because I would like some advice as to how appropriate the approach >> is, but first here are the numbers: >> >> *Lines*: 74264390 >> *Matches*: 70565742 >> *Misses*: 2693687 >> *Soft*: 1004961 >> >> >> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*. >> Matches are when the chromosome and position are the same, soft-matches are >> when they are not the same but the position from one of the alignments is >> included in the list of alternative positions for the other alignment (e.g >> XA:Z:15,-102516528,76M,0), and misses are the rest. >> >> Here is the detailed process from the start. The comparison script is >> here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >> n/compare_bwa_bam.sh >> >> 1) Un-align tumor and normal BAM files, retaining the original aligned >> BAM files >> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with >> alignments from both tumor and normal >> 3) use samtools to extract the entries, limited for the first in pair >> (?), cut the read-name, chromosome, position (??) and extra information >> (for additional alignments) and sort them. We do this for the original >> files and for the BWA-Mem merged_output file, but separating tumor and >> normal entries (marked with the codes 'tumor' and 'normal', I believe from >> the headers I set when un-aligning them) >> 4) join the lines by read-name, separately for the tumor and normal pairs >> of files, and check for matches >> >> I've two questions: >> (?) Is it OK to select only the first in pair, its what the guy in the >> example did, and it did simplify the code without repeated read-names >> (??) I guess its OK to only check chromosome and position, the cigar >> would be necessarily the same. >> >> Best regards >> >> Miguel >> >> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez >> wrote: >> >>> Dear all, >>> >>> Let me summarize the status of the testing for Sanger and DKFZ. The >>> validation has been run for two donors for each workflow: DO50311 DO52140 >>> >>> Sanger: >>> ---------- >>> >>> Sanger call only somatic variants. The results are *identical for >>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The >>> discrepancies are reproducible (on the same machine at least), i.e. the >>> same are found after running the workflow a second time. >>> >>> DKFZ: >>> --------- >>> DKFZ cals somatic and germline variants, except germline CNVs. For both >>> germline and somatic variants the results are *identical for SNV.MNV >>> and Indels* but with *large discrepancies for SV and CNV*. >>> >>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of >>> investigating this issue I believe. >>> >>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas >>> Demeulemeester. Denis I believe is investigating this problem further. I >>> haven't had the chance to investigate this much myself. >>> >>> Best >>> >>> Miguel >>> >>> >>> >>> >>> --------------------- >>> RESULTS >>> --------------------- >>> >>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt >>> >>> Comparison of somatic.snv.mnv for DO50311 using DKFZ >>> --- >>> Common: 51087 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.indel for DO50311 using DKFZ >>> --- >>> Common: 26469 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.sv for DO50311 using DKFZ >>> --- >>> Common: 231 >>> Extra: 44 >>> - Example: 10:20596800:N:,10:5606682 >>> 1:N:,11:16776092:N: >>> Missing: 48 >>> - Example: 10:119704959:N:,10:131163 >>> 22:N:,10:47063485:N: >>> >>> >>> Comparison of somatic.cnv for DO50311 using DKFZ >>> --- >>> Common: 731 >>> Extra: 213 >>> - Example: 10:132510034:N:,10:205968 >>> 01:N:,10:47674883:N: >>> Missing: 190 >>> - Example: 10:100891940:N:,10:10 >>> 4975905:N:,10:119704960:N: >>> >>> >>> Comparison of germline.snv.mnv for DO50311 using DKFZ >>> --- >>> Common: 3850992 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of germline.indel for DO50311 using DKFZ >>> --- >>> Common: 709060 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of germline.sv for DO50311 using DKFZ >>> --- >>> Common: 1393 >>> Extra: 231 >>> - Example: 10:134319313:N:,10:134948 >>> 976:N:,10:19996638:N: >>> Missing: 615 >>> - Example: 10:101851839:N:,10:101851 >>> 884:N:,10:10745225:N: >>> >>> File not found /mnt/1TB/work/DockerTest-Migue >>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz >>> >>> Comparison of somatic.snv.mnv for DO52140 using DKFZ >>> --- >>> Common: 37160 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.indel for DO52140 using DKFZ >>> --- >>> Common: 19347 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.sv for DO52140 using DKFZ >>> --- >>> Common: 72 >>> Extra: 23 >>> - Example: 10:132840774:N:,11:382520 >>> 19:N:,11:47700673:N: >>> Missing: 61 >>> - Example: 10:134749140:N:,11:179191 >>> :N:,11:38252005:N: >>> >>> >>> Comparison of somatic.cnv for DO52140 using DKFZ >>> --- >>> Common: 275 >>> Extra: 94 >>> - Example: 1:106505931:N:,1:10906889 >>> 9:N:,1:109359995:N: >>> Missing: 286 >>> - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: >>> >>> >>> Comparison of germline.snv.mnv for DO52140 using DKFZ >>> --- >>> Common: 3833896 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of germline.indel for DO52140 using DKFZ >>> --- >>> Common: 706572 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of germline.sv for DO52140 using DKFZ >>> --- >>> Common: 1108 >>> Extra: 1116 >>> - Example: 10:102158308:N:,10:104645 >>> 247:N:,10:105097522:N: >>> Missing: 2908 >>> - Example: 10:100107032:N:,10:100107 >>> 151:N:,10:102158345:N: >>> >>> File not found /mnt/1TB/work/DockerTest-Migue >>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz >>> >>> Comparison of somatic.snv.mnv for DO50311 using Sanger >>> --- >>> Common: 156299 >>> Extra: 1 >>> - Example: Y:58885197:A:G >>> Missing: 14 >>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C >>> >>> >>> Comparison of somatic.indel for DO50311 using Sanger >>> --- >>> Common: 812487 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.sv for DO50311 using Sanger >>> --- >>> Common: 260 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.cnv for DO50311 using Sanger >>> --- >>> Common: 138 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.snv.mnv for DO52140 using Sanger >>> --- >>> Common: 87234 >>> Extra: 5 >>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A >>> Missing: 7 >>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A >>> >>> >>> Comparison of somatic.indel for DO52140 using Sanger >>> --- >>> Common: 803986 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.sv for DO52140 using Sanger >>> --- >>> Common: 6 >>> Extra: 0 >>> Missing: 0 >>> >>> >>> Comparison of somatic.cnv for DO52140 using Sanger >>> --- >>> Common: 36 >>> Extra: 0 >>> Missing: 2 >>> - Example: 10:11767915:T:,10:11779907:G: >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lincoln.stein at gmail.com Thu Apr 6 10:55:42 2017 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Thu, 6 Apr 2017 10:55:42 -0400 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches In-Reply-To: References: Message-ID: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez wrote: > Dear all, > > This is just an advance teaser for the BWA-Mem validation after the latest > changes, it is currently running over the tumor BAM, but the normal BAM has > completed and the *missmatches are two orders of magnitude lower* than in > our two previous attempts. Before further discussion here are the raw > numbers: > > Lines: 1125172217 > Matches: 1083221794 > *Misses: 143716* > Soft: 41806707 > > If my calculation are correct this means 96.3% matches, *0.013% > miss-matches*, and 3.7% soft-matches. > > The fix was two part. First realizing that the input of this process > should not be a single unaligned version of the output BAMs, but several > input BAMs. Breaking down the output bam into it's constituent BAMs, by a > process implemented by Jonas, dit not address the problem unfortunately. > After this first attempt it was pointed out to us, I think by Keiran, that > the order of the reads matter, and so our attempt to work back from the > output BAM was not going to work. Junjun came back to us with the second > part of the fix, he located a subset of original unaligned BAMs in the DKFZ > that we could use. Downloading these BAM files and submitting them to > BWA-Mem in the same order as was specified in the output BAM header > achieved these promising results. > > I will reply this message in a few days with the corresponding numbers for > the other BAM, the tumor, which is currently running. > > Best regards > > Miguel > > > > On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: > >> Dear all, >> >> Great news! The BWA-Mem test on a real PCAWG donor succeed in running; >> achieving an overlap with the original BAM alignment similar to the >> HCC1143 test. The numbers are: >> >> Lines: 1708047647 >> Matches: 1589172843 >> Misses: 62726130 >> Soft: 56148674 >> >> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. >> Compared to the HCC1143 test there are a few percentage points in matches >> that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio >> of misses is very close 3.6%. >> >> I'm running this test on a second donor. >> >> Best regards >> >> Miguel >> >> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez >> wrote: >> >>> Dear colleagues, >>> >>> I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 >>> data. >>> >>> I think what solved the problem was setting the headers to the unaligned >>> BAM files. I'm currently trying it out with the DO35937 donor, but its too >>> early to say if its working or not. >>> >>> To compare BAM files I've followed some advice that I found on the >>> internet https://www.biostars.org/p/166221/. I will detail them a bit >>> below because I would like some advice as to how appropriate the approach >>> is, but first here are the numbers: >>> >>> *Lines*: 74264390 >>> *Matches*: 70565742 >>> *Misses*: 2693687 >>> *Soft*: 1004961 >>> >>> >>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*. >>> Matches are when the chromosome and position are the same, soft-matches are >>> when they are not the same but the position from one of the alignments is >>> included in the list of alternative positions for the other alignment (e.g >>> XA:Z:15,-102516528,76M,0), and misses are the rest. >>> >>> Here is the detailed process from the start. The comparison script is >>> here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>> n/compare_bwa_bam.sh >>> >>> 1) Un-align tumor and normal BAM files, retaining the original aligned >>> BAM files >>> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam >>> with alignments from both tumor and normal >>> 3) use samtools to extract the entries, limited for the first in pair >>> (?), cut the read-name, chromosome, position (??) and extra information >>> (for additional alignments) and sort them. We do this for the original >>> files and for the BWA-Mem merged_output file, but separating tumor and >>> normal entries (marked with the codes 'tumor' and 'normal', I believe from >>> the headers I set when un-aligning them) >>> 4) join the lines by read-name, separately for the tumor and normal >>> pairs of files, and check for matches >>> >>> I've two questions: >>> (?) Is it OK to select only the first in pair, its what the guy in the >>> example did, and it did simplify the code without repeated read-names >>> (??) I guess its OK to only check chromosome and position, the cigar >>> would be necessarily the same. >>> >>> Best regards >>> >>> Miguel >>> >>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez >>> wrote: >>> >>>> Dear all, >>>> >>>> Let me summarize the status of the testing for Sanger and DKFZ. The >>>> validation has been run for two donors for each workflow: DO50311 DO52140 >>>> >>>> Sanger: >>>> ---------- >>>> >>>> Sanger call only somatic variants. The results are *identical for >>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The >>>> discrepancies are reproducible (on the same machine at least), i.e. the >>>> same are found after running the workflow a second time. >>>> >>>> DKFZ: >>>> --------- >>>> DKFZ cals somatic and germline variants, except germline CNVs. For both >>>> germline and somatic variants the results are *identical for SNV.MNV >>>> and Indels* but with *large discrepancies for SV and CNV*. >>>> >>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of >>>> investigating this issue I believe. >>>> >>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas >>>> Demeulemeester. Denis I believe is investigating this problem further. I >>>> haven't had the chance to investigate this much myself. >>>> >>>> Best >>>> >>>> Miguel >>>> >>>> >>>> >>>> >>>> --------------------- >>>> RESULTS >>>> --------------------- >>>> >>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt >>>> >>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ >>>> --- >>>> Common: 51087 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.indel for DO50311 using DKFZ >>>> --- >>>> Common: 26469 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.sv for DO50311 using DKFZ >>>> --- >>>> Common: 231 >>>> Extra: 44 >>>> - Example: 10:20596800:N:,10:5606682 >>>> 1:N:,11:16776092:N: >>>> Missing: 48 >>>> - Example: 10:119704959:N:,10:131163 >>>> 22:N:,10:47063485:N: >>>> >>>> >>>> Comparison of somatic.cnv for DO50311 using DKFZ >>>> --- >>>> Common: 731 >>>> Extra: 213 >>>> - Example: 10:132510034:N:,10:205968 >>>> 01:N:,10:47674883:N: >>>> Missing: 190 >>>> - Example: 10:100891940:N:,10:10 >>>> 4975905:N:,10:119704960:N: >>>> >>>> >>>> Comparison of germline.snv.mnv for DO50311 using DKFZ >>>> --- >>>> Common: 3850992 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of germline.indel for DO50311 using DKFZ >>>> --- >>>> Common: 709060 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of germline.sv for DO50311 using DKFZ >>>> --- >>>> Common: 1393 >>>> Extra: 231 >>>> - Example: 10:134319313:N:,10:134948 >>>> 976:N:,10:19996638:N: >>>> Missing: 615 >>>> - Example: 10:101851839:N:,10:101851 >>>> 884:N:,10:10745225:N: >>>> >>>> File not found /mnt/1TB/work/DockerTest-Migue >>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz >>>> >>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ >>>> --- >>>> Common: 37160 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.indel for DO52140 using DKFZ >>>> --- >>>> Common: 19347 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.sv for DO52140 using DKFZ >>>> --- >>>> Common: 72 >>>> Extra: 23 >>>> - Example: 10:132840774:N:,11:382520 >>>> 19:N:,11:47700673:N: >>>> Missing: 61 >>>> - Example: 10:134749140:N:,11:179191 >>>> :N:,11:38252005:N: >>>> >>>> >>>> Comparison of somatic.cnv for DO52140 using DKFZ >>>> --- >>>> Common: 275 >>>> Extra: 94 >>>> - Example: 1:106505931:N:,1:10906889 >>>> 9:N:,1:109359995:N: >>>> Missing: 286 >>>> - Example: 10:88653561:N:,11:179192: >>>> N:,11:38252006:N: >>>> >>>> >>>> Comparison of germline.snv.mnv for DO52140 using DKFZ >>>> --- >>>> Common: 3833896 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of germline.indel for DO52140 using DKFZ >>>> --- >>>> Common: 706572 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of germline.sv for DO52140 using DKFZ >>>> --- >>>> Common: 1108 >>>> Extra: 1116 >>>> - Example: 10:102158308:N:,10:104645 >>>> 247:N:,10:105097522:N: >>>> Missing: 2908 >>>> - Example: 10:100107032:N:,10:100107 >>>> 151:N:,10:102158345:N: >>>> >>>> File not found /mnt/1TB/work/DockerTest-Migue >>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz >>>> >>>> Comparison of somatic.snv.mnv for DO50311 using Sanger >>>> --- >>>> Common: 156299 >>>> Extra: 1 >>>> - Example: Y:58885197:A:G >>>> Missing: 14 >>>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C >>>> >>>> >>>> Comparison of somatic.indel for DO50311 using Sanger >>>> --- >>>> Common: 812487 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.sv for DO50311 using Sanger >>>> --- >>>> Common: 260 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.cnv for DO50311 using Sanger >>>> --- >>>> Common: 138 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.snv.mnv for DO52140 using Sanger >>>> --- >>>> Common: 87234 >>>> Extra: 5 >>>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A >>>> Missing: 7 >>>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A >>>> >>>> >>>> Comparison of somatic.indel for DO52140 using Sanger >>>> --- >>>> Common: 803986 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.sv for DO52140 using Sanger >>>> --- >>>> Common: 6 >>>> Extra: 0 >>>> Missing: 0 >>>> >>>> >>>> Comparison of somatic.cnv for DO52140 using Sanger >>>> --- >>>> Common: 36 >>>> Extra: 0 >>>> Missing: 2 >>>> - Example: 10:11767915:T:,10:11779907:G: >>>> >>> >>> >> > -- *Lincoln Stein* Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto *Ontario Institute for Cancer Research* MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news *Executive Assistant* *Melisa Torres* Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Thu Apr 6 11:36:25 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Thu, 6 Apr 2017 17:36:25 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches In-Reply-To: References: Message-ID: Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein wrote: > Hi Miguel, > > Sounds like a significant achievement! But remind me what a "soft match" > is? > > Lincoln > > On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: > >> Dear all, >> >> This is just an advance teaser for the BWA-Mem validation after the >> latest changes, it is currently running over the tumor BAM, but the normal >> BAM has completed and the *missmatches are two orders of magnitude lower* >> than in our two previous attempts. Before further discussion here are the >> raw numbers: >> >> Lines: 1125172217 >> Matches: 1083221794 >> *Misses: 143716* >> Soft: 41806707 >> >> If my calculation are correct this means 96.3% matches, *0.013% >> miss-matches*, and 3.7% soft-matches. >> >> The fix was two part. First realizing that the input of this process >> should not be a single unaligned version of the output BAMs, but several >> input BAMs. Breaking down the output bam into it's constituent BAMs, by a >> process implemented by Jonas, dit not address the problem unfortunately. >> After this first attempt it was pointed out to us, I think by Keiran, that >> the order of the reads matter, and so our attempt to work back from the >> output BAM was not going to work. Junjun came back to us with the second >> part of the fix, he located a subset of original unaligned BAMs in the DKFZ >> that we could use. Downloading these BAM files and submitting them to >> BWA-Mem in the same order as was specified in the output BAM header >> achieved these promising results. >> >> I will reply this message in a few days with the corresponding numbers >> for the other BAM, the tumor, which is currently running. >> >> Best regards >> >> Miguel >> >> >> >> On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez >> wrote: >> >>> Dear all, >>> >>> Great news! The BWA-Mem test on a real PCAWG donor succeed in running; >>> achieving an overlap with the original BAM alignment similar to the >>> HCC1143 test. The numbers are: >>> >>> Lines: 1708047647 >>> Matches: 1589172843 >>> Misses: 62726130 >>> Soft: 56148674 >>> >>> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. >>> Compared to the HCC1143 test there are a few percentage points in matches >>> that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio >>> of misses is very close 3.6%. >>> >>> I'm running this test on a second donor. >>> >>> Best regards >>> >>> Miguel >>> >>> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez >>> wrote: >>> >>>> Dear colleagues, >>>> >>>> I'm very happy to say that the BWA-Mem pipeline finished for the >>>> HCC1143 data. >>>> >>>> I think what solved the problem was setting the headers to the >>>> unaligned BAM files. I'm currently trying it out with the DO35937 donor, >>>> but its too early to say if its working or not. >>>> >>>> To compare BAM files I've followed some advice that I found on the >>>> internet https://www.biostars.org/p/166221/. I will detail them a bit >>>> below because I would like some advice as to how appropriate the approach >>>> is, but first here are the numbers: >>>> >>>> *Lines*: 74264390 >>>> *Matches*: 70565742 >>>> *Misses*: 2693687 >>>> *Soft*: 1004961 >>>> >>>> >>>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*. >>>> Matches are when the chromosome and position are the same, soft-matches are >>>> when they are not the same but the position from one of the alignments is >>>> included in the list of alternative positions for the other alignment (e.g >>>> XA:Z:15,-102516528,76M,0), and misses are the rest. >>>> >>>> Here is the detailed process from the start. The comparison script is >>>> here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>>> n/compare_bwa_bam.sh >>>> >>>> 1) Un-align tumor and normal BAM files, retaining the original aligned >>>> BAM files >>>> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam >>>> with alignments from both tumor and normal >>>> 3) use samtools to extract the entries, limited for the first in pair >>>> (?), cut the read-name, chromosome, position (??) and extra information >>>> (for additional alignments) and sort them. We do this for the original >>>> files and for the BWA-Mem merged_output file, but separating tumor and >>>> normal entries (marked with the codes 'tumor' and 'normal', I believe from >>>> the headers I set when un-aligning them) >>>> 4) join the lines by read-name, separately for the tumor and normal >>>> pairs of files, and check for matches >>>> >>>> I've two questions: >>>> (?) Is it OK to select only the first in pair, its what the guy in the >>>> example did, and it did simplify the code without repeated read-names >>>> (??) I guess its OK to only check chromosome and position, the cigar >>>> would be necessarily the same. >>>> >>>> Best regards >>>> >>>> Miguel >>>> >>>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez >>> > wrote: >>>> >>>>> Dear all, >>>>> >>>>> Let me summarize the status of the testing for Sanger and DKFZ. The >>>>> validation has been run for two donors for each workflow: DO50311 DO52140 >>>>> >>>>> Sanger: >>>>> ---------- >>>>> >>>>> Sanger call only somatic variants. The results are *identical for >>>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The >>>>> discrepancies are reproducible (on the same machine at least), i.e. the >>>>> same are found after running the workflow a second time. >>>>> >>>>> DKFZ: >>>>> --------- >>>>> DKFZ cals somatic and germline variants, except germline CNVs. For >>>>> both germline and somatic variants the results are *identical for >>>>> SNV.MNV and Indels* but with *large discrepancies for SV and CNV*. >>>>> >>>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of >>>>> investigating this issue I believe. >>>>> >>>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas >>>>> Demeulemeester. Denis I believe is investigating this problem further. I >>>>> haven't had the chance to investigate this much myself. >>>>> >>>>> Best >>>>> >>>>> Miguel >>>>> >>>>> >>>>> >>>>> >>>>> --------------------- >>>>> RESULTS >>>>> --------------------- >>>>> >>>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt >>>>> >>>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ >>>>> --- >>>>> Common: 51087 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.indel for DO50311 using DKFZ >>>>> --- >>>>> Common: 26469 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.sv for DO50311 using DKFZ >>>>> --- >>>>> Common: 231 >>>>> Extra: 44 >>>>> - Example: 10:20596800:N:,10:5606682 >>>>> 1:N:,11:16776092:N: >>>>> Missing: 48 >>>>> - Example: 10:119704959:N:,10:131163 >>>>> 22:N:,10:47063485:N: >>>>> >>>>> >>>>> Comparison of somatic.cnv for DO50311 using DKFZ >>>>> --- >>>>> Common: 731 >>>>> Extra: 213 >>>>> - Example: 10:132510034:N:,10:205968 >>>>> 01:N:,10:47674883:N: >>>>> Missing: 190 >>>>> - Example: 10:100891940:N:,10:10 >>>>> 4975905:N:,10:119704960:N: >>>>> >>>>> >>>>> Comparison of germline.snv.mnv for DO50311 using DKFZ >>>>> --- >>>>> Common: 3850992 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of germline.indel for DO50311 using DKFZ >>>>> --- >>>>> Common: 709060 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of germline.sv for DO50311 using DKFZ >>>>> --- >>>>> Common: 1393 >>>>> Extra: 231 >>>>> - Example: 10:134319313:N:,10:134948 >>>>> 976:N:,10:19996638:N: >>>>> Missing: 615 >>>>> - Example: 10:101851839:N:,10:101851 >>>>> 884:N:,10:10745225:N: >>>>> >>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz >>>>> >>>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ >>>>> --- >>>>> Common: 37160 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.indel for DO52140 using DKFZ >>>>> --- >>>>> Common: 19347 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.sv for DO52140 using DKFZ >>>>> --- >>>>> Common: 72 >>>>> Extra: 23 >>>>> - Example: 10:132840774:N:,11:382520 >>>>> 19:N:,11:47700673:N: >>>>> Missing: 61 >>>>> - Example: 10:134749140:N:,11:179191 >>>>> :N:,11:38252005:N: >>>>> >>>>> >>>>> Comparison of somatic.cnv for DO52140 using DKFZ >>>>> --- >>>>> Common: 275 >>>>> Extra: 94 >>>>> - Example: 1:106505931:N:,1:10906889 >>>>> 9:N:,1:109359995:N: >>>>> Missing: 286 >>>>> - Example: 10:88653561:N:,11:179192: >>>>> N:,11:38252006:N: >>>>> >>>>> >>>>> Comparison of germline.snv.mnv for DO52140 using DKFZ >>>>> --- >>>>> Common: 3833896 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of germline.indel for DO52140 using DKFZ >>>>> --- >>>>> Common: 706572 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of germline.sv for DO52140 using DKFZ >>>>> --- >>>>> Common: 1108 >>>>> Extra: 1116 >>>>> - Example: 10:102158308:N:,10:104645 >>>>> 247:N:,10:105097522:N: >>>>> Missing: 2908 >>>>> - Example: 10:100107032:N:,10:100107 >>>>> 151:N:,10:102158345:N: >>>>> >>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz >>>>> >>>>> Comparison of somatic.snv.mnv for DO50311 using Sanger >>>>> --- >>>>> Common: 156299 >>>>> Extra: 1 >>>>> - Example: Y:58885197:A:G >>>>> Missing: 14 >>>>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C >>>>> >>>>> >>>>> Comparison of somatic.indel for DO50311 using Sanger >>>>> --- >>>>> Common: 812487 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.sv for DO50311 using Sanger >>>>> --- >>>>> Common: 260 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.cnv for DO50311 using Sanger >>>>> --- >>>>> Common: 138 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.snv.mnv for DO52140 using Sanger >>>>> --- >>>>> Common: 87234 >>>>> Extra: 5 >>>>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A >>>>> Missing: 7 >>>>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A >>>>> >>>>> >>>>> Comparison of somatic.indel for DO52140 using Sanger >>>>> --- >>>>> Common: 803986 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.sv for DO52140 using Sanger >>>>> --- >>>>> Common: 6 >>>>> Extra: 0 >>>>> Missing: 0 >>>>> >>>>> >>>>> Comparison of somatic.cnv for DO52140 using Sanger >>>>> --- >>>>> Common: 36 >>>>> Extra: 0 >>>>> Missing: 2 >>>>> - Example: 10:11767915:T:,10:11779907:G: >>>>> >>>> >>>> >>> >> > > > -- > *Lincoln Stein* > > Scientific Director (Interim), Ontario Institute for Cancer Research > Director, Informatics and Bio-computing Program, OICR > Senior Principal Investigator, OICR > Professor, Department of Molecular Genetics, University of Toronto > > > *Ontario Institute for Cancer Research* > MaRS Centre > 661 University Avenue > Suite 510 > Toronto, Ontario > Canada M5G 0A3 > > Tel: 416-673-8514 > Mobile: 416-817-8240 > Email: lincoln.stein at gmail.com > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > > *Executive Assistant* > *Melisa Torres* > Tel: 647-259-4253 > Email: melisa.torres at oicr.on.ca > www.oicr.on.ca > > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Sun Apr 9 10:47:58 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Sun, 9 Apr 2017 14:47:58 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches In-Reply-To: References: Message-ID: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Mon Apr 10 07:14:50 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Mon, 10 Apr 2017 11:14:50 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches In-Reply-To: References: Message-ID: <5F9677E4-895B-4ACB-824F-9DF4E52DFB00@crick.ac.uk> Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Mon Apr 10 08:15:00 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Mon, 10 Apr 2017 14:15:00 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Message-ID: Hi all, The comparison with the *tumor BAM* for DO51057 has completed with *rates of miss-maches (**0.043%) and soft-matches (**4.64%) just slightly higher* *than for the* *normal BAM*. These *numbers are not definitive* since as you can read from Jonas just below, *there might still be a discrepancy* in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. *Lines*: 1010685786 *Matches*: 963319037 *Misses*: 442926 *Soft*: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester < Jonas.Demeulemeester at crick.ac.uk> wrote: > Hi all, > > I?m currently running the comparison of the BWA-Mem docker reproduced bams > and the PCAWG ones for DO51057. > I should be able to send a report some time today. > > Miguel, looking at your code, I believe you?re feeding the unaligned bams > into the pipeline in the order given by the read group lines (@RG) in the > header of the PCAWG bam. > I?m using the order recorded in the command line/programs used (@CL/@PG) > lines of the PCAWG bam, which is often different for whatever reason. > I?m not entirely sure which one is the correct one, but I?m guessing the > one in the @CL/@PG lines is the actual one as it chronologically reiterates > the whole procedure ( [align - sort] x N followed by merge + flag dups ) > If this is the case, the true % mismatches may be lower still than 0.013%, > if not, then I should see a higher mismatch rate and the 0.013% is due to > something else still. > > Regarding the soft-matches, I agree with Junjun, we may want to ask the > people behind the variant callers, but I guess they are probably dealing > with these multiply-mapping reads internally. > > Best, > Jonas > > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > > > On 9 Apr 2017, at 15:47, Junjun Zhang wrote: > > Hi Miguel, > > This is indeed good news, the mismatch is significantly lower. > > Regarding soft matches, thanks for the explanation. I wonder whether it > has impact (or how much impact) on variant calls, do variant callers take > into account the information that a read may map to multiple places? Does > it make adjustment at the time of variant calling? I guess these are > questions for variant caller authors. > > Thanks, > Junjun > > > > From: on > behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM > To: Lincoln Stein > Cc: Francis Ouellette , Keiran Raine , > "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM > only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches > > Hi Lincoln, > > Soft-match means that the alignment position in the new BAM is not the > same is the one in the original BAM, but is included in the list of > alternative alignments for that read. > > For instance, the original bam aligns a read to chr 1 pos 1000, but also > admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the > new bam aligns it at chr 2 pos 2000, which is not the position chosen by > the original BAM but is in the alternative list. It could also work the > other way, that the original position is included in the list of > alternative positions of the new BAM > > I hope this was clear. > > Best regards > > Miguel > > > > > > > On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: > >> Hi Miguel, >> >> Sounds like a significant achievement! But remind me what a "soft match" >> is? >> >> Lincoln >> >> On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez >> wrote: >> >>> Dear all, >>> >>> This is just an advance teaser for the BWA-Mem validation after the >>> latest changes, it is currently running over the tumor BAM, but the normal >>> BAM has completed and the *missmatches are two orders of magnitude >>> lower* than in our two previous attempts. Before further discussion >>> here are the raw numbers: >>> >>> Lines: 1125172217 >>> Matches: 1083221794 >>> *Misses: 143716* >>> Soft: 41806707 >>> >>> If my calculation are correct this means 96.3% matches, *0.013% >>> miss-matches*, and 3.7% soft-matches. >>> >>> The fix was two part. First realizing that the input of this process >>> should not be a single unaligned version of the output BAMs, but several >>> input BAMs. Breaking down the output bam into it's constituent BAMs, by a >>> process implemented by Jonas, dit not address the problem unfortunately. >>> After this first attempt it was pointed out to us, I think by Keiran, that >>> the order of the reads matter, and so our attempt to work back from the >>> output BAM was not going to work. Junjun came back to us with the second >>> part of the fix, he located a subset of original unaligned BAMs in the DKFZ >>> that we could use. Downloading these BAM files and submitting them to >>> BWA-Mem in the same order as was specified in the output BAM header >>> achieved these promising results. >>> >>> I will reply this message in a few days with the corresponding numbers >>> for the other BAM, the tumor, which is currently running. >>> >>> Best regards >>> >>> Miguel >>> >>> >>> >>> On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez >>> wrote: >>> >>>> Dear all, >>>> >>>> Great news! The BWA-Mem test on a real PCAWG donor succeed in running; >>>> achieving an overlap with the original BAM alignment similar to the >>>> HCC1143 test. The numbers are: >>>> >>>> Lines: 1708047647 >>>> Matches: 1589172843 >>>> Misses: 62726130 >>>> Soft: 56148674 >>>> >>>> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. >>>> Compared to the HCC1143 test there are a few percentage points in matches >>>> that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio >>>> of misses is very close 3.6%. >>>> >>>> I'm running this test on a second donor. >>>> >>>> Best regards >>>> >>>> Miguel >>>> >>>> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez >>> > wrote: >>>> >>>>> Dear colleagues, >>>>> >>>>> I'm very happy to say that the BWA-Mem pipeline finished for the >>>>> HCC1143 data. >>>>> >>>>> I think what solved the problem was setting the headers to the >>>>> unaligned BAM files. I'm currently trying it out with the DO35937 donor, >>>>> but its too early to say if its working or not. >>>>> >>>>> To compare BAM files I've followed some advice that I found on the >>>>> internet https://www.biostars.org/p/166221/. I will detail them a bit >>>>> below because I would like some advice as to how appropriate the approach >>>>> is, but first here are the numbers: >>>>> >>>>> *Lines*: 74264390 >>>>> *Matches*: 70565742 >>>>> *Misses*: 2693687 >>>>> *Soft*: 1004961 >>>>> >>>>> >>>>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*. >>>>> Matches are when the chromosome and position are the same, soft-matches are >>>>> when they are not the same but the position from one of the alignments is >>>>> included in the list of alternative positions for the other alignment (e.g >>>>> XA:Z:15,-102516528,76M,0), and misses are the rest. >>>>> >>>>> Here is the detailed process from the start. The comparison script is >>>>> here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>>>> n/compare_bwa_bam.sh >>>>> >>>>> 1) Un-align tumor and normal BAM files, retaining the original aligned >>>>> BAM files >>>>> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam >>>>> with alignments from both tumor and normal >>>>> 3) use samtools to extract the entries, limited for the first in pair >>>>> (?), cut the read-name, chromosome, position (??) and extra information >>>>> (for additional alignments) and sort them. We do this for the original >>>>> files and for the BWA-Mem merged_output file, but separating tumor and >>>>> normal entries (marked with the codes 'tumor' and 'normal', I believe from >>>>> the headers I set when un-aligning them) >>>>> 4) join the lines by read-name, separately for the tumor and normal >>>>> pairs of files, and check for matches >>>>> >>>>> I've two questions: >>>>> (?) Is it OK to select only the first in pair, its what the guy in the >>>>> example did, and it did simplify the code without repeated read-names >>>>> (??) I guess its OK to only check chromosome and position, the cigar >>>>> would be necessarily the same. >>>>> >>>>> Best regards >>>>> >>>>> Miguel >>>>> >>>>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez < >>>>> miguel.vazquez at cnio.es> wrote: >>>>> >>>>>> Dear all, >>>>>> >>>>>> Let me summarize the status of the testing for Sanger and DKFZ. The >>>>>> validation has been run for two donors for each workflow: DO50311 DO52140 >>>>>> >>>>>> Sanger: >>>>>> ---------- >>>>>> >>>>>> Sanger call only somatic variants. The results are *identical for >>>>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The >>>>>> discrepancies are reproducible (on the same machine at least), i.e. the >>>>>> same are found after running the workflow a second time. >>>>>> >>>>>> DKFZ: >>>>>> --------- >>>>>> DKFZ cals somatic and germline variants, except germline CNVs. For >>>>>> both germline and somatic variants the results are *identical for >>>>>> SNV.MNV and Indels* but with *large discrepancies for SV and CNV*. >>>>>> >>>>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of >>>>>> investigating this issue I believe. >>>>>> >>>>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas >>>>>> Demeulemeester. Denis I believe is investigating this problem further. I >>>>>> haven't had the chance to investigate this much myself. >>>>>> >>>>>> Best >>>>>> >>>>>> Miguel >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --------------------- >>>>>> RESULTS >>>>>> --------------------- >>>>>> >>>>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt >>>>>> >>>>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ >>>>>> --- >>>>>> Common: 51087 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.indel for DO50311 using DKFZ >>>>>> --- >>>>>> Common: 26469 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.sv for DO50311 using DKFZ >>>>>> --- >>>>>> Common: 231 >>>>>> Extra: 44 >>>>>> - Example: 10:20596800:N:,10:5606682 >>>>>> 1:N:,11:16776092:N: >>>>>> Missing: 48 >>>>>> - Example: 10:119704959:N:,10:131163 >>>>>> 22:N:,10:47063485:N: >>>>>> >>>>>> >>>>>> Comparison of somatic.cnv for DO50311 using DKFZ >>>>>> --- >>>>>> Common: 731 >>>>>> Extra: 213 >>>>>> - Example: 10:132510034:N:,10:205968 >>>>>> 01:N:,10:47674883:N: >>>>>> Missing: 190 >>>>>> - Example: 10:100891940:N:,10:10 >>>>>> 4975905:N:,10:119704960:N: >>>>>> >>>>>> >>>>>> Comparison of germline.snv.mnv for DO50311 using DKFZ >>>>>> --- >>>>>> Common: 3850992 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of germline.indel for DO50311 using DKFZ >>>>>> --- >>>>>> Common: 709060 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of germline.sv for DO50311 using DKFZ >>>>>> --- >>>>>> Common: 1393 >>>>>> Extra: 231 >>>>>> - Example: 10:134319313:N:,10:134948 >>>>>> 976:N:,10:19996638:N: >>>>>> Missing: 615 >>>>>> - Example: 10:101851839:N:,10:101851 >>>>>> 884:N:,10:10745225:N: >>>>>> >>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz >>>>>> >>>>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ >>>>>> --- >>>>>> Common: 37160 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.indel for DO52140 using DKFZ >>>>>> --- >>>>>> Common: 19347 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.sv for DO52140 using DKFZ >>>>>> --- >>>>>> Common: 72 >>>>>> Extra: 23 >>>>>> - Example: 10:132840774:N:,11:382520 >>>>>> 19:N:,11:47700673:N: >>>>>> Missing: 61 >>>>>> - Example: 10:134749140:N:,11:179191 >>>>>> :N:,11:38252005:N: >>>>>> >>>>>> >>>>>> Comparison of somatic.cnv for DO52140 using DKFZ >>>>>> --- >>>>>> Common: 275 >>>>>> Extra: 94 >>>>>> - Example: 1:106505931:N:,1:10906889 >>>>>> 9:N:,1:109359995:N: >>>>>> Missing: 286 >>>>>> - Example: 10:88653561:N:,11:179192: >>>>>> N:,11:38252006:N: >>>>>> >>>>>> >>>>>> Comparison of germline.snv.mnv for DO52140 using DKFZ >>>>>> --- >>>>>> Common: 3833896 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of germline.indel for DO52140 using DKFZ >>>>>> --- >>>>>> Common: 706572 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of germline.sv for DO52140 using DKFZ >>>>>> --- >>>>>> Common: 1108 >>>>>> Extra: 1116 >>>>>> - Example: 10:102158308:N:,10:104645 >>>>>> 247:N:,10:105097522:N: >>>>>> Missing: 2908 >>>>>> - Example: 10:100107032:N:,10:100107 >>>>>> 151:N:,10:102158345:N: >>>>>> >>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz >>>>>> >>>>>> Comparison of somatic.snv.mnv for DO50311 using Sanger >>>>>> --- >>>>>> Common: 156299 >>>>>> Extra: 1 >>>>>> - Example: Y:58885197:A:G >>>>>> Missing: 14 >>>>>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C >>>>>> >>>>>> >>>>>> Comparison of somatic.indel for DO50311 using Sanger >>>>>> --- >>>>>> Common: 812487 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.sv for DO50311 using Sanger >>>>>> --- >>>>>> Common: 260 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.cnv for DO50311 using Sanger >>>>>> --- >>>>>> Common: 138 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.snv.mnv for DO52140 using Sanger >>>>>> --- >>>>>> Common: 87234 >>>>>> Extra: 5 >>>>>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A >>>>>> Missing: 7 >>>>>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A >>>>>> >>>>>> >>>>>> Comparison of somatic.indel for DO52140 using Sanger >>>>>> --- >>>>>> Common: 803986 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.sv for DO52140 using Sanger >>>>>> --- >>>>>> Common: 6 >>>>>> Extra: 0 >>>>>> Missing: 0 >>>>>> >>>>>> >>>>>> Comparison of somatic.cnv for DO52140 using Sanger >>>>>> --- >>>>>> Common: 36 >>>>>> Extra: 0 >>>>>> Missing: 2 >>>>>> - Example: 10:11767915:T:,10:11779907:G: >>>>>> >>>>> >>>>> >>>> >>> >> >> >> -- >> *Lincoln Stein* >> >> Scientific Director (Interim), Ontario Institute for Cancer Research >> Director, Informatics and Bio-computing Program, OICR >> Senior Principal Investigator, OICR >> Professor, Department of Molecular Genetics, University of Toronto >> >> >> *Ontario Institute for Cancer Research* >> MaRS Centre >> 661 University Avenue >> Suite 510 >> Toronto, Ontario >> Canada M5G 0A3 >> >> Tel: 416-673-8514 >> Mobile: 416-817-8240 >> Email: lincoln.stein at gmail.com >> Toll-free: 1-866-678-6427 >> Twitter: @OICR_news >> >> *Executive Assistant* >> *Melisa Torres* >> Tel: 647-259-4253 >> Email: melisa.torres at oicr.on.ca >> www.oicr.on.ca >> >> This message and any attachments may contain confidential and/or >> privileged information for the sole use of the intended recipient. Any >> review or distribution by anyone other than the person for whom it was >> originally intended is strictly prohibited. If you have received this >> message in error, please contact the sender and delete all copies. >> Opinions, conclusions or other information contained in this message may >> not be that of the organization. >> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Denis.Yuen at oicr.on.ca Mon Apr 10 10:07:50 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Mon, 10 Apr 2017 14:07:50 +0000 Subject: [DOCKTESTERS] sv-merge parameters Message-ID: Hi, Miguel, following up on the Monday call, I'm aware of the following sources of information about the parameters for sv-merge 1) The readme lists four file parameters plus a run-id https://bitbucket.org/weischenfeldt/pcawg_sv_merge/overview 2) A small test parameter file https://bitbucket.org/weischenfeldt/pcawg_sv_merge/src/112c7a2302647af1194f745b66d8755e71cf9041/Dockstore.json?at=docker&fileviewer=file-view-default 3) A larger test parameter file https://www.synapse.org/#!Synapse:syn8547037 Etsehiwot (CC'ed) should be able to shed further light on these parameters and what they mean as well. Thanks! Denis Yuen Senior Software Developer Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Apr 10 12:11:55 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 10 Apr 2017 16:11:55 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: Message-ID: Hi, I would like to run the BWA-mem dockerized workflow in the Collaboratory environment, but I need some help in order to do this: * A ready-to-run script or instructions * The input files: single file or multiple files, whatever the script needs as an input * The donor ID, preferably the same donor that was already used in order to prove the reproducibility of the results I can start the workflow on a large VM in order to speed up the result. Also, I'm currently running the DKFZ workflow on DO50398 because I've already ran Sanger on it, and I want to compare the run times for the two workflows on the same data set. Thank you, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 2:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Mon Apr 10 12:33:16 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Mon, 10 Apr 2017 16:33:16 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: Message-ID: Hi George, We should be able to provide you with a final answer on this tomorrow by the latest :) Would that be OK for you? You could already download the relevant unaligned bam files as this will anyhow take some time. Miguel and I have been running on donor DO51057 and you can find GNOS IDs for the unaligned files in the attached JSON file. Get all of the bams defined under the ?unaligned_bams? headers (there will be 2 of these headers in the JSON, one for the normal and one for the tumor). The BWA-Mem docker is internally set to run on 8 cores, so using a large VM is unlikely to affect run-time. Best wishes, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 17:11, George Mihaiescu > wrote: Hi, I would like to run the BWA-mem dockerized workflow in the Collaboratory environment, but I need some help in order to do this: * A ready-to-run script or instructions * The input files: single file or multiple files, whatever the script needs as an input * The donor ID, preferably the same donor that was already used in order to prove the reproducibility of the results I can start the workflow on a large VM in order to speed up the result. Also, I'm currently running the DKFZ workflow on DO50398 because I've already ran Sanger on it, and I want to compare the run times for the two workflows on the same data set. Thank you, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 2:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DO51057.json.gz Type: application/x-gzip Size: 33031 bytes Desc: DO51057.json.gz URL: From mikisvaz at gmail.com Mon Apr 10 12:34:43 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Mon, 10 Apr 2017 18:34:43 +0200 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: Message-ID: Hi George, You can find my scripts as usual at https://github.com/mikisvaz/ PCAWG-Docker-Test. It now includes the json files that Junjun sent us for three donors, and a script to download the unaligned BAMs, run the test, and compare the result. Try: 1- update the repo 2-> bin/download_unaligned.sh DO51057 3-> bin/run_bwa_test.sh DO51057 4-> bin/compare_bwa_bam.sh tests/BWA-Mem/DO51057/normal/output/DO51057.\[TYPE\].merged_output.bam data/DO51057/normal.bam 5-> bin/compare_bwa_bam.sh tests/BWA-Mem/DO51057/tumor/output/DO51057.\[TYPE\].merged_output.bam data/DO51057/tumor.bam Jonas has a slightly different set of scripts that might also be a bit more correct, so perhaps you can wait for his input. Best Miguel On Mon, Apr 10, 2017 at 6:11 PM, George Mihaiescu < George.Mihaiescu at oicr.on.ca> wrote: > Hi, > > I would like to run the BWA-mem dockerized workflow in the Collaboratory > environment, but I need some help in order to do this: > > - A ready-to-run script or instructions > - The input files: single file or multiple files, whatever the script > needs as an input > - The donor ID, preferably the same donor that was already used in > order to prove the reproducibility of the results > > I can start the workflow on a large VM in order to speed up the result. > > Also, I'm currently running the DKFZ workflow on DO50398 because I've > already ran Sanger on it, and I want to compare the run times for the two > workflows on the same data set. > > Thank you, > George > > > From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 2:08 PM > To: Jonas Demeulemeester > Cc: Keiran Raine , Junjun Zhang , > George Mihaiescu , " > docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update > > Thanks Jonas for this information. > > I hope that someone here can provide us with some suggestion on what to > try next. Perhaps the version issue that Jonas point out is the key. > > I just want to add that, as I told Jonas earlier, my own tests using the > new split BAM files also gave 3% mismatches. > > Best regards > > Miguel > > On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk> wrote: > >> Hi all, >> >> A brief update on the BWA-Mem docker tests. >> I prepared normal + tumor lane-level unaligned bams for DO503011 and ran >> the BWA-Mem workflow for normal and tumor seperately. >> Doing the comparison however, I am still getting 3% of reads that are >> aligned differently (see below for a few examples). >> However, when checking the headers of the original and newly mapped bam >> files (attached) I noticed that the original is mapped using a different >> version of BWA and SeqWare. >> I?m hoping the mapping differences can be ascribed to this. >> >> Is there a list available somewhere detailing which samples were mapped >> using which versions? >> That way we could select a relevant test sample without having to sort >> through the headers of all different bams. >> >> Best wishes, >> Jonas >> >> >> >> >> >> newly aligned: >> >> IDflagchrpos >> HS2000-1012_275:7:1101:17411:15403993112743126 >> HS2000-1012_275:7:1101:17411:154031473112743376 >> HS2000-1012_275:7:1101:11883:83640991628672999 >> HS2000-1012_275:7:1101:11883:836401471628673223 >> HS2000-1012_275:7:1101:16576:28476163GL000238.121309 >> HS2000-1012_275:7:1101:16576:2847683GL000238.121664 >> >> vs the original: >> >> IDflagchrpos >> HS2000-1012_275:7:1101:17411:1540399854944243 >> HS2000-1012_275:7:1101:17411:15403147854944493 >> HS2000-1012_275:7:1101:11883:836401631628464362 >> HS2000-1012_275:7:1101:11883:83640831628464586 >> HS2000-1012_275:7:1101:16576:2847699126124549 >> HS2000-1012_275:7:1101:16576:28476147126124903 >> >> >> _________________________________ >> Jonas Demeulemeester, PhD >> Postdoctoral Researcher >> The Francis Crick Institute >> 1 Midland Road >> London >> NW1 1AT >> >> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >> M: +44 (0)7482 070730 <+44%207482%20070730> >> *E:* jonas.demeulemeester at crick.ac.uk >> *W:* www.crick.ac.uk >> >> The Francis Crick Institute Limited is a registered charity in England >> and Wales no. 1140062 and a company registered in England and Wales no. >> 06885462, with its registered office at 1 Midland Road London NW1 1AT >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Apr 10 13:32:57 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 10 Apr 2017 17:32:57 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: Message-ID: Thank you for help guys. I'll download the files using Miguel's "bin/download_unaligned.sh" script till for now, and use Jonas's updated BWA-Mem script tomorrow. Cheers, George From: Jonas Demeulemeester > Date: Monday, April 10, 2017 at 12:33 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Keiran Raine >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Hi George, We should be able to provide you with a final answer on this tomorrow by the latest :) Would that be OK for you? You could already download the relevant unaligned bam files as this will anyhow take some time. Miguel and I have been running on donor DO51057 and you can find GNOS IDs for the unaligned files in the attached JSON file. Get all of the bams defined under the ?unaligned_bams? headers (there will be 2 of these headers in the JSON, one for the normal and one for the tumor). The BWA-Mem docker is internally set to run on 8 cores, so using a large VM is unlikely to affect run-time. Best wishes, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 17:11, George Mihaiescu > wrote: Hi, I would like to run the BWA-mem dockerized workflow in the Collaboratory environment, but I need some help in order to do this: * A ready-to-run script or instructions * The input files: single file or multiple files, whatever the script needs as an input * The donor ID, preferably the same donor that was already used in order to prove the reproducibility of the results I can start the workflow on a large VM in order to speed up the result. Also, I'm currently running the DKFZ workflow on DO50398 because I've already ran Sanger on it, and I want to compare the run times for the two workflows on the same data set. Thank you, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 2:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Tue Apr 11 05:43:00 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Tue, 11 Apr 2017 09:43:00 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: Message-ID: Hi all, I?ve completed the testing run of the BWA-Mem docker on PCAWG donor DO51057. Briefly, like Miguel?s run, this test used the original unaligned bam files for DO51057, but feeds them via the JSON file into the docker in a slightly different order (as recorded in the @PG/@CL lines in the original mapped PCAWG bams) Results of the comparison are as follows: Matched normal: Lines: 1125172217 Matches: 1083221794 Misses: 143668 Soft: 41806755 Tumor: Lines: 1010685786 Matches: 963319037 Misses: 442902 Soft: 46923847 Which are exactly the numbers reported by Miguel (resulting in 0.043% and 0.013% mismatch rates). The fact that the numbers match exactly comes as a bit of a surprise I think, but shows that the current pipeline is highly reproducible, even across platforms. @Miguel, could you verify the order of mapping of the different read groups in the header of your newly mapped bams. For the original and newly mapped normal bams the order recorded in the @CL lines is CPCG_0098_Ly_R_PE_517_WG.3 - 6 - 1 - 4 - 2 - 5 For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.5 - 3 - 6 - 2 - 4 - 1 Do you observe the same or a different order? If it?s the same, then the pipeline does some internal reordering and the order of records in the JSON doesn?t matter. If not, then the order of the bams doesn?t seem to matter as much (at least in this case), but maybe rather the order of reads within the bams (as evidenced by our high error rates previously). Looking forward to hearing your thoughts on this! Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 10 Apr 2017, at 13:15, Miguel Vazquez > wrote: Hi all, The comparison with the tumor BAM for DO51057 has completed with rates of miss-maches (0.043%) and soft-matches (4.64%) just slightly higher than for the normal BAM. These numbers are not definitive since as you can read from Jonas just below, there might still be a discrepancy in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. Lines: 1010685786 Matches: 963319037 Misses: 442926 Soft: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester > wrote: Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Tue Apr 11 07:00:21 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Tue, 11 Apr 2017 13:00:21 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: Message-ID: Hi Jonas, About the BAM order in the header I have some lines that start with @PG and then have a "CL:" field with the command line, I guess you are referring to those. The order is actually 1,2,3,4,5 and 6, which is the one I used. About the numbers, they are almost the same, yet not entirely the same. There are 442902 miss-matches in yours and 442926 in mine. So it appears that 24 of my miss-matches became soft-matches in yours. I would have expected a move between matches and soft-matches but not with miss-matches and soft-matches. It's a bit odd. I could send you the list of my miss-matches and we can find out which are the ones that moved and find out why. On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester < Jonas.Demeulemeester at crick.ac.uk> wrote: > Hi all, > > I?ve completed the testing run of the BWA-Mem docker on PCAWG donor > DO51057. > Briefly, like Miguel?s run, this test used the original unaligned bam > files for DO51057, but feeds them via the JSON file into the docker in a > slightly different order (as recorded in the @PG/@CL lines in the original > mapped PCAWG bams) > Results of the comparison are as follows: > > *Matched normal*: > Lines: 1125172217 > Matches: 1083221794 > Misses: 143668 > Soft: 41806755 > > *Tumor:* > Lines: 1010685786 > Matches: 963319037 <963%2031%2090%2037> > Misses: 442902 > Soft: 46923847 > > Which are *exactly the numbers reported by Miguel *(resulting in *0.043%* > and *0.013%* mismatch rates). > The fact that the numbers match exactly comes as a bit of a surprise I > think, but shows that the current pipeline is highly reproducible, even > across platforms. > > @Miguel, could you verify the order of mapping of the different read > groups in the header of your newly mapped bams. > For the original and newly mapped normal bams the order recorded in the > @CL lines is CPCG_0098_Ly_R_PE_517_WG.*3 - 6 - 1 - 4 - 2 - 5* > For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.*5 - 3 - 6 - 2 > - 4 - 1* > Do you observe the same or a different order? > If it?s the same, then the pipeline does some internal reordering and the > order of records in the JSON doesn?t matter. > If not, then the order of the bams doesn?t seem to matter as much (at > least in this case), but maybe rather the order of reads within the bams > (as evidenced by our high error rates previously). > > Looking forward to hearing your thoughts on this! > Jonas > > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > > > On 10 Apr 2017, at 13:15, Miguel Vazquez wrote: > > Hi all, > > The comparison with the *tumor BAM* for DO51057 has completed with * > rates of miss-maches (**0.043%) and soft-matches (**4.64%) just slightly > higher* *than for the* *normal BAM*. These *numbers are not definitive* > since as you can read from Jonas just below, *there might still be a > discrepancy* in the order in which the BAM where processed. We'll soon > know from Jonas if a different order will fix these rates even more. > > *Lines*: 1010685786 > *Matches*: 963319037 <963%2031%2090%2037> > *Misses*: 442926 > *Soft*: 46923823 > > Best regards > > Miguel > > On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk> wrote: > >> Hi all, >> >> I?m currently running the comparison of the BWA-Mem docker reproduced >> bams and the PCAWG ones for DO51057. >> I should be able to send a report some time today. >> >> Miguel, looking at your code, I believe you?re feeding the unaligned bams >> into the pipeline in the order given by the read group lines (@RG) in the >> header of the PCAWG bam. >> I?m using the order recorded in the command line/programs used (@CL/@PG) >> lines of the PCAWG bam, which is often different for whatever reason. >> I?m not entirely sure which one is the correct one, but I?m guessing the >> one in the @CL/@PG lines is the actual one as it chronologically reiterates >> the whole procedure ( [align - sort] x N followed by merge + flag dups ) >> If this is the case, the true % mismatches may be lower still than >> 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is >> due to something else still. >> >> Regarding the soft-matches, I agree with Junjun, we may want to ask the >> people behind the variant callers, but I guess they are probably dealing >> with these multiply-mapping reads internally. >> >> Best, >> Jonas >> >> >> _________________________________ >> Jonas Demeulemeester, PhD >> Postdoctoral Researcher >> The Francis Crick Institute >> 1 Midland Road >> London >> NW1 1AT >> >> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >> M: +44 (0)7482 070730 <+44%207482%20070730> >> *E:* jonas.demeulemeester at crick.ac.uk >> *W:* www.crick.ac.uk >> >> >> >> On 9 Apr 2017, at 15:47, Junjun Zhang wrote: >> >> Hi Miguel, >> >> This is indeed good news, the mismatch is significantly lower. >> >> Regarding soft matches, thanks for the explanation. I wonder whether it >> has impact (or how much impact) on variant calls, do variant callers take >> into account the information that a read may map to multiple places? Does >> it make adjustment at the time of variant calling? I guess these are >> questions for variant caller authors. >> >> Thanks, >> Junjun >> >> >> >> From: on >> behalf of Miguel Vazquez >> Date: Thursday, April 6, 2017 at 11:36 AM >> To: Lincoln Stein >> Cc: Francis Ouellette , Keiran Raine < >> kr2 at sanger.ac.uk>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM >> only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches >> >> Hi Lincoln, >> >> Soft-match means that the alignment position in the new BAM is not the >> same is the one in the original BAM, but is included in the list of >> alternative alignments for that read. >> >> For instance, the original bam aligns a read to chr 1 pos 1000, but also >> admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the >> new bam aligns it at chr 2 pos 2000, which is not the position chosen by >> the original BAM but is in the alternative list. It could also work the >> other way, that the original position is included in the list of >> alternative positions of the new BAM >> >> I hope this was clear. >> >> Best regards >> >> Miguel >> >> >> >> >> >> >> On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein >> wrote: >> >>> Hi Miguel, >>> >>> Sounds like a significant achievement! But remind me what a "soft match" >>> is? >>> >>> Lincoln >>> >>> On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez >>> wrote: >>> >>>> Dear all, >>>> >>>> This is just an advance teaser for the BWA-Mem validation after the >>>> latest changes, it is currently running over the tumor BAM, but the normal >>>> BAM has completed and the *missmatches are two orders of magnitude >>>> lower* than in our two previous attempts. Before further discussion >>>> here are the raw numbers: >>>> >>>> Lines: 1125172217 >>>> Matches: 1083221794 >>>> *Misses: 143716* >>>> Soft: 41806707 >>>> >>>> If my calculation are correct this means 96.3% matches, *0.013% >>>> miss-matches*, and 3.7% soft-matches. >>>> >>>> The fix was two part. First realizing that the input of this process >>>> should not be a single unaligned version of the output BAMs, but several >>>> input BAMs. Breaking down the output bam into it's constituent BAMs, by a >>>> process implemented by Jonas, dit not address the problem unfortunately. >>>> After this first attempt it was pointed out to us, I think by Keiran, that >>>> the order of the reads matter, and so our attempt to work back from the >>>> output BAM was not going to work. Junjun came back to us with the second >>>> part of the fix, he located a subset of original unaligned BAMs in the DKFZ >>>> that we could use. Downloading these BAM files and submitting them to >>>> BWA-Mem in the same order as was specified in the output BAM header >>>> achieved these promising results. >>>> >>>> I will reply this message in a few days with the corresponding numbers >>>> for the other BAM, the tumor, which is currently running. >>>> >>>> Best regards >>>> >>>> Miguel >>>> >>>> >>>> >>>> On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez >>> > wrote: >>>> >>>>> Dear all, >>>>> >>>>> Great news! The BWA-Mem test on a real PCAWG donor succeed in running; >>>>> achieving an overlap with the original BAM alignment similar to the >>>>> HCC1143 test. The numbers are: >>>>> >>>>> Lines: 1708047647 >>>>> Matches: 1589172843 >>>>> Misses: 62726130 >>>>> Soft: 56148674 >>>>> >>>>> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. >>>>> Compared to the HCC1143 test there are a few percentage points in matches >>>>> that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio >>>>> of misses is very close 3.6%. >>>>> >>>>> I'm running this test on a second donor. >>>>> >>>>> Best regards >>>>> >>>>> Miguel >>>>> >>>>> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez < >>>>> miguel.vazquez at cnio.es> wrote: >>>>> >>>>>> Dear colleagues, >>>>>> >>>>>> I'm very happy to say that the BWA-Mem pipeline finished for the >>>>>> HCC1143 data. >>>>>> >>>>>> I think what solved the problem was setting the headers to the >>>>>> unaligned BAM files. I'm currently trying it out with the DO35937 donor, >>>>>> but its too early to say if its working or not. >>>>>> >>>>>> To compare BAM files I've followed some advice that I found on the >>>>>> internet https://www.biostars.org/p/166221/. I will detail them a >>>>>> bit below because I would like some advice as to how appropriate the >>>>>> approach is, but first here are the numbers: >>>>>> >>>>>> *Lines*: 74264390 >>>>>> *Matches*: 70565742 >>>>>> *Misses*: 2693687 >>>>>> *Soft*: 1004961 >>>>>> >>>>>> >>>>>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*. >>>>>> Matches are when the chromosome and position are the same, soft-matches are >>>>>> when they are not the same but the position from one of the alignments is >>>>>> included in the list of alternative positions for the other alignment (e.g >>>>>> XA:Z:15,-102516528,76M,0), and misses are the rest. >>>>>> >>>>>> Here is the detailed process from the start. The comparison script is >>>>>> here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>>>>> n/compare_bwa_bam.sh >>>>>> >>>>>> 1) Un-align tumor and normal BAM files, retaining the original >>>>>> aligned BAM files >>>>>> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam >>>>>> with alignments from both tumor and normal >>>>>> 3) use samtools to extract the entries, limited for the first in pair >>>>>> (?), cut the read-name, chromosome, position (??) and extra information >>>>>> (for additional alignments) and sort them. We do this for the original >>>>>> files and for the BWA-Mem merged_output file, but separating tumor and >>>>>> normal entries (marked with the codes 'tumor' and 'normal', I believe from >>>>>> the headers I set when un-aligning them) >>>>>> 4) join the lines by read-name, separately for the tumor and normal >>>>>> pairs of files, and check for matches >>>>>> >>>>>> I've two questions: >>>>>> (?) Is it OK to select only the first in pair, its what the guy in >>>>>> the example did, and it did simplify the code without repeated read-names >>>>>> (??) I guess its OK to only check chromosome and position, the cigar >>>>>> would be necessarily the same. >>>>>> >>>>>> Best regards >>>>>> >>>>>> Miguel >>>>>> >>>>>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez < >>>>>> miguel.vazquez at cnio.es> wrote: >>>>>> >>>>>>> Dear all, >>>>>>> >>>>>>> Let me summarize the status of the testing for Sanger and DKFZ. The >>>>>>> validation has been run for two donors for each workflow: DO50311 DO52140 >>>>>>> >>>>>>> Sanger: >>>>>>> ---------- >>>>>>> >>>>>>> Sanger call only somatic variants. The results are *identical for >>>>>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The >>>>>>> discrepancies are reproducible (on the same machine at least), i.e. the >>>>>>> same are found after running the workflow a second time. >>>>>>> >>>>>>> DKFZ: >>>>>>> --------- >>>>>>> DKFZ cals somatic and germline variants, except germline CNVs. For >>>>>>> both germline and somatic variants the results are *identical for >>>>>>> SNV.MNV and Indels* but with *large discrepancies for SV and CNV*. >>>>>>> >>>>>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of >>>>>>> investigating this issue I believe. >>>>>>> >>>>>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas >>>>>>> Demeulemeester. Denis I believe is investigating this problem further. I >>>>>>> haven't had the chance to investigate this much myself. >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> Miguel >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> --------------------- >>>>>>> RESULTS >>>>>>> --------------------- >>>>>>> >>>>>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt >>>>>>> >>>>>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ >>>>>>> --- >>>>>>> Common: 51087 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.indel for DO50311 using DKFZ >>>>>>> --- >>>>>>> Common: 26469 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.sv for DO50311 using DKFZ >>>>>>> --- >>>>>>> Common: 231 >>>>>>> Extra: 44 >>>>>>> - Example: 10:20596800:N:,10:5606682 >>>>>>> 1:N:,11:16776092:N: >>>>>>> Missing: 48 >>>>>>> - Example: 10:119704959:N:,10:131163 >>>>>>> 22:N:,10:47063485:N: >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.cnv for DO50311 using DKFZ >>>>>>> --- >>>>>>> Common: 731 >>>>>>> Extra: 213 >>>>>>> - Example: 10:132510034:N:,10:205968 >>>>>>> 01:N:,10:47674883:N: >>>>>>> Missing: 190 >>>>>>> - Example: 10:100891940:N:,10:10 >>>>>>> 4975905:N:,10:119704960:N: >>>>>>> >>>>>>> >>>>>>> Comparison of germline.snv.mnv for DO50311 using DKFZ >>>>>>> --- >>>>>>> Common: 3850992 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of germline.indel for DO50311 using DKFZ >>>>>>> --- >>>>>>> Common: 709060 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of germline.sv for DO50311 using DKFZ >>>>>>> --- >>>>>>> Common: 1393 >>>>>>> Extra: 231 >>>>>>> - Example: 10:134319313:N:,10:134948 >>>>>>> 976:N:,10:19996638:N: >>>>>>> Missing: 615 >>>>>>> - Example: 10:101851839:N:,10:101851 >>>>>>> 884:N:,10:10745225:N: >>>>>>> >>>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz >>>>>>> >>>>>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ >>>>>>> --- >>>>>>> Common: 37160 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.indel for DO52140 using DKFZ >>>>>>> --- >>>>>>> Common: 19347 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.sv for DO52140 using DKFZ >>>>>>> --- >>>>>>> Common: 72 >>>>>>> Extra: 23 >>>>>>> - Example: 10:132840774:N:,11:382520 >>>>>>> 19:N:,11:47700673:N: >>>>>>> Missing: 61 >>>>>>> - Example: 10:134749140:N:,11:179191 >>>>>>> :N:,11:38252005:N: >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.cnv for DO52140 using DKFZ >>>>>>> --- >>>>>>> Common: 275 >>>>>>> Extra: 94 >>>>>>> - Example: 1:106505931:N:,1:10906889 >>>>>>> 9:N:,1:109359995:N: >>>>>>> Missing: 286 >>>>>>> - Example: 10:88653561:N:,11:179192: >>>>>>> N:,11:38252006:N: >>>>>>> >>>>>>> >>>>>>> Comparison of germline.snv.mnv for DO52140 using DKFZ >>>>>>> --- >>>>>>> Common: 3833896 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of germline.indel for DO52140 using DKFZ >>>>>>> --- >>>>>>> Common: 706572 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of germline.sv for DO52140 using DKFZ >>>>>>> --- >>>>>>> Common: 1108 >>>>>>> Extra: 1116 >>>>>>> - Example: 10:102158308:N:,10:104645 >>>>>>> 247:N:,10:105097522:N: >>>>>>> Missing: 2908 >>>>>>> - Example: 10:100107032:N:,10:100107 >>>>>>> 151:N:,10:102158345:N: >>>>>>> >>>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz >>>>>>> >>>>>>> Comparison of somatic.snv.mnv for DO50311 using Sanger >>>>>>> --- >>>>>>> Common: 156299 >>>>>>> Extra: 1 >>>>>>> - Example: Y:58885197:A:G >>>>>>> Missing: 14 >>>>>>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.indel for DO50311 using Sanger >>>>>>> --- >>>>>>> Common: 812487 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.sv for DO50311 using Sanger >>>>>>> --- >>>>>>> Common: 260 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.cnv for DO50311 using Sanger >>>>>>> --- >>>>>>> Common: 138 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.snv.mnv for DO52140 using Sanger >>>>>>> --- >>>>>>> Common: 87234 >>>>>>> Extra: 5 >>>>>>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A >>>>>>> Missing: 7 >>>>>>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.indel for DO52140 using Sanger >>>>>>> --- >>>>>>> Common: 803986 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.sv for DO52140 using Sanger >>>>>>> --- >>>>>>> Common: 6 >>>>>>> Extra: 0 >>>>>>> Missing: 0 >>>>>>> >>>>>>> >>>>>>> Comparison of somatic.cnv for DO52140 using Sanger >>>>>>> --- >>>>>>> Common: 36 >>>>>>> Extra: 0 >>>>>>> Missing: 2 >>>>>>> - Example: 10:11767915:T:,10:11779907:G: >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> *Lincoln Stein* >>> >>> Scientific Director (Interim), Ontario Institute for Cancer Research >>> Director, Informatics and Bio-computing Program, OICR >>> Senior Principal Investigator, OICR >>> Professor, Department of Molecular Genetics, University of Toronto >>> >>> >>> *Ontario Institute for Cancer Research* >>> MaRS Centre >>> 661 University Avenue >>> Suite 510 >>> Toronto, Ontario >>> Canada M5G 0A3 >>> >>> Tel: 416-673-8514 >>> Mobile: 416-817-8240 >>> Email: lincoln.stein at gmail.com >>> Toll-free: 1-866-678-6427 >>> Twitter: @OICR_news >>> >>> *Executive Assistant* >>> *Melisa Torres* >>> Tel: 647-259-4253 >>> Email: melisa.torres at oicr.on.ca >>> www.oicr.on.ca >>> >>> This message and any attachments may contain confidential and/or >>> privileged information for the sole use of the intended recipient. Any >>> review or distribution by anyone other than the person for whom it was >>> originally intended is strictly prohibited. If you have received this >>> message in error, please contact the sender and delete all copies. >>> Opinions, conclusions or other information contained in this message may >>> not be that of the organization. >>> >>> _______________________________________________ >>> docktesters mailing list >>> docktesters at lists.icgc.org >>> https://lists.icgc.org/mailman/listinfo/docktesters >>> >>> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> >> The Francis Crick Institute Limited is a registered charity in England >> and Wales no. 1140062 and a company registered in England and Wales no. >> 06885462, with its registered office at 1 Midland Road London NW1 1AT >> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> > > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Tue Apr 11 07:51:20 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Tue, 11 Apr 2017 11:51:20 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: Message-ID: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> Hi Miguel, Thanks for the update, that was indeed the order I was looking for. And you?re right, I was too quick, only the number of matches is exactly identical. The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my run I guess may be cases where a uniquely mapping and a non-uniquely mapping read are flagged as duplicates and one is removed in your run and the other in mine and the original run. We could have a look at these reads, and try to trace this issue, but as Lincoln mentioned, maybe we should switch our focus to the other containers, and consider BWA-Mem validated given the observed small mismatch rates. It seems as if differences in the sequence of bam file processing only have a small effect on the final result (in this case at least). Could it be that the order of reads within the original bams has a bigger effect than the order of the bams themselves? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 11 Apr 2017, at 12:00, Miguel Vazquez > wrote: Hi Jonas, About the BAM order in the header I have some lines that start with @PG and then have a "CL:" field with the command line, I guess you are referring to those. The order is actually 1,2,3,4,5 and 6, which is the one I used. About the numbers, they are almost the same, yet not entirely the same. There are 442902 miss-matches in yours and 442926 in mine. So it appears that 24 of my miss-matches became soft-matches in yours. I would have expected a move between matches and soft-matches but not with miss-matches and soft-matches. It's a bit odd. I could send you the list of my miss-matches and we can find out which are the ones that moved and find out why. On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester > wrote: Hi all, I?ve completed the testing run of the BWA-Mem docker on PCAWG donor DO51057. Briefly, like Miguel?s run, this test used the original unaligned bam files for DO51057, but feeds them via the JSON file into the docker in a slightly different order (as recorded in the @PG/@CL lines in the original mapped PCAWG bams) Results of the comparison are as follows: Matched normal: Lines: 1125172217 Matches: 1083221794 Misses: 143668 Soft: 41806755 Tumor: Lines: 1010685786 Matches: 963319037 Misses: 442902 Soft: 46923847 Which are exactly the numbers reported by Miguel (resulting in 0.043% and 0.013% mismatch rates). The fact that the numbers match exactly comes as a bit of a surprise I think, but shows that the current pipeline is highly reproducible, even across platforms. @Miguel, could you verify the order of mapping of the different read groups in the header of your newly mapped bams. For the original and newly mapped normal bams the order recorded in the @CL lines is CPCG_0098_Ly_R_PE_517_WG.3 - 6 - 1 - 4 - 2 - 5 For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.5 - 3 - 6 - 2 - 4 - 1 Do you observe the same or a different order? If it?s the same, then the pipeline does some internal reordering and the order of records in the JSON doesn?t matter. If not, then the order of the bams doesn?t seem to matter as much (at least in this case), but maybe rather the order of reads within the bams (as evidenced by our high error rates previously). Looking forward to hearing your thoughts on this! Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 13:15, Miguel Vazquez > wrote: Hi all, The comparison with the tumor BAM for DO51057 has completed with rates of miss-maches (0.043%) and soft-matches (4.64%) just slightly higher than for the normal BAM. These numbers are not definitive since as you can read from Jonas just below, there might still be a discrepancy in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. Lines: 1010685786 Matches: 963319037 Misses: 442926 Soft: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester > wrote: Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Tue Apr 11 10:10:25 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Tue, 11 Apr 2017 16:10:25 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> Message-ID: Hi Jonas, I agree with leaving the matter rest following Lincoln's input last TC. There might be a lesson to be learned here, but unless someone here prompts us again about this we should move on. About your conclusion on BAM order and read order, I agree, it seems like read order is more important than BAM order, anyway I think you have a better understanding of this subject than I do. About the test that George is going to conduct, I guess he'll be fine just using my version of the scripts. At some point I'll try to incorporate your version of the BAM orders into my scripts so our version converge back. Not a pressing issue for now I think. Best regards Miguel On Tue, Apr 11, 2017 at 1:51 PM, Jonas Demeulemeester < Jonas.Demeulemeester at crick.ac.uk> wrote: > Hi Miguel, > > Thanks for the update, that was indeed the order I was looking for. > And you?re right, I was too quick, *only the number of matches is exactly > identical.* > The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my > run I guess may be cases where a uniquely mapping and a non-uniquely > mapping read are flagged as duplicates and one is removed in your run and > the other in mine and the original run. > We could have a look at these reads, and try to trace this issue, but as > Lincoln mentioned, maybe we should switch our focus to the other > containers, and consider BWA-Mem validated given the observed small > mismatch rates. > > It seems as if differences in the sequence of bam file processing only > have a small effect on the final result (in this case at least). > Could it be that the order of reads within the original bams has a bigger > effect than the order of the bams themselves? > > Thanks, > Jonas > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > > > On 11 Apr 2017, at 12:00, Miguel Vazquez wrote: > > Hi Jonas, > > About the BAM order in the header I have some lines that start with @PG > and then have a "CL:" field with the command line, I guess you are > referring to those. The order is actually 1,2,3,4,5 and 6, which is the one > I used. > > About the numbers, they are almost the same, yet not entirely the same. > There are 442902 miss-matches in yours and 442926 in mine. So it appears > that 24 of my miss-matches became soft-matches in yours. I would have > expected a move between matches and soft-matches but not with miss-matches > and soft-matches. It's a bit odd. I could send you the list of my > miss-matches and we can find out which are the ones that moved and find out > why. > > On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk> wrote: > >> Hi all, >> >> I?ve completed the testing run of the BWA-Mem docker on PCAWG donor >> DO51057. >> Briefly, like Miguel?s run, this test used the original unaligned bam >> files for DO51057, but feeds them via the JSON file into the docker in a >> slightly different order (as recorded in the @PG/@CL lines in the original >> mapped PCAWG bams) >> Results of the comparison are as follows: >> >> *Matched normal*: >> Lines: 1125172217 >> Matches: 1083221794 >> Misses: 143668 >> Soft: 41806755 >> >> *Tumor:* >> Lines: 1010685786 >> Matches: 963319037 <963%2031%2090%2037> >> Misses: 442902 >> Soft: 46923847 >> >> Which are *exactly the numbers reported by Miguel *(resulting in *0.043%* >> and *0.013%* mismatch rates). >> The fact that the numbers match exactly comes as a bit of a surprise I >> think, but shows that the current pipeline is highly reproducible, even >> across platforms. >> >> @Miguel, could you verify the order of mapping of the different read >> groups in the header of your newly mapped bams. >> For the original and newly mapped normal bams the order recorded in the >> @CL lines is CPCG_0098_Ly_R_PE_517_WG.*3 - 6 - 1 - 4 - 2 - 5* >> For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.*5 - 3 - 6 - 2 >> - 4 - 1* >> Do you observe the same or a different order? >> If it?s the same, then the pipeline does some internal reordering and the >> order of records in the JSON doesn?t matter. >> If not, then the order of the bams doesn?t seem to matter as much (at >> least in this case), but maybe rather the order of reads within the bams >> (as evidenced by our high error rates previously). >> >> Looking forward to hearing your thoughts on this! >> Jonas >> >> >> _________________________________ >> Jonas Demeulemeester, PhD >> Postdoctoral Researcher >> The Francis Crick Institute >> 1 Midland Road >> London >> NW1 1AT >> >> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >> M: +44 (0)7482 070730 <+44%207482%20070730> >> *E:* jonas.demeulemeester at crick.ac.uk >> *W:* www.crick.ac.uk >> >> >> >> On 10 Apr 2017, at 13:15, Miguel Vazquez wrote: >> >> Hi all, >> >> The comparison with the *tumor BAM* for DO51057 has completed with * >> rates of miss-maches (**0.043%) and soft-matches (**4.64%) just slightly >> higher* *than for the* *normal BAM*. These *numbers are not definitive* >> since as you can read from Jonas just below, *there might still be a >> discrepancy* in the order in which the BAM where processed. We'll soon >> know from Jonas if a different order will fix these rates even more. >> >> *Lines*: 1010685786 >> *Matches*: 963319037 <963%2031%2090%2037> >> *Misses*: 442926 >> *Soft*: 46923823 >> >> Best regards >> >> Miguel >> >> On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester < >> Jonas.Demeulemeester at crick.ac.uk> wrote: >> >>> Hi all, >>> >>> I?m currently running the comparison of the BWA-Mem docker reproduced >>> bams and the PCAWG ones for DO51057. >>> I should be able to send a report some time today. >>> >>> Miguel, looking at your code, I believe you?re feeding the unaligned >>> bams into the pipeline in the order given by the read group lines (@RG) in >>> the header of the PCAWG bam. >>> I?m using the order recorded in the command line/programs used (@CL/@PG) >>> lines of the PCAWG bam, which is often different for whatever reason. >>> I?m not entirely sure which one is the correct one, but I?m guessing the >>> one in the @CL/@PG lines is the actual one as it chronologically reiterates >>> the whole procedure ( [align - sort] x N followed by merge + flag dups ) >>> If this is the case, the true % mismatches may be lower still than >>> 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is >>> due to something else still. >>> >>> Regarding the soft-matches, I agree with Junjun, we may want to ask the >>> people behind the variant callers, but I guess they are probably dealing >>> with these multiply-mapping reads internally. >>> >>> Best, >>> Jonas >>> >>> >>> _________________________________ >>> Jonas Demeulemeester, PhD >>> Postdoctoral Researcher >>> The Francis Crick Institute >>> 1 Midland Road >>> London >>> NW1 1AT >>> >>> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >>> M: +44 (0)7482 070730 <+44%207482%20070730> >>> *E:* jonas.demeulemeester at crick.ac.uk >>> *W:* www.crick.ac.uk >>> >>> >>> >>> On 9 Apr 2017, at 15:47, Junjun Zhang wrote: >>> >>> Hi Miguel, >>> >>> This is indeed good news, the mismatch is significantly lower. >>> >>> Regarding soft matches, thanks for the explanation. I wonder whether it >>> has impact (or how much impact) on variant calls, do variant callers take >>> into account the information that a read may map to multiple places? Does >>> it make adjustment at the time of variant calling? I guess these are >>> questions for variant caller authors. >>> >>> Thanks, >>> Junjun >>> >>> >>> >>> From: on >>> behalf of Miguel Vazquez >>> Date: Thursday, April 6, 2017 at 11:36 AM >>> To: Lincoln Stein >>> Cc: Francis Ouellette , Keiran Raine < >>> kr2 at sanger.ac.uk>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM >>> only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches >>> >>> Hi Lincoln, >>> >>> Soft-match means that the alignment position in the new BAM is not the >>> same is the one in the original BAM, but is included in the list of >>> alternative alignments for that read. >>> >>> For instance, the original bam aligns a read to chr 1 pos 1000, but also >>> admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the >>> new bam aligns it at chr 2 pos 2000, which is not the position chosen by >>> the original BAM but is in the alternative list. It could also work the >>> other way, that the original position is included in the list of >>> alternative positions of the new BAM >>> >>> I hope this was clear. >>> >>> Best regards >>> >>> Miguel >>> >>> >>> >>> >>> >>> >>> On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein >>> wrote: >>> >>>> Hi Miguel, >>>> >>>> Sounds like a significant achievement! But remind me what a "soft >>>> match" is? >>>> >>>> Lincoln >>>> >>>> On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez >>> > wrote: >>>> >>>>> Dear all, >>>>> >>>>> This is just an advance teaser for the BWA-Mem validation after the >>>>> latest changes, it is currently running over the tumor BAM, but the normal >>>>> BAM has completed and the *missmatches are two orders of magnitude >>>>> lower* than in our two previous attempts. Before further discussion >>>>> here are the raw numbers: >>>>> >>>>> Lines: 1125172217 >>>>> Matches: 1083221794 >>>>> *Misses: 143716* >>>>> Soft: 41806707 >>>>> >>>>> If my calculation are correct this means 96.3% matches, *0.013% >>>>> miss-matches*, and 3.7% soft-matches. >>>>> >>>>> The fix was two part. First realizing that the input of this process >>>>> should not be a single unaligned version of the output BAMs, but several >>>>> input BAMs. Breaking down the output bam into it's constituent BAMs, by a >>>>> process implemented by Jonas, dit not address the problem unfortunately. >>>>> After this first attempt it was pointed out to us, I think by Keiran, that >>>>> the order of the reads matter, and so our attempt to work back from the >>>>> output BAM was not going to work. Junjun came back to us with the second >>>>> part of the fix, he located a subset of original unaligned BAMs in the DKFZ >>>>> that we could use. Downloading these BAM files and submitting them to >>>>> BWA-Mem in the same order as was specified in the output BAM header >>>>> achieved these promising results. >>>>> >>>>> I will reply this message in a few days with the corresponding numbers >>>>> for the other BAM, the tumor, which is currently running. >>>>> >>>>> Best regards >>>>> >>>>> Miguel >>>>> >>>>> >>>>> >>>>> On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez < >>>>> miguel.vazquez at cnio.es> wrote: >>>>> >>>>>> Dear all, >>>>>> >>>>>> Great news! The BWA-Mem test on a real PCAWG donor succeed in >>>>>> running; achieving an overlap with the original BAM alignment similar to >>>>>> the HCC1143 test. The numbers are: >>>>>> >>>>>> Lines: 1708047647 >>>>>> Matches: 1589172843 >>>>>> Misses: 62726130 >>>>>> Soft: 56148674 >>>>>> >>>>>> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. >>>>>> Compared to the HCC1143 test there are a few percentage points in matches >>>>>> that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio >>>>>> of misses is very close 3.6%. >>>>>> >>>>>> I'm running this test on a second donor. >>>>>> >>>>>> Best regards >>>>>> >>>>>> Miguel >>>>>> >>>>>> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez < >>>>>> miguel.vazquez at cnio.es> wrote: >>>>>> >>>>>>> Dear colleagues, >>>>>>> >>>>>>> I'm very happy to say that the BWA-Mem pipeline finished for the >>>>>>> HCC1143 data. >>>>>>> >>>>>>> I think what solved the problem was setting the headers to the >>>>>>> unaligned BAM files. I'm currently trying it out with the DO35937 donor, >>>>>>> but its too early to say if its working or not. >>>>>>> >>>>>>> To compare BAM files I've followed some advice that I found on the >>>>>>> internet https://www.biostars.org/p/166221/. I will detail them a >>>>>>> bit below because I would like some advice as to how appropriate the >>>>>>> approach is, but first here are the numbers: >>>>>>> >>>>>>> *Lines*: 74264390 >>>>>>> *Matches*: 70565742 >>>>>>> *Misses*: 2693687 >>>>>>> *Soft*: 1004961 >>>>>>> >>>>>>> >>>>>>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*. >>>>>>> Matches are when the chromosome and position are the same, soft-matches are >>>>>>> when they are not the same but the position from one of the alignments is >>>>>>> included in the list of alternative positions for the other alignment (e.g >>>>>>> XA:Z:15,-102516528,76M,0), and misses are the rest. >>>>>>> >>>>>>> Here is the detailed process from the start. The comparison script >>>>>>> is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>>>>>> n/compare_bwa_bam.sh >>>>>>> >>>>>>> 1) Un-align tumor and normal BAM files, retaining the original >>>>>>> aligned BAM files >>>>>>> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam >>>>>>> with alignments from both tumor and normal >>>>>>> 3) use samtools to extract the entries, limited for the first in >>>>>>> pair (?), cut the read-name, chromosome, position (??) and extra >>>>>>> information (for additional alignments) and sort them. We do this for the >>>>>>> original files and for the BWA-Mem merged_output file, but separating tumor >>>>>>> and normal entries (marked with the codes 'tumor' and 'normal', I believe >>>>>>> from the headers I set when un-aligning them) >>>>>>> 4) join the lines by read-name, separately for the tumor and normal >>>>>>> pairs of files, and check for matches >>>>>>> >>>>>>> I've two questions: >>>>>>> (?) Is it OK to select only the first in pair, its what the guy in >>>>>>> the example did, and it did simplify the code without repeated read-names >>>>>>> (??) I guess its OK to only check chromosome and position, the cigar >>>>>>> would be necessarily the same. >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> Miguel >>>>>>> >>>>>>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez < >>>>>>> miguel.vazquez at cnio.es> wrote: >>>>>>> >>>>>>>> Dear all, >>>>>>>> >>>>>>>> Let me summarize the status of the testing for Sanger and DKFZ. The >>>>>>>> validation has been run for two donors for each workflow: DO50311 DO52140 >>>>>>>> >>>>>>>> Sanger: >>>>>>>> ---------- >>>>>>>> >>>>>>>> Sanger call only somatic variants. The results are *identical for >>>>>>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The >>>>>>>> discrepancies are reproducible (on the same machine at least), i.e. the >>>>>>>> same are found after running the workflow a second time. >>>>>>>> >>>>>>>> DKFZ: >>>>>>>> --------- >>>>>>>> DKFZ cals somatic and germline variants, except germline CNVs. For >>>>>>>> both germline and somatic variants the results are *identical for >>>>>>>> SNV.MNV and Indels* but with *large discrepancies for SV and CNV*. >>>>>>>> >>>>>>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of >>>>>>>> investigating this issue I believe. >>>>>>>> >>>>>>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas >>>>>>>> Demeulemeester. Denis I believe is investigating this problem further. I >>>>>>>> haven't had the chance to investigate this much myself. >>>>>>>> >>>>>>>> Best >>>>>>>> >>>>>>>> Miguel >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------- >>>>>>>> RESULTS >>>>>>>> --------------------- >>>>>>>> >>>>>>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt >>>>>>>> >>>>>>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ >>>>>>>> --- >>>>>>>> Common: 51087 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.indel for DO50311 using DKFZ >>>>>>>> --- >>>>>>>> Common: 26469 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.sv for DO50311 using DKFZ >>>>>>>> --- >>>>>>>> Common: 231 >>>>>>>> Extra: 44 >>>>>>>> - Example: 10:20596800:N:,10:5606682 >>>>>>>> 1:N:,11:16776092:N: >>>>>>>> Missing: 48 >>>>>>>> - Example: 10:119704959:N:,10:131163 >>>>>>>> 22:N:,10:47063485:N: >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.cnv for DO50311 using DKFZ >>>>>>>> --- >>>>>>>> Common: 731 >>>>>>>> Extra: 213 >>>>>>>> - Example: 10:132510034:N:,10:205968 >>>>>>>> 01:N:,10:47674883:N: >>>>>>>> Missing: 190 >>>>>>>> - Example: 10:100891940:N:,10:10 >>>>>>>> 4975905:N:,10:119704960:N: >>>>>>>> >>>>>>>> >>>>>>>> Comparison of germline.snv.mnv for DO50311 using DKFZ >>>>>>>> --- >>>>>>>> Common: 3850992 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of germline.indel for DO50311 using DKFZ >>>>>>>> --- >>>>>>>> Common: 709060 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of germline.sv for DO50311 using DKFZ >>>>>>>> --- >>>>>>>> Common: 1393 >>>>>>>> Extra: 231 >>>>>>>> - Example: 10:134319313:N:,10:134948 >>>>>>>> 976:N:,10:19996638:N: >>>>>>>> Missing: 615 >>>>>>>> - Example: 10:101851839:N:,10:101851 >>>>>>>> 884:N:,10:10745225:N: >>>>>>>> >>>>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz >>>>>>>> >>>>>>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ >>>>>>>> --- >>>>>>>> Common: 37160 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.indel for DO52140 using DKFZ >>>>>>>> --- >>>>>>>> Common: 19347 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.sv for DO52140 using DKFZ >>>>>>>> --- >>>>>>>> Common: 72 >>>>>>>> Extra: 23 >>>>>>>> - Example: 10:132840774:N:,11:382520 >>>>>>>> 19:N:,11:47700673:N: >>>>>>>> Missing: 61 >>>>>>>> - Example: 10:134749140:N:,11:179191 >>>>>>>> :N:,11:38252005:N: >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.cnv for DO52140 using DKFZ >>>>>>>> --- >>>>>>>> Common: 275 >>>>>>>> Extra: 94 >>>>>>>> - Example: 1:106505931:N:,1:10906889 >>>>>>>> 9:N:,1:109359995:N: >>>>>>>> Missing: 286 >>>>>>>> - Example: 10:88653561:N:,11:179192: >>>>>>>> N:,11:38252006:N: >>>>>>>> >>>>>>>> >>>>>>>> Comparison of germline.snv.mnv for DO52140 using DKFZ >>>>>>>> --- >>>>>>>> Common: 3833896 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of germline.indel for DO52140 using DKFZ >>>>>>>> --- >>>>>>>> Common: 706572 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of germline.sv for DO52140 using DKFZ >>>>>>>> --- >>>>>>>> Common: 1108 >>>>>>>> Extra: 1116 >>>>>>>> - Example: 10:102158308:N:,10:104645 >>>>>>>> 247:N:,10:105097522:N: >>>>>>>> Missing: 2908 >>>>>>>> - Example: 10:100107032:N:,10:100107 >>>>>>>> 151:N:,10:102158345:N: >>>>>>>> >>>>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz >>>>>>>> >>>>>>>> Comparison of somatic.snv.mnv for DO50311 using Sanger >>>>>>>> --- >>>>>>>> Common: 156299 >>>>>>>> Extra: 1 >>>>>>>> - Example: Y:58885197:A:G >>>>>>>> Missing: 14 >>>>>>>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.indel for DO50311 using Sanger >>>>>>>> --- >>>>>>>> Common: 812487 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.sv for DO50311 using Sanger >>>>>>>> --- >>>>>>>> Common: 260 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.cnv for DO50311 using Sanger >>>>>>>> --- >>>>>>>> Common: 138 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.snv.mnv for DO52140 using Sanger >>>>>>>> --- >>>>>>>> Common: 87234 >>>>>>>> Extra: 5 >>>>>>>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A >>>>>>>> Missing: 7 >>>>>>>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.indel for DO52140 using Sanger >>>>>>>> --- >>>>>>>> Common: 803986 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.sv for DO52140 using Sanger >>>>>>>> --- >>>>>>>> Common: 6 >>>>>>>> Extra: 0 >>>>>>>> Missing: 0 >>>>>>>> >>>>>>>> >>>>>>>> Comparison of somatic.cnv for DO52140 using Sanger >>>>>>>> --- >>>>>>>> Common: 36 >>>>>>>> Extra: 0 >>>>>>>> Missing: 2 >>>>>>>> - Example: 10:11767915:T:,10:11779907:G: >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> *Lincoln Stein* >>>> >>>> Scientific Director (Interim), Ontario Institute for Cancer Research >>>> Director, Informatics and Bio-computing Program, OICR >>>> Senior Principal Investigator, OICR >>>> Professor, Department of Molecular Genetics, University of Toronto >>>> >>>> >>>> *Ontario Institute for Cancer Research* >>>> MaRS Centre >>>> 661 University Avenue >>>> Suite 510 >>>> Toronto, Ontario >>>> Canada M5G 0A3 >>>> >>>> Tel: 416-673-8514 >>>> Mobile: 416-817-8240 >>>> Email: lincoln.stein at gmail.com >>>> Toll-free: 1-866-678-6427 >>>> Twitter: @OICR_news >>>> >>>> *Executive Assistant* >>>> *Melisa Torres* >>>> Tel: 647-259-4253 >>>> Email: melisa.torres at oicr.on.ca >>>> www.oicr.on.ca >>>> >>>> This message and any attachments may contain confidential and/or >>>> privileged information for the sole use of the intended recipient. Any >>>> review or distribution by anyone other than the person for whom it was >>>> originally intended is strictly prohibited. If you have received this >>>> message in error, please contact the sender and delete all copies. >>>> Opinions, conclusions or other information contained in this message may >>>> not be that of the organization. >>>> >>>> _______________________________________________ >>>> docktesters mailing list >>>> docktesters at lists.icgc.org >>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>> >>>> >>> _______________________________________________ >>> docktesters mailing list >>> docktesters at lists.icgc.org >>> https://lists.icgc.org/mailman/listinfo/docktesters >>> >>> >>> The Francis Crick Institute Limited is a registered charity in England >>> and Wales no. 1140062 and a company registered in England and Wales no. >>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>> >>> _______________________________________________ >>> docktesters mailing list >>> docktesters at lists.icgc.org >>> https://lists.icgc.org/mailman/listinfo/docktesters >>> >>> >> >> The Francis Crick Institute Limited is a registered charity in England >> and Wales no. 1140062 and a company registered in England and Wales no. >> 06885462, with its registered office at 1 Midland Road London NW1 1AT >> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> > > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christina.yung at oicr.on.ca Tue Apr 11 10:18:43 2017 From: christina.yung at oicr.on.ca (Christina Yung) Date: Tue, 11 Apr 2017 09:18:43 -0500 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> Message-ID: <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Tue Apr 11 10:18:53 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Tue, 11 Apr 2017 14:18:53 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: Message-ID: Hi Miguel, Yesterday I downloaded the data using "bin/download_unaligned.sh DO51057", so now I'll run the "bin/run_bwa_test.sh DO51057" script from you repo. We'll see if the miss and soft-matches reported are exactly the same in a few days. George From: Miguel Vazquez > Date: Tuesday, April 11, 2017 at 10:10 AM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Lincoln Stein >, Francis Ouellette >, "docktesters at lists.icgc.org" >, Keiran Raine > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Hi Jonas, I agree with leaving the matter rest following Lincoln's input last TC. There might be a lesson to be learned here, but unless someone here prompts us again about this we should move on. About your conclusion on BAM order and read order, I agree, it seems like read order is more important than BAM order, anyway I think you have a better understanding of this subject than I do. About the test that George is going to conduct, I guess he'll be fine just using my version of the scripts. At some point I'll try to incorporate your version of the BAM orders into my scripts so our version converge back. Not a pressing issue for now I think. Best regards Miguel On Tue, Apr 11, 2017 at 1:51 PM, Jonas Demeulemeester > wrote: Hi Miguel, Thanks for the update, that was indeed the order I was looking for. And you?re right, I was too quick, only the number of matches is exactly identical. The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my run I guess may be cases where a uniquely mapping and a non-uniquely mapping read are flagged as duplicates and one is removed in your run and the other in mine and the original run. We could have a look at these reads, and try to trace this issue, but as Lincoln mentioned, maybe we should switch our focus to the other containers, and consider BWA-Mem validated given the observed small mismatch rates. It seems as if differences in the sequence of bam file processing only have a small effect on the final result (in this case at least). Could it be that the order of reads within the original bams has a bigger effect than the order of the bams themselves? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 11 Apr 2017, at 12:00, Miguel Vazquez > wrote: Hi Jonas, About the BAM order in the header I have some lines that start with @PG and then have a "CL:" field with the command line, I guess you are referring to those. The order is actually 1,2,3,4,5 and 6, which is the one I used. About the numbers, they are almost the same, yet not entirely the same. There are 442902 miss-matches in yours and 442926 in mine. So it appears that 24 of my miss-matches became soft-matches in yours. I would have expected a move between matches and soft-matches but not with miss-matches and soft-matches. It's a bit odd. I could send you the list of my miss-matches and we can find out which are the ones that moved and find out why. On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester > wrote: Hi all, I?ve completed the testing run of the BWA-Mem docker on PCAWG donor DO51057. Briefly, like Miguel?s run, this test used the original unaligned bam files for DO51057, but feeds them via the JSON file into the docker in a slightly different order (as recorded in the @PG/@CL lines in the original mapped PCAWG bams) Results of the comparison are as follows: Matched normal: Lines: 1125172217 Matches: 1083221794 Misses: 143668 Soft: 41806755 Tumor: Lines: 1010685786 Matches: 963319037 Misses: 442902 Soft: 46923847 Which are exactly the numbers reported by Miguel (resulting in 0.043% and 0.013% mismatch rates). The fact that the numbers match exactly comes as a bit of a surprise I think, but shows that the current pipeline is highly reproducible, even across platforms. @Miguel, could you verify the order of mapping of the different read groups in the header of your newly mapped bams. For the original and newly mapped normal bams the order recorded in the @CL lines is CPCG_0098_Ly_R_PE_517_WG.3 - 6 - 1 - 4 - 2 - 5 For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.5 - 3 - 6 - 2 - 4 - 1 Do you observe the same or a different order? If it?s the same, then the pipeline does some internal reordering and the order of records in the JSON doesn?t matter. If not, then the order of the bams doesn?t seem to matter as much (at least in this case), but maybe rather the order of reads within the bams (as evidenced by our high error rates previously). Looking forward to hearing your thoughts on this! Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 13:15, Miguel Vazquez > wrote: Hi all, The comparison with the tumor BAM for DO51057 has completed with rates of miss-maches (0.043%) and soft-matches (4.64%) just slightly higher than for the normal BAM. These numbers are not definitive since as you can read from Jonas just below, there might still be a discrepancy in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. Lines: 1010685786 Matches: 963319037 Misses: 442926 Soft: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester > wrote: Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From Denis.Yuen at oicr.on.ca Tue Apr 11 10:28:29 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Tue, 11 Apr 2017 14:28:29 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> , <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> Message-ID: <0794de7a71da4181a0acf29f1891d75d@oicr.on.ca> Christina: Yes, I believe that https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow would be an appropriate place for these kinds of instructions. I've pre-emptively given Miguel and Jonas permission to edit that repository. Denis Yuen Senior Software Developer Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. ________________________________ From: docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org on behalf of Christina Yung Sent: April 11, 2017 10:18:43 AM To: Miguel Vazquez; Jonas Demeulemeester; George Mihaiescu Cc: Lincoln Stein; Francis Ouellette; docktesters at lists.icgc.org; Keiran Raine Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Hi Jonas & Miguel, After the effort you've put in to identify issues, could you document the steps to reproduce the alignment similar to production runs? In other words, as a user who has downloaded a PCAWG aligned BAM, how do I revert back to lane BAMs, and input the lane BAMs in the right order to the docker? Denis, will github (https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow) be appropriate for such instructions? Thanks, Christina On 4/11/2017 9:10 AM, Miguel Vazquez wrote: Hi Jonas, I agree with leaving the matter rest following Lincoln's input last TC. There might be a lesson to be learned here, but unless someone here prompts us again about this we should move on. About your conclusion on BAM order and read order, I agree, it seems like read order is more important than BAM order, anyway I think you have a better understanding of this subject than I do. About the test that George is going to conduct, I guess he'll be fine just using my version of the scripts. At some point I'll try to incorporate your version of the BAM orders into my scripts so our version converge back. Not a pressing issue for now I think. Best regards Miguel On Tue, Apr 11, 2017 at 1:51 PM, Jonas Demeulemeester > wrote: Hi Miguel, Thanks for the update, that was indeed the order I was looking for. And you?re right, I was too quick, only the number of matches is exactly identical. The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my run I guess may be cases where a uniquely mapping and a non-uniquely mapping read are flagged as duplicates and one is removed in your run and the other in mine and the original run. We could have a look at these reads, and try to trace this issue, but as Lincoln mentioned, maybe we should switch our focus to the other containers, and consider BWA-Mem validated given the observed small mismatch rates. It seems as if differences in the sequence of bam file processing only have a small effect on the final result (in this case at least). Could it be that the order of reads within the original bams has a bigger effect than the order of the bams themselves? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 11 Apr 2017, at 12:00, Miguel Vazquez > wrote: Hi Jonas, About the BAM order in the header I have some lines that start with @PG and then have a "CL:" field with the command line, I guess you are referring to those. The order is actually 1,2,3,4,5 and 6, which is the one I used. About the numbers, they are almost the same, yet not entirely the same. There are 442902 miss-matches in yours and 442926 in mine. So it appears that 24 of my miss-matches became soft-matches in yours. I would have expected a move between matches and soft-matches but not with miss-matches and soft-matches. It's a bit odd. I could send you the list of my miss-matches and we can find out which are the ones that moved and find out why. On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester > wrote: Hi all, I?ve completed the testing run of the BWA-Mem docker on PCAWG donor DO51057. Briefly, like Miguel?s run, this test used the original unaligned bam files for DO51057, but feeds them via the JSON file into the docker in a slightly different order (as recorded in the @PG/@CL lines in the original mapped PCAWG bams) Results of the comparison are as follows: Matched normal: Lines: 1125172217 Matches: 1083221794 Misses: 143668 Soft: 41806755 Tumor: Lines: 1010685786 Matches: 963319037 Misses: 442902 Soft: 46923847 Which are exactly the numbers reported by Miguel (resulting in 0.043% and 0.013% mismatch rates). The fact that the numbers match exactly comes as a bit of a surprise I think, but shows that the current pipeline is highly reproducible, even across platforms. @Miguel, could you verify the order of mapping of the different read groups in the header of your newly mapped bams. For the original and newly mapped normal bams the order recorded in the @CL lines is CPCG_0098_Ly_R_PE_517_WG.3 - 6 - 1 - 4 - 2 - 5 For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.5 - 3 - 6 - 2 - 4 - 1 Do you observe the same or a different order? If it?s the same, then the pipeline does some internal reordering and the order of records in the JSON doesn?t matter. If not, then the order of the bams doesn?t seem to matter as much (at least in this case), but maybe rather the order of reads within the bams (as evidenced by our high error rates previously). Looking forward to hearing your thoughts on this! Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 13:15, Miguel Vazquez > wrote: Hi all, The comparison with the tumor BAM for DO51057 has completed with rates of miss-maches (0.043%) and soft-matches (4.64%) just slightly higher than for the normal BAM. These numbers are not definitive since as you can read from Jonas just below, there might still be a discrepancy in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. Lines: 1010685786 Matches: 963319037 Misses: 442926 Soft: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester > wrote: Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Tue Apr 11 10:39:21 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Tue, 11 Apr 2017 16:39:21 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> Message-ID: Hi Christina, Fortunately all the steps are in scripts we we won't forget how it's done. What you ask for is actually not really accurate anymore. The process of reverting the aligned BAM is not what we ended up doing since it leads to a 3% rate of miss-matches. First we though that splitting the problem was that the BWA was not ran over a single BAM but over a set of smaller BAM files, but splitting the BAM didn't not address the 3% miss-match ratio. I think it was Keiran we came to the conclusion that the problem was the order of the reads inside the BAM. Junjun found that the DKFZ had saved some of the original unaligned BAM files, which is what we ended up using and reduced the miss-match rate from 3% to 0.014% and 0.04% for the normal and tumor BAMs. Our latest discussion was about the order in which those original BAMs are entered into the BWA to explain those small remaining percentages; we have gained some insights but we don't plan to get to the bottom of it. Bottom line is that there is not process of reverting the aligned BAM into the unaligned prior to running the BWA, rather the original unaligned BAMs must be used. Its a pity that it works this way, however it is not a realistic use-case anyway to revert an already aligned BAM, is it? If it's OK with you Christina we can still document this on the repo. Do you mind writing this down Jonas? I think you understand this better than I do. Best regards Miguel On Tue, Apr 11, 2017 at 4:18 PM, Christina Yung wrote: > Hi Jonas & Miguel, > > After the effort you've put in to identify issues, could you document the > steps to reproduce the alignment similar to production runs? In other > words, as a user who has downloaded a PCAWG aligned BAM, how do I revert > back to lane BAMs, and input the lane BAMs in the right order to the > docker? > > Denis, will github (https://github.com/ICGC-TCGA- > PanCancer/Seqware-BWA-Workflow) be appropriate for such instructions? > > Thanks, > Christina > > On 4/11/2017 9:10 AM, Miguel Vazquez wrote: > > Hi Jonas, > > I agree with leaving the matter rest following Lincoln's input last TC. > There might be a lesson to be learned here, but unless someone here prompts > us again about this we should move on. About your conclusion on BAM order > and read order, I agree, it seems like read order is more important than > BAM order, anyway I think you have a better understanding of this subject > than I do. > > About the test that George is going to conduct, I guess he'll be fine just > using my version of the scripts. At some point I'll try to incorporate your > version of the BAM orders into my scripts so our version converge back. Not > a pressing issue for now I think. > > Best regards > > Miguel > > On Tue, Apr 11, 2017 at 1:51 PM, Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk> wrote: > >> Hi Miguel, >> >> Thanks for the update, that was indeed the order I was looking for. >> And you?re right, I was too quick, *only the number of matches is >> exactly identical.* >> The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my >> run I guess may be cases where a uniquely mapping and a non-uniquely >> mapping read are flagged as duplicates and one is removed in your run and >> the other in mine and the original run. >> We could have a look at these reads, and try to trace this issue, but as >> Lincoln mentioned, maybe we should switch our focus to the other >> containers, and consider BWA-Mem validated given the observed small >> mismatch rates. >> >> It seems as if differences in the sequence of bam file processing only >> have a small effect on the final result (in this case at least). >> Could it be that the order of reads within the original bams has a bigger >> effect than the order of the bams themselves? >> >> Thanks, >> Jonas >> >> _________________________________ >> Jonas Demeulemeester, PhD >> Postdoctoral Researcher >> The Francis Crick Institute >> 1 Midland Road >> London >> NW1 1AT >> >> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >> M: +44 (0)7482 070730 <+44%207482%20070730> >> *E:* jonas.demeulemeester at crick.ac.uk >> *W:* www.crick.ac.uk >> >> >> >> On 11 Apr 2017, at 12:00, Miguel Vazquez wrote: >> >> Hi Jonas, >> >> About the BAM order in the header I have some lines that start with @PG >> and then have a "CL:" field with the command line, I guess you are >> referring to those. The order is actually 1,2,3,4,5 and 6, which is the one >> I used. >> >> About the numbers, they are almost the same, yet not entirely the same. >> There are 442902 miss-matches in yours and 442926 in mine. So it appears >> that 24 of my miss-matches became soft-matches in yours. I would have >> expected a move between matches and soft-matches but not with miss-matches >> and soft-matches. It's a bit odd. I could send you the list of my >> miss-matches and we can find out which are the ones that moved and find out >> why. >> >> On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester < >> Jonas.Demeulemeester at crick.ac.uk> wrote: >> >>> Hi all, >>> >>> I?ve completed the testing run of the BWA-Mem docker on PCAWG donor >>> DO51057. >>> Briefly, like Miguel?s run, this test used the original unaligned bam >>> files for DO51057, but feeds them via the JSON file into the docker in a >>> slightly different order (as recorded in the @PG/@CL lines in the original >>> mapped PCAWG bams) >>> Results of the comparison are as follows: >>> >>> *Matched normal*: >>> Lines: 1125172217 >>> Matches: 1083221794 >>> Misses: 143668 >>> Soft: 41806755 >>> >>> *Tumor:* >>> Lines: 1010685786 >>> Matches: 963319037 <963%2031%2090%2037> >>> Misses: 442902 >>> Soft: 46923847 >>> >>> Which are *exactly the numbers reported by Miguel *(resulting in >>> *0.043%* and *0.013%* mismatch rates). >>> The fact that the numbers match exactly comes as a bit of a surprise I >>> think, but shows that the current pipeline is highly reproducible, even >>> across platforms. >>> >>> @Miguel, could you verify the order of mapping of the different read >>> groups in the header of your newly mapped bams. >>> For the original and newly mapped normal bams the order recorded in the >>> @CL lines is CPCG_0098_Ly_R_PE_517_WG.*3 - 6 - 1 - 4 - 2 - 5* >>> For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.*5 - 3 - 6 - >>> 2 - 4 - 1* >>> Do you observe the same or a different order? >>> If it?s the same, then the pipeline does some internal reordering and >>> the order of records in the JSON doesn?t matter. >>> If not, then the order of the bams doesn?t seem to matter as much (at >>> least in this case), but maybe rather the order of reads within the bams >>> (as evidenced by our high error rates previously). >>> >>> Looking forward to hearing your thoughts on this! >>> Jonas >>> >>> >>> _________________________________ >>> Jonas Demeulemeester, PhD >>> Postdoctoral Researcher >>> The Francis Crick Institute >>> 1 Midland Road >>> London >>> NW1 1AT >>> >>> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >>> M: +44 (0)7482 070730 <+44%207482%20070730> >>> *E:* jonas.demeulemeester at crick.ac.uk >>> *W:* www.crick.ac.uk >>> >>> >>> >>> On 10 Apr 2017, at 13:15, Miguel Vazquez wrote: >>> >>> Hi all, >>> >>> The comparison with the *tumor BAM* for DO51057 has completed with * >>> rates of miss-maches (**0.043%) and soft-matches (**4.64%) just >>> slightly higher* *than for the* *normal BAM*. These *numbers are not >>> definitive* since as you can read from Jonas just below, *there might >>> still be a discrepancy* in the order in which the BAM where processed. >>> We'll soon know from Jonas if a different order will fix these rates even >>> more. >>> >>> *Lines*: 1010685786 >>> *Matches*: 963319037 <963%2031%2090%2037> >>> *Misses*: 442926 >>> *Soft*: 46923823 >>> >>> Best regards >>> >>> Miguel >>> >>> On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester < >>> Jonas.Demeulemeester at crick.ac.uk> wrote: >>> >>>> Hi all, >>>> >>>> I?m currently running the comparison of the BWA-Mem docker reproduced >>>> bams and the PCAWG ones for DO51057. >>>> I should be able to send a report some time today. >>>> >>>> Miguel, looking at your code, I believe you?re feeding the unaligned >>>> bams into the pipeline in the order given by the read group lines (@RG) in >>>> the header of the PCAWG bam. >>>> I?m using the order recorded in the command line/programs used >>>> (@CL/@PG) lines of the PCAWG bam, which is often different for whatever >>>> reason. >>>> I?m not entirely sure which one is the correct one, but I?m guessing >>>> the one in the @CL/@PG lines is the actual one as it chronologically >>>> reiterates the whole procedure ( [align - sort] x N followed by merge + >>>> flag dups ) >>>> If this is the case, the true % mismatches may be lower still than >>>> 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is >>>> due to something else still. >>>> >>>> Regarding the soft-matches, I agree with Junjun, we may want to ask the >>>> people behind the variant callers, but I guess they are probably dealing >>>> with these multiply-mapping reads internally. >>>> >>>> Best, >>>> Jonas >>>> >>>> >>>> _________________________________ >>>> Jonas Demeulemeester, PhD >>>> Postdoctoral Researcher >>>> The Francis Crick Institute >>>> 1 Midland Road >>>> London >>>> NW1 1AT >>>> >>>> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >>>> M: +44 (0)7482 070730 <+44%207482%20070730> >>>> *E:* jonas.demeulemeester at crick.ac.uk >>>> *W:* www.crick.ac.uk >>>> >>>> >>>> >>>> On 9 Apr 2017, at 15:47, Junjun Zhang wrote: >>>> >>>> Hi Miguel, >>>> >>>> This is indeed good news, the mismatch is significantly lower. >>>> >>>> Regarding soft matches, thanks for the explanation. I wonder whether it >>>> has impact (or how much impact) on variant calls, do variant callers take >>>> into account the information that a read may map to multiple places? Does >>>> it make adjustment at the time of variant calling? I guess these are >>>> questions for variant caller authors. >>>> >>>> Thanks, >>>> Junjun >>>> >>>> >>>> >>>> From: on >>>> behalf of Miguel Vazquez >>>> Date: Thursday, April 6, 2017 at 11:36 AM >>>> To: Lincoln Stein >>>> Cc: Francis Ouellette , Keiran Raine < >>>> kr2 at sanger.ac.uk>, "docktesters at lists.icgc.org" < >>>> docktesters at lists.icgc.org> >>>> Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM >>>> only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches >>>> >>>> Hi Lincoln, >>>> >>>> Soft-match means that the alignment position in the new BAM is not the >>>> same is the one in the original BAM, but is included in the list of >>>> alternative alignments for that read. >>>> >>>> For instance, the original bam aligns a read to chr 1 pos 1000, but >>>> also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, >>>> the new bam aligns it at chr 2 pos 2000, which is not the position chosen >>>> by the original BAM but is in the alternative list. It could also work the >>>> other way, that the original position is included in the list of >>>> alternative positions of the new BAM >>>> >>>> I hope this was clear. >>>> >>>> Best regards >>>> >>>> Miguel >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein >>>> wrote: >>>> >>>>> Hi Miguel, >>>>> >>>>> Sounds like a significant achievement! But remind me what a "soft >>>>> match" is? >>>>> >>>>> Lincoln >>>>> >>>>> On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez < >>>>> miguel.vazquez at cnio.es> wrote: >>>>> >>>>>> Dear all, >>>>>> >>>>>> This is just an advance teaser for the BWA-Mem validation after the >>>>>> latest changes, it is currently running over the tumor BAM, but the normal >>>>>> BAM has completed and the *missmatches are two orders of magnitude >>>>>> lower* than in our two previous attempts. Before further discussion >>>>>> here are the raw numbers: >>>>>> >>>>>> Lines: 1125172217 >>>>>> Matches: 1083221794 >>>>>> *Misses: 143716* >>>>>> Soft: 41806707 >>>>>> >>>>>> If my calculation are correct this means 96.3% matches, *0.013% >>>>>> miss-matches*, and 3.7% soft-matches. >>>>>> >>>>>> The fix was two part. First realizing that the input of this process >>>>>> should not be a single unaligned version of the output BAMs, but several >>>>>> input BAMs. Breaking down the output bam into it's constituent BAMs, by a >>>>>> process implemented by Jonas, dit not address the problem unfortunately. >>>>>> After this first attempt it was pointed out to us, I think by Keiran, that >>>>>> the order of the reads matter, and so our attempt to work back from the >>>>>> output BAM was not going to work. Junjun came back to us with the second >>>>>> part of the fix, he located a subset of original unaligned BAMs in the DKFZ >>>>>> that we could use. Downloading these BAM files and submitting them to >>>>>> BWA-Mem in the same order as was specified in the output BAM header >>>>>> achieved these promising results. >>>>>> >>>>>> I will reply this message in a few days with the corresponding >>>>>> numbers for the other BAM, the tumor, which is currently running. >>>>>> >>>>>> Best regards >>>>>> >>>>>> Miguel >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez < >>>>>> miguel.vazquez at cnio.es> wrote: >>>>>> >>>>>>> Dear all, >>>>>>> >>>>>>> Great news! The BWA-Mem test on a real PCAWG donor succeed in >>>>>>> running; achieving an overlap with the original BAM alignment similar to >>>>>>> the HCC1143 test. The numbers are: >>>>>>> >>>>>>> Lines: 1708047647 >>>>>>> Matches: 1589172843 >>>>>>> Misses: 62726130 >>>>>>> Soft: 56148674 >>>>>>> >>>>>>> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. >>>>>>> Compared to the HCC1143 test there are a few percentage points in matches >>>>>>> that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio >>>>>>> of misses is very close 3.6%. >>>>>>> >>>>>>> I'm running this test on a second donor. >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> Miguel >>>>>>> >>>>>>> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez < >>>>>>> miguel.vazquez at cnio.es> wrote: >>>>>>> >>>>>>>> Dear colleagues, >>>>>>>> >>>>>>>> I'm very happy to say that the BWA-Mem pipeline finished for the >>>>>>>> HCC1143 data. >>>>>>>> >>>>>>>> I think what solved the problem was setting the headers to the >>>>>>>> unaligned BAM files. I'm currently trying it out with the DO35937 donor, >>>>>>>> but its too early to say if its working or not. >>>>>>>> >>>>>>>> To compare BAM files I've followed some advice that I found on the >>>>>>>> internet https://www.biostars.org/p/166221/. I will detail them a >>>>>>>> bit below because I would like some advice as to how appropriate the >>>>>>>> approach is, but first here are the numbers: >>>>>>>> >>>>>>>> *Lines*: 74264390 >>>>>>>> *Matches*: 70565742 >>>>>>>> *Misses*: 2693687 >>>>>>>> *Soft*: 1004961 >>>>>>>> >>>>>>>> >>>>>>>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*. >>>>>>>> Matches are when the chromosome and position are the same, soft-matches are >>>>>>>> when they are not the same but the position from one of the alignments is >>>>>>>> included in the list of alternative positions for the other alignment (e.g >>>>>>>> XA:Z:15,-102516528,76M,0), and misses are the rest. >>>>>>>> >>>>>>>> Here is the detailed process from the start. The comparison script >>>>>>>> is here https://github.com/mikisvaz/PC >>>>>>>> AWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh >>>>>>>> >>>>>>>> 1) Un-align tumor and normal BAM files, retaining the original >>>>>>>> aligned BAM files >>>>>>>> 2) Run BWA-Mem wich produces a file called >>>>>>>> HCC1143.merged_output.bam with alignments from both tumor and normal >>>>>>>> 3) use samtools to extract the entries, limited for the first in >>>>>>>> pair (?), cut the read-name, chromosome, position (??) and extra >>>>>>>> information (for additional alignments) and sort them. We do this for the >>>>>>>> original files and for the BWA-Mem merged_output file, but separating tumor >>>>>>>> and normal entries (marked with the codes 'tumor' and 'normal', I believe >>>>>>>> from the headers I set when un-aligning them) >>>>>>>> 4) join the lines by read-name, separately for the tumor and normal >>>>>>>> pairs of files, and check for matches >>>>>>>> >>>>>>>> I've two questions: >>>>>>>> (?) Is it OK to select only the first in pair, its what the guy in >>>>>>>> the example did, and it did simplify the code without repeated read-names >>>>>>>> (??) I guess its OK to only check chromosome and position, the >>>>>>>> cigar would be necessarily the same. >>>>>>>> >>>>>>>> Best regards >>>>>>>> >>>>>>>> Miguel >>>>>>>> >>>>>>>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez < >>>>>>>> miguel.vazquez at cnio.es> wrote: >>>>>>>> >>>>>>>>> Dear all, >>>>>>>>> >>>>>>>>> Let me summarize the status of the testing for Sanger and DKFZ. >>>>>>>>> The validation has been run for two donors for each workflow: DO50311 >>>>>>>>> DO52140 >>>>>>>>> >>>>>>>>> Sanger: >>>>>>>>> ---------- >>>>>>>>> >>>>>>>>> Sanger call only somatic variants. The results are *identical for >>>>>>>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The >>>>>>>>> discrepancies are reproducible (on the same machine at least), i.e. the >>>>>>>>> same are found after running the workflow a second time. >>>>>>>>> >>>>>>>>> DKFZ: >>>>>>>>> --------- >>>>>>>>> DKFZ cals somatic and germline variants, except germline CNVs. For >>>>>>>>> both germline and somatic variants the results are *identical for >>>>>>>>> SNV.MNV and Indels* but with *large discrepancies for SV and CNV*. >>>>>>>>> >>>>>>>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process >>>>>>>>> of investigating this issue I believe. >>>>>>>>> >>>>>>>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas >>>>>>>>> Demeulemeester. Denis I believe is investigating this problem further. I >>>>>>>>> haven't had the chance to investigate this much myself. >>>>>>>>> >>>>>>>>> Best >>>>>>>>> >>>>>>>>> Miguel >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------- >>>>>>>>> RESULTS >>>>>>>>> --------------------- >>>>>>>>> >>>>>>>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt >>>>>>>>> >>>>>>>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 51087 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.indel for DO50311 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 26469 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.sv for DO50311 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 231 >>>>>>>>> Extra: 44 >>>>>>>>> - Example: 10:20596800:N:,10:5606682 >>>>>>>>> 1:N:,11:16776092:N: >>>>>>>>> Missing: 48 >>>>>>>>> - Example: 10:119704959:N:,10:131163 >>>>>>>>> 22:N:,10:47063485:N: >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.cnv for DO50311 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 731 >>>>>>>>> Extra: 213 >>>>>>>>> - Example: 10:132510034:N:,10:205968 >>>>>>>>> 01:N:,10:47674883:N: >>>>>>>>> Missing: 190 >>>>>>>>> - Example: 10:100891940:N:,10:10 >>>>>>>>> 4975905:N:,10:119704960:N: >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of germline.snv.mnv for DO50311 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 3850992 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of germline.indel for DO50311 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 709060 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of germline.sv for DO50311 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 1393 >>>>>>>>> Extra: 231 >>>>>>>>> - Example: 10:134319313:N:,10:134948 >>>>>>>>> 976:N:,10:19996638:N: >>>>>>>>> Missing: 615 >>>>>>>>> - Example: 10:101851839:N:,10:101851 >>>>>>>>> 884:N:,10:10745225:N: >>>>>>>>> >>>>>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>>>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz >>>>>>>>> >>>>>>>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 37160 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.indel for DO52140 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 19347 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.sv for DO52140 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 72 >>>>>>>>> Extra: 23 >>>>>>>>> - Example: 10:132840774:N:,11:382520 >>>>>>>>> 19:N:,11:47700673:N: >>>>>>>>> Missing: 61 >>>>>>>>> - Example: 10:134749140:N:,11:179191 >>>>>>>>> :N:,11:38252005:N: >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.cnv for DO52140 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 275 >>>>>>>>> Extra: 94 >>>>>>>>> - Example: 1:106505931:N:,1:10906889 >>>>>>>>> 9:N:,1:109359995:N: >>>>>>>>> Missing: 286 >>>>>>>>> - Example: 10:88653561:N:,11:179192: >>>>>>>>> N:,11:38252006:N: >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of germline.snv.mnv for DO52140 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 3833896 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of germline.indel for DO52140 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 706572 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of germline.sv for DO52140 using DKFZ >>>>>>>>> --- >>>>>>>>> Common: 1108 >>>>>>>>> Extra: 1116 >>>>>>>>> - Example: 10:102158308:N:,10:104645 >>>>>>>>> 247:N:,10:105097522:N: >>>>>>>>> Missing: 2908 >>>>>>>>> - Example: 10:100107032:N:,10:100107 >>>>>>>>> 151:N:,10:102158345:N: >>>>>>>>> >>>>>>>>> File not found /mnt/1TB/work/DockerTest-Migue >>>>>>>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz >>>>>>>>> >>>>>>>>> Comparison of somatic.snv.mnv for DO50311 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 156299 >>>>>>>>> Extra: 1 >>>>>>>>> - Example: Y:58885197:A:G >>>>>>>>> Missing: 14 >>>>>>>>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.indel for DO50311 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 812487 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.sv for DO50311 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 260 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.cnv for DO50311 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 138 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.snv.mnv for DO52140 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 87234 >>>>>>>>> Extra: 5 >>>>>>>>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A >>>>>>>>> Missing: 7 >>>>>>>>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.indel for DO52140 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 803986 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.sv for DO52140 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 6 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Comparison of somatic.cnv for DO52140 using Sanger >>>>>>>>> --- >>>>>>>>> Common: 36 >>>>>>>>> Extra: 0 >>>>>>>>> Missing: 2 >>>>>>>>> - Example: 10:11767915:T:,10:11779907:G: >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Lincoln Stein* >>>>> >>>>> Scientific Director (Interim), Ontario Institute for Cancer Research >>>>> Director, Informatics and Bio-computing Program, OICR >>>>> Senior Principal Investigator, OICR >>>>> Professor, Department of Molecular Genetics, University of Toronto >>>>> >>>>> >>>>> *Ontario Institute for Cancer Research* >>>>> MaRS Centre >>>>> 661 University Avenue >>>>> Suite 510 >>>>> Toronto, Ontario >>>>> Canada M5G 0A3 >>>>> >>>>> Tel: 416-673-8514 >>>>> Mobile: 416-817-8240 >>>>> Email: lincoln.stein at gmail.com >>>>> Toll-free: 1-866-678-6427 >>>>> Twitter: @OICR_news >>>>> >>>>> *Executive Assistant* >>>>> *Melisa Torres* >>>>> Tel: 647-259-4253 >>>>> Email: melisa.torres at oicr.on.ca >>>>> www.oicr.on.ca >>>>> >>>>> This message and any attachments may contain confidential and/or >>>>> privileged information for the sole use of the intended recipient. Any >>>>> review or distribution by anyone other than the person for whom it was >>>>> originally intended is strictly prohibited. If you have received this >>>>> message in error, please contact the sender and delete all copies. >>>>> Opinions, conclusions or other information contained in this message may >>>>> not be that of the organization. >>>>> >>>>> _______________________________________________ >>>>> docktesters mailing list >>>>> docktesters at lists.icgc.org >>>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>>> >>>>> >>>> _______________________________________________ >>>> docktesters mailing list >>>> docktesters at lists.icgc.org >>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>> >>>> >>>> The Francis Crick Institute Limited is a registered charity in England >>>> and Wales no. 1140062 and a company registered in England and Wales no. >>>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>>> >>>> _______________________________________________ >>>> docktesters mailing list >>>> docktesters at lists.icgc.org >>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>> >>>> >>> >>> The Francis Crick Institute Limited is a registered charity in England >>> and Wales no. 1140062 and a company registered in England and Wales no. >>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>> >>> _______________________________________________ >>> docktesters mailing list >>> docktesters at lists.icgc.org >>> https://lists.icgc.org/mailman/listinfo/docktesters >>> >>> >> >> The Francis Crick Institute Limited is a registered charity in England >> and Wales no. 1140062 and a company registered in England and Wales no. >> 06885462, with its registered office at 1 Midland Road London NW1 1AT >> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> > > > _______________________________________________ > docktesters mailing listdocktesters at lists.icgc.orghttps://lists.icgc.org/mailman/listinfo/docktesters > > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Tue Apr 11 10:48:40 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Tue, 11 Apr 2017 14:48:40 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> Message-ID: Hi Miguel, Thanks for summarizing all of the discoveries, it's well described. This needs to be documented in the paper, and in addition as Lincoln suggested Monday, it would be good to have some description in the README of the workflow. Cheers, Junjun From: > on behalf of Miguel Vazquez > Date: Tuesday, April 11, 2017 at 10:39 AM To: Christina Yung > Cc: Lincoln Stein >, "docktesters at lists.icgc.org" >, Francis Ouellette >, Keiran Raine >, George Mihaiescu > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Hi Christina, Fortunately all the steps are in scripts we we won't forget how it's done. What you ask for is actually not really accurate anymore. The process of reverting the aligned BAM is not what we ended up doing since it leads to a 3% rate of miss-matches. First we though that splitting the problem was that the BWA was not ran over a single BAM but over a set of smaller BAM files, but splitting the BAM didn't not address the 3% miss-match ratio. I think it was Keiran we came to the conclusion that the problem was the order of the reads inside the BAM. Junjun found that the DKFZ had saved some of the original unaligned BAM files, which is what we ended up using and reduced the miss-match rate from 3% to 0.014% and 0.04% for the normal and tumor BAMs. Our latest discussion was about the order in which those original BAMs are entered into the BWA to explain those small remaining percentages; we have gained some insights but we don't plan to get to the bottom of i Bottom line is that there is not process of reverting the aligned BAM into the unaligned prior to running the BWA, rather the original unaligned BAMs must be used. Its a pity that it works this way, however it is not a realistic use-case anyway to revert an already aligned BAM, is it? If it's OK with you Christina we can still document this on the repo. Do you mind writing this down Jonas? I think you understand this better than I do. Best regards Miguel On Tue, Apr 11, 2017 at 4:18 PM, Christina Yung > wrote: Hi Jonas & Miguel, After the effort you've put in to identify issues, could you document the steps to reproduce the alignment similar to production runs? In other words, as a user who has downloaded a PCAWG aligned BAM, how do I revert back to lane BAMs, and input the lane BAMs in the right order to the docker? Denis, will github (https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow) be appropriate for such instructions? Thanks, Christina On 4/11/2017 9:10 AM, Miguel Vazquez wrote: Hi Jonas, I agree with leaving the matter rest following Lincoln's input last TC. There might be a lesson to be learned here, but unless someone here prompts us again about this we should move on. About your conclusion on BAM order and read order, I agree, it seems like read order is more important than BAM order, anyway I think you have a better understanding of this subject than I do. About the test that George is going to conduct, I guess he'll be fine just using my version of the scripts. At some point I'll try to incorporate your version of the BAM orders into my scripts so our version converge back. Not a pressing issue for now I think. Best regards Miguel On Tue, Apr 11, 2017 at 1:51 PM, Jonas Demeulemeester > wrote: Hi Miguel, Thanks for the update, that was indeed the order I was looking for. And you?re right, I was too quick, only the number of matches is exactly identical. The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my run I guess may be cases where a uniquely mapping and a non-uniquely mapping read are flagged as duplicates and one is removed in your run and the other in mine and the original run. We could have a look at these reads, and try to trace this issue, but as Lincoln mentioned, maybe we should switch our focus to the other containers, and consider BWA-Mem validated given the observed small mismatch rates. It seems as if differences in the sequence of bam file processing only have a small effect on the final result (in this case at least). Could it be that the order of reads within the original bams has a bigger effect than the order of the bams themselves? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 11 Apr 2017, at 12:00, Miguel Vazquez > wrote: Hi Jonas, About the BAM order in the header I have some lines that start with @PG and then have a "CL:" field with the command line, I guess you are referring to those. The order is actually 1,2,3,4,5 and 6, which is the one I used. About the numbers, they are almost the same, yet not entirely the same. There are 442902 miss-matches in yours and 442926 in mine. So it appears that 24 of my miss-matches became soft-matches in yours. I would have expected a move between matches and soft-matches but not with miss-matches and soft-matches. It's a bit odd. I could send you the list of my miss-matches and we can find out which are the ones that moved and find out why. On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester > wrote: Hi all, I?ve completed the testing run of the BWA-Mem docker on PCAWG donor DO51057. Briefly, like Miguel?s run, this test used the original unaligned bam files for DO51057, but feeds them via the JSON file into the docker in a slightly different order (as recorded in the @PG/@CL lines in the original mapped PCAWG bams) Results of the comparison are as follows: Matched normal: Lines: 1125172217 Matches: 1083221794 Misses: 143668 Soft: 41806755 Tumor: Lines: 1010685786 Matches: 963319037 Misses: 442902 Soft: 46923847 Which are exactly the numbers reported by Miguel (resulting in 0.043% and 0.013% mismatch rates). The fact that the numbers match exactly comes as a bit of a surprise I think, but shows that the current pipeline is highly reproducible, even across platforms. @Miguel, could you verify the order of mapping of the different read groups in the header of your newly mapped bams. For the original and newly mapped normal bams the order recorded in the @CL lines is CPCG_0098_Ly_R_PE_517_WG.3 - 6 - 1 - 4 - 2 - 5 For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.5 - 3 - 6 - 2 - 4 - 1 Do you observe the same or a different order? If it?s the same, then the pipeline does some internal reordering and the order of records in the JSON doesn?t matter. If not, then the order of the bams doesn?t seem to matter as much (at least in this case), but maybe rather the order of reads within the bams (as evidenced by our high error rates previously). Looking forward to hearing your thoughts on this! Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 13:15, Miguel Vazquez > wrote: Hi all, The comparison with the tumor BAM for DO51057 has completed with rates of miss-maches (0.043%) and soft-matches (4.64%) just slightly higher than for the normal BAM. These numbers are not definitive since as you can read from Jonas just below, there might still be a discrepancy in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. Lines: 1010685786 Matches: 963319037 Misses: 442926 Soft: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester > wrote: Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.orghttps://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Tue Apr 11 11:34:06 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Tue, 11 Apr 2017 15:34:06 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> Message-ID: <1172211A-6114-4AAB-AE6D-E63A20555157@crick.ac.uk> Hi all, I?ll try and document the findings on the workflow git page. I?ll also do one final check by regenerating the unaligned bams and realigning for DO51057, just so we have an exact comparison for this sample. Cheers, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 11 Apr 2017, at 15:48, Junjun Zhang > wrote: Hi Miguel, Thanks for summarizing all of the discoveries, it's well described. This needs to be documented in the paper, and in addition as Lincoln suggested Monday, it would be good to have some description in the README of the workflow. Cheers, Junjun From: > on behalf of Miguel Vazquez > Date: Tuesday, April 11, 2017 at 10:39 AM To: Christina Yung > Cc: Lincoln Stein >, "docktesters at lists.icgc.org" >, Francis Ouellette >, Keiran Raine >, George Mihaiescu > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Hi Christina, Fortunately all the steps are in scripts we we won't forget how it's done. What you ask for is actually not really accurate anymore. The process of reverting the aligned BAM is not what we ended up doing since it leads to a 3% rate of miss-matches. First we though that splitting the problem was that the BWA was not ran over a single BAM but over a set of smaller BAM files, but splitting the BAM didn't not address the 3% miss-match ratio. I think it was Keiran we came to the conclusion that the problem was the order of the reads inside the BAM. Junjun found that the DKFZ had saved some of the original unaligned BAM files, which is what we ended up using and reduced the miss-match rate from 3% to 0.014% and 0.04% for the normal and tumor BAMs. Our latest discussion was about the order in which those original BAMs are entered into the BWA to explain those small remaining percentages; we have gained some insights but we don't plan to get to the bottom of i Bottom line is that there is not process of reverting the aligned BAM into the unaligned prior to running the BWA, rather the original unaligned BAMs must be used. Its a pity that it works this way, however it is not a realistic use-case anyway to revert an already aligned BAM, is it? If it's OK with you Christina we can still document this on the repo. Do you mind writing this down Jonas? I think you understand this better than I do. Best regards Miguel On Tue, Apr 11, 2017 at 4:18 PM, Christina Yung > wrote: Hi Jonas & Miguel, After the effort you've put in to identify issues, could you document the steps to reproduce the alignment similar to production runs? In other words, as a user who has downloaded a PCAWG aligned BAM, how do I revert back to lane BAMs, and input the lane BAMs in the right order to the docker? Denis, will github (https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow) be appropriate for such instructions? Thanks, Christina On 4/11/2017 9:10 AM, Miguel Vazquez wrote: Hi Jonas, I agree with leaving the matter rest following Lincoln's input last TC. There might be a lesson to be learned here, but unless someone here prompts us again about this we should move on. About your conclusion on BAM order and read order, I agree, it seems like read order is more important than BAM order, anyway I think you have a better understanding of this subject than I do. About the test that George is going to conduct, I guess he'll be fine just using my version of the scripts. At some point I'll try to incorporate your version of the BAM orders into my scripts so our version converge back. Not a pressing issue for now I think. Best regards Miguel On Tue, Apr 11, 2017 at 1:51 PM, Jonas Demeulemeester > wrote: Hi Miguel, Thanks for the update, that was indeed the order I was looking for. And you?re right, I was too quick, only the number of matches is exactly identical. The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my run I guess may be cases where a uniquely mapping and a non-uniquely mapping read are flagged as duplicates and one is removed in your run and the other in mine and the original run. We could have a look at these reads, and try to trace this issue, but as Lincoln mentioned, maybe we should switch our focus to the other containers, and consider BWA-Mem validated given the observed small mismatch rates. It seems as if differences in the sequence of bam file processing only have a small effect on the final result (in this case at least). Could it be that the order of reads within the original bams has a bigger effect than the order of the bams themselves? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 11 Apr 2017, at 12:00, Miguel Vazquez > wrote: Hi Jonas, About the BAM order in the header I have some lines that start with @PG and then have a "CL:" field with the command line, I guess you are referring to those. The order is actually 1,2,3,4,5 and 6, which is the one I used. About the numbers, they are almost the same, yet not entirely the same. There are 442902 miss-matches in yours and 442926 in mine. So it appears that 24 of my miss-matches became soft-matches in yours. I would have expected a move between matches and soft-matches but not with miss-matches and soft-matches. It's a bit odd. I could send you the list of my miss-matches and we can find out which are the ones that moved and find out why. On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester > wrote: Hi all, I?ve completed the testing run of the BWA-Mem docker on PCAWG donor DO51057. Briefly, like Miguel?s run, this test used the original unaligned bam files for DO51057, but feeds them via the JSON file into the docker in a slightly different order (as recorded in the @PG/@CL lines in the original mapped PCAWG bams) Results of the comparison are as follows: Matched normal: Lines: 1125172217 Matches: 1083221794 Misses: 143668 Soft: 41806755 Tumor: Lines: 1010685786 Matches: 963319037 Misses: 442902 Soft: 46923847 Which are exactly the numbers reported by Miguel (resulting in 0.043% and 0.013% mismatch rates). The fact that the numbers match exactly comes as a bit of a surprise I think, but shows that the current pipeline is highly reproducible, even across platforms. @Miguel, could you verify the order of mapping of the different read groups in the header of your newly mapped bams. For the original and newly mapped normal bams the order recorded in the @CL lines is CPCG_0098_Ly_R_PE_517_WG.3 - 6 - 1 - 4 - 2 - 5 For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.5 - 3 - 6 - 2 - 4 - 1 Do you observe the same or a different order? If it?s the same, then the pipeline does some internal reordering and the order of records in the JSON doesn?t matter. If not, then the order of the bams doesn?t seem to matter as much (at least in this case), but maybe rather the order of reads within the bams (as evidenced by our high error rates previously). Looking forward to hearing your thoughts on this! Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 13:15, Miguel Vazquez > wrote: Hi all, The comparison with the tumor BAM for DO51057 has completed with rates of miss-maches (0.043%) and soft-matches (4.64%) just slightly higher than for the normal BAM. These numbers are not definitive since as you can read from Jonas just below, there might still be a discrepancy in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. Lines: 1010685786 Matches: 963319037 Misses: 442926 Soft: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester > wrote: Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.orghttps://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christina.Yung at oicr.on.ca Tue Apr 11 12:16:11 2017 From: Christina.Yung at oicr.on.ca (Christina Yung) Date: Tue, 11 Apr 2017 16:16:11 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> Message-ID: Hi Miguel, Thanks for the explanation. My concern is that majority of the unaligned BAMs have been deleted, and the remaining ones will eventually be deleted to reduce storage. So the aligned BAMs will be the starting point for any users. Without the unaligned BAMs, can you figure out from the header of the aligned BAMs what order the lane BAMs were run? Christina On 4/11/2017 9:39 AM, Miguel Vazquez wrote: Hi Christina, Fortunately all the steps are in scripts we we won't forget how it's done. What you ask for is actually not really accurate anymore. The process of reverting the aligned BAM is not what we ended up doing since it leads to a 3% rate of miss-matches. First we though that splitting the problem was that the BWA was not ran over a single BAM but over a set of smaller BAM files, but splitting the BAM didn't not address the 3% miss-match ratio. I think it was Keiran we came to the conclusion that the problem was the order of the reads inside the BAM. Junjun found that the DKFZ had saved some of the original unaligned BAM files, which is what we ended up using and reduced the miss-match rate from 3% to 0.014% and 0.04% for the normal and tumor BAMs. Our latest discussion was about the order in which those original BAMs are entered into the BWA to explain those small remaining percentages; we have gained some insights but we don't plan to get to the bottom of it. Bottom line is that there is not process of reverting the aligned BAM into the unaligned prior to running the BWA, rather the original unaligned BAMs must be used. Its a pity that it works this way, however it is not a realistic use-case anyway to revert an already aligned BAM, is it? If it's OK with you Christina we can still document this on the repo. Do you mind writing this down Jonas? I think you understand this better than I do. Best regards Miguel On Tue, Apr 11, 2017 at 4:18 PM, Christina Yung > wrote: Hi Jonas & Miguel, After the effort you've put in to identify issues, could you document the steps to reproduce the alignment similar to production runs? In other words, as a user who has downloaded a PCAWG aligned BAM, how do I revert back to lane BAMs, and input the lane BAMs in the right order to the docker? Denis, will github (https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow) be appropriate for such instructions? Thanks, Christina On 4/11/2017 9:10 AM, Miguel Vazquez wrote: Hi Jonas, I agree with leaving the matter rest following Lincoln's input last TC. There might be a lesson to be learned here, but unless someone here prompts us again about this we should move on. About your conclusion on BAM order and read order, I agree, it seems like read order is more important than BAM order, anyway I think you have a better understanding of this subject than I do. About the test that George is going to conduct, I guess he'll be fine just using my version of the scripts. At some point I'll try to incorporate your version of the BAM orders into my scripts so our version converge back. Not a pressing issue for now I think. Best regards Miguel On Tue, Apr 11, 2017 at 1:51 PM, Jonas Demeulemeester > wrote: Hi Miguel, Thanks for the update, that was indeed the order I was looking for. And you?re right, I was too quick, only the number of matches is exactly identical. The 24 (normal) and 48 (tumor) mismatches that became soft-matches in my run I guess may be cases where a uniquely mapping and a non-uniquely mapping read are flagged as duplicates and one is removed in your run and the other in mine and the original run. We could have a look at these reads, and try to trace this issue, but as Lincoln mentioned, maybe we should switch our focus to the other containers, and consider BWA-Mem validated given the observed small mismatch rates. It seems as if differences in the sequence of bam file processing only have a small effect on the final result (in this case at least). Could it be that the order of reads within the original bams has a bigger effect than the order of the bams themselves? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 11 Apr 2017, at 12:00, Miguel Vazquez > wrote: Hi Jonas, About the BAM order in the header I have some lines that start with @PG and then have a "CL:" field with the command line, I guess you are referring to those. The order is actually 1,2,3,4,5 and 6, which is the one I used. About the numbers, they are almost the same, yet not entirely the same. There are 442902 miss-matches in yours and 442926 in mine. So it appears that 24 of my miss-matches became soft-matches in yours. I would have expected a move between matches and soft-matches but not with miss-matches and soft-matches. It's a bit odd. I could send you the list of my miss-matches and we can find out which are the ones that moved and find out why. On Tue, Apr 11, 2017 at 11:43 AM, Jonas Demeulemeester > wrote: Hi all, I?ve completed the testing run of the BWA-Mem docker on PCAWG donor DO51057. Briefly, like Miguel?s run, this test used the original unaligned bam files for DO51057, but feeds them via the JSON file into the docker in a slightly different order (as recorded in the @PG/@CL lines in the original mapped PCAWG bams) Results of the comparison are as follows: Matched normal: Lines: 1125172217 Matches: 1083221794 Misses: 143668 Soft: 41806755 Tumor: Lines: 1010685786 Matches: 963319037 Misses: 442902 Soft: 46923847 Which are exactly the numbers reported by Miguel (resulting in 0.043% and 0.013% mismatch rates). The fact that the numbers match exactly comes as a bit of a surprise I think, but shows that the current pipeline is highly reproducible, even across platforms. @Miguel, could you verify the order of mapping of the different read groups in the header of your newly mapped bams. For the original and newly mapped normal bams the order recorded in the @CL lines is CPCG_0098_Ly_R_PE_517_WG.3 - 6 - 1 - 4 - 2 - 5 For the tumor bams the order is: CPCG_0098_Pr_P_PE_500_WG.5 - 3 - 6 - 2 - 4 - 1 Do you observe the same or a different order? If it?s the same, then the pipeline does some internal reordering and the order of records in the JSON doesn?t matter. If not, then the order of the bams doesn?t seem to matter as much (at least in this case), but maybe rather the order of reads within the bams (as evidenced by our high error rates previously). Looking forward to hearing your thoughts on this! Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 10 Apr 2017, at 13:15, Miguel Vazquez > wrote: Hi all, The comparison with the tumor BAM for DO51057 has completed with rates of miss-maches (0.043%) and soft-matches (4.64%) just slightly higher than for the normal BAM. These numbers are not definitive since as you can read from Jonas just below, there might still be a discrepancy in the order in which the BAM where processed. We'll soon know from Jonas if a different order will fix these rates even more. Lines: 1010685786 Matches: 963319037 Misses: 442926 Soft: 46923823 Best regards Miguel On Mon, Apr 10, 2017 at 1:14 PM, Jonas Demeulemeester > wrote: Hi all, I?m currently running the comparison of the BWA-Mem docker reproduced bams and the PCAWG ones for DO51057. I should be able to send a report some time today. Miguel, looking at your code, I believe you?re feeding the unaligned bams into the pipeline in the order given by the read group lines (@RG) in the header of the PCAWG bam. I?m using the order recorded in the command line/programs used (@CL/@PG) lines of the PCAWG bam, which is often different for whatever reason. I?m not entirely sure which one is the correct one, but I?m guessing the one in the @CL/@PG lines is the actual one as it chronologically reiterates the whole procedure ( [align - sort] x N followed by merge + flag dups ) If this is the case, the true % mismatches may be lower still than 0.013%, if not, then I should see a higher mismatch rate and the 0.013% is due to something else still. Regarding the soft-matches, I agree with Junjun, we may want to ask the people behind the variant callers, but I guess they are probably dealing with these multiply-mapping reads internally. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 9 Apr 2017, at 15:47, Junjun Zhang > wrote: Hi Miguel, This is indeed good news, the mismatch is significantly lower. Regarding soft matches, thanks for the explanation. I wonder whether it has impact (or how much impact) on variant calls, do variant callers take into account the information that a read may map to multiple places? Does it make adjustment at the time of variant calling? I guess these are questions for variant caller authors. Thanks, Junjun From: > on behalf of Miguel Vazquez > Date: Thursday, April 6, 2017 at 11:36 AM To: Lincoln Stein > Cc: Francis Ouellette >, Keiran Raine >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches Hi Lincoln, Soft-match means that the alignment position in the new BAM is not the same is the one in the original BAM, but is included in the list of alternative alignments for that read. For instance, the original bam aligns a read to chr 1 pos 1000, but also admits that is could be aligned at chr 2 pos 2000 or chr 3 pos 3000, the new bam aligns it at chr 2 pos 2000, which is not the position chosen by the original BAM but is in the alternative list. It could also work the other way, that the original position is included in the list of alternative positions of the new BAM I hope this was clear. Best regards Miguel On Thu, Apr 6, 2017 at 4:55 PM, Lincoln Stein > wrote: Hi Miguel, Sounds like a significant achievement! But remind me what a "soft match" is? Lincoln On Thu, Apr 6, 2017 at 10:28 AM, Miguel Vazquez > wrote: Dear all, This is just an advance teaser for the BWA-Mem validation after the latest changes, it is currently running over the tumor BAM, but the normal BAM has completed and the missmatches are two orders of magnitude lower than in our two previous attempts. Before further discussion here are the raw numbers: Lines: 1125172217 Matches: 1083221794 Misses: 143716 Soft: 41806707 If my calculation are correct this means 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches. The fix was two part. First realizing that the input of this process should not be a single unaligned version of the output BAMs, but several input BAMs. Breaking down the output bam into it's constituent BAMs, by a process implemented by Jonas, dit not address the problem unfortunately. After this first attempt it was pointed out to us, I think by Keiran, that the order of the reads matter, and so our attempt to work back from the output BAM was not going to work. Junjun came back to us with the second part of the fix, he located a subset of original unaligned BAMs in the DKFZ that we could use. Downloading these BAM files and submitting them to BWA-Mem in the same order as was specified in the output BAM header achieved these promising results. I will reply this message in a few days with the corresponding numbers for the other BAM, the tumor, which is currently running. Best regards Miguel On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez > wrote: Dear all, Great news! The BWA-Mem test on a real PCAWG donor succeed in running; achieving an overlap with the original BAM alignment similar to the HCC1143 test. The numbers are: Lines: 1708047647 Matches: 1589172843 Misses: 62726130 Soft: 56148674 Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared to the HCC1143 test there are a few percentage points in matches that turn into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses is very close 3.6%. I'm running this test on a second donor. Best regards Miguel On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez > wrote: Dear colleagues, I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143 data. I think what solved the problem was setting the headers to the unaligned BAM files. I'm currently trying it out with the DO35937 donor, but its too early to say if its working or not. To compare BAM files I've followed some advice that I found on the internet https://www.biostars.org/p/166221/. I will detail them a bit below because I would like some advice as to how appropriate the approach is, but first here are the numbers: Lines: 74264390 Matches: 70565742 Misses: 2693687 Soft: 1004961 Which means 95% matches, 3.6% miss-matches, and 1.3% soft-matches. Matches are when the chromosome and position are the same, soft-matches are when they are not the same but the position from one of the alignments is included in the list of alternative positions for the other alignment (e.g XA:Z:15,-102516528,76M,0), and misses are the rest. Here is the detailed process from the start. The comparison script is here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_bwa_bam.sh 1) Un-align tumor and normal BAM files, retaining the original aligned BAM files 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with alignments from both tumor and normal 3) use samtools to extract the entries, limited for the first in pair (?), cut the read-name, chromosome, position (??) and extra information (for additional alignments) and sort them. We do this for the original files and for the BWA-Mem merged_output file, but separating tumor and normal entries (marked with the codes 'tumor' and 'normal', I believe from the headers I set when un-aligning them) 4) join the lines by read-name, separately for the tumor and normal pairs of files, and check for matches I've two questions: (?) Is it OK to select only the first in pair, its what the guy in the example did, and it did simplify the code without repeated read-names (??) I guess its OK to only check chromosome and position, the cigar would be necessarily the same. Best regards Miguel On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez > wrote: Dear all, Let me summarize the status of the testing for Sanger and DKFZ. The validation has been run for two donors for each workflow: DO50311 DO52140 Sanger: ---------- Sanger call only somatic variants. The results are identical for Indels and SVs but almost identical for SNV.MNV and CNV. The discrepancies are reproducible (on the same machine at least), i.e. the same are found after running the workflow a second time. DKFZ: --------- DKFZ cals somatic and germline variants, except germline CNVs. For both germline and somatic variants the results are identical for SNV.MNV and Indels but with large discrepancies for SV and CNV. Kortine Kleinheinz and Joachim Weischenfeldt are in the process of investigating this issue I believe. BWA-Mem failed for me and has also failed for Denis Yuen and Jonas Demeulemeester. Denis I believe is investigating this problem further. I haven't had the chance to investigate this much myself. Best Miguel --------------------- RESULTS --------------------- ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt Comparison of somatic.snv.mnv for DO50311 using DKFZ --- Common: 51087 Extra: 0 Missing: 0 Comparison of somatic.indel for DO50311 using DKFZ --- Common: 26469 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using DKFZ --- Common: 231 Extra: 44 - Example: 10:20596800:N:,10:56066821:N:,11:16776092:N: Missing: 48 - Example: 10:119704959:N:,10:13116322:N:,10:47063485:N: Comparison of somatic.cnv for DO50311 using DKFZ --- Common: 731 Extra: 213 - Example: 10:132510034:N:,10:20596801:N:,10:47674883:N: Missing: 190 - Example: 10:100891940:N:,10:104975905:N:,10:119704960:N: Comparison of germline.snv.mnv for DO50311 using DKFZ --- Common: 3850992 Extra: 0 Missing: 0 Comparison of germline.indel for DO50311 using DKFZ --- Common: 709060 Extra: 0 Missing: 0 Comparison of germline.sv for DO50311 using DKFZ --- Common: 1393 Extra: 231 - Example: 10:134319313:N:,10:134948976:N:,10:19996638:N: Missing: 615 - Example: 10:101851839:N:,10:101851884:N:,10:10745225:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO52140 using DKFZ --- Common: 37160 Extra: 0 Missing: 0 Comparison of somatic.indel for DO52140 using DKFZ --- Common: 19347 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using DKFZ --- Common: 72 Extra: 23 - Example: 10:132840774:N:,11:38252019:N:,11:47700673:N: Missing: 61 - Example: 10:134749140:N:,11:179191:N:,11:38252005:N: Comparison of somatic.cnv for DO52140 using DKFZ --- Common: 275 Extra: 94 - Example: 1:106505931:N:,1:109068899:N:,1:109359995:N: Missing: 286 - Example: 10:88653561:N:,11:179192:N:,11:38252006:N: Comparison of germline.snv.mnv for DO52140 using DKFZ --- Common: 3833896 Extra: 0 Missing: 0 Comparison of germline.indel for DO52140 using DKFZ --- Common: 706572 Extra: 0 Missing: 0 Comparison of germline.sv for DO52140 using DKFZ --- Common: 1108 Extra: 1116 - Example: 10:102158308:N:,10:104645247:N:,10:105097522:N: Missing: 2908 - Example: 10:100107032:N:,10:100107151:N:,10:102158345:N: File not found /mnt/1TB/work/DockerTest-Miguel/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz Comparison of somatic.snv.mnv for DO50311 using Sanger --- Common: 156299 Extra: 1 - Example: Y:58885197:A:G Missing: 14 - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C Comparison of somatic.indel for DO50311 using Sanger --- Common: 812487 Extra: 0 Missing: 0 Comparison of somatic.sv for DO50311 using Sanger --- Common: 260 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO50311 using Sanger --- Common: 138 Extra: 0 Missing: 0 Comparison of somatic.snv.mnv for DO52140 using Sanger --- Common: 87234 Extra: 5 - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A Missing: 7 - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A Comparison of somatic.indel for DO52140 using Sanger --- Common: 803986 Extra: 0 Missing: 0 Comparison of somatic.sv for DO52140 using Sanger --- Common: 6 Extra: 0 Missing: 0 Comparison of somatic.cnv for DO52140 using Sanger --- Common: 36 Extra: 0 Missing: 2 - Example: 10:11767915:T:,10:11779907:G: -- Lincoln Stein Scientific Director (Interim), Ontario Institute for Cancer Research Director, Informatics and Bio-computing Program, OICR Senior Principal Investigator, OICR Professor, Department of Molecular Genetics, University of Toronto Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Tel: 416-673-8514 Mobile: 416-817-8240 Email: lincoln.stein at gmail.com Toll-free: 1-866-678-6427 Twitter: @OICR_news Executive Assistant Melisa Torres Tel: 647-259-4253 Email: melisa.torres at oicr.on.ca www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Tue Apr 11 12:31:53 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Tue, 11 Apr 2017 18:31:53 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> Message-ID: Hi Christina, There are two issues: 1- Splitting the BAM files and running them in the right order. Can do 2- That the order of the reads *inside* a BAM is the same. Can not fix So if we would like the inquisitive user to reproduce the alignment process from the available aligned BAM we need to tell him that the *reads* are not in the same order and that about 3% of the reads will be aligned differently. Compared to the problem of the read order, the problem with the BAM splitting and ordering is negligible, in fact, splitting the BAM i believe did nothing at all to our numbers, so we might as well not even bother I think, but there are people better suited than me to make this call. In short, we can claim: 1) that the process is reproducible to a 99.99 percent using the original unaligned BAM files 2) that working back from the aligned BAM one is able to reproduce the results to a 96% accuracy, the lack on accuracy apparently due to different read ordering. The process to reproduce in 2) could be the simple one, just unalign the BAM, or the more elaborate one that involves splitting the BAM an feeding it in the right order, which does not seem to improve anything. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Tue Apr 11 13:01:14 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Tue, 11 Apr 2017 19:01:14 +0200 Subject: [DOCKTESTERS] Help request: get donor file information from ICGC DCC programmatically Message-ID: Dear friends, For our upcoming testing work on the filters I think we will be using the SNV and SV files that where submitted from the different providers and comparing with the resulting final VCFs. To access these I was planning to use GNOS or the ICGC client, for which we will need to specify the "Object ID" or "Submitted Bundle ID" that are found in the file pages of ICGC (e.g https://dcc.icgc.org/repositories/files/FI384359). For the normal and tumor BAMS and for the consensus VCF I was going to the site manually and extracting them, but it is tedious and error prone, and it's the only really manual thing in the scripts. It will be great to automate it. Is there a programmatic way to find all the files for a donor, with the names, and then access for any of them the "Object ID" and "Submitted Bundle ID"? Best regards Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From kr2 at sanger.ac.uk Tue Apr 11 16:37:48 2017 From: kr2 at sanger.ac.uk (Keiran Raine) Date: Tue, 11 Apr 2017 20:37:48 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> Message-ID: <9D77245E-5018-4861-8F9E-33ADD163DC26@sanger.ac.uk> Hi, Please be aware that failure to split the BAM's by readgroup for remapping so that lanes/readgroups are tagged appropriately had implications for analysis algorithms. I'm unsure how you would remap without splitting by readgroup when libraries can be different between readgroups (which would result in a loss of metadata). For example, the CaVEMan (SNV) caller uses the readgroup as a co-variate to ensure that lane to lane artefacts are modelled correctly. In both cgpPindel (indel) and BRASS (SV) the insert size for the individual readgroups needs to be correct, this is skewed if data is merged during mapping. An example from cgpPindel in our internal test process showed that starting from the exact same read order in the individual lane/readgroup BAMs but merging the files in a different order could result in minor changes to indel calls (items that failed filtering). We found this was due to the order that reads from the same location being presented to the core algorithm in a different order. Hope this is useful. Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 From: on behalf of Miguel Vazquez Date: Tuesday, 11 April 2017 at 17:31 To: Christina Yung Cc: Lincoln Stein , "docktesters at lists.icgc.org" , Francis Ouellette , Keiran Raine , George Mihaiescu Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Hi Christina, There are two issues: 1- Splitting the BAM files and running them in the right order. Can do 2- That the order of the reads *inside* a BAM is the same. Can not fix So if we would like the inquisitive user to reproduce the alignment process from the available aligned BAM we need to tell him that the *reads* are not in the same order and that about 3% of the reads will be aligned differently. Compared to the problem of the read order, the problem with the BAM splitting and ordering is negligible, in fact, splitting the BAM i believe did nothing at all to our numbers, so we might as well not even bother I think, but there are people better suited than me to make this call. In short, we can claim: 1) that the process is reproducible to a 99.99 percent using the original unaligned BAM files 2) that working back from the aligned BAM one is able to reproduce the results to a 96% accuracy, the lack on accuracy apparently due to different read ordering. The process to reproduce in 2) could be the simple one, just unalign the BAM, or the more elaborate one that involves splitting the BAM an feeding it in the right order, which does not seem to improve anything. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Wed Apr 12 09:11:54 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Wed, 12 Apr 2017 15:11:54 +0200 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: <9D77245E-5018-4861-8F9E-33ADD163DC26@sanger.ac.uk> References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> <9D77245E-5018-4861-8F9E-33ADD163DC26@sanger.ac.uk> Message-ID: Hi Keiran, I don't quite follow all the details but I understand that you think that in working back from the aligned BAMs we should make sure we split the BAM files. I think our tests with splitting the BAM when working back from the aligned BAM did not seem to affect positively the match rate of aligned reads, however, if I understood you correctly, downstream algorithms like SVN and Indel callers could still be affected by not splitting the BAM. If so this is an interesting observation, though I don't think our testing will cover running these methods over the re-aligned BAM files, so we would not run into this scenario. Finally Keiran, what is your opinion on the discussion about working back from the aligned BAMs? could you summarize for us again what is the reason you think there is for the 3% miss-matched reads when using the rolled back splitted BAMs and only 0.01% when using the original unaligned BAMS, and if there is any possibility of addressing this or not? Thanks for your input Miguel On Tue, Apr 11, 2017 at 10:37 PM, Keiran Raine wrote: > Hi, > > > > Please be aware that failure to split the BAM's by readgroup for remapping > so that lanes/readgroups are tagged appropriately had implications for > analysis algorithms. > > > > I'm unsure how you would remap without splitting by readgroup when > libraries can be different between readgroups (which would result in a loss > of metadata). > > > > For example, the CaVEMan (SNV) caller uses the readgroup as a co-variate > to ensure that lane to lane artefacts are modelled correctly. In both > cgpPindel (indel) and BRASS (SV) the insert size for the individual > readgroups needs to be correct, this is skewed if data is merged during > mapping. > > > > An example from cgpPindel in our internal test process showed that > starting from the exact same read order in the individual lane/readgroup > BAMs but merging the files in a different order could result in minor > changes to indel calls (items that failed filtering). We found this was > due to the order that reads from the same location being presented to the > core algorithm in a different order. > > > > Hope this is useful. > > > > Keiran Raine > > Principal Bioinformatician > > Cancer Genome Project > > Wellcome Trust Sanger Institute > > > > kr2 at sanger.ac.uk > > Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244> > > Office: H104 > > > > *From: * on behalf of Miguel Vazquez < > miguel.vazquez at cnio.es> > *Date: *Tuesday, 11 April 2017 at 17:31 > *To: *Christina Yung > *Cc: *Lincoln Stein , "docktesters at lists.icgc.org" > , Francis Ouellette , > Keiran Raine , George Mihaiescu < > George.Mihaiescu at oicr.on.ca> > *Subject: *Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% > miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% > soft-matches > > > > Hi Christina, > > > > There are two issues: > > > > 1- Splitting the BAM files and running them in the right order. Can do > > 2- That the order of the reads *inside* a BAM is the same. Can not fix > > > > So if we would like the inquisitive user to reproduce the alignment > process from the available aligned BAM we need to tell him that the *reads* > are not in the same order and that about 3% of the reads will be aligned > differently. > > > > Compared to the problem of the read order, the problem with the BAM > splitting and ordering is negligible, in fact, splitting the BAM i believe > did nothing at all to our numbers, so we might as well not even bother I > think, but there are people better suited than me to make this call. > > > > In short, we can claim: > > > > 1) that the process is reproducible to a 99.99 percent using the original > unaligned BAM files > > 2) that working back from the aligned BAM one is able to reproduce the > results to a 96% accuracy, the lack on accuracy apparently due to different > read ordering. > > > > The process to reproduce in 2) could be the simple one, just unalign the > BAM, or the more elaborate one that involves splitting the BAM an feeding > it in the right order, which does not seem to improve anything. > > > > > > > -- The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a company > registered in England with number 2742969, whose registered office is 215 > Euston Road, London, NW1 2BE. > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kr2 at sanger.ac.uk Wed Apr 12 10:27:50 2017 From: kr2 at sanger.ac.uk (Keiran Raine) Date: Wed, 12 Apr 2017 14:27:50 +0000 Subject: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches In-Reply-To: References: <0D2E0A29-1F15-4FCC-9025-26FDC1FBA3CF@crick.ac.uk> <2c0574cf-72d1-9167-86c8-4d04a138f8b7@oicr.on.ca> <9D77245E-5018-4861-8F9E-33ADD163DC26@sanger.ac.uk> Message-ID: Hi Miguel, When the original mapping was performed the input files were from many different sources and the read order (as noted) would have been different. The classes I'm aware of would be: ? Lane BAM generated from raw sequenced FASTQ (i.e. sequencing ordered) ? Lane BAM generated from BWA-aln mapped data (different mapping produced to BWA-mem) ? Lane BAM generated from mappings to a different reference The same data going into the process for each of these would result in a different read order on entry into the PCAWG mapping flow. BWA internally splits the data into blocks and estimates the insert distribution required to determine reads as properly-paired within that block. If the data has previously been through mapping all of the well mapped data clusters and the unmapped and aberrant pairs are no longer distributed throughout the input which changes the insert size distribution. BWA additionally is affected by the number of threads in use if an additional (hidden) parameter is not set to make the blocks of reads consistent. The option may not be in the version used in PanCancer. We specified a set thread count to prevent the problem but this variable allows threads to be modified: * -K 10e8 :: hidden (yay!) option that eliminates randomness in chunking when using threads so that results can be deterministic. If you can independently take the same source BAM on two different setups, split and remap with the results being a match then you prove reproducibility for the same input (it was done at the beginning of the project so it should still be true). FYI, when comparing within our group we don't consider reads with MAPQ=0. A final item that may affect the read matching (depending on how your matching works) is that when merging the remapped data reads mapped to the same location are inserted based on the file order they are presented to the code. For example, take 3 reads mapped to the same start location from 3 different lanes: f1=ra @ 1:1000 f2=rb @ 1:1000 f3=rc @ 1:1000 # merge/merging markdup or the like: bammerge I=f1 I=f2 I=f3 # read order ra, rb, rc bammerge I=f3 I=f1 I=f2 # read order rc, ra, rb, I hope this helps/clarifies things, Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 From: on behalf of Miguel Vazquez Date: Wednesday, 12 April 2017 at 14:11 To: Keiran Raine Cc: Christina Yung , Lincoln Stein , Francis Ouellette , "docktesters at lists.icgc.org" , George Mihaiescu Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Hi Keiran, I don't quite follow all the details but I understand that you think that in working back from the aligned BAMs we should make sure we split the BAM files. I think our tests with splitting the BAM when working back from the aligned BAM did not seem to affect positively the match rate of aligned reads, however, if I understood you correctly, downstream algorithms like SVN and Indel callers could still be affected by not splitting the BAM. If so this is an interesting observation, though I don't think our testing will cover running these methods over the re-aligned BAM files, so we would not run into this scenario. Finally Keiran, what is your opinion on the discussion about working back from the aligned BAMs? could you summarize for us again what is the reason you think there is for the 3% miss-matched reads when using the rolled back splitted BAMs and only 0.01% when using the original unaligned BAMS, and if there is any possibility of addressing this or not? Thanks for your input Miguel On Tue, Apr 11, 2017 at 10:37 PM, Keiran Raine > wrote: Hi, Please be aware that failure to split the BAM's by readgroup for remapping so that lanes/readgroups are tagged appropriately had implications for analysis algorithms. I'm unsure how you would remap without splitting by readgroup when libraries can be different between readgroups (which would result in a loss of metadata). For example, the CaVEMan (SNV) caller uses the readgroup as a co-variate to ensure that lane to lane artefacts are modelled correctly. In both cgpPindel (indel) and BRASS (SV) the insert size for the individual readgroups needs to be correct, this is skewed if data is merged during mapping. An example from cgpPindel in our internal test process showed that starting from the exact same read order in the individual lane/readgroup BAMs but merging the files in a different order could result in minor changes to indel calls (items that failed filtering). We found this was due to the order that reads from the same location being presented to the core algorithm in a different order. Hope this is useful. Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 From: > on behalf of Miguel Vazquez > Date: Tuesday, 11 April 2017 at 17:31 To: Christina Yung > Cc: Lincoln Stein >, "docktesters at lists.icgc.org" >, Francis Ouellette >, Keiran Raine >, George Mihaiescu > Subject: Re: [DOCKTESTERS] BWA-Mem validation of DO51057 normal) 0.013% miss-matches, and 3.7% soft-matches, tumor) 0.043% miss-matches, and 4.64% soft-matches Hi Christina, There are two issues: 1- Splitting the BAM files and running them in the right order. Can do 2- That the order of the reads *inside* a BAM is the same. Can not fix So if we would like the inquisitive user to reproduce the alignment process from the available aligned BAM we need to tell him that the *reads* are not in the same order and that about 3% of the reads will be aligned differently. Compared to the problem of the read order, the problem with the BAM splitting and ordering is negligible, in fact, splitting the BAM i believe did nothing at all to our numbers, so we might as well not even bother I think, but there are people better suited than me to make this call. In short, we can claim: 1) that the process is reproducible to a 99.99 percent using the original unaligned BAM files 2) that working back from the aligned BAM one is able to reproduce the results to a 96% accuracy, the lack on accuracy apparently due to different read ordering. The process to reproduce in 2) could be the simple one, just unalign the BAM, or the more elaborate one that involves splitting the BAM an feeding it in the right order, which does not seem to improve anything. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Mon Apr 17 06:39:51 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Mon, 17 Apr 2017 12:39:51 +0200 Subject: [DOCKTESTERS] Help request: get donor file information from ICGC DCC programmatically In-Reply-To: References: Message-ID: Hi Jonas, Denis, and Dusan, Jonas, the release_may2016 file you mention does indeed have some of the GNOS ids we would need but not all unfortunately. For instance the consensus vcf file is not there it seems. Denis, Dusan. I think I figured out how to do this. I download the full tsv export with curl -X GET --header 'Accept: text/tsv' ' https://dcc.icgc.org/api/v1/repository/files/export?filters=%7B%7D' get all the files for a donor and the for each file I download the associated json info with curl -X GET --header 'Accept: application/json' " https://dcc.icgc.org/api/v1/repository/files/$file" There I can find the file name along with the ID's I need to download it. Best Miguel On Wed, Apr 12, 2017 at 5:38 PM, Denis Yuen wrote: > Hi, > > > As I understand it, there is an API for the portal > http://docs.icgc.org/portal/api-endpoints/#/ > > I'm going to also forward this to Dusan who may be able to point you at a > more specific endpoint to use or in the correct direction. > ------------------------------ > *From:* docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org > on behalf of > Miguel Vazquez > *Sent:* April 11, 2017 1:01:14 PM > *To:* docktesters at lists.icgc.org; Junjun Zhang > *Subject:* [DOCKTESTERS] Help request: get donor file information from > ICGC DCC programmatically > > Dear friends, > > For our upcoming testing work on the filters I think we will be using the > SNV and SV files that where submitted from the different providers and > comparing with the resulting final VCFs. To access these I was planning to > use GNOS or the ICGC client, for which we will need to specify the "Object > ID" or "Submitted Bundle ID" that are found in the file pages of ICGC (e.g > https://dcc.icgc.org/repositories/files/FI384359). For the normal and > tumor BAMS and for the consensus VCF I was going to the site manually and > extracting them, but it is tedious and error prone, and it's the only > really manual thing in the scripts. It will be great to automate it. > > Is there a programmatic way to find all the files for a donor, with the > names, and then access for any of them the "Object ID" and "Submitted > Bundle ID"? > > Best regards > > Miguel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Mon Apr 17 09:03:17 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Mon, 17 Apr 2017 13:03:17 +0000 Subject: [DOCKTESTERS] Help request: get donor file information from ICGC DCC programmatically In-Reply-To: References: Message-ID: Hi Miguel, That should work. For the 'export', if you'd like it's possible to add suitable filters so that the export only gives you file of interest, for example, only PCAWG files of certain type. Cheers, Junjun From: > on behalf of Miguel Vazquez > Date: Monday, April 17, 2017 at 6:39 AM To: Denis Yuen >, Jonas Demeulemeester > Cc: "docktesters at lists.icgc.org" >, Dusan Andric > Subject: Re: [DOCKTESTERS] Help request: get donor file information from ICGC DCC programmatically Hi Jonas, Denis, and Dusan, Jonas, the release_may2016 file you mention does indeed have some of the GNOS ids we would need but not all unfortunately. For instance the consensus vcf file is not there it seems. Denis, Dusan. I think I figured out how to do this. I download the full tsv export with curl -X GET --header 'Accept: text/tsv' 'https://dcc.icgc.org/api/v1/repository/files/export?filters=%7B%7D' get all the files for a donor and the for each file I download the associated json info with curl -X GET --header 'Accept: application/json' "https://dcc.icgc.org/api/v1/repository/files/$file" There I can find the file name along with the ID's I need to download it. Best Miguel On Wed, Apr 12, 2017 at 5:38 PM, Denis Yuen > wrote: Hi, As I understand it, there is an API for the portal http://docs.icgc.org/portal/api-endpoints/#/ I'm going to also forward this to Dusan who may be able to point you at a more specific endpoint to use or in the correct direction. ________________________________ From: docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org > on behalf of Miguel Vazquez > Sent: April 11, 2017 1:01:14 PM To: docktesters at lists.icgc.org; Junjun Zhang Subject: [DOCKTESTERS] Help request: get donor file information from ICGC DCC programmatically Dear friends, For our upcoming testing work on the filters I think we will be using the SNV and SV files that where submitted from the different providers and comparing with the resulting final VCFs. To access these I was planning to use GNOS or the ICGC client, for which we will need to specify the "Object ID" or "Submitted Bundle ID" that are found in the file pages of ICGC (e.g https://dcc.icgc.org/repositories/files/FI384359). For the normal and tumor BAMS and for the consensus VCF I was going to the site manually and extracting them, but it is tedious and error prone, and it's the only really manual thing in the scripts. It will be great to automate it. Is there a programmatic way to find all the files for a donor, with the names, and then access for any of them the "Object ID" and "Submitted Bundle ID"? Best regards Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From Denis.Yuen at oicr.on.ca Mon Apr 17 09:57:10 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Mon, 17 Apr 2017 13:57:10 +0000 Subject: [DOCKTESTERS] sv-merge parameter questions Message-ID: Hi, Miguel is currently testing out the sv-merge workflow ( https://bitbucket.org/weischenfeldt/pcawg_sv_merge) with real data (as opposed to the testing data in https://bitbucket.org/weischenfeldt/pcawg_sv_merge/src/ced14ae88a2fbfb7274eb71834121dca0de81236/Dockstore.json?at=docker&fileviewer=file-view-default ) I believe that he has some questions about where to find the vcf parameter files on either the icgc-portal or on GNOS for a particular donor. Could either of you shed some light on this? Thanks! Denis Yuen Senior Software Developer Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Mon Apr 17 10:20:17 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Mon, 17 Apr 2017 16:20:17 +0200 Subject: [DOCKTESTERS] sv-merge parameter questions In-Reply-To: References: Message-ID: Thanks Denis for putting us in contact. If it helps here is a list of the files I find in the ICGC DCC for a particular donor (in format $type.$specimen.$filename) CNSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.dkfz-copyNumberEstimation_1-0-189.20150817.somatic.cnv.vcf.gz CNSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.svcp_1-0-5.20150701.somatic.cnv.vcf.gz SGV.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.broad-snowman-10.20151223.germline.indel.vcf.gz SGV.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.dkfz-indelCalling_1-0-132-1.20150817.germline.indel.vcf.gz SGV.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.dkfz-snvCalling_1-0-132-1.20150817.germline.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.broad-mutect-v3.20160222.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.broad-snowman-10.20151223.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.consensus.20160830.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.dkfz-indelCalling_1-0-132-1.20150817.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.MUSE_1-0rc-b391201-vcf.20151223.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.svcp_1-0-5.20150701.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.svcp_1-0-5.20150701.somatic.snv_mnv.vcf.gz StGV.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.broad-snowman-10.20151223.germline.sv.vcf.gz StGV.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.embl-delly_1-3-0-preFilter.20150817.germline.sv.vcf.gz StSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.broad-dRanger-10.20151223.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.broad-dRanger_snowman-10.20151223.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.broad-snowman-10.20151223.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.embl-delly_1-3-0-preFilter.20150817.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.2ea2294d-fab9-43ae-a222-370487495b06.svfix2_4-0-12.20160208.somatic.sv.vcf.gz On Mon, Apr 17, 2017 at 3:57 PM, Denis Yuen wrote: > Hi, > > > Miguel is currently testing out the sv-merge workflow ( > https://bitbucket.org/weischenfeldt/pcawg_sv_merge) with real data (as > opposed to the testing data in https://bitbucket.org/ > weischenfeldt/pcawg_sv_merge/src/ced14ae88a2fbfb7274eb71834121d > ca0de81236/Dockstore.json?at=docker&fileviewer=file-view-default ) > > > I believe that he has some questions about where to find the vcf parameter > files on either the icgc-portal or on GNOS for a particular donor. Could > either of you shed some light on this? Thanks! > > > *Denis Yuen* > Senior Software Developer > > > *Ontario* *Institute* *for* *Cancer* *Research* > MaRS Centre > 661 University Avenue > Suite 510 > Toronto, Ontario, Canada M5G 0A3 > > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > *www.oicr.on.ca * > > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Fri Apr 21 14:49:47 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Fri, 21 Apr 2017 20:49:47 +0200 Subject: [DOCKTESTERS] Problems with Dockstore Message-ID: Hi Brian et al. I was trying out a new container and I got the an error. I've also tried upgrading the dockstore client. ubuntu at ip-10-42-6-176:~/DockerTest-Miguel$ *dockstore tool convert entry2json --entry registry.hub.docker.com/essi/pcawg_sv_filter:testing > Dockstore.json* *Invalid tagio.swagger.client.ApiException: Invalid version.* I'm not sure how to fix this. Best Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From Denis.Yuen at oicr.on.ca Fri Apr 21 15:21:47 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Fri, 21 Apr 2017 19:21:47 +0000 Subject: [DOCKTESTERS] Problems with Dockstore In-Reply-To: References: Message-ID: Hi, Unfortunately, it looks like pcawg sv filter really doesn't exist on Dockstore or at least the developer of that tool hasn't registered any versions of it. Did you mean to use sv merge? https://dockstore.org/containers/registry.hub.docker.com/essi/pcawg_sv_merge Denis Yuen Senior Software Developer Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. ________________________________ From: docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org on behalf of Miguel Vazquez Sent: April 21, 2017 2:49:47 PM To: Brian O'Connor Cc: docktesters at lists.icgc.org Subject: [DOCKTESTERS] Problems with Dockstore Hi Brian et al. I was trying out a new container and I got the an error. I've also tried upgrading the dockstore client. ubuntu at ip-10-42-6-176:~/DockerTest-Miguel$ dockstore tool convert entry2json --entry registry.hub.docker.com/essi/pcawg_sv_filter:testing > Dockstore.json Invalid tag io.swagger.client.ApiException: Invalid version. I'm not sure how to fix this. Best Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Fri Apr 21 15:27:57 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Fri, 21 Apr 2017 21:27:57 +0200 Subject: [DOCKTESTERS] Problems with Dockstore In-Reply-To: References: Message-ID: Thanks Denis, I just saw the entry in dockstore and when ahead with it: https://dockstore.org/containers/registry.hub.docker.com/essi/pcawg_sv_filter I'm also working on the SV merge right now. Best M On Fri, Apr 21, 2017 at 9:21 PM, Denis Yuen wrote: > Hi, > > Unfortunately, it looks like pcawg sv filter really doesn't exist on > Dockstore or at least the developer of that tool hasn't registered any > versions of it. > > Did you mean to use sv merge? https://dockstore.org/ > containers/registry.hub.docker.com/essi/pcawg_sv_merge > > > > *Denis Yuen* > Senior Software Developer > > > *Ontario* *Institute* *for* *Cancer* *Research* > MaRS Centre > 661 University Avenue > Suite 510 > Toronto, Ontario, Canada M5G 0A3 > > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > *www.oicr.on.ca * > > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > ------------------------------ > *From:* docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org > on behalf of > Miguel Vazquez > *Sent:* April 21, 2017 2:49:47 PM > *To:* Brian O'Connor > *Cc:* docktesters at lists.icgc.org > *Subject:* [DOCKTESTERS] Problems with Dockstore > > Hi Brian et al. > > I was trying out a new container and I got the an error. I've also tried > upgrading the dockstore client. > > ubuntu at ip-10-42-6-176:~/DockerTest-Miguel$ *dockstore tool convert > entry2json --entry registry.hub.docker.com/essi/pcawg_sv_filter:testing > > > Dockstore.json* > > *Invalid tag io.swagger.client.ApiException: Invalid version.* > > I'm not sure how to fix this. > > Best > > Miguel > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Wed Apr 26 06:47:24 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Wed, 26 Apr 2017 12:47:24 +0200 Subject: [DOCKTESTERS] Delly workflow version in Dockstore Message-ID: Hi all, I'm trying to clarify the status of the Delly-DKFZ workflow testing. As far as I remember we found some discrepancies in CNV and SV that apparently traced back to changes in the Delly workflow. After that I admit I must have lost track and I don't remember addressing any updates. I now see 3 different Delly workflows in Dockstore https://dockstore.org/containers/registry.hub.docker.com/francescof/pcawg_delly_workflow https://dockstore.org/containers/registry.hub.docker.com/francescof/finsen_delly_workflow https://dockstore.org/containers/quay.io/pancancer/pcawg_delly_workflow The last one is the one I used, more precisely version 2.0.1-cwl1.0 Should I change to a different one? Best regards Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From Denis.Yuen at oicr.on.ca Wed Apr 26 10:28:28 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Wed, 26 Apr 2017 14:28:28 +0000 Subject: [DOCKTESTERS] Delly workflow version in Dockstore In-Reply-To: References: Message-ID: <9290defae51048de8642d6c8c8f1923b@oicr.on.ca> Hi, My current understanding is that the last one would be the "canonical" version for the purposes of presenting the work that we did in PCAWG. You have the correct version according to my notes as well. I think the first two are experiments that were being done by Francesco to further improve the workflow for future use. For now, do not believe you need to change versions. ________________________________ From: docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org on behalf of Miguel Vazquez Sent: April 26, 2017 6:47:24 AM To: docktesters at lists.icgc.org; Brian O'Connor; Francesco Favero Subject: [DOCKTESTERS] Delly workflow version in Dockstore Hi all, I'm trying to clarify the status of the Delly-DKFZ workflow testing. As far as I remember we found some discrepancies in CNV and SV that apparently traced back to changes in the Delly workflow. After that I admit I must have lost track and I don't remember addressing any updates. I now see 3 different Delly workflows in Dockstore https://dockstore.org/containers/registry.hub.docker.com/francescof/pcawg_delly_workflow https://dockstore.org/containers/registry.hub.docker.com/francescof/finsen_delly_workflow https://dockstore.org/containers/quay.io/pancancer/pcawg_delly_workflow The last one is the one I used, more precisely version 2.0.1-cwl1.0 Should I change to a different one? Best regards Miguel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Thu Apr 27 07:42:02 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Thu, 27 Apr 2017 13:42:02 +0200 Subject: [DOCKTESTERS] Succesfull run of SV-Merge workflow, but no validation, on donors DO50311 and DO52526 Message-ID: Dear all, I'm very pleased to announce that the SV-Merge workflow now runs succesfully thanks to the changes from Francesco Favero. You can read bellow the details if you are interested. The next issue remaining is to validate these results against the official release. The files I have listed from ICGC for each donor are the following: CNSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svcp_1-0-3.20150120.somatic.cnv.vcf.gz SGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.germline.indel.vcf.gz SGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-indelCalling_1-0-132-1-hpc.1510221331.germline.indel.vcf.gz SGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-snvCalling_1-0-132-1-hpc.1510221331.germline.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-mutect-v3.20160222.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.consensus.20160830.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.consensus.20161006.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-indelCalling_1-0-132-1-hpc.1510221331.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-snvCalling_1-0-132-1-hpc.1510221331.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.MUSE_1-0rc-vcf.20151023.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svcp_1-0-3.20150120.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svcp_1-0-3.20150120.somatic.snv_mnv.vcf.gz StGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.germline.sv.vcf.gz StGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.embl-delly_1-0-0-preFilter-hpc.150715.germline.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-dRanger.20151023.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-dRanger_snowman.20151023.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.embl-delly_1-0-0-preFilter-hpc.150715.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svfix2_4-0-12.20160208.somatic.sv.vcf.gz While the resulting files from the workflow are: DO52526.log DO52526.somatic.sv.bedpe DO52526.somatic.sv_full.bedpe DO52526.somatic.sv_full.vcf.gz DO52526.somatic.sv_full.vcf.gz.tbi DO52526.somatic.sv.stat DO52526.somatic.sv.vcf.gz DO52526.somatic.sv.vcf.gz.tbi DO52526.sv.tar.gz Is there a correspondence of files that I can check for differences? Best Miguel On Thu, Apr 27, 2017 at 1:31 PM, Miguel Vazquez wrote: > Hi Francesco, > > Thank you very much for promptly resolving this issue. I can confirm that > the workflow runs nicely and the outputs are also collected correctly. I'll > update the wiki to reflect that the workflow runs successfully. > > As with the filter workflows, the next issue is knowing what I need to > compare this with to check if there were any differences with the official > released file. Do you know? > > Best regards, > > Miguel > > > > On Thu, Apr 27, 2017 at 12:10 PM, Francesco Favero < > francesco.favero at bric.ku.dk> wrote: > >> Hi Miguel, >> >> I?ve added some text in the README in the git repo and slightly changed >> the behaviour of the tool to fix your problem. >> >> We had to change the url of the docker container, as we moved the image >> to our docker-hub group account. >> >> https://dockstore.org/containers/registry.hub.docker.com/ >> weischenfeldt/pcawg_sv_merge >> the stable version is 1.0.2 (it?s also the only one in dockstore so far) >> >> We have tested it, and it works for our end. I hope it will get better >> for you as well. >> >> The following is not a testing issue, but since Denis is in cc it's still >> useful to point it out: >> I believe we need to change the tool name to end with ?workflow", as for >> the guidelines. >> I?m afraid to do so effectively, we need to change also the git >> repository name and/or the docker container image. Is that right? >> >> Thanks >> >> Best >> >> Francesco >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From francesco.favero at bric.ku.dk Thu Apr 27 09:02:57 2017 From: francesco.favero at bric.ku.dk (Francesco Favero) Date: Thu, 27 Apr 2017 13:02:57 +0000 Subject: [DOCKTESTERS] Succesfull run of SV-Merge workflow, but no validation, on donors DO50311 and DO52526 In-Reply-To: References: Message-ID: <5e961612-0799-4566-bfbf-8b91fcb103bb@P1KITHUB07W.unicph.domain> Dear Miguel, I?m glad it worked :). The final sv-merge call set to compare with is the one linked out here https://wiki.oicr.on.ca/display/PANCANCER/Linkouts+to+Most+Current+PCAWG+Data pointing to this sv-merge file set: https://www.synapse.org/#!Synapse:syn7596712 The final SV-merge vcf files are named .pcawg_consensus_1.6..somatic.sv.vcf.gz and would be the ones to compare with DO52526.somatic.sv.vcf.gz from the Docker run. Best Francesco On 27 Apr 2017, at 13.42, Miguel Vazquez > wrote: Dear all, I'm very pleased to announce that the SV-Merge workflow now runs succesfully thanks to the changes from Francesco Favero. You can read bellow the details if you are interested. The next issue remaining is to validate these results against the official release. The files I have listed from ICGC for each donor are the following: CNSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svcp_1-0-3.20150120.somatic.cnv.vcf.gz SGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.germline.indel.vcf.gz SGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-indelCalling_1-0-132-1-hpc.1510221331.germline.indel.vcf.gz SGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-snvCalling_1-0-132-1-hpc.1510221331.germline.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-mutect-v3.20160222.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.consensus.20160830.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.consensus.20161006.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-indelCalling_1-0-132-1-hpc.1510221331.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.dkfz-snvCalling_1-0-132-1-hpc.1510221331.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.MUSE_1-0rc-vcf.20151023.somatic.snv_mnv.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svcp_1-0-3.20150120.somatic.indel.vcf.gz SSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svcp_1-0-3.20150120.somatic.snv_mnv.vcf.gz StGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.germline.sv.vcf.gz StGV.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.embl-delly_1-0-0-preFilter-hpc.150715.germline.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-dRanger.20151023.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-dRanger_snowman.20151023.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.broad-snowman.20151023.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.embl-delly_1-0-0-preFilter-hpc.150715.somatic.sv.vcf.gz StSM.Primary tumour - solid tissue.131332b2-ff51-4bd7-a626-aff2ecea6135.svfix2_4-0-12.20160208.somatic.sv.vcf.gz While the resulting files from the workflow are: DO52526.log DO52526.somatic.sv.bedpe DO52526.somatic.sv_full.bedpe DO52526.somatic.sv_full.vcf.gz DO52526.somatic.sv_full.vcf.gz.tbi DO52526.somatic.sv.stat DO52526.somatic.sv.vcf.gz DO52526.somatic.sv.vcf.gz.tbi DO52526.sv.tar.gz Is there a correspondence of files that I can check for differences? Best Miguel On Thu, Apr 27, 2017 at 1:31 PM, Miguel Vazquez > wrote: Hi Francesco, Thank you very much for promptly resolving this issue. I can confirm that the workflow runs nicely and the outputs are also collected correctly. I'll update the wiki to reflect that the workflow runs successfully. As with the filter workflows, the next issue is knowing what I need to compare this with to check if there were any differences with the official released file. Do you know? Best regards, Miguel On Thu, Apr 27, 2017 at 12:10 PM, Francesco Favero > wrote: Hi Miguel, I?ve added some text in the README in the git repo and slightly changed the behaviour of the tool to fix your problem. We had to change the url of the docker container, as we moved the image to our docker-hub group account. https://dockstore.org/containers/registry.hub.docker.com/weischenfeldt/pcawg_sv_merge the stable version is 1.0.2 (it?s also the only one in dockstore so far) We have tested it, and it works for our end. I hope it will get better for you as well. The following is not a testing issue, but since Denis is in cc it's still useful to point it out: I believe we need to change the tool name to end with ?workflow", as for the guidelines. I?m afraid to do so effectively, we need to change also the git repository name and/or the docker container image. Is that right? Thanks Best Francesco -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Thu Apr 27 11:15:12 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Thu, 27 Apr 2017 17:15:12 +0200 Subject: [DOCKTESTERS] SV-Merge validated 100% match on donors DO50311 and DO52526 Message-ID: Dear all, Thanks again to Francesco I managed to validate the SV-Merge with 100% overlap! Comparison of SV for *DO50311* using SV-Merge --- *Common: 330Extra: 0Missing: 0* Comparison of SV for *DO52526* using SV-Merge --- *Common: 86Extra: 0Missing: 0* I'll update the wiki accordingly Best regards Miguel On Thu, Apr 27, 2017 at 3:02 PM, Francesco Favero < francesco.favero at bric.ku.dk> wrote: > Dear Miguel, > > I?m glad it worked :). > > The final sv-merge call set to compare with is the one linked out here > https://wiki.oicr.on.ca/display/PANCANCER/Linkouts+to+ > Most+Current+PCAWG+Data pointing to this sv-merge file set: > > https://www.synapse.org/#!Synapse:syn7596712 > > > The final SV-merge vcf files are named .pcawg_ > consensus_1.6..somatic.sv.vcf.gz and would be the ones to compare > with DO52526.somatic.sv.vcf.gz from the Docker run. > > Best > Francesco > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Thu Apr 27 12:17:04 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Thu, 27 Apr 2017 16:17:04 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: Message-ID: I was in vacation last week and then busy with other tasks, but I would like to add that I ran DKFZ on donor DO50398 and the comparison returned 100% validation. Comparison for DO50398 using DKFZ --- Common: 109936 Extra: 0 Missing: 0 Cheers, George From: George Mihaiescu > Date: Friday, March 24, 2017 at 11:58 PM To: Keiran Raine >, Miguel Vazquez >, Jonas Demeulemeester > Cc: Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Hi Keiran, I used the original aligned BAMs available in Collaboratory and GNOS sites. One of my two other Sanger tests ran against the same donor completed too, and it had exactly the same output when I ran the "compare_result.sh" script, but I'm not sure what you meant by "the key information for determining if a call change is erroneous". Is the check script correctly (or not) validating the result? I'll probably send a final report on Monday with the results of all four tests (three Sanger and one DKFZ). Cheers, George From: Keiran Raine > Date: Thursday, March 23, 2017 at 4:58 AM To: George Mihaiescu >, Miguel Vazquez >, Jonas Demeulemeester > Cc: Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Hi, Sorry if this is in your confluence page but I'm unable to access (could be as I'm outside OICR or that the default for your space is owner only). Can you confirm if the CaVEMan calling was base on the BAM file that the original data was generated with or a one mapped with the new/recent mapping flow? Also, the key information for determining if a call change is erroneous: 1. Is the variant is marked 'PASSED'. 2. What are the probabilities attached to the VCF record (should be in the info field)? As previously stated we do expect a small variance in the results for the data processed at the beginning of the project and those at the end as well as some minor changes introduced when the normal-panel was moved from a web-service to a local file. Regards, Keiran From: George Mihaiescu > Date: Wednesday, 22 March 2017 at 20:18 To: Miguel Vazquez >, Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update I finished one of the dockerized Sanger tests and upon verification there were just a few differences, but I'm not sure if they are normal or not. Results: root at dockstore-test3:~/PCAWG-Docker-Test# bin/compare_result.sh Sanger DO50398 var/spool/cwl/0/caveman/ var/spool/cwl/0/caveman/splitList var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz var/spool/cwl/0/caveman/alg_bean var/spool/cwl/0/caveman/prob_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz.tbi var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.no_analysis.bed var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz.tbi var/spool/cwl/0/caveman/cov_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz.tbi var/spool/cwl/0/caveman/caveman.cfg.ini Comparison for DO50398 using Sanger --- Common: 171325 Extra: 3 - Example: 14:20031258:G,8:43827158:A,X:61711363:C Missing: 13 - Example: 10:106963148:T,17:64794691:G,1:82709263:T Because I'm a infrastructure architect my main reason for the test was to monitor resource utilization, so I wrote a wiki detailing my observations: https://wiki.oicr.on.ca/display/~gmihaiescu/Dockerized+Sanger+workflow I have there more Docker tests running, two of them run Sanger against the same donor (but using Vms with 8 cores because I want to see if the run time and resource utilization are constant), and a third test that is running DKFZ. Cheers, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 1:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joachim.weischenfeldt at bric.ku.dk Thu Apr 27 12:47:12 2017 From: joachim.weischenfeldt at bric.ku.dk (=?Windows-1252?Q?Joachim_L=FCtken_Weischenfeldt?=) Date: Thu, 27 Apr 2017 16:47:12 +0000 Subject: [DOCKTESTERS] SV-Merge validated 100% match on donors DO50311 and DO52526 In-Reply-To: References: Message-ID: <4bc8bf0d-dda7-4db4-9b51-c43748a60e5e@P1KITHUB01W.unicph.domain> That's great news! Thanks for your efforts. Best Joachim Den 27. apr. 2017 kl. 17.15 skrev Miguel Vazquez >: Dear all, Thanks again to Francesco I managed to validate the SV-Merge with 100% overlap! Comparison of SV for DO50311 using SV-Merge --- Common: 330 Extra: 0 Missing: 0 Comparison of SV for DO52526 using SV-Merge --- Common: 86 Extra: 0 Missing: 0 I'll update the wiki accordingly Best regards Miguel On Thu, Apr 27, 2017 at 3:02 PM, Francesco Favero > wrote: Dear Miguel, I?m glad it worked :). The final sv-merge call set to compare with is the one linked out here https://wiki.oicr.on.ca/display/PANCANCER/Linkouts+to+Most+Current+PCAWG+Data pointing to this sv-merge file set: https://www.synapse.org/#!Synapse:syn7596712 The final SV-merge vcf files are named .pcawg_consensus_1.6..somatic.sv.vcf.gz and would be the ones to compare with DO52526.somatic.sv.vcf.gz from the Docker run. Best Francesco -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Fri Apr 28 11:26:59 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Fri, 28 Apr 2017 15:26:59 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: Message-ID: Hi George, That's great news. Thanks for giving us a hand on this. You should join the Monday call to give update if you'd like. Best, Junjun From: George Mihaiescu > Date: Thursday, April 27, 2017 at 12:17 PM To: Keiran Raine >, Miguel Vazquez >, Jonas Demeulemeester > Cc: Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update I was in vacation last week and then busy with other tasks, but I would like to add that I ran DKFZ on donor DO50398 and the comparison returned 100% validation. Comparison for DO50398 using DKFZ --- Common: 109936 Extra: 0 Missing: 0 Cheers, George From: George Mihaiescu > Date: Friday, March 24, 2017 at 11:58 PM To: Keiran Raine >, Miguel Vazquez >, Jonas Demeulemeester > Cc: Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Hi Keiran, I used the original aligned BAMs available in Collaboratory and GNOS sites. One of my two other Sanger tests ran against the same donor completed too, and it had exactly the same output when I ran the "compare_result.sh" script, but I'm not sure what you meant by "the key information for determining if a call change is erroneous". Is the check script correctly (or not) validating the result? I'll probably send a final report on Monday with the results of all four tests (three Sanger and one DKFZ). Cheers, George From: Keiran Raine > Date: Thursday, March 23, 2017 at 4:58 AM To: George Mihaiescu >, Miguel Vazquez >, Jonas Demeulemeester > Cc: Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Hi, Sorry if this is in your confluence page but I'm unable to access (could be as I'm outside OICR or that the default for your space is owner only). Can you confirm if the CaVEMan calling was base on the BAM file that the original data was generated with or a one mapped with the new/recent mapping flow? Also, the key information for determining if a call change is erroneous: 1. Is the variant is marked 'PASSED'. 2. What are the probabilities attached to the VCF record (should be in the info field)? As previously stated we do expect a small variance in the results for the data processed at the beginning of the project and those at the end as well as some minor changes introduced when the normal-panel was moved from a web-service to a local file. Regards, Keiran From: George Mihaiescu > Date: Wednesday, 22 March 2017 at 20:18 To: Miguel Vazquez >, Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update I finished one of the dockerized Sanger tests and upon verification there were just a few differences, but I'm not sure if they are normal or not. Results: root at dockstore-test3:~/PCAWG-Docker-Test# bin/compare_result.sh Sanger DO50398 var/spool/cwl/0/caveman/ var/spool/cwl/0/caveman/splitList var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz var/spool/cwl/0/caveman/alg_bean var/spool/cwl/0/caveman/prob_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz.tbi var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.no_analysis.bed var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz.tbi var/spool/cwl/0/caveman/cov_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz.tbi var/spool/cwl/0/caveman/caveman.cfg.ini Comparison for DO50398 using Sanger --- Common: 171325 Extra: 3 - Example: 14:20031258:G,8:43827158:A,X:61711363:C Missing: 13 - Example: 10:106963148:T,17:64794691:G,1:82709263:T Because I'm a infrastructure architect my main reason for the test was to monitor resource utilization, so I wrote a wiki detailing my observations: https://wiki.oicr.on.ca/display/~gmihaiescu/Dockerized+Sanger+workflow I have there more Docker tests running, two of them run Sanger against the same donor (but using Vms with 8 cores because I want to see if the run time and resource utilization are constant), and a third test that is running DKFZ. Cheers, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 1:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I'm hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: