From Denis.Yuen at oicr.on.ca Wed Mar 1 10:26:01 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Wed, 1 Mar 2017 15:26:01 +0000 Subject: [DOCKTESTERS] Thanks! Message-ID: <26d1914151c94301bcc761ef88aaa011@oicr.on.ca> Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. -------------- next part -------------- An HTML attachment was scrubbed... URL: From christina.yung at oicr.on.ca Mon Mar 6 09:43:44 2017 From: christina.yung at oicr.on.ca (Christina Yung) Date: Mon, 6 Mar 2017 08:43:44 -0600 Subject: [DOCKTESTERS] Fwd: PCAWG-TECH Author Form In-Reply-To: <6d28eeef5e2d47a1ba82ea42a3013ff8@oicr.on.ca> References: <6d28eeef5e2d47a1ba82ea42a3013ff8@oicr.on.ca> Message-ID: <6a9200eb-1317-e2cd-d1c6-372defff08c4@oicr.on.ca> An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Fri Mar 10 15:51:53 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Fri, 10 Mar 2017 20:51:53 +0000 Subject: [DOCKTESTERS] Thanks! Message-ID: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Sat Mar 11 10:57:23 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Sat, 11 Mar 2017 16:57:23 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang wrote: > Dear Docktesters, > > George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to > run some bioinformatics workflows to test Collab environment. > > Just thought this is a good opportunity to use as extra help for testing > out the PCAWG dockerized workflows. > > Miguel, Denis and others, what workflows / datasets do you think would be > good for George to run? > > Thanks, > Junjun > > > > From: on > behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM > To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! > > Hi, > > Just wanted to say thanks to Miguel and Jonas for keeping the workflow > testing data page up-to-date. > > https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data > > > As we work on new versions or debugging, it is invaluable to know what > versions of the workflows have worked outside OICR, thanks! > > > > *Denis Yuen* > Senior Software Developer > > > *Ontario**Institute**for**Cancer**Research* > MaRSCentre > 661 University Avenue > Suite510 > Toronto, Ontario,Canada M5G0A3 > > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > *www.oicr.on.ca * > > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Sat Mar 11 11:00:10 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Sat, 11 Mar 2017 16:00:10 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Sat Mar 11 18:15:14 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Sat, 11 Mar 2017 23:15:14 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: , Message-ID: <570FCD5C-E577-4CBA-A741-7ADC562CFB65@crick.ac.uk> Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Sun Mar 12 23:45:14 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Mon, 13 Mar 2017 03:45:14 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: <570FCD5C-E577-4CBA-A741-7ADC562CFB65@crick.ac.uk> References: <570FCD5C-E577-4CBA-A741-7ADC562CFB65@crick.ac.uk> Message-ID: Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Mar 13 00:12:09 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 13 Mar 2017 04:12:09 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Mon Mar 13 07:53:03 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Mon, 13 Mar 2017 12:53:03 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < George.Mihaiescu at oicr.on.ca> wrote: > Hi, > > I've started Sanger on DO50398 and it's been running for more than 24 > hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" > > I just started a second run on a different VM on same donor, just to > compare run times. > The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some > monitoring graphs when it finishes the workflow, but I have no idea how to > check its correctness. > > Give me a list of donors and what workflows you want me to run and I'll > try to schedule them tomorrow. > > George > > > From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM > To: Jonas Demeulemeester , George > Mihaiescu > Cc: Miguel Vazquez , Denis Yuen < > Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Thanks Miguel and Jonas for your help here! > > Do you have any update on the latest testing? Please feel free updating > the wiki with any update: https://wiki.oicr.on. > ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference > > Regards, > Junjun > > > > From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM > To: George Mihaiescu > Cc: Miguel Vazquez , Junjun Zhang < > junjun.zhang at oicr.on.ca>, Denis Yuen , " > docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! > > Hi George, > > Yup, I've been running the PCAWG dockers mainly using Miguel's set of > scripts. > Give them a go and if you run into issues, just let us know! > > Cheers, > Jonas > > > On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: > > Sure, I'll give it a try and report later. > > Thank you, > > *George Mihaiescu* > Senior Cloud Architect > > *Ontario Institute for Cancer Research* > MaRS Centre > 661 University Avenue > Suite 510 > Toronto, Ontario > Canada M5G 0A3 > > Email: George.Mihaiescu at oicr.on.ca > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > > www.oicr.on.ca > > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > > > From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM > To: Junjun Zhang > Cc: Denis Yuen , Jonas Demeulemeester < > jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < > George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Hi Junjun, > > I think Jonas has been using my scripts to run some of the tests, maybe > George could try them as well, it should be very easy for him to try the > Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. > > https://github.com/mikisvaz/PCAWG-Docker-Test > > He would just need to update the tokens for DACO access and the scripts > will take care of downloading the BAM files, running the workflows and > evaluating the result. > > The documentation there is reasonably updated, but if this sounds good > then perhaps he could contact me and I could walk him through the details. > > Best regards > > Miguel > > On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: > >> Dear Docktesters, >> >> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to >> run some bioinformatics workflows to test Collab environment. >> >> Just thought this is a good opportunity to use as extra help for testing >> out the PCAWG dockerized workflows. >> >> Miguel, Denis and others, what workflows / datasets do you think would be >> good for George to run? >> >> Thanks, >> Junjun >> >> >> >> From: on >> behalf of Denis Yuen >> Date: Wednesday, March 1, 2017 at 10:26 AM >> To: "docktesters at lists.icgc.org" >> Subject: [DOCKTESTERS] Thanks! >> >> Hi, >> >> Just wanted to say thanks to Miguel and Jonas for keeping the workflow >> testing data page up-to-date. >> >> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >> >> >> As we work on new versions or debugging, it is invaluable to know what >> versions of the workflows have worked outside OICR, thanks! >> >> >> >> *Denis Yuen* >> Senior Software Developer >> >> >> *Ontario**Institute**for**Cancer**Research* >> MaRSCentre >> 661 University Avenue >> Suite510 >> Toronto, Ontario,Canada M5G0A3 >> >> Toll-free: 1-866-678-6427 >> Twitter: @OICR_news >> *www.oicr.on.ca * >> >> This message and any attachments may contain confidential and/or >> privileged information for the sole use of the intended recipient. Any >> review or distribution by anyone other than the person for whom it was >> originally intended is strictly prohibited. If you have received this >> message in error, please contact the sender and delete all copies. >> Opinions, conclusions or other information contained in this message may >> not be that of the organization. >> >> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Mar 13 09:43:59 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 13 Mar 2017 13:43:59 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Mon Mar 13 09:52:03 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Mon, 13 Mar 2017 14:52:03 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < George.Mihaiescu at oicr.on.ca> wrote: > Hi Miguel, > > I've started the test by running "bin/run_test.sh Sanger DO50398", so I > guess with just one workflow running it should complete faster than two > weeks. > I think it still should take a long time. My scripts will run one workflow after another. > > Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" > script to use a docker container that has the icgc client inside and pull > data from Collaboratory. There is no "bam.bas" file downloaded, just a > ".bam" and a ".bam.bai" files, not sure if this is an issue. > > I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem > By looking at the "bin/compare_result_type.sh" it looks like it's using > the gnos client to pull down the existing VCF files for comparison reasons, > but I think we store those files in Collaboratory as well, so I'll work > with Junjun to adapt the script for this. > > Let me know if you need any help > I think I initially tried to run the DKFZ workflow, but it complained > about having to run Delly first, so I abandoned this for now. > Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. > > I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. > Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel > > George > > From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM > To: George Mihaiescu > Cc: Junjun Zhang , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Hi George, > > The Sanger workflow is very lengthy, it takes about two weeks in my tests. > > About correctness, my scripts also cover that part, if you are not using > them they might still help you to clarify how we do it. The idea is to take > each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both > germline and somatic and compare it with the result uploaded to GNOS (not > all pipelines produce all files). This is the relevant part in the > run_batch.sh script: > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi > n/run_batch.sh#L42-L46 > > The bin/compare_result_type.sh script will take care of downloading the > correct file from GNOS and running the comparison. The comparison itself is > simple since all files are VCFs, it consists in taking out the variants in > terms of chromosome, position, reference and alternative allele and > measuring the overlaps. > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > bin/compare_result_type.sh > > About which donors to test, DO52140 is one Jonas and I have both tested > and could be interesting to get a third opinion. Also, any other donor > could be interesting to see if something new comes up. I'm not sure which > options is best. > > Miguel > > > > > On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > >> Hi, >> >> I've started Sanger on DO50398 and it's been running for more than 24 >> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" >> >> I just started a second run on a different VM on same donor, just to >> compare run times. >> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some >> monitoring graphs when it finishes the workflow, but I have no idea how to >> check its correctness. >> >> Give me a list of donors and what workflows you want me to run and I'll >> try to schedule them tomorrow. >> >> George >> >> >> From: Junjun Zhang >> Date: Sunday, March 12, 2017 at 10:45 PM >> To: Jonas Demeulemeester , George >> Mihaiescu >> Cc: Miguel Vazquez , Denis Yuen < >> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Thanks Miguel and Jonas for your help here! >> >> Do you have any update on the latest testing? Please feel free updating >> the wiki with any update: https://wiki.oicr.on.c >> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference >> >> Regards, >> Junjun >> >> >> >> From: Jonas Demeulemeester >> Date: Saturday, March 11, 2017 at 7:15 PM >> To: George Mihaiescu >> Cc: Miguel Vazquez , Junjun Zhang < >> junjun.zhang at oicr.on.ca>, Denis Yuen , " >> docktesters at lists.icgc.org" >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi George, >> >> Yup, I've been running the PCAWG dockers mainly using Miguel's set of >> scripts. >> Give them a go and if you run into issues, just let us know! >> >> Cheers, >> Jonas >> >> >> On 11 Mar 2017, at 17:00, George Mihaiescu >> wrote: >> >> Sure, I'll give it a try and report later. >> >> Thank you, >> >> *George Mihaiescu* >> Senior Cloud Architect >> >> *Ontario Institute for Cancer Research* >> MaRS Centre >> 661 University Avenue >> Suite 510 >> Toronto, Ontario >> Canada M5G 0A3 >> >> Email: George.Mihaiescu at oicr.on.ca >> Toll-free: 1-866-678-6427 >> Twitter: @OICR_news >> >> www.oicr.on.ca >> >> This message and any attachments may contain confidential and/or >> privileged information for the sole use of the intended recipient. Any >> review or distribution by anyone other than the person for whom it was >> originally intended is strictly prohibited. If you have received this >> message in error, please contact the sender and delete all copies. >> Opinions, conclusions or other information contained in this message may >> not be that of the organization. >> >> >> >> From: Miguel Vazquez >> Date: Saturday, March 11, 2017 at 10:57 AM >> To: Junjun Zhang >> Cc: Denis Yuen , Jonas Demeulemeester < >> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < >> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi Junjun, >> >> I think Jonas has been using my scripts to run some of the tests, maybe >> George could try them as well, it should be very easy for him to try the >> Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. >> >> https://github.com/mikisvaz/PCAWG-Docker-Test >> >> He would just need to update the tokens for DACO access and the scripts >> will take care of downloading the BAM files, running the workflows and >> evaluating the result. >> >> The documentation there is reasonably updated, but if this sounds good >> then perhaps he could contact me and I could walk him through the details. >> >> Best regards >> >> Miguel >> >> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang >> wrote: >> >>> Dear Docktesters, >>> >>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to >>> run some bioinformatics workflows to test Collab environment. >>> >>> Just thought this is a good opportunity to use as extra help for testing >>> out the PCAWG dockerized workflows. >>> >>> Miguel, Denis and others, what workflows / datasets do you think would >>> be good for George to run? >>> >>> Thanks, >>> Junjun >>> >>> >>> >>> From: on >>> behalf of Denis Yuen >>> Date: Wednesday, March 1, 2017 at 10:26 AM >>> To: "docktesters at lists.icgc.org" >>> Subject: [DOCKTESTERS] Thanks! >>> >>> Hi, >>> >>> Just wanted to say thanks to Miguel and Jonas for keeping the workflow >>> testing data page up-to-date. >>> >>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >>> >>> >>> As we work on new versions or debugging, it is invaluable to know what >>> versions of the workflows have worked outside OICR, thanks! >>> >>> >>> >>> *Denis Yuen* >>> Senior Software Developer >>> >>> >>> *Ontario**Institute**for**Cancer**Research* >>> MaRSCentre >>> 661 University Avenue >>> Suite510 >>> Toronto, Ontario,Canada M5G0A3 >>> >>> Toll-free: 1-866-678-6427 >>> Twitter: @OICR_news >>> *www.oicr.on.ca * >>> >>> This message and any attachments may contain confidential and/or >>> privileged information for the sole use of the intended recipient. Any >>> review or distribution by anyone other than the person for whom it was >>> originally intended is strictly prohibited. If you have received this >>> message in error, please contact the sender and delete all copies. >>> Opinions, conclusions or other information contained in this message may >>> not be that of the organization. >>> >>> >>> _______________________________________________ >>> docktesters mailing list >>> docktesters at lists.icgc.org >>> https://lists.icgc.org/mailman/listinfo/docktesters >>> >>> >> The Francis Crick Institute Limited is a registered charity in England >> and Wales no. 1140062 and a company registered in England and Wales no. >> 06885462, with its registered office at 1 Midland Road London NW1 1AT >> >> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Mon Mar 13 11:22:23 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Mon, 13 Mar 2017 16:22:23 +0100 Subject: [DOCKTESTERS] Help needed with DKFZ BiasFilter. Validation of DO52140. 100% match is wrong! Message-ID: Dear all, I just learnt that the DKFZ BiasFilter is NOT the OXOG filter workflow, which means* I checked for the wrong thing in this validation!* I'm sorry for the confusion. Right now I pass the BAM files and the consensus.vcf (SNV_MNV) downloaded from GNOS to the BiasFilter and compare the resulting VCF with the consensus looking at the set of mutations containing the OXOGFAIL flag. This apparently is not the comparison to make. *What is it that I need to compare? is it the bPcr and bSeq flags?* One first look at those flags do show quite some discrepancies unfortunately on both donors (DO52140 and DO35937) for both flags. For instance for DO35937 we find 11 mutations flaged bPcr with in the new result, while the consensus.vcf only finds one, of them. Something similar happens with the bSeq. Can you please confirm this so I can come reply with a full report on this. Kind regards, and sorry again for the confusion. Miguel On Mon, Feb 27, 2017 at 7:30 PM, Miguel Vazquez wrote: > Dear friends, > > I've performed the first test with the DKFZ BiasFilter and got a perfect > match. There are 55 variants annotated with OXOGFAIL and they are the same > in the input VCF file (consensus SNV/MNV VCF for that donor) and the output > of the BiasFilter. I'm running the test on a second donor. > > Best regards > > Miguel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christina.yung at oicr.on.ca Mon Mar 13 11:48:39 2017 From: christina.yung at oicr.on.ca (Christina Yung) Date: Mon, 13 Mar 2017 10:48:39 -0500 Subject: [DOCKTESTERS] Help needed with DKFZ BiasFilter. Validation of DO52140. 100% match is wrong! In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Mar 13 12:57:14 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 13 Mar 2017 16:57:14 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Mon Mar 13 13:01:06 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Mon, 13 Mar 2017 18:01:06 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu < George.Mihaiescu at oicr.on.ca> wrote: > Junjun told me this would provide value to the testing process, so I would > like to kick off a test of the BWA_mem docker. > Can somebody provide some quick instructions and the location of the > unaligned BAM files that were used already? > > Also, do we have somewhere the steps involved in each workflow, so I can > get an idea of how far they are while running? > For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 > steps from finish? > > Thank you, > George > > From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM > > To: George Mihaiescu > Cc: Junjun Zhang , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Hi George, > > Answers inline > > On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > >> Hi Miguel, >> >> I've started the test by running "bin/run_test.sh Sanger DO50398", so I >> guess with just one workflow running it should complete faster than two >> weeks. >> > > I think it still should take a long time. My scripts will run one workflow > after another. > > >> >> Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" >> script to use a docker container that has the icgc client inside and pull >> data from Collaboratory. There is no "bam.bas" file downloaded, just a >> ".bam" and a ".bam.bai" files, not sure if this is an issue. >> >> > I wondered the same thing first time I did this, but this file is produced > by the pipeline. There was some problem with this that was dealt with by > the developers and updated in the docker. So I think you won't have a > problem > > >> By looking at the "bin/compare_result_type.sh" it looks like it's using >> the gnos client to pull down the existing VCF files for comparison reasons, >> but I think we store those files in Collaboratory as well, so I'll work >> with Junjun to adapt the script for this. >> >> > Let me know if you need any help > > >> I think I initially tried to run the DKFZ workflow, but it complained >> about having to run Delly first, so I abandoned this for now. >> > > Yes, if you look at the run_batch.sh you will see that when using DKFZ it > will always run Delly first. Delly prepares some files the the DKFZ file > needs, namely related to copy number I believe. > > >> >> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. >> > > Remember that you will need to add the relevant has-keys for the different > files in the etc/donor_files.csv. Its a bit tedious right now. You need to > go to the ICGC DCC and find these codes manually for the files you need. > Ask me if you need help. Once you have all you can run all the workflows > for that donor and evaluate results. > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > etc/donor_files.csv > > > Regards > > Miguel > > >> >> George >> >> From: Miguel Vazquez >> Date: Monday, March 13, 2017 at 6:53 AM >> To: George Mihaiescu >> Cc: Junjun Zhang , Jonas Demeulemeester < >> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi George, >> >> The Sanger workflow is very lengthy, it takes about two weeks in my >> tests. >> >> About correctness, my scripts also cover that part, if you are not using >> them they might still help you to clarify how we do it. The idea is to take >> each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both >> germline and somatic and compare it with the result uploaded to GNOS (not >> all pipelines produce all files). This is the relevant part in the >> run_batch.sh script: >> >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >> n/run_batch.sh#L42-L46 >> >> The bin/compare_result_type.sh script will take care of downloading the >> correct file from GNOS and running the comparison. The comparison itself is >> simple since all files are VCFs, it consists in taking out the variants in >> terms of chromosome, position, reference and alternative allele and >> measuring the overlaps. >> >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >> n/compare_result_type.sh >> >> About which donors to test, DO52140 is one Jonas and I have both tested >> and could be interesting to get a third opinion. Also, any other donor >> could be interesting to see if something new comes up. I'm not sure which >> options is best. >> >> Miguel >> >> >> >> >> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < >> George.Mihaiescu at oicr.on.ca> wrote: >> >>> Hi, >>> >>> I've started Sanger on DO50398 and it's been running for more than 24 >>> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" >>> >>> I just started a second run on a different VM on same donor, just to >>> compare run times. >>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some >>> monitoring graphs when it finishes the workflow, but I have no idea how to >>> check its correctness. >>> >>> Give me a list of donors and what workflows you want me to run and I'll >>> try to schedule them tomorrow. >>> >>> George >>> >>> >>> From: Junjun Zhang >>> Date: Sunday, March 12, 2017 at 10:45 PM >>> To: Jonas Demeulemeester , George >>> Mihaiescu >>> Cc: Miguel Vazquez , Denis Yuen < >>> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Thanks Miguel and Jonas for your help here! >>> >>> Do you have any update on the latest testing? Please feel free updating >>> the wiki with any update: https://wiki.oicr.on.c >>> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference >>> >>> Regards, >>> Junjun >>> >>> >>> >>> From: Jonas Demeulemeester >>> Date: Saturday, March 11, 2017 at 7:15 PM >>> To: George Mihaiescu >>> Cc: Miguel Vazquez , Junjun Zhang < >>> junjun.zhang at oicr.on.ca>, Denis Yuen , " >>> docktesters at lists.icgc.org" >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Hi George, >>> >>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of >>> scripts. >>> Give them a go and if you run into issues, just let us know! >>> >>> Cheers, >>> Jonas >>> >>> >>> On 11 Mar 2017, at 17:00, George Mihaiescu >>> wrote: >>> >>> Sure, I'll give it a try and report later. >>> >>> Thank you, >>> >>> *George Mihaiescu* >>> Senior Cloud Architect >>> >>> *Ontario Institute for Cancer Research* >>> MaRS Centre >>> 661 University Avenue >>> Suite 510 >>> Toronto, Ontario >>> Canada M5G 0A3 >>> >>> Email: George.Mihaiescu at oicr.on.ca >>> Toll-free: 1-866-678-6427 >>> Twitter: @OICR_news >>> >>> www.oicr.on.ca >>> >>> This message and any attachments may contain confidential and/or >>> privileged information for the sole use of the intended recipient. Any >>> review or distribution by anyone other than the person for whom it was >>> originally intended is strictly prohibited. If you have received this >>> message in error, please contact the sender and delete all copies. >>> Opinions, conclusions or other information contained in this message may >>> not be that of the organization. >>> >>> >>> >>> From: Miguel Vazquez >>> Date: Saturday, March 11, 2017 at 10:57 AM >>> To: Junjun Zhang >>> Cc: Denis Yuen , Jonas Demeulemeester < >>> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < >>> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Hi Junjun, >>> >>> I think Jonas has been using my scripts to run some of the tests, maybe >>> George could try them as well, it should be very easy for him to try the >>> Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. >>> >>> https://github.com/mikisvaz/PCAWG-Docker-Test >>> >>> He would just need to update the tokens for DACO access and the scripts >>> will take care of downloading the BAM files, running the workflows and >>> evaluating the result. >>> >>> The documentation there is reasonably updated, but if this sounds good >>> then perhaps he could contact me and I could walk him through the details. >>> >>> Best regards >>> >>> Miguel >>> >>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang >>> wrote: >>> >>>> Dear Docktesters, >>>> >>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans >>>> to run some bioinformatics workflows to test Collab environment. >>>> >>>> Just thought this is a good opportunity to use as extra help for >>>> testing out the PCAWG dockerized workflows. >>>> >>>> Miguel, Denis and others, what workflows / datasets do you think would >>>> be good for George to run? >>>> >>>> Thanks, >>>> Junjun >>>> >>>> >>>> >>>> From: on >>>> behalf of Denis Yuen >>>> Date: Wednesday, March 1, 2017 at 10:26 AM >>>> To: "docktesters at lists.icgc.org" >>>> Subject: [DOCKTESTERS] Thanks! >>>> >>>> Hi, >>>> >>>> Just wanted to say thanks to Miguel and Jonas for keeping the workflow >>>> testing data page up-to-date. >>>> >>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >>>> >>>> >>>> As we work on new versions or debugging, it is invaluable to know what >>>> versions of the workflows have worked outside OICR, thanks! >>>> >>>> >>>> >>>> *Denis Yuen* >>>> Senior Software Developer >>>> >>>> >>>> *Ontario**Institute**for**Cancer**Research* >>>> MaRSCentre >>>> 661 University Avenue >>>> Suite510 >>>> Toronto, Ontario,Canada M5G0A3 >>>> >>>> Toll-free: 1-866-678-6427 >>>> Twitter: @OICR_news >>>> *www.oicr.on.ca * >>>> >>>> This message and any attachments may contain confidential and/or >>>> privileged information for the sole use of the intended recipient. Any >>>> review or distribution by anyone other than the person for whom it was >>>> originally intended is strictly prohibited. If you have received this >>>> message in error, please contact the sender and delete all copies. >>>> Opinions, conclusions or other information contained in this message may >>>> not be that of the organization. >>>> >>>> >>>> _______________________________________________ >>>> docktesters mailing list >>>> docktesters at lists.icgc.org >>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>> >>>> >>> The Francis Crick Institute Limited is a registered charity in England >>> and Wales no. 1140062 and a company registered in England and Wales no. >>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>> >>> >>> _______________________________________________ >>> docktesters mailing list >>> docktesters at lists.icgc.org >>> https://lists.icgc.org/mailman/listinfo/docktesters >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Mon Mar 13 13:51:53 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Mon, 13 Mar 2017 17:51:53 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Mon Mar 13 14:31:47 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Mon, 13 Mar 2017 19:31:47 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang wrote: > Hi Miguel, > > I thought you kept the unaligned sequence you prepared for the testing. > > Following your link about preparing unaligned input, I found this: > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/ > master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the > high mismatch rate. > > When BWA MEM workflow runs, the alignments are done one lane level BAM at > a time, then merge the aligned BAM later: https://github.com/ICGC > -TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main > /java/com/github/seqware/WorkflowClient.java#L201 > > I see the script prepare_unaligned.sh always generates one read group > (i.e., lane) for normal or tumour, no matter how many read groups (lanes) > in the aligned BAMs. This has big impact on the alignment result when lanes > are aligned independently comparing aligned altogether. > > The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but > it only works when the input is *single lane BAM file*: > https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+ > PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a. > PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)Followthisifyoustart > fromsinglelaneBAMfiles > > So, I think in order to perform testing alignment workflow properly, we > will need to prepare *lane level *unaligned BAM (one lane one BAM) as > inputs. For example, this aligned BAM: https://gtrepo-ebi.annail > abs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, > it has 7 read groups (search for read_group). It needs to be converted to 7 > individual lane level BAM files. > > Not sure whether it's the best way to do BAM splitting, but here is > someone's Python code to do it: https://gist.github.com/seandavi/2014542 > > Hope this helps, > Junjun > > > > From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM > To: George Mihaiescu > Cc: Jonas Demeulemeester , Junjun Zhang > , "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Hi George, > > The analigned BAM files are not available as far as I know, rather you > must unalign the final BAM files, the normal ones you get from ICGC or > GNOS. This process is also in my scripts, as you see here: > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi > n/run_batch.sh#L32 > > About the steps in the workflows, I don't know them myself. I think you'll > need to ask the developers, and not all workflows use the same underlying > workflow enactment tool. Not an easy answer > > > > On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > >> Junjun told me this would provide value to the testing process, so I >> would like to kick off a test of the BWA_mem docker. >> Can somebody provide some quick instructions and the location of the >> unaligned BAM files that were used already? >> >> Also, do we have somewhere the steps involved in each workflow, so I can >> get an idea of how far they are while running? >> For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 >> steps from finish? >> >> Thank you, >> George >> >> From: Miguel Vazquez >> Date: Monday, March 13, 2017 at 8:52 AM >> >> To: George Mihaiescu >> Cc: Junjun Zhang , Jonas Demeulemeester < >> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi George, >> >> Answers inline >> >> On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < >> George.Mihaiescu at oicr.on.ca> wrote: >> >>> Hi Miguel, >>> >>> I've started the test by running "bin/run_test.sh Sanger DO50398", so I >>> guess with just one workflow running it should complete faster than two >>> weeks. >>> >> >> I think it still should take a long time. My scripts will run one >> workflow after another. >> >> >>> >>> Because I'm running in Collaboratory I've changed the >>> "get_icgc_donor.sh" script to use a docker container that has the icgc >>> client inside and pull data from Collaboratory. There is no "bam.bas" file >>> downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an >>> issue. >>> >>> >> I wondered the same thing first time I did this, but this file is >> produced by the pipeline. There was some problem with this that was dealt >> with by the developers and updated in the docker. So I think you won't have >> a problem >> >> >>> By looking at the "bin/compare_result_type.sh" it looks like it's using >>> the gnos client to pull down the existing VCF files for comparison reasons, >>> but I think we store those files in Collaboratory as well, so I'll work >>> with Junjun to adapt the script for this. >>> >>> >> Let me know if you need any help >> >> >>> I think I initially tried to run the DKFZ workflow, but it complained >>> about having to run Delly first, so I abandoned this for now. >>> >> >> Yes, if you look at the run_batch.sh you will see that when using DKFZ it >> will always run Delly first. Delly prepares some files the the DKFZ file >> needs, namely related to copy number I believe. >> >> >>> >>> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. >>> >> >> Remember that you will need to add the relevant has-keys for the >> different files in the etc/donor_files.csv. Its a bit tedious right now. >> You need to go to the ICGC DCC and find these codes manually for the files >> you need. Ask me if you need help. Once you have all you can run all the >> workflows for that donor and evaluate results. >> >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/et >> c/donor_files.csv >> >> >> Regards >> >> Miguel >> >> >>> >>> George >>> >>> From: Miguel Vazquez >>> Date: Monday, March 13, 2017 at 6:53 AM >>> To: George Mihaiescu >>> Cc: Junjun Zhang , Jonas Demeulemeester < >>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Hi George, >>> >>> The Sanger workflow is very lengthy, it takes about two weeks in my >>> tests. >>> >>> About correctness, my scripts also cover that part, if you are not using >>> them they might still help you to clarify how we do it. The idea is to take >>> each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both >>> germline and somatic and compare it with the result uploaded to GNOS (not >>> all pipelines produce all files). This is the relevant part in the >>> run_batch.sh script: >>> >>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>> n/run_batch.sh#L42-L46 >>> >>> The bin/compare_result_type.sh script will take care of downloading the >>> correct file from GNOS and running the comparison. The comparison itself is >>> simple since all files are VCFs, it consists in taking out the variants in >>> terms of chromosome, position, reference and alternative allele and >>> measuring the overlaps. >>> >>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>> n/compare_result_type.sh >>> >>> About which donors to test, DO52140 is one Jonas and I have both tested >>> and could be interesting to get a third opinion. Also, any other donor >>> could be interesting to see if something new comes up. I'm not sure which >>> options is best. >>> >>> Miguel >>> >>> >>> >>> >>> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < >>> George.Mihaiescu at oicr.on.ca> wrote: >>> >>>> Hi, >>>> >>>> I've started Sanger on DO50398 and it's been running for more than 24 >>>> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" >>>> >>>> I just started a second run on a different VM on same donor, just to >>>> compare run times. >>>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send >>>> some monitoring graphs when it finishes the workflow, but I have no idea >>>> how to check its correctness. >>>> >>>> Give me a list of donors and what workflows you want me to run and I'll >>>> try to schedule them tomorrow. >>>> >>>> George >>>> >>>> >>>> From: Junjun Zhang >>>> Date: Sunday, March 12, 2017 at 10:45 PM >>>> To: Jonas Demeulemeester , George >>>> Mihaiescu >>>> Cc: Miguel Vazquez , Denis Yuen < >>>> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < >>>> docktesters at lists.icgc.org> >>>> Subject: Re: [DOCKTESTERS] Thanks! >>>> >>>> Thanks Miguel and Jonas for your help here! >>>> >>>> Do you have any update on the latest testing? Please feel free updating >>>> the wiki with any update: https://wiki.oicr.on.c >>>> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference >>>> >>>> Regards, >>>> Junjun >>>> >>>> >>>> >>>> From: Jonas Demeulemeester >>>> Date: Saturday, March 11, 2017 at 7:15 PM >>>> To: George Mihaiescu >>>> Cc: Miguel Vazquez , Junjun Zhang < >>>> junjun.zhang at oicr.on.ca>, Denis Yuen , " >>>> docktesters at lists.icgc.org" >>>> Subject: Re: [DOCKTESTERS] Thanks! >>>> >>>> Hi George, >>>> >>>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of >>>> scripts. >>>> Give them a go and if you run into issues, just let us know! >>>> >>>> Cheers, >>>> Jonas >>>> >>>> >>>> On 11 Mar 2017, at 17:00, George Mihaiescu >>>> wrote: >>>> >>>> Sure, I'll give it a try and report later. >>>> >>>> Thank you, >>>> >>>> *George Mihaiescu* >>>> Senior Cloud Architect >>>> >>>> *Ontario Institute for Cancer Research* >>>> MaRS Centre >>>> 661 University Avenue >>>> Suite 510 >>>> Toronto, Ontario >>>> Canada M5G 0A3 >>>> >>>> Email: George.Mihaiescu at oicr.on.ca >>>> Toll-free: 1-866-678-6427 >>>> Twitter: @OICR_news >>>> >>>> www.oicr.on.ca >>>> >>>> This message and any attachments may contain confidential and/or >>>> privileged information for the sole use of the intended recipient. Any >>>> review or distribution by anyone other than the person for whom it was >>>> originally intended is strictly prohibited. If you have received this >>>> message in error, please contact the sender and delete all copies. >>>> Opinions, conclusions or other information contained in this message may >>>> not be that of the organization. >>>> >>>> >>>> >>>> From: Miguel Vazquez >>>> Date: Saturday, March 11, 2017 at 10:57 AM >>>> To: Junjun Zhang >>>> Cc: Denis Yuen , Jonas Demeulemeester < >>>> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < >>>> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < >>>> docktesters at lists.icgc.org> >>>> Subject: Re: [DOCKTESTERS] Thanks! >>>> >>>> Hi Junjun, >>>> >>>> I think Jonas has been using my scripts to run some of the tests, maybe >>>> George could try them as well, it should be very easy for him to try the >>>> Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. >>>> >>>> https://github.com/mikisvaz/PCAWG-Docker-Test >>>> >>>> He would just need to update the tokens for DACO access and the scripts >>>> will take care of downloading the BAM files, running the workflows and >>>> evaluating the result. >>>> >>>> The documentation there is reasonably updated, but if this sounds good >>>> then perhaps he could contact me and I could walk him through the details. >>>> >>>> Best regards >>>> >>>> Miguel >>>> >>>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang >>>> wrote: >>>> >>>>> Dear Docktesters, >>>>> >>>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans >>>>> to run some bioinformatics workflows to test Collab environment. >>>>> >>>>> Just thought this is a good opportunity to use as extra help for >>>>> testing out the PCAWG dockerized workflows. >>>>> >>>>> Miguel, Denis and others, what workflows / datasets do you think would >>>>> be good for George to run? >>>>> >>>>> Thanks, >>>>> Junjun >>>>> >>>>> >>>>> >>>>> From: on >>>>> behalf of Denis Yuen >>>>> Date: Wednesday, March 1, 2017 at 10:26 AM >>>>> To: "docktesters at lists.icgc.org" >>>>> Subject: [DOCKTESTERS] Thanks! >>>>> >>>>> Hi, >>>>> >>>>> Just wanted to say thanks to Miguel and Jonas for keeping the workflow >>>>> testing data page up-to-date. >>>>> >>>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >>>>> >>>>> >>>>> As we work on new versions or debugging, it is invaluable to know what >>>>> versions of the workflows have worked outside OICR, thanks! >>>>> >>>>> >>>>> >>>>> *Denis Yuen* >>>>> Senior Software Developer >>>>> >>>>> >>>>> *Ontario**Institute**for**Cancer**Research* >>>>> MaRSCentre >>>>> 661 University Avenue >>>>> Suite510 >>>>> Toronto, Ontario,Canada M5G0A3 >>>>> >>>>> Toll-free: 1-866-678-6427 >>>>> Twitter: @OICR_news >>>>> *www.oicr.on.ca * >>>>> >>>>> This message and any attachments may contain confidential and/or >>>>> privileged information for the sole use of the intended recipient. Any >>>>> review or distribution by anyone other than the person for whom it was >>>>> originally intended is strictly prohibited. If you have received this >>>>> message in error, please contact the sender and delete all copies. >>>>> Opinions, conclusions or other information contained in this message may >>>>> not be that of the organization. >>>>> >>>>> >>>>> _______________________________________________ >>>>> docktesters mailing list >>>>> docktesters at lists.icgc.org >>>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>>> >>>>> >>>> The Francis Crick Institute Limited is a registered charity in England >>>> and Wales no. 1140062 and a company registered in England and Wales no. >>>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>>> >>>> >>>> _______________________________________________ >>>> docktesters mailing list >>>> docktesters at lists.icgc.org >>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Mon Mar 13 17:16:35 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Mon, 13 Mar 2017 21:16:35 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: Hi Keiran, Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? Thanks, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM To: Junjun Zhang > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From kr2 at sanger.ac.uk Tue Mar 14 05:16:17 2017 From: kr2 at sanger.ac.uk (Keiran Raine) Date: Tue, 14 Mar 2017 09:16:17 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> Hi Junjun, You won't be able to separate out the readgroups in the headers if the input is a merged BAM file. If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates). If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass. Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 > On 13 Mar 2017, at 21:16, Junjun Zhang wrote: > > Hi Keiran, > > Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? > > Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? > > Thanks, > Junjun > > > > From: Miguel Vazquez > > Date: Monday, March 13, 2017 at 2:31 PM > To: Junjun Zhang > > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org " > > Subject: Re: [DOCKTESTERS] Thanks! > >> Hi Junjun >> >> About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. >> >> About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: >> >> 1- Download the metadata for the BAM file >> 2- Determine the read_groups >> 3- Split the BAM file according to these read_groups >> 4- Unalign these BAM files and produce header files with different lanes >> 5- Run BWA-Mem >> 6- Compare collectively the reads from these BAM files with the original BAM >> >> Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. >> >> Regards >> >> Miguel >> >> >> On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: >>> Hi Miguel, >>> >>> I thought you kept the unaligned sequence you prepared for the testing. >>> >>> Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35 , which actually could explain the high mismatch rate. >>> >>> When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 >>> >>> I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. >>> >>> The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles >>> >>> So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63 , it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. >>> >>> Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 >>> >>> Hope this helps, >>> Junjun >>> >>> >>> >>> From: Miguel Vazquez > >>> Date: Monday, March 13, 2017 at 1:01 PM >>> To: George Mihaiescu > >>> Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org " > >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>>> Hi George, >>>> >>>> The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: >>>> >>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 >>>> >>>> About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer >>>> >>>> >>>> >>>> On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: >>>>> Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. >>>>> Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? >>>>> >>>>> Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? >>>>> For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? >>>>> >>>>> Thank you, >>>>> George >>>>> >>>>> From: Miguel Vazquez > >>>>> Date: Monday, March 13, 2017 at 8:52 AM >>>>> >>>>> To: George Mihaiescu > >>>>> Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org " > >>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>> >>>>> Hi George, >>>>> >>>>> Answers inline >>>>> >>>>> On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: >>>>>> Hi Miguel, >>>>>> >>>>>> I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. >>>>> >>>>> I think it still should take a long time. My scripts will run one workflow after another. >>>>> >>>>>> >>>>>> Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. >>>>>> >>>>> >>>>> I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem >>>>> >>>>>> By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. >>>>>> >>>>> >>>>> Let me know if you need any help >>>>> >>>>>> I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. >>>>> >>>>> Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. >>>>> >>>>>> >>>>>> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. >>>>> >>>>> Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. >>>>> >>>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv >>>>> >>>>> >>>>> Regards >>>>> >>>>> Miguel >>>>> >>>>>> >>>>>> George >>>>>> >>>>>> From: Miguel Vazquez > >>>>>> Date: Monday, March 13, 2017 at 6:53 AM >>>>>> To: George Mihaiescu > >>>>>> Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org " > >>>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>>> >>>>>> Hi George, >>>>>> >>>>>> The Sanger workflow is very lengthy, it takes about two weeks in my tests. >>>>>> >>>>>> About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: >>>>>> >>>>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 >>>>>> >>>>>> The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. >>>>>> >>>>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh >>>>>> >>>>>> About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. >>>>>> >>>>>> Miguel >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" >>>>>>> >>>>>>> I just started a second run on a different VM on same donor, just to compare run times. >>>>>>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. >>>>>>> >>>>>>> Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. >>>>>>> >>>>>>> George >>>>>>> >>>>>>> >>>>>>> From: Junjun Zhang > >>>>>>> Date: Sunday, March 12, 2017 at 10:45 PM >>>>>>> To: Jonas Demeulemeester >, George Mihaiescu > >>>>>>> Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org " > >>>>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>>>> >>>>>>> Thanks Miguel and Jonas for your help here! >>>>>>> >>>>>>> Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference >>>>>>> >>>>>>> Regards, >>>>>>> Junjun >>>>>>> >>>>>>> >>>>>>> >>>>>>> From: Jonas Demeulemeester > >>>>>>> Date: Saturday, March 11, 2017 at 7:15 PM >>>>>>> To: George Mihaiescu > >>>>>>> Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org " > >>>>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>>>> >>>>>>>> Hi George, >>>>>>>> >>>>>>>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. >>>>>>>> Give them a go and if you run into issues, just let us know! >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Jonas >>>>>>>> >>>>>>>> >>>>>>>> On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: >>>>>>>> >>>>>>>>> Sure, I'll give it a try and report later. >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> George Mihaiescu >>>>>>>>> Senior Cloud Architect >>>>>>>>> >>>>>>>>> Ontario Institute for Cancer Research >>>>>>>>> MaRS Centre >>>>>>>>> 661 University Avenue >>>>>>>>> Suite 510 >>>>>>>>> Toronto, Ontario >>>>>>>>> Canada M5G 0A3 >>>>>>>>> >>>>>>>>> Email: George.Mihaiescu at oicr.on.ca >>>>>>>>> Toll-free: 1-866-678-6427 >>>>>>>>> Twitter: @OICR_news >>>>>>>>> >>>>>>>>> www.oicr.on.ca >>>>>>>>> >>>>>>>>> This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> From: Miguel Vazquez > >>>>>>>>> Date: Saturday, March 11, 2017 at 10:57 AM >>>>>>>>> To: Junjun Zhang > >>>>>>>>> Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org " > >>>>>>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>>>>>> >>>>>>>>> Hi Junjun, >>>>>>>>> >>>>>>>>> I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. >>>>>>>>> >>>>>>>>> https://github.com/mikisvaz/PCAWG-Docker-Test >>>>>>>>> >>>>>>>>> He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. >>>>>>>>> >>>>>>>>> The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. >>>>>>>>> >>>>>>>>> Best regards >>>>>>>>> >>>>>>>>> Miguel >>>>>>>>> >>>>>>>>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: >>>>>>>>>> Dear Docktesters, >>>>>>>>>> >>>>>>>>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. >>>>>>>>>> >>>>>>>>>> Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. >>>>>>>>>> >>>>>>>>>> Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Junjun >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> From: > on behalf of Denis Yuen > >>>>>>>>>> Date: Wednesday, March 1, 2017 at 10:26 AM >>>>>>>>>> To: "docktesters at lists.icgc.org " > >>>>>>>>>> Subject: [DOCKTESTERS] Thanks! >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. >>>>>>>>>>> >>>>>>>>>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >>>>>>>>>>> >>>>>>>>>>> As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Denis Yuen >>>>>>>>>>> Senior Software Developer >>>>>>>>>>> >>>>>>>>>>> OntarioInstituteforCancerResearch >>>>>>>>>>> MaRSCentre >>>>>>>>>>> 661 University Avenue >>>>>>>>>>> Suite510 >>>>>>>>>>> Toronto, Ontario,Canada M5G0A3 >>>>>>>>>>> Toll-free: 1-866-678-6427 >>>>>>>>>>> Twitter: @OICR_news >>>>>>>>>>> www.oicr.on.ca >>>>>>>>>>> This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> docktesters mailing list >>>>>>>>>> docktesters at lists.icgc.org >>>>>>>>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>>>>>>>> >>>>>>>>> >>>>>>>> The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> docktesters mailing list >>>>>>> docktesters at lists.icgc.org >>>>>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>>>>> >>>>>> >>>>> >>>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2058 bytes Desc: not available URL: From Junjun.Zhang at oicr.on.ca Tue Mar 14 08:28:17 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Tue, 14 Mar 2017 12:28:17 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> References: , <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> Message-ID: <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> Hi Kieran, Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA MEM alignment result, one must use lane level BAMs (one lane one BAM) as input. A processing is needed to prepare lane level BAMs from merged BAM. @Migual, hope this is helpful. Let us know if you have any other questions. Best regards Junjun On Mar 14, 2017, at 5:16 AM, Keiran Raine > wrote: Hi Junjun, You won't be able to separate out the readgroups in the headers if the input is a merged BAM file. If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates). If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass. Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 On 13 Mar 2017, at 21:16, Junjun Zhang > wrote: Hi Keiran, Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? Thanks, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM To: Junjun Zhang > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Tue Mar 14 08:44:13 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Tue, 14 Mar 2017 13:44:13 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> Message-ID: Hi Junjun and Keiran, I'm sorry guys, but his is too alien for me, this was never my area of expertise. I'm going to need someone to write a script for me that takes a BAM file and turns it into what ever I need to run BWA-Mem on. At least pseudo-code or something that I can start with. I think perhaps someone more knowledgeable than me should consider if this procedure as a whole is acceptable in terms of reproducibility, and how would be best to document it or if it could possibly be improved. Also, I don't think I understand the nature of the problem because from what I can fathom this problem should have either broken the process or render a much larger of discrepancies than 3%. Can someone explain in layman words how can only 3% of reads be affected? Best regards Miguel On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang wrote: > Hi Kieran, > > Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA > MEM alignment result, one must use lane level BAMs (one lane one BAM) as > input. > > A processing is needed to prepare lane level BAMs from merged BAM. > > @Migual, hope this is helpful. Let us know if you have any other > questions. > > Best regards > Junjun > > On Mar 14, 2017, at 5:16 AM, Keiran Raine wrote: > > Hi Junjun, > > You won't be able to separate out the readgroups in the headers if the > input is a merged BAM file . If there are different libraries, read > lengths etc it will cause problems for insert-size determination (used in > determining proper-pairs) and result in inter-library duplicate removal (by > definition reads from different libraries can't be duplicates). > > If you really need to do it this way you'd have to add a pre-processing > step, bamtofastq can split a BAM into it's component readgroups in a single > pass. > > Regards, > > Keiran Raine > Principal Bioinformatician > Cancer Genome Project > Wellcome Trust Sanger Institute > > kr2 at sanger.ac.uk > Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244> > Office: H104 > > On 13 Mar 2017, at 21:16, Junjun Zhang wrote: > > Hi Keiran, > > Can you please comment on this, i.e., comparison between alignment done > lane by lane v.s. done with all lanes mixed? > > Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM > workflow. The starting point is the aligned BAM because we don't have the > unaligned lane BAM any more. The key point here is: should input BAM > organized by lanes, one lane one BAM? Or just one BAM containing all lanes? > > Thanks, > Junjun > > > > From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM > To: Junjun Zhang > Cc: George Mihaiescu , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Hi Junjun > > About the unaligned BAM files, in fact I do have them for the two test > I've ran. I could put them available for George but I think he could just > as well produce them on site, since he might have to do that anyway. But we > can always explore that option, though right now I don't know of a simple > way to move these files around. > > About the number of lanes let me just say good grief! This is the first > time I hear about it. So if I understand you correctly I need to: > > 1- Download the metadata for the BAM file > 2- Determine the read_groups > 3- Split the BAM file according to these read_groups > 4- Unalign these BAM files and produce header files with different lanes > 5- Run BWA-Mem > 6- Compare collectively the reads from these BAM files with the original > BAM > > Could you please confirm that this is the case? Is this consistent with > the 3% mismatches? A similar percentage was found in the HCC1143, could > this be the reason for that as well? Also I asked Keiran about these > headers and he said there where OK. If you could please confirm that I need > to do this extended process I'd be grateful, because its quite involved and > there are concepts here I'm not familiar with. > > Regards > > Miguel > > > On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: > >> Hi Miguel, >> >> I thought you kept the unaligned sequence you prepared for the testing. >> >> Following your link about preparing unaligned input, I found this: >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/mas >> ter/bin/prepare_unaligned.sh#L16-L35, which actually could explain the >> high mismatch rate. >> >> When BWA MEM workflow runs, the alignments are done one lane level BAM at >> a time, then merge the aligned BAM later: https://github.com/ICGC >> -TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/ >> java/com/github/seqware/WorkflowClient.java#L201 >> >> I see the script prepare_unaligned.sh always generates one read group >> (i.e., lane) for normal or tumour, no matter how many read groups (lanes) >> in the aligned BAMs. This has big impact on the alignment result when lanes >> are aligned independently comparing aligned altogether. >> >> The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, >> but it only works when the input is *single lane BAM file*: >> https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+P >> CAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PC >> APorPAWG)SequenceSubmissionSOP-v1.0-a)Followthisifyoustartfr >> omsinglelaneBAMfiles >> >> So, I think in order to perform testing alignment workflow properly, we >> will need to prepare *lane level *unaligned BAM (one lane one BAM) as >> inputs. For example, this aligned BAM: https://gtrepo-ebi.annail >> abs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, >> it has 7 read groups (search for read_group). It needs to be converted to 7 >> individual lane level BAM files. >> >> Not sure whether it's the best way to do BAM splitting, but here is >> someone's Python code to do it: https://gist.github.com/seandavi/2014542 >> >> Hope this helps, >> Junjun >> >> >> >> From: Miguel Vazquez >> Date: Monday, March 13, 2017 at 1:01 PM >> To: George Mihaiescu >> Cc: Jonas Demeulemeester , Junjun >> Zhang , "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi George, >> >> The analigned BAM files are not available as far as I know, rather you >> must unalign the final BAM files, the normal ones you get from ICGC or >> GNOS. This process is also in my scripts, as you see here: >> >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >> n/run_batch.sh#L32 >> >> About the steps in the workflows, I don't know them myself. I think >> you'll need to ask the developers, and not all workflows use the same >> underlying workflow enactment tool. Not an easy answer >> >> >> >> On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu < >> George.Mihaiescu at oicr.on.ca> wrote: >> >>> Junjun told me this would provide value to the testing process, so I >>> would like to kick off a test of the BWA_mem docker. >>> Can somebody provide some quick instructions and the location of the >>> unaligned BAM files that were used already? >>> >>> Also, do we have somewhere the steps involved in each workflow, so I can >>> get an idea of how far they are while running? >>> For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 >>> steps from finish? >>> >>> Thank you, >>> George >>> >>> From: Miguel Vazquez >>> Date: Monday, March 13, 2017 at 8:52 AM >>> >>> To: George Mihaiescu >>> Cc: Junjun Zhang , Jonas Demeulemeester < >>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Hi George, >>> >>> Answers inline >>> >>> On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < >>> George.Mihaiescu at oicr.on.ca> wrote: >>> >>>> Hi Miguel, >>>> >>>> I've started the test by running "bin/run_test.sh Sanger DO50398", so I >>>> guess with just one workflow running it should complete faster than two >>>> weeks. >>>> >>> >>> I think it still should take a long time. My scripts will run one >>> workflow after another. >>> >>> >>>> >>>> Because I'm running in Collaboratory I've changed the >>>> "get_icgc_donor.sh" script to use a docker container that has the icgc >>>> client inside and pull data from Collaboratory. There is no "bam.bas" file >>>> downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an >>>> issue. >>>> >>>> >>> I wondered the same thing first time I did this, but this file is >>> produced by the pipeline. There was some problem with this that was dealt >>> with by the developers and updated in the docker. So I think you won't have >>> a problem >>> >>> >>>> By looking at the "bin/compare_result_type.sh" it looks like it's using >>>> the gnos client to pull down the existing VCF files for comparison reasons, >>>> but I think we store those files in Collaboratory as well, so I'll work >>>> with Junjun to adapt the script for this. >>>> >>>> >>> Let me know if you need any help >>> >>> >>>> I think I initially tried to run the DKFZ workflow, but it complained >>>> about having to run Delly first, so I abandoned this for now. >>>> >>> >>> Yes, if you look at the run_batch.sh you will see that when using DKFZ >>> it will always run Delly first. Delly prepares some files the the DKFZ >>> file needs, namely related to copy number I believe. >>> >>> >>>> >>>> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. >>>> >>> >>> Remember that you will need to add the relevant has-keys for the >>> different files in the etc/donor_files.csv. Its a bit tedious right now. >>> You need to go to the ICGC DCC and find these codes manually for the files >>> you need. Ask me if you need help. Once you have all you can run all the >>> workflows for that donor and evaluate results. >>> >>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/et >>> c/donor_files.csv >>> >>> >>> Regards >>> >>> Miguel >>> >>> >>>> >>>> George >>>> >>>> From: Miguel Vazquez >>>> Date: Monday, March 13, 2017 at 6:53 AM >>>> To: George Mihaiescu >>>> Cc: Junjun Zhang , Jonas Demeulemeester < >>>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >>>> docktesters at lists.icgc.org> >>>> Subject: Re: [DOCKTESTERS] Thanks! >>>> >>>> Hi George, >>>> >>>> The Sanger workflow is very lengthy, it takes about two weeks in my >>>> tests. >>>> >>>> About correctness, my scripts also cover that part, if you are not >>>> using them they might still help you to clarify how we do it. The idea is >>>> to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for >>>> both germline and somatic and compare it with the result uploaded to GNOS >>>> (not all pipelines produce all files). This is the relevant part in the >>>> run_batch.sh script: >>>> >>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>>> n/run_batch.sh#L42-L46 >>>> >>>> The bin/compare_result_type.sh script will take care of downloading the >>>> correct file from GNOS and running the comparison. The comparison itself is >>>> simple since all files are VCFs, it consists in taking out the variants in >>>> terms of chromosome, position, reference and alternative allele and >>>> measuring the overlaps. >>>> >>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >>>> n/compare_result_type.sh >>>> >>>> About which donors to test, DO52140 is one Jonas and I have both tested >>>> and could be interesting to get a third opinion. Also, any other donor >>>> could be interesting to see if something new comes up. I'm not sure which >>>> options is best. >>>> >>>> Miguel >>>> >>>> >>>> >>>> >>>> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < >>>> George.Mihaiescu at oicr.on.ca> wrote: >>>> >>>>> Hi, >>>>> >>>>> I've started Sanger on DO50398 and it's been running for more than 24 >>>>> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" >>>>> >>>>> I just started a second run on a different VM on same donor, just to >>>>> compare run times. >>>>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send >>>>> some monitoring graphs when it finishes the workflow, but I have no idea >>>>> how to check its correctness. >>>>> >>>>> Give me a list of donors and what workflows you want me to run and >>>>> I'll try to schedule them tomorrow. >>>>> >>>>> George >>>>> >>>>> >>>>> From: Junjun Zhang >>>>> Date: Sunday, March 12, 2017 at 10:45 PM >>>>> To: Jonas Demeulemeester , George >>>>> Mihaiescu >>>>> Cc: Miguel Vazquez , Denis Yuen < >>>>> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < >>>>> docktesters at lists.icgc.org> >>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>> >>>>> Thanks Miguel and Jonas for your help here! >>>>> >>>>> Do you have any update on the latest testing? Please feel free >>>>> updating the wiki with any update: https://wiki.oicr.on.c >>>>> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference >>>>> >>>>> Regards, >>>>> Junjun >>>>> >>>>> >>>>> >>>>> From: Jonas Demeulemeester >>>>> Date: Saturday, March 11, 2017 at 7:15 PM >>>>> To: George Mihaiescu >>>>> Cc: Miguel Vazquez , Junjun Zhang < >>>>> junjun.zhang at oicr.on.ca>, Denis Yuen , " >>>>> docktesters at lists.icgc.org" >>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>> >>>>> Hi George, >>>>> >>>>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of >>>>> scripts. >>>>> Give them a go and if you run into issues, just let us know! >>>>> >>>>> Cheers, >>>>> Jonas >>>>> >>>>> >>>>> On 11 Mar 2017, at 17:00, George Mihaiescu < >>>>> George.Mihaiescu at oicr.on.ca> wrote: >>>>> >>>>> Sure, I'll give it a try and report later. >>>>> >>>>> Thank you, >>>>> *George Mihaiescu* >>>>> Senior Cloud Architect >>>>> >>>>> *Ontario Institute for Cancer Research* >>>>> MaRS Centre >>>>> 661 University Avenue >>>>> Suite 510 >>>>> Toronto, Ontario >>>>> Canada M5G 0A3 >>>>> >>>>> Email: George.Mihaiescu at oicr.on.ca >>>>> Toll-free: 1-866-678-6427 >>>>> Twitter: @OICR_news >>>>> >>>>> www.oicr.on.ca >>>>> >>>>> This message and any attachments may contain confidential and/or >>>>> privileged information for the sole use of the intended recipient. Any >>>>> review or distribution by anyone other than the person for whom it was >>>>> originally intended is strictly prohibited. If you have received this >>>>> message in error, please contact the sender and delete all copies. >>>>> Opinions, conclusions or other information contained in this message may >>>>> not be that of the organization. >>>>> >>>>> >>>>> >>>>> From: Miguel Vazquez >>>>> Date: Saturday, March 11, 2017 at 10:57 AM >>>>> To: Junjun Zhang >>>>> Cc: Denis Yuen , Jonas Demeulemeester < >>>>> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < >>>>> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < >>>>> docktesters at lists.icgc.org> >>>>> Subject: Re: [DOCKTESTERS] Thanks! >>>>> >>>>> Hi Junjun, >>>>> >>>>> I think Jonas has been using my scripts to run some of the tests, >>>>> maybe George could try them as well, it should be very easy for him to try >>>>> the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. >>>>> >>>>> https://github.com/mikisvaz/PCAWG-Docker-Test >>>>> >>>>> He would just need to update the tokens for DACO access and the >>>>> scripts will take care of downloading the BAM files, running the workflows >>>>> and evaluating the result. >>>>> >>>>> The documentation there is reasonably updated, but if this sounds good >>>>> then perhaps he could contact me and I could walk him through the details. >>>>> >>>>> Best regards >>>>> >>>>> Miguel >>>>> >>>>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang >>>> > wrote: >>>>> >>>>>> Dear Docktesters, >>>>>> >>>>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans >>>>>> to run some bioinformatics workflows to test Collab environment. >>>>>> >>>>>> Just thought this is a good opportunity to use as extra help for >>>>>> testing out the PCAWG dockerized workflows. >>>>>> >>>>>> Miguel, Denis and others, what workflows / datasets do you think >>>>>> would be good for George to run? >>>>>> >>>>>> Thanks, >>>>>> Junjun >>>>>> >>>>>> >>>>>> >>>>>> From: >>>>>> on behalf of Denis Yuen >>>>>> Date: Wednesday, March 1, 2017 at 10:26 AM >>>>>> To: "docktesters at lists.icgc.org" >>>>>> Subject: [DOCKTESTERS] Thanks! >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> Just wanted to say thanks to Miguel and Jonas for keeping the >>>>>> workflow testing data page up-to-date. >>>>>> >>>>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >>>>>> >>>>>> As we work on new versions or debugging, it is invaluable to know >>>>>> what versions of the workflows have worked outside OICR, thanks! >>>>>> >>>>>> >>>>>> *Denis Yuen* >>>>>> Senior Software Developer >>>>>> >>>>>> >>>>>> *Ontario**Institute**for**Cancer**Research* >>>>>> MaRSCentre >>>>>> 661 University Avenue >>>>>> Suite510 >>>>>> Toronto, Ontario,Canada M5G0A3 >>>>>> >>>>>> Toll-free: 1-866-678-6427 >>>>>> Twitter: @OICR_news >>>>>> *www.oicr.on.ca * >>>>>> >>>>>> This message and any attachments may contain confidential and/or >>>>>> privileged information for the sole use of the intended recipient. Any >>>>>> review or distribution by anyone other than the person for whom it was >>>>>> originally intended is strictly prohibited. If you have received this >>>>>> message in error, please contact the sender and delete all copies. >>>>>> Opinions, conclusions or other information contained in this message may >>>>>> not be that of the organization. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> docktesters mailing list >>>>>> docktesters at lists.icgc.org >>>>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>>>> >>>>>> >>>>> The Francis Crick Institute Limited is a registered charity in England >>>>> and Wales no. 1140062 and a company registered in England and Wales no. >>>>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>>>> >>>>> >>>>> _______________________________________________ >>>>> docktesters mailing list >>>>> docktesters at lists.icgc.org >>>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>>> >>>>> >>>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Tue Mar 14 09:49:34 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Tue, 14 Mar 2017 13:49:34 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> Message-ID: Hi Miguel, I?ll have a go at modifying your scripts to do this kind of preprocessing. As to why alignment by lane level vs alignment of a single merged bam would result in only 3% discrepancies, I can imagine that read lengths etc may not be that different between the different libraries (for our tested donors at least). Please correct me if I?m wrong though! Best regards, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 14 Mar 2017, at 12:44, Miguel Vazquez > wrote: Hi Junjun and Keiran, I'm sorry guys, but his is too alien for me, this was never my area of expertise. I'm going to need someone to write a script for me that takes a BAM file and turns it into what ever I need to run BWA-Mem on. At least pseudo-code or something that I can start with. I think perhaps someone more knowledgeable than me should consider if this procedure as a whole is acceptable in terms of reproducibility, and how would be best to document it or if it could possibly be improved. Also, I don't think I understand the nature of the problem because from what I can fathom this problem should have either broken the process or render a much larger of discrepancies than 3%. Can someone explain in layman words how can only 3% of reads be affected? Best regards Miguel On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang > wrote: Hi Kieran, Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA MEM alignment result, one must use lane level BAMs (one lane one BAM) as input. A processing is needed to prepare lane level BAMs from merged BAM. @Migual, hope this is helpful. Let us know if you have any other questions. Best regards Junjun On Mar 14, 2017, at 5:16 AM, Keiran Raine > wrote: Hi Junjun, You won't be able to separate out the readgroups in the headers if the input is a merged BAM file . If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates). If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass. Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 On 13 Mar 2017, at 21:16, Junjun Zhang > wrote: Hi Keiran, Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? Thanks, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM To: Junjun Zhang > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Tue Mar 14 10:21:01 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Tue, 14 Mar 2017 14:21:01 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> Message-ID: Hi Jonas, Much appreciated for your kind offer. Once we get new alignment result using lane level BAM input, it should be easier to diagnosis the mismatches. Regards, Junjun From: Jonas Demeulemeester > Date: Tuesday, March 14, 2017 at 9:49 AM To: Miguel Vazquez > Cc: Junjun Zhang >, Keiran Raine >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Miguel, I?ll have a go at modifying your scripts to do this kind of preprocessing. As to why alignment by lane level vs alignment of a single merged bam would result in only 3% discrepancies, I can imagine that read lengths etc may not be that different between the different libraries (for our tested donors at least). Please correct me if I?m wrong though! Best regards, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 14 Mar 2017, at 12:44, Miguel Vazquez > wrote: Hi Junjun and Keiran, I'm sorry guys, but his is too alien for me, this was never my area of expertise. I'm going to need someone to write a script for me that takes a BAM file and turns it into what ever I need to run BWA-Mem on. At least pseudo-code or something that I can start with. I think perhaps someone more knowledgeable than me should consider if this procedure as a whole is acceptable in terms of reproducibility, and how would be best to document it or if it could possibly be improved. Also, I don't think I understand the nature of the problem because from what I can fathom this problem should have either broken the process or render a much larger of discrepancies than 3%. Can someone explain in layman words how can only 3% of reads be affected? Best regards Miguel On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang > wrote: Hi Kieran, Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA MEM alignment result, one must use lane level BAMs (one lane one BAM) as input. A processing is needed to prepare lane level BAMs from merged BAM. @Migual, hope this is helpful. Let us know if you have any other questions. Best regards Junjun On Mar 14, 2017, at 5:16 AM, Keiran Raine > wrote: Hi Junjun, You won't be able to separate out the readgroups in the headers if the input is a merged BAM file . If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates). If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass. Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 On 13 Mar 2017, at 21:16, Junjun Zhang > wrote: Hi Keiran, Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? Thanks, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM To: Junjun Zhang > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From kr2 at sanger.ac.uk Tue Mar 14 10:32:40 2017 From: kr2 at sanger.ac.uk (Keiran Raine) Date: Tue, 14 Mar 2017 14:32:40 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> Message-ID: <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> Hi, You would also only expect a minimal level of duplicates in a good test sample, and likely quite a small number of readgroups. Keiran From: Jonas Demeulemeester Date: Tuesday, 14 March 2017 at 13:49 To: Miguel Vazquez Cc: Junjun Zhang , Keiran Raine , George Mihaiescu , "docktesters at lists.icgc.org" Subject: Re: [DOCKTESTERS] Thanks! Hi Miguel, I?ll have a go at modifying your scripts to do this kind of preprocessing. As to why alignment by lane level vs alignment of a single merged bam would result in only 3% discrepancies, I can imagine that read lengths etc may not be that different between the different libraries (for our tested donors at least). Please correct me if I?m wrong though! Best regards, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http:/> On 14 Mar 2017, at 12:44, Miguel Vazquez > wrote: Hi Junjun and Keiran, I'm sorry guys, but his is too alien for me, this was never my area of expertise. I'm going to need someone to write a script for me that takes a BAM file and turns it into what ever I need to run BWA-Mem on. At least pseudo-code or something that I can start with. I think perhaps someone more knowledgeable than me should consider if this procedure as a whole is acceptable in terms of reproducibility, and how would be best to document it or if it could possibly be improved. Also, I don't think I understand the nature of the problem because from what I can fathom this problem should have either broken the process or render a much larger of discrepancies than 3%. Can someone explain in layman words how can only 3% of reads be affected? Best regards Miguel On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang > wrote: Hi Kieran, Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA MEM alignment result, one must use lane level BAMs (one lane one BAM) as input. A processing is needed to prepare lane level BAMs from merged BAM. @Migual, hope this is helpful. Let us know if you have any other questions. Best regards Junjun On Mar 14, 2017, at 5:16 AM, Keiran Raine > wrote: Hi Junjun, You won't be able to separate out the readgroups in the headers if the input is a merged BAM file . If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates). If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass. Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 On 13 Mar 2017, at 21:16, Junjun Zhang > wrote: Hi Keiran, Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? Thanks, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM To: Junjun Zhang > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Wed Mar 15 06:09:36 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Wed, 15 Mar 2017 11:09:36 +0100 Subject: [DOCKTESTERS] Important correction: DKFZ BiasFilter large missmatch on DO52140 and DO35937 Message-ID: Dear all, As you can read below I made a mistake on my previous validation for the DKFZ BiasFilter. Unfortunately large differences have turned up now that I've corrected the process. In brief on both donors I've found that re-runing the filter flags some additional variants for both flags bPcr and bSeq. Notably all the discrepancies are for the new method flagging more variants. For instance in many cases the original file contains just one variant with the flag where the new one ten or twenty. You can read the details at the end of this email where we are comparing the original VCF to the new one. Note that the orginal VCF is the consensus variants are the input I use for the BiasFilter along with the corresponding BAM files for that donor. I can only imagine that if this VCF was not the one originally used due to some filtering step then perhaps the bias calculations might have been affected. If that is so I would need instructions on where to get the precise input VCFs. Best regards Miguel ----RESULTS---- Comparison for *DO52140* tag *bPcr* --- Common: *1* Extra: *12* - Example: 11:81550771:C:A,12:19486241:G:T,2:12287406:G:T Missing: 0 Comparison for *DO52140* tag *bSeq* --- Common: *1* Extra: *23* - Example: 10:17681457:G:T,12:112049882:T:G,12:130990011:T:A Missing: 0 Comparison for *DO35937* tag *bPcr* --- Common: *1* Extra: *10* - Example: 1:114845662:G:T,14:33282600:C:A,16:78467879:G:T Missing: 0 Comparison for *DO35937* tag *bSeq* --- Common: *6* Extra: *88* - Example: 10:21703903:A:C,10:24183103:C:T,10:51468498:C:G Missing: 0 On Mon, Mar 13, 2017 at 4:48 PM, Christina Yung wrote: > Hi Miguel, > > The bPCR and bSeq flags are indeed the ones flagged by the DKFZ bias > filter. When you summarize the comparison, please cc Matthias of DKFZ as > his team developed this filter. No issue at all, and thanks again for your > great work! > > Christina > > > On 3/13/2017 10:22 AM, Miguel Vazquez wrote: > > Dear all, > > I just learnt that the DKFZ BiasFilter is NOT the OXOG filter workflow, > which means* I checked for the wrong thing in this validation!* I'm sorry > for the confusion. > > Right now I pass the BAM files and the consensus.vcf (SNV_MNV) downloaded > from GNOS to the BiasFilter and compare the resulting VCF with the > consensus looking at the set of mutations containing the OXOGFAIL flag. > This apparently is not the comparison to make. *What is it that I need to > compare? is it the bPcr and bSeq flags?* > > One first look at those flags do show quite some discrepancies > unfortunately on both donors (DO52140 and DO35937) for both flags. For > instance for DO35937 we find 11 mutations flaged bPcr with in the new > result, while the consensus.vcf only finds one, of them. Something similar > happens with the bSeq. > > Can you please confirm this so I can come reply with a full report on this. > > Kind regards, and sorry again for the confusion. > > Miguel > > > > On Mon, Feb 27, 2017 at 7:30 PM, Miguel Vazquez > wrote: > >> Dear friends, >> >> I've performed the first test with the DKFZ BiasFilter and got a perfect >> match. There are 55 variants annotated with OXOGFAIL and they are the same >> in the input VCF file (consensus SNV/MNV VCF for that donor) and the output >> of the BiasFilter. I'm running the test on a second donor. >> >> Best regards >> >> Miguel >> > > > > _______________________________________________ > docktesters mailing listdocktesters at lists.icgc.orghttps://lists.icgc.org/mailman/listinfo/docktesters > > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Wed Mar 15 06:20:51 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Wed, 15 Mar 2017 10:20:51 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> Message-ID: Hi all, I?ve written up the code to prepare unaligned bam files split by read group from the merged bams (prepare_unaligned.sh, I deprecated the previous one as prepare_unaligned_deprecated.sh). Briefly, it?s using Picard to split and reset the bams and afterwards to correct the headers. (I?ve added a wrapper script to install Picard locally as well: install_picard.sh) For subsampled merged DO50311 bam files this results in 5 separate bams for the tumor (tumor.unaligned.1?5.bam) with the following headers corresponding to the 5 different read groups in the original data: @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03''' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_7 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03'''' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_8 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03 PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_1 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_2 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03'' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_6 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 For the (subsampled) normal there are 3 read groups and hence 3 unaligned bam files (normal.unaligned.1?5.bam) with the following headers @HD VN:1.4 @RG ID:CRUK-CI:LP6005333-DNA_C03'' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c PU:CRUK-CI:LP6005333-DNA_C03_1 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005333-DNA_C03' PL:ILLUMINA CN:CRUK-CI DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c PU:CRUK-CI:LP6005333-DNA_C03_8 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005333-DNA_C03 PL:ILLUMINA CN:CRUK-CI DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c PU:CRUK-CI:LP6005333-DNA_C03_7 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 I?ve also modified the downstream code to run tests and prepare the JSON file for input (run_test.sh, BWA-Mem.json.template). If I?m not mistaken, feeding either all tumor or all normal bam files to the BWA-Mem docker should result in the desired, merged output, as all files are processed separately internally before being merged in a final BWA-Mem docker step. Please correct me if I?m wrong. In any case, I?m pushing the code from my repo (https://github.com/jdemeul/PCAWG-Docker-Test) to Miguel?s (https://github.com/mikisvaz/PCAWG-Docker-Test), so anyone interested can look at it (and try it) Using this setup, the BWA-Mem docker runs successfully here (on my downsampled DO50311 dummy bams), up until the point the output unaligned_bam_bai file needs to be collected. (Error while running job: Error collecting output for parameter 'merged_output_bai': Long-running script killed after 20 seconds.) This is an error I was having before as well, and initially thought it was a disk space issue, but I no longer think this is the case. I?ve attached the run output, does anyone know what might be the issue here? Best wishes, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 14 Mar 2017, at 14:32, Keiran Raine > wrote: Hi, You would also only expect a minimal level of duplicates in a good test sample, and likely quite a small number of readgroups. Keiran From: Jonas Demeulemeester > Date: Tuesday, 14 March 2017 at 13:49 To: Miguel Vazquez > Cc: Junjun Zhang >, Keiran Raine >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Miguel, I?ll have a go at modifying your scripts to do this kind of preprocessing. As to why alignment by lane level vs alignment of a single merged bam would result in only 3% discrepancies, I can imagine that read lengths etc may not be that different between the different libraries (for our tested donors at least). Please correct me if I?m wrong though! Best regards, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 14 Mar 2017, at 12:44, Miguel Vazquez > wrote: Hi Junjun and Keiran, I'm sorry guys, but his is too alien for me, this was never my area of expertise. I'm going to need someone to write a script for me that takes a BAM file and turns it into what ever I need to run BWA-Mem on. At least pseudo-code or something that I can start with. I think perhaps someone more knowledgeable than me should consider if this procedure as a whole is acceptable in terms of reproducibility, and how would be best to document it or if it could possibly be improved. Also, I don't think I understand the nature of the problem because from what I can fathom this problem should have either broken the process or render a much larger of discrepancies than 3%. Can someone explain in layman words how can only 3% of reads be affected? Best regards Miguel On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang > wrote: Hi Kieran, Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA MEM alignment result, one must use lane level BAMs (one lane one BAM) as input. A processing is needed to prepare lane level BAMs from merged BAM. @Migual, hope this is helpful. Let us know if you have any other questions. Best regards Junjun On Mar 14, 2017, at 5:16 AM, Keiran Raine > wrote: Hi Junjun, You won't be able to separate out the readgroups in the headers if the input is a merged BAM file . If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates). If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass. Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 On 13 Mar 2017, at 21:16, Junjun Zhang > wrote: Hi Keiran, Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? Thanks, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM To: Junjun Zhang > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From:> on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: BWA-Mem_error_log_dummy.txt URL: From Jonas.Demeulemeester at crick.ac.uk Wed Mar 15 06:30:12 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Wed, 15 Mar 2017 10:30:12 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> Message-ID: Note, ignore the dcc_specimen_type comment line in the headers below, I?ve since changed these to @CO dcc_specimen_type:Primary tumour - solid tissue for tumor samples, and @CO dcc_specimen_type:Normal for the normal. Best, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 15 Mar 2017, at 10:20, Jonas Demeulemeester > wrote: Hi all, I?ve written up the code to prepare unaligned bam files split by read group from the merged bams (prepare_unaligned.sh, I deprecated the previous one as prepare_unaligned_deprecated.sh). Briefly, it?s using Picard to split and reset the bams and afterwards to correct the headers. (I?ve added a wrapper script to install Picard locally as well: install_picard.sh) For subsampled merged DO50311 bam files this results in 5 separate bams for the tumor (tumor.unaligned.1?5.bam) with the following headers corresponding to the 5 different read groups in the original data: @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03''' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_7 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03'''' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_8 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03 PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_1 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_2 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005334-DNA_C03'' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_6 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 For the (subsampled) normal there are 3 read groups and hence 3 unaligned bam files (normal.unaligned.1?5.bam) with the following headers @HD VN:1.4 @RG ID:CRUK-CI:LP6005333-DNA_C03'' PL:ILLUMINA CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c PU:CRUK-CI:LP6005333-DNA_C03_1 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005333-DNA_C03' PL:ILLUMINA CN:CRUK-CI DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c PU:CRUK-CI:LP6005333-DNA_C03_8 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 @HD VN:1.4 @RG ID:CRUK-CI:LP6005333-DNA_C03 PL:ILLUMINA CN:CRUK-CI DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c PU:CRUK-CI:LP6005333-DNA_C03_7 @CO dcc_project_code:DOCKER-TEST @CO submitter_donor_id:dummy @CO submitter_specimen_id:dummy.specimen @CO submitter_sample_id:dummy.sample @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 I?ve also modified the downstream code to run tests and prepare the JSON file for input (run_test.sh, BWA-Mem.json.template). If I?m not mistaken, feeding either all tumor or all normal bam files to the BWA-Mem docker should result in the desired, merged output, as all files are processed separately internally before being merged in a final BWA-Mem docker step. Please correct me if I?m wrong. In any case, I?m pushing the code from my repo (https://github.com/jdemeul/PCAWG-Docker-Test) to Miguel?s (https://github.com/mikisvaz/PCAWG-Docker-Test), so anyone interested can look at it (and try it) Using this setup, the BWA-Mem docker runs successfully here (on my downsampled DO50311 dummy bams), up until the point the output unaligned_bam_bai file needs to be collected. (Error while running job: Error collecting output for parameter 'merged_output_bai': Long-running script killed after 20 seconds.) This is an error I was having before as well, and initially thought it was a disk space issue, but I no longer think this is the case. I?ve attached the run output, does anyone know what might be the issue here? Best wishes, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 14 Mar 2017, at 14:32, Keiran Raine > wrote: Hi, You would also only expect a minimal level of duplicates in a good test sample, and likely quite a small number of readgroups. Keiran From: Jonas Demeulemeester > Date: Tuesday, 14 March 2017 at 13:49 To: Miguel Vazquez > Cc: Junjun Zhang >, Keiran Raine >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Miguel, I?ll have a go at modifying your scripts to do this kind of preprocessing. As to why alignment by lane level vs alignment of a single merged bam would result in only 3% discrepancies, I can imagine that read lengths etc may not be that different between the different libraries (for our tested donors at least). Please correct me if I?m wrong though! Best regards, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk On 14 Mar 2017, at 12:44, Miguel Vazquez > wrote: Hi Junjun and Keiran, I'm sorry guys, but his is too alien for me, this was never my area of expertise. I'm going to need someone to write a script for me that takes a BAM file and turns it into what ever I need to run BWA-Mem on. At least pseudo-code or something that I can start with. I think perhaps someone more knowledgeable than me should consider if this procedure as a whole is acceptable in terms of reproducibility, and how would be best to document it or if it could possibly be improved. Also, I don't think I understand the nature of the problem because from what I can fathom this problem should have either broken the process or render a much larger of discrepancies than 3%. Can someone explain in layman words how can only 3% of reads be affected? Best regards Miguel On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang > wrote: Hi Kieran, Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA MEM alignment result, one must use lane level BAMs (one lane one BAM) as input. A processing is needed to prepare lane level BAMs from merged BAM. @Migual, hope this is helpful. Let us know if you have any other questions. Best regards Junjun On Mar 14, 2017, at 5:16 AM, Keiran Raine > wrote: Hi Junjun, You won't be able to separate out the readgroups in the headers if the input is a merged BAM file . If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates). If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass. Regards, Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute kr2 at sanger.ac.uk Tel:+44 (0)1223 834244 Ext: 4983 Office: H104 On 13 Mar 2017, at 21:16, Junjun Zhang > wrote: Hi Keiran, Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed? Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes? Thanks, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 2:31 PM To: Junjun Zhang > Cc: George Mihaiescu >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around. About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to: 1- Download the metadata for the BAM file 2- Determine the read_groups 3- Split the BAM file according to these read_groups 4- Unalign these BAM files and produce header files with different lanes 5- Run BWA-Mem 6- Compare collectively the reads from these BAM files with the original BAM Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with. Regards Miguel On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: Hi Miguel, I thought you kept the unaligned sequence you prepared for the testing. Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate. When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201 I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether. The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file: https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files. Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542 Hope this helps, Junjun From: Miguel Vazquez > Date: Monday, March 13, 2017 at 1:01 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32 About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu > wrote: Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker. Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already? Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running? For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish? Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From:> on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Wed Mar 15 06:30:04 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Wed, 15 Mar 2017 11:30:04 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> Message-ID: Excellent Jonas! Thank you so much. I'll pull your changes and try to help out debugging the issues. Cheers Miguel On Wed, Mar 15, 2017 at 11:20 AM, Jonas Demeulemeester < Jonas.Demeulemeester at crick.ac.uk> wrote: > Hi all, > > I?ve written up the code to prepare unaligned bam files split by read > group from the merged bams (*prepare_unaligned.sh*, I deprecated the > previous one as *prepare_unaligned_deprecated.sh*). > Briefly, it?s using Picard to split and reset the bams and afterwards to > correct the headers. > (I?ve added a wrapper script to install Picard locally as well: > *install_picard.sh*) > > For subsampled merged DO50311 bam files this results in 5 separate bams > for the tumor (tumor.unaligned.1?5.bam) with the following headers > corresponding to the 5 different read groups in the original data: > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03''' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_7 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03'''' PL:ILLUMINA > CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 > LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 > SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_8 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03 PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_1 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_2 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03'' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_6 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > > > For the (subsampled) normal there are 3 read groups and hence 3 unaligned > bam files (normal.unaligned.1?5.bam) with the following headers > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005333-DNA_C03'' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 > PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c > PU:CRUK-CI:LP6005333-DNA_C03_1 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005333-DNA_C03' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 > PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c > PU:CRUK-CI:LP6005333-DNA_C03_8 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005333-DNA_C03 PL:ILLUMINA CN:CRUK-CI > DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 > PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c > PU:CRUK-CI:LP6005333-DNA_C03_7 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > I?ve also modified the downstream code to run tests and prepare the JSON > file for input (*run_test.sh*, *BWA-Mem.json.template*). > If I?m not mistaken, feeding either all tumor or all normal bam files to > the BWA-Mem docker should result in the desired, merged output, as all > files are processed separately internally before being merged in a final > BWA-Mem docker step. > Please correct me if I?m wrong. > In any case, I?m pushing the code from my repo ( > https://github.com/jdemeul/PCAWG-Docker-Test) to Miguel?s ( > https://github.com/mikisvaz/PCAWG-Docker-Test), so anyone interested can > look at it (and try it) > > Using this setup, the BWA-Mem docker runs successfully here (on my > downsampled DO50311 dummy bams), up until the point the output > unaligned_bam_bai file needs to be collected. > (*Error while running job: Error collecting output for parameter > 'merged_output_bai': Long-running script killed after 20 seconds.*) > This is an error I was having before as well, and initially thought it was > a disk space issue, but I no longer think this is the case. > I?ve attached the run output, does anyone know what might be the issue > here? > > Best wishes, > Jonas > > > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > > > On 14 Mar 2017, at 14:32, Keiran Raine wrote: > > Hi, > > You would also only expect a minimal level of duplicates in a good test > sample, and likely quite a small number of readgroups. > > Keiran > > *From: *Jonas Demeulemeester > *Date: *Tuesday, 14 March 2017 at 13:49 > *To: *Miguel Vazquez > *Cc: *Junjun Zhang , Keiran Raine < > kr2 at sanger.ac.uk>, George Mihaiescu , " > docktesters at lists.icgc.org" > *Subject: *Re: [DOCKTESTERS] Thanks! > > Hi Miguel, > > I?ll have a go at modifying your scripts to do this kind of preprocessing. > > As to why alignment by lane level vs alignment of a single merged bam > would result in only 3% discrepancies, I can imagine that read lengths etc > may not be that different between the different libraries (for our tested > donors at least). > Please correct me if I?m wrong though! > > Best regards, > Jonas > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > > > > On 14 Mar 2017, at 12:44, Miguel Vazquez wrote: > > > Hi Junjun and Keiran, > > I'm sorry guys, but his is too alien for me, this was never my area of > expertise. I'm going to need someone to write a script for me that takes a > BAM file and turns it into what ever I need to run BWA-Mem on. At least > pseudo-code or something that I can start with. > > I think perhaps someone more knowledgeable than me should consider if this > procedure as a whole is acceptable in terms of reproducibility, and how > would be best to document it or if it could possibly be improved. > Also, I don't think I understand the nature of the problem because from > what I can fathom this problem should have either broken the process or > render a much larger of discrepancies than 3%. Can someone explain in > layman words how can only 3% of reads be affected? > > > Best regards > Miguel > > > > On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang > wrote: > > Hi Kieran, > > Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA > MEM alignment result, one must use lane level BAMs (one lane one BAM) as > input. > > A processing is needed to prepare lane level BAMs from merged BAM. > > @Migual, hope this is helpful. Let us know if you have any other > questions. > > Best regards > Junjun > > > On Mar 14, 2017, at 5:16 AM, Keiran Raine wrote: > > Hi Junjun, > > You won't be able to separate out the readgroups in the headers if the > input is a merged BAM file . If there are different libraries, read > lengths etc it will cause problems for insert-size determination (used in > determining proper-pairs) and result in inter-library duplicate removal (by > definition reads from different libraries can't be duplicates). > > If you really need to do it this way you'd have to add a pre-processing > step, bamtofastq can split a BAM into it's component readgroups in a single > pass. > > Regards, > > Keiran Raine > Principal Bioinformatician > Cancer Genome Project > Wellcome Trust Sanger Institute > > kr2 at sanger.ac.uk > Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244> > Office: H104 > > > On 13 Mar 2017, at 21:16, Junjun Zhang wrote: > > Hi Keiran, > > Can you please comment on this, i.e., comparison between alignment done > lane by lane v.s. done with all lanes mixed? > > Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM > workflow. The starting point is the aligned BAM because we don't have the > unaligned lane BAM any more. The key point here is: should input BAM > organized by lanes, one lane one BAM? Or just one BAM containing all lanes? > > Thanks, > Junjun > > > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 2:31 PM > *To: *Junjun Zhang > *Cc: *George Mihaiescu , Jonas > Demeulemeester , " > docktesters at lists.icgc.org" > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi Junjun > > About the unaligned BAM files, in fact I do have them for the two test > I've ran. I could put them available for George but I think he could just > as well produce them on site, since he might have to do that anyway. But we > can always explore that option, though right now I don't know of a simple > way to move these files around. > > About the number of lanes let me just say good grief! This is the first > time I hear about it. So if I understand you correctly I need to: > > 1- Download the metadata for the BAM file > 2- Determine the read_groups > 3- Split the BAM file according to these read_groups > 4- Unalign these BAM files and produce header files with different lanes > 5- Run BWA-Mem > 6- Compare collectively the reads from these BAM files with the original > BAM > > Could you please confirm that this is the case? Is this consistent with > the 3% mismatches? A similar percentage was found in the HCC1143, could > this be the reason for that as well? Also I asked Keiran about these > headers and he said there where OK. If you could please confirm that I need > to do this extended process I'd be grateful, because its quite involved and > there are concepts here I'm not familiar with. > > Regards > > Miguel > > > On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: > > Hi Miguel, > > I thought you kept the unaligned sequence you prepared for the testing. > > Following your link about preparing unaligned input, I found this: > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_ > unaligned.sh#L16-L35, which actually could explain the high mismatch rate. > > When BWA MEM workflow runs, the alignments are done one lane level BAM at > a time, then merge the aligned BAM later: https://github.com/ > ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/ > main/java/com/github/seqware/WorkflowClient.java#L201 > > I see the script prepare_unaligned.sh always generates one read group > (i.e., lane) for normal or tumour, no matter how many read groups (lanes) > in the aligned BAMs. This has big impact on the alignment result when lanes > are aligned independently comparing aligned altogether. > > The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but > it only works when the input is *single lane BAM file*: > https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a. > k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a. > k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)Followthisifyoustartfromsingle > laneBAMfiles > > So, I think in order to perform testing alignment workflow properly, we > will need to prepare *lane level *unaligned BAM (one lane one BAM) as > inputs. For example, this aligned BAM: https://gtrepo-ebi. > annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432- > 4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It > needs to be converted to 7 individual lane level BAM files. > > Not sure whether it's the best way to do BAM splitting, but here is > someone's Python code to do it: https://gist.github.com/seandavi/2014542 > > Hope this helps, > Junjun > > > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 1:01 PM > *To: *George Mihaiescu > *Cc: *Jonas Demeulemeester , Junjun > Zhang , "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > > The analigned BAM files are not available as far as I know, rather you > must unalign the final BAM files, the normal ones you get from ICGC or > GNOS. This process is also in my scripts, as you see here: > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > bin/run_batch.sh#L32 > > About the steps in the workflows, I don't know them myself. I think you'll > need to ask the developers, and not all workflows use the same underlying > workflow enactment tool. Not an easy answer > > > > On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > > Junjun told me this would provide value to the testing process, so I would > like to kick off a test of the BWA_mem docker. > Can somebody provide some quick instructions and the location of the > unaligned BAM files that were used already? > > Also, do we have somewhere the steps involved in each workflow, so I can > get an idea of how far they are while running? > For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 > steps from finish? > > Thank you, > George > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 8:52 AM > > *To: *George Mihaiescu > *Cc: *Junjun Zhang , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > Answers inline > > On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > > Hi Miguel, > > I've started the test by running "bin/run_test.sh Sanger DO50398", so I > guess with just one workflow running it should complete faster than two > weeks. > > > I think it still should take a long time. My scripts will run one workflow > after another. > > > > Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" > script to use a docker container that has the icgc client inside and pull > data from Collaboratory. There is no "bam.bas" file downloaded, just a > ".bam" and a ".bam.bai" files, not sure if this is an issue. > > > > I wondered the same thing first time I did this, but this file is produced > by the pipeline. There was some problem with this that was dealt with by > the developers and updated in the docker. So I think you won't have a > problem > > > By looking at the "bin/compare_result_type.sh" it looks like it's using > the gnos client to pull down the existing VCF files for comparison reasons, > but I think we store those files in Collaboratory as well, so I'll work > with Junjun to adapt the script for this. > > > > Let me know if you need any help > > > I think I initially tried to run the DKFZ workflow, but it complained > about having to run Delly first, so I abandoned this for now. > > > Yes, if you look at the run_batch.sh you will see that when using DKFZ it > will always run Delly first. Delly prepares some files the the DKFZ file > needs, namely related to copy number I believe. > > > > I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. > > > Remember that you will need to add the relevant has-keys for the different > files in the etc/donor_files.csv. Its a bit tedious right now. You need to > go to the ICGC DCC and find these codes manually for the files you need. > Ask me if you need help. Once you have all you can run all the workflows > for that donor and evaluate results. > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > etc/donor_files.csv > > > > Regards > Miguel > > > > George > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 6:53 AM > *To: *George Mihaiescu > *Cc: *Junjun Zhang , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > > The Sanger workflow is very lengthy, it takes about two weeks in my tests. > > > About correctness, my scripts also cover that part, if you are not using > them they might still help you to clarify how we do it. The idea is to take > each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both > germline and somatic and compare it with the result uploaded to GNOS (not > all pipelines produce all files). This is the relevant part in the > run_batch.sh script: > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > bin/run_batch.sh#L42-L46 > The bin/compare_result_type.sh script will take care of downloading the > correct file from GNOS and running the comparison. The comparison itself is > simple since all files are VCFs, it consists in taking out the variants in > terms of chromosome, position, reference and alternative allele and > measuring the overlaps. > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > bin/compare_result_type.sh > > About which donors to test, DO52140 is one Jonas and I have both tested > and could be interesting to get a third opinion. Also, any other donor > could be interesting to see if something new comes up. I'm not sure which > options is best. > > Miguel > > > > > On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > > Hi, > > I've started Sanger on DO50398 and it's been running for more than 24 > hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" > > I just started a second run on a different VM on same donor, just to > compare run times. > The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some > monitoring graphs when it finishes the workflow, but I have no idea how to > check its correctness. > > Give me a list of donors and what workflows you want me to run and I'll > try to schedule them tomorrow. > > George > > > *From: *Junjun Zhang > *Date: *Sunday, March 12, 2017 at 10:45 PM > *To: *Jonas Demeulemeester , George > Mihaiescu > *Cc: *Miguel Vazquez , Denis Yuen < > Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > Thanks Miguel and Jonas for your help here! > > Do you have any update on the latest testing? Please feel free updating > the wiki with any update: https://wiki.oicr.on. > ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference > > Regards, > Junjun > > > > *From: *Jonas Demeulemeester > *Date: *Saturday, March 11, 2017 at 7:15 PM > *To: *George Mihaiescu > *Cc: *Miguel Vazquez , Junjun Zhang < > junjun.zhang at oicr.on.ca>, Denis Yuen , " > docktesters at lists.icgc.org" > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > > Yup, I've been running the PCAWG dockers mainly using Miguel's set of > scripts. > Give them a go and if you run into issues, just let us know! > > Cheers, > Jonas > > > > On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: > > Sure, I'll give it a try and report later. > > Thank you, > > *George Mihaiescu* > Senior Cloud Architect > > *Ontario Institute for Cancer Research* > MaRS Centre > 661 University Avenue > Suite 510 > Toronto, Ontario > Canada M5G 0A3 > > Email: George.Mihaiescu at oicr.on.ca > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > > www.oicr.on.ca > > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > > *From: *Miguel Vazquez > *Date: *Saturday, March 11, 2017 at 10:57 AM > *To: *Junjun Zhang > *Cc: *Denis Yuen , Jonas Demeulemeester < > jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < > George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi Junjun, > > I think Jonas has been using my scripts to run some of the tests, maybe > George could try them as well, it should be very easy for him to try the > Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. > > https://github.com/mikisvaz/PCAWG-Docker-Test > > He would just need to update the tokens for DACO access and the scripts > will take care of downloading the BAM files, running the workflows and > evaluating the result. > > The documentation there is reasonably updated, but if this sounds good > then perhaps he could contact me and I could walk him through the details. > > Best regards > Miguel > > On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: > > Dear Docktesters, > > George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to > run some bioinformatics workflows to test Collab environment. > > Just thought this is a good opportunity to use as extra help for testing > out the PCAWG dockerized workflows. > > Miguel, Denis and others, what workflows / datasets do you think would be > good for George to run? > > Thanks, > Junjun > > > > *From:* on > behalf of Denis Yuen > *Date: *Wednesday, March 1, 2017 at 10:26 AM > *To: *"docktesters at lists.icgc.org" > *Subject: *[DOCKTESTERS] Thanks! > > > > Hi, > > Just wanted to say thanks to Miguel and Jonas for keeping the workflow > testing data page up-to-date. > > https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data > > As we work on new versions or debugging, it is invaluable to know what > versions of the workflows have worked outside OICR, thanks! > > > *Denis Yuen* > Senior Software Developer > > *OntarioInstituteforCancerResearch* > MaRSCentre > 661 University Avenue > Suite510 > Toronto, Ontario,Canada M5G0A3 > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > *www.oicr.on.ca * > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > > > *The Francis Crick Institute Limited is a registered charity in England > and Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT* > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > > > > > > > > > > > > > > > *The Francis Crick Institute Limited is a registered charity in England > and Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT* > -- The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a company > registered in England with number 2742969, whose registered office is 215 > Euston Road, London, NW1 2BE. > > > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Wed Mar 15 10:15:41 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Wed, 15 Mar 2017 15:15:41 +0100 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> Message-ID: Dear Jonas, I've pulled your changes and I'm now running the unalignment phase, which as you know takes a bit. I hope that after I can help you debug the error. Keep up updated if you make any changes in your repo I need to know about Best M On Wed, Mar 15, 2017 at 11:20 AM, Jonas Demeulemeester < Jonas.Demeulemeester at crick.ac.uk> wrote: > Hi all, > > I?ve written up the code to prepare unaligned bam files split by read > group from the merged bams (*prepare_unaligned.sh*, I deprecated the > previous one as *prepare_unaligned_deprecated.sh*). > Briefly, it?s using Picard to split and reset the bams and afterwards to > correct the headers. > (I?ve added a wrapper script to install Picard locally as well: > *install_picard.sh*) > > For subsampled merged DO50311 bam files this results in 5 separate bams > for the tumor (tumor.unaligned.1?5.bam) with the following headers > corresponding to the 5 different read groups in the original data: > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03''' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_7 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03'''' PL:ILLUMINA > CN:CRUK-CI DT:2014-07-27T01:00:00+0100 PI:0 > LB:WGS:CRUK-CI:LP6005334-DNA_C03 PM:Illumina HiSeq 2000 > SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 PU:CRUK-CI:LP6005334-DNA_C03_8 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03 PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_1 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_2 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005334-DNA_C03'' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005334-DNA_C03 > PM:Illumina HiSeq 2000 SM:b02b4bba-6e66-44fb-a48f-38c309aaaac5 > PU:CRUK-CI:LP6005334-DNA_C03_6 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > > > For the (subsampled) normal there are 3 read groups and hence 3 unaligned > bam files (normal.unaligned.1?5.bam) with the following headers > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005333-DNA_C03'' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-27T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 > PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c > PU:CRUK-CI:LP6005333-DNA_C03_1 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005333-DNA_C03' PL:ILLUMINA CN:CRUK-CI > DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 > PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c > PU:CRUK-CI:LP6005333-DNA_C03_8 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > @HD VN:1.4 > @RG ID:CRUK-CI:LP6005333-DNA_C03 PL:ILLUMINA CN:CRUK-CI > DT:2014-07-26T01:00:00+0100 PI:0 LB:WGS:CRUK-CI:LP6005333-DNA_C03 > PM:Illumina HiSeq 2000 SM:8c0354eb-6a3e-4a98-b41c-f8add599884c > PU:CRUK-CI:LP6005333-DNA_C03_7 > @CO dcc_project_code:DOCKER-TEST > @CO submitter_donor_id:dummy > @CO submitter_specimen_id:dummy.specimen > @CO submitter_sample_id:dummy.sample > @CO dcc_specimen_type:Primary tumour - solid tissue > @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 > > > I?ve also modified the downstream code to run tests and prepare the JSON > file for input (*run_test.sh*, *BWA-Mem.json.template*). > If I?m not mistaken, feeding either all tumor or all normal bam files to > the BWA-Mem docker should result in the desired, merged output, as all > files are processed separately internally before being merged in a final > BWA-Mem docker step. > Please correct me if I?m wrong. > In any case, I?m pushing the code from my repo ( > https://github.com/jdemeul/PCAWG-Docker-Test) to Miguel?s ( > https://github.com/mikisvaz/PCAWG-Docker-Test), so anyone interested can > look at it (and try it) > > Using this setup, the BWA-Mem docker runs successfully here (on my > downsampled DO50311 dummy bams), up until the point the output > unaligned_bam_bai file needs to be collected. > (*Error while running job: Error collecting output for parameter > 'merged_output_bai': Long-running script killed after 20 seconds.*) > This is an error I was having before as well, and initially thought it was > a disk space issue, but I no longer think this is the case. > I?ve attached the run output, does anyone know what might be the issue > here? > > Best wishes, > Jonas > > > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > > > On 14 Mar 2017, at 14:32, Keiran Raine wrote: > > Hi, > > You would also only expect a minimal level of duplicates in a good test > sample, and likely quite a small number of readgroups. > > Keiran > > *From: *Jonas Demeulemeester > *Date: *Tuesday, 14 March 2017 at 13:49 > *To: *Miguel Vazquez > *Cc: *Junjun Zhang , Keiran Raine < > kr2 at sanger.ac.uk>, George Mihaiescu , " > docktesters at lists.icgc.org" > *Subject: *Re: [DOCKTESTERS] Thanks! > > Hi Miguel, > > I?ll have a go at modifying your scripts to do this kind of preprocessing. > > As to why alignment by lane level vs alignment of a single merged bam > would result in only 3% discrepancies, I can imagine that read lengths etc > may not be that different between the different libraries (for our tested > donors at least). > Please correct me if I?m wrong though! > > Best regards, > Jonas > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > > > > On 14 Mar 2017, at 12:44, Miguel Vazquez wrote: > > > Hi Junjun and Keiran, > > I'm sorry guys, but his is too alien for me, this was never my area of > expertise. I'm going to need someone to write a script for me that takes a > BAM file and turns it into what ever I need to run BWA-Mem on. At least > pseudo-code or something that I can start with. > > I think perhaps someone more knowledgeable than me should consider if this > procedure as a whole is acceptable in terms of reproducibility, and how > would be best to document it or if it could possibly be improved. > Also, I don't think I understand the nature of the problem because from > what I can fathom this problem should have either broken the process or > render a much larger of discrepancies than 3%. Can someone explain in > layman words how can only 3% of reads be affected? > > > Best regards > Miguel > > > > On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang > wrote: > > Hi Kieran, > > Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA > MEM alignment result, one must use lane level BAMs (one lane one BAM) as > input. > > A processing is needed to prepare lane level BAMs from merged BAM. > > @Migual, hope this is helpful. Let us know if you have any other > questions. > > Best regards > Junjun > > > On Mar 14, 2017, at 5:16 AM, Keiran Raine wrote: > > Hi Junjun, > > You won't be able to separate out the readgroups in the headers if the > input is a merged BAM file . If there are different libraries, read > lengths etc it will cause problems for insert-size determination (used in > determining proper-pairs) and result in inter-library duplicate removal (by > definition reads from different libraries can't be duplicates). > > If you really need to do it this way you'd have to add a pre-processing > step, bamtofastq can split a BAM into it's component readgroups in a single > pass. > > Regards, > > Keiran Raine > Principal Bioinformatician > Cancer Genome Project > Wellcome Trust Sanger Institute > > kr2 at sanger.ac.uk > Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244> > Office: H104 > > > On 13 Mar 2017, at 21:16, Junjun Zhang wrote: > > Hi Keiran, > > Can you please comment on this, i.e., comparison between alignment done > lane by lane v.s. done with all lanes mixed? > > Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM > workflow. The starting point is the aligned BAM because we don't have the > unaligned lane BAM any more. The key point here is: should input BAM > organized by lanes, one lane one BAM? Or just one BAM containing all lanes? > > Thanks, > Junjun > > > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 2:31 PM > *To: *Junjun Zhang > *Cc: *George Mihaiescu , Jonas > Demeulemeester , " > docktesters at lists.icgc.org" > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi Junjun > > About the unaligned BAM files, in fact I do have them for the two test > I've ran. I could put them available for George but I think he could just > as well produce them on site, since he might have to do that anyway. But we > can always explore that option, though right now I don't know of a simple > way to move these files around. > > About the number of lanes let me just say good grief! This is the first > time I hear about it. So if I understand you correctly I need to: > > 1- Download the metadata for the BAM file > 2- Determine the read_groups > 3- Split the BAM file according to these read_groups > 4- Unalign these BAM files and produce header files with different lanes > 5- Run BWA-Mem > 6- Compare collectively the reads from these BAM files with the original > BAM > > Could you please confirm that this is the case? Is this consistent with > the 3% mismatches? A similar percentage was found in the HCC1143, could > this be the reason for that as well? Also I asked Keiran about these > headers and he said there where OK. If you could please confirm that I need > to do this extended process I'd be grateful, because its quite involved and > there are concepts here I'm not familiar with. > > Regards > > Miguel > > > On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang > wrote: > > Hi Miguel, > > I thought you kept the unaligned sequence you prepared for the testing. > > Following your link about preparing unaligned input, I found this: > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_ > unaligned.sh#L16-L35, which actually could explain the high mismatch rate. > > When BWA MEM workflow runs, the alignments are done one lane level BAM at > a time, then merge the aligned BAM later: https://github.com/ > ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/ > main/java/com/github/seqware/WorkflowClient.java#L201 > > I see the script prepare_unaligned.sh always generates one read group > (i.e., lane) for normal or tumour, no matter how many read groups (lanes) > in the aligned BAMs. This has big impact on the alignment result when lanes > are aligned independently comparing aligned altogether. > > The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but > it only works when the input is *single lane BAM file*: > https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a. > k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a. > k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)Followthisifyoustartfromsingle > laneBAMfiles > > So, I think in order to perform testing alignment workflow properly, we > will need to prepare *lane level *unaligned BAM (one lane one BAM) as > inputs. For example, this aligned BAM: https://gtrepo-ebi. > annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432- > 4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It > needs to be converted to 7 individual lane level BAM files. > > Not sure whether it's the best way to do BAM splitting, but here is > someone's Python code to do it: https://gist.github.com/seandavi/2014542 > > Hope this helps, > Junjun > > > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 1:01 PM > *To: *George Mihaiescu > *Cc: *Jonas Demeulemeester , Junjun > Zhang , "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > > The analigned BAM files are not available as far as I know, rather you > must unalign the final BAM files, the normal ones you get from ICGC or > GNOS. This process is also in my scripts, as you see here: > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > bin/run_batch.sh#L32 > > About the steps in the workflows, I don't know them myself. I think you'll > need to ask the developers, and not all workflows use the same underlying > workflow enactment tool. Not an easy answer > > > > On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > > Junjun told me this would provide value to the testing process, so I would > like to kick off a test of the BWA_mem docker. > Can somebody provide some quick instructions and the location of the > unaligned BAM files that were used already? > > Also, do we have somewhere the steps involved in each workflow, so I can > get an idea of how far they are while running? > For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 > steps from finish? > > Thank you, > George > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 8:52 AM > > *To: *George Mihaiescu > *Cc: *Junjun Zhang , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > Answers inline > > On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > > Hi Miguel, > > I've started the test by running "bin/run_test.sh Sanger DO50398", so I > guess with just one workflow running it should complete faster than two > weeks. > > > I think it still should take a long time. My scripts will run one workflow > after another. > > > > Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" > script to use a docker container that has the icgc client inside and pull > data from Collaboratory. There is no "bam.bas" file downloaded, just a > ".bam" and a ".bam.bai" files, not sure if this is an issue. > > > > I wondered the same thing first time I did this, but this file is produced > by the pipeline. There was some problem with this that was dealt with by > the developers and updated in the docker. So I think you won't have a > problem > > > By looking at the "bin/compare_result_type.sh" it looks like it's using > the gnos client to pull down the existing VCF files for comparison reasons, > but I think we store those files in Collaboratory as well, so I'll work > with Junjun to adapt the script for this. > > > > Let me know if you need any help > > > I think I initially tried to run the DKFZ workflow, but it complained > about having to run Delly first, so I abandoned this for now. > > > Yes, if you look at the run_batch.sh you will see that when using DKFZ it > will always run Delly first. Delly prepares some files the the DKFZ file > needs, namely related to copy number I believe. > > > > I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. > > > Remember that you will need to add the relevant has-keys for the different > files in the etc/donor_files.csv. Its a bit tedious right now. You need to > go to the ICGC DCC and find these codes manually for the files you need. > Ask me if you need help. Once you have all you can run all the workflows > for that donor and evaluate results. > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > etc/donor_files.csv > > > > Regards > Miguel > > > > George > > *From: *Miguel Vazquez > *Date: *Monday, March 13, 2017 at 6:53 AM > *To: *George Mihaiescu > *Cc: *Junjun Zhang , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > > The Sanger workflow is very lengthy, it takes about two weeks in my tests. > > > About correctness, my scripts also cover that part, if you are not using > them they might still help you to clarify how we do it. The idea is to take > each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both > germline and somatic and compare it with the result uploaded to GNOS (not > all pipelines produce all files). This is the relevant part in the > run_batch.sh script: > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > bin/run_batch.sh#L42-L46 > The bin/compare_result_type.sh script will take care of downloading the > correct file from GNOS and running the comparison. The comparison itself is > simple since all files are VCFs, it consists in taking out the variants in > terms of chromosome, position, reference and alternative allele and > measuring the overlaps. > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > bin/compare_result_type.sh > > About which donors to test, DO52140 is one Jonas and I have both tested > and could be interesting to get a third opinion. Also, any other donor > could be interesting to see if something new comes up. I'm not sure which > options is best. > > Miguel > > > > > On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > > Hi, > > I've started Sanger on DO50398 and it's been running for more than 24 > hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" > > I just started a second run on a different VM on same donor, just to > compare run times. > The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some > monitoring graphs when it finishes the workflow, but I have no idea how to > check its correctness. > > Give me a list of donors and what workflows you want me to run and I'll > try to schedule them tomorrow. > > George > > > *From: *Junjun Zhang > *Date: *Sunday, March 12, 2017 at 10:45 PM > *To: *Jonas Demeulemeester , George > Mihaiescu > *Cc: *Miguel Vazquez , Denis Yuen < > Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > Thanks Miguel and Jonas for your help here! > > Do you have any update on the latest testing? Please feel free updating > the wiki with any update: https://wiki.oicr.on. > ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference > > Regards, > Junjun > > > > *From: *Jonas Demeulemeester > *Date: *Saturday, March 11, 2017 at 7:15 PM > *To: *George Mihaiescu > *Cc: *Miguel Vazquez , Junjun Zhang < > junjun.zhang at oicr.on.ca>, Denis Yuen , " > docktesters at lists.icgc.org" > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi George, > > Yup, I've been running the PCAWG dockers mainly using Miguel's set of > scripts. > Give them a go and if you run into issues, just let us know! > > Cheers, > Jonas > > > > On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: > > Sure, I'll give it a try and report later. > > Thank you, > > *George Mihaiescu* > Senior Cloud Architect > > *Ontario Institute for Cancer Research* > MaRS Centre > 661 University Avenue > Suite 510 > Toronto, Ontario > Canada M5G 0A3 > > Email: George.Mihaiescu at oicr.on.ca > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > > www.oicr.on.ca > > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > > *From: *Miguel Vazquez > *Date: *Saturday, March 11, 2017 at 10:57 AM > *To: *Junjun Zhang > *Cc: *Denis Yuen , Jonas Demeulemeester < > jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < > George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > *Subject: *Re: [DOCKTESTERS] Thanks! > > > Hi Junjun, > > I think Jonas has been using my scripts to run some of the tests, maybe > George could try them as well, it should be very easy for him to try the > Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. > > https://github.com/mikisvaz/PCAWG-Docker-Test > > He would just need to update the tokens for DACO access and the scripts > will take care of downloading the BAM files, running the workflows and > evaluating the result. > > The documentation there is reasonably updated, but if this sounds good > then perhaps he could contact me and I could walk him through the details. > > Best regards > Miguel > > On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: > > Dear Docktesters, > > George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to > run some bioinformatics workflows to test Collab environment. > > Just thought this is a good opportunity to use as extra help for testing > out the PCAWG dockerized workflows. > > Miguel, Denis and others, what workflows / datasets do you think would be > good for George to run? > > Thanks, > Junjun > > > > *From:* on > behalf of Denis Yuen > *Date: *Wednesday, March 1, 2017 at 10:26 AM > *To: *"docktesters at lists.icgc.org" > *Subject: *[DOCKTESTERS] Thanks! > > > > Hi, > > Just wanted to say thanks to Miguel and Jonas for keeping the workflow > testing data page up-to-date. > > https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data > > As we work on new versions or debugging, it is invaluable to know what > versions of the workflows have worked outside OICR, thanks! > > > *Denis Yuen* > Senior Software Developer > > *OntarioInstituteforCancerResearch* > MaRSCentre > 661 University Avenue > Suite510 > Toronto, Ontario,Canada M5G0A3 > Toll-free: 1-866-678-6427 > Twitter: @OICR_news > *www.oicr.on.ca * > This message and any attachments may contain confidential and/or > privileged information for the sole use of the intended recipient. Any > review or distribution by anyone other than the person for whom it was > originally intended is strictly prohibited. If you have received this > message in error, please contact the sender and delete all copies. > Opinions, conclusions or other information contained in this message may > not be that of the organization. > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > > > *The Francis Crick Institute Limited is a registered charity in England > and Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT* > > > _______________________________________________ > docktesters mailing list > docktesters at lists.icgc.org > https://lists.icgc.org/mailman/listinfo/docktesters > > > > > > > > > > > > > > > > *The Francis Crick Institute Limited is a registered charity in England > and Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT* > -- The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a company > registered in England with number 2742969, whose registered office is 215 > Euston Road, London, NW1 2BE. > > > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Junjun.Zhang at oicr.on.ca Sun Mar 19 19:17:59 2017 From: Junjun.Zhang at oicr.on.ca (Junjun Zhang) Date: Sun, 19 Mar 2017 23:17:59 +0000 Subject: [DOCKTESTERS] Important correction: DKFZ BiasFilter large missmatch on DO52140 and DO35937 In-Reply-To: References: Message-ID: Thanks Miguel for getting new test done and detailed summary. Matthias, can you please comment? Junjun From: > on behalf of Miguel Vazquez > Date: Wednesday, March 15, 2017 at 6:09 AM To: Christina Yung > Cc: Lincoln Stein >, Francis Ouellette >, "docktesters at lists.icgc.org" >, "Schlesner, Matthias" > Subject: [DOCKTESTERS] Important correction: DKFZ BiasFilter large missmatch on DO52140 and DO35937 Dear all, As you can read below I made a mistake on my previous validation for the DKFZ BiasFilter. Unfortunately large differences have turned up now that I've corrected the process. In brief on both donors I've found that re-runing the filter flags some additional variants for both flags bPcr and bSeq. Notably all the discrepancies are for the new method flagging more variants. For instance in many cases the original file contains just one variant with the flag where the new one ten or twenty. You can read the details at the end of this email where we are comparing the original VCF to the new one. Note that the orginal VCF is the consensus variants are the input I use for the BiasFilter along with the corresponding BAM files for that donor. I can only imagine that if this VCF was not the one originally used due to some filtering step then perhaps the bias calculations might have been affected. If that is so I would need instructions on where to get the precise input VCFs. Best regards Miguel ----RESULTS---- Comparison for DO52140 tag bPcr --- Common: 1 Extra: 12 - Example: 11:81550771:C:A,12:19486241:G:T,2:12287406:G:T Missing: 0 Comparison for DO52140 tag bSeq --- Common: 1 Extra: 23 - Example: 10:17681457:G:T,12:112049882:T:G,12:130990011:T:A Missing: 0 Comparison for DO35937 tag bPcr --- Common: 1 Extra: 10 - Example: 1:114845662:G:T,14:33282600:C:A,16:78467879:G:T Missing: 0 Comparison for DO35937 tag bSeq --- Common: 6 Extra: 88 - Example: 10:21703903:A:C,10:24183103:C:T,10:51468498:C:G Missing: 0 On Mon, Mar 13, 2017 at 4:48 PM, Christina Yung > wrote: Hi Miguel, The bPCR and bSeq flags are indeed the ones flagged by the DKFZ bias filter. When you summarize the comparison, please cc Matthias of DKFZ as his team developed this filter. No issue at all, and thanks again for your great work! Christina On 3/13/2017 10:22 AM, Miguel Vazquez wrote: Dear all, I just learnt that the DKFZ BiasFilter is NOT the OXOG filter workflow, which means I checked for the wrong thing in this validation! I'm sorry for the confusion. Right now I pass the BAM files and the consensus.vcf (SNV_MNV) downloaded from GNOS to the BiasFilter and compare the resulting VCF with the consensus looking at the set of mutations containing the OXOGFAIL flag. This apparently is not the comparison to make. What is it that I need to compare? is it the bPcr and bSeq flags? One first look at those flags do show quite some discrepancies unfortunately on both donors (DO52140 and DO35937) for both flags. For instance for DO35937 we find 11 mutations flaged bPcr with in the new result, while the consensus.vcf only finds one, of them. Something similar happens with the bSeq. Can you please confirm this so I can come reply with a full report on this. Kind regards, and sorry again for the confusion. Miguel On Mon, Feb 27, 2017 at 7:30 PM, Miguel Vazquez > wrote: Dear friends, I've performed the first test with the DKFZ BiasFilter and got a perfect match. There are 55 variants annotated with OXOGFAIL and they are the same in the input VCF file (consensus SNV/MNV VCF for that donor) and the output of the BiasFilter. I'm running the test on a second donor. Best regards Miguel _______________________________________________ docktesters mailing list docktesters at lists.icgc.orghttps://lists.icgc.org/mailman/listinfo/docktesters _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Mar 20 13:18:59 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 20 Mar 2017 17:18:59 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0 quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Mon Mar 20 16:13:43 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Mon, 20 Mar 2017 20:13:43 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: , Message-ID: Hi George, Do you have the DKFZ workflow dependencies tarball in place (and named correctly)? That's the file it's clearly not finding: 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz You can find the link to this reference tarball on the DKFZ pipeline github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workflows) Hope this helps, Jonas On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0 quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Mar 20 21:19:07 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Tue, 21 Mar 2017 01:19:07 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Hi Jonas, It seems that I forgot to run this script https://github.com/gmihaiescu/PCAWG-Docker-Test/blob/master/bin/get_dkfz_resources.sh. I also have to get a new GNOS token which will take a while as my DACO expired apparently. Thank you, George From: Jonas Demeulemeester > Date: Monday, March 20, 2017 at 3:13 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Do you have the DKFZ workflow dependencies tarball in place (and named correctly)? That's the file it's clearly not finding: 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz You can find the link to this reference tarball on the DKFZ pipeline github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workflows) Hope this helps, Jonas On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Wed Mar 22 08:56:46 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Wed, 22 Mar 2017 12:56:46 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: I received a new GNOS token and installed the gtdownload client (not easy because I'm running on a Ubuntu 16.04 VM). Now, when I run the "bin/get_dkfz_resources.sh" script, it stays at zero: Status: 0 bytes downloaded (0.000% complete) current rate: /s Child 1 downloading ( ) Child 2 downloading ( ) Child 3 downloading ( ) Child 4 downloading ( ) Child 5 downloading ( ) Child 6 downloading ( ) Child 7 downloading ( ) Child 8 downloading ( ) Status: 0 bytes downloaded (0.000% complete) current rate: /s Child 1 downloading ( ) Child 2 downloading ( ) Child 3 downloading ( ) Child 4 downloading ( ) Child 5 downloading ( ) Child 6 downloading ( ) Child 7 downloading ( ) Child 8 downloading ( ) Status: 0 bytes downloaded (0.000% complete) current rate: /s Is there another way I can download that file? Also, I saw on https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data that you already ran the Sanger workflow against D0218695, do you remember how long it took? I couldn't find the original run time for that donor looking through github (https://github.com/ICGC-TCGA-PanCancer/ceph_transfer_ops). The initial VM running this donor has been running for more than 10 days, and I don't remember Sanger taking so long. Thank you, George From: Jonas Demeulemeester > Date: Monday, March 20, 2017 at 3:13 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Do you have the DKFZ workflow dependencies tarball in place (and named correctly)? That's the file it's clearly not finding: 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz You can find the link to this reference tarball on the DKFZ pipeline github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workflows) Hope this helps, Jonas On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From Denis.Yuen at oicr.on.ca Wed Mar 22 10:00:10 2017 From: Denis.Yuen at oicr.on.ca (Denis Yuen) Date: Wed, 22 Mar 2017 14:00:10 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: , Message-ID: <9f143fc3109a4a9c9db43821a53e20c0@oicr.on.ca> Hi, I have a local copy of the file on my desktop. It's a bit ironic, but if GNOS is currently down, we could setup a local transfer or use a secure OICR USB key. The file is 22 GB in size. Denis Yuen Senior Software Developer Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. ________________________________ From: docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org on behalf of George Mihaiescu Sent: March 22, 2017 8:56:46 AM To: Jonas Demeulemeester Cc: docktesters at lists.icgc.org Subject: Re: [DOCKTESTERS] Thanks! I received a new GNOS token and installed the gtdownload client (not easy because I'm running on a Ubuntu 16.04 VM). Now, when I run the "bin/get_dkfz_resources.sh" script, it stays at zero: Status: 0 bytes downloaded (0.000% complete) current rate: /s Child 1 downloading ( ) Child 2 downloading ( ) Child 3 downloading ( ) Child 4 downloading ( ) Child 5 downloading ( ) Child 6 downloading ( ) Child 7 downloading ( ) Child 8 downloading ( ) Status: 0 bytes downloaded (0.000% complete) current rate: /s Child 1 downloading ( ) Child 2 downloading ( ) Child 3 downloading ( ) Child 4 downloading ( ) Child 5 downloading ( ) Child 6 downloading ( ) Child 7 downloading ( ) Child 8 downloading ( ) Status: 0 bytes downloaded (0.000% complete) current rate: /s Is there another way I can download that file? Also, I saw on https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data that you already ran the Sanger workflow against D0218695, do you remember how long it took? I couldn't find the original run time for that donor looking through github (https://github.com/ICGC-TCGA-PanCancer/ceph_transfer_ops). The initial VM running this donor has been running for more than 10 days, and I don't remember Sanger taking so long. Thank you, George From: Jonas Demeulemeester > Date: Monday, March 20, 2017 at 3:13 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Do you have the DKFZ workflow dependencies tarball in place (and named correctly)? That's the file it's clearly not finding: 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz You can find the link to this reference tarball on the DKFZ pipeline github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workflows) Hope this helps, Jonas On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Wed Mar 22 13:56:32 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Wed, 22 Mar 2017 17:56:32 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> Message-ID: <0B22ECBD-2187-41D9-9EA8-C8AC98B39425@crick.ac.uk> Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 3 112743126 HS2000-1012_275:7:1101:17411:15403 147 3 112743376 HS2000-1012_275:7:1101:11883:83640 99 16 28672999 HS2000-1012_275:7:1101:11883:83640 147 16 28673223 HS2000-1012_275:7:1101:16576:28476 163 GL000238.1 21309 HS2000-1012_275:7:1101:16576:28476 83 GL000238.1 21664 vs the original: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 8 54944243 HS2000-1012_275:7:1101:17411:15403 147 8 54944493 HS2000-1012_275:7:1101:11883:83640 163 16 28464362 HS2000-1012_275:7:1101:11883:83640 83 16 28464586 HS2000-1012_275:7:1101:16576:28476 99 12 6124549 HS2000-1012_275:7:1101:16576:28476 147 12 6124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: new.header.normal.bam.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: orig.header.normal.bam.txt URL: From mikisvaz at gmail.com Wed Mar 22 14:08:11 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Wed, 22 Mar 2017 19:08:11 +0100 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: <0B22ECBD-2187-41D9-9EA8-C8AC98B39425@crick.ac.uk> References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> <0B22ECBD-2187-41D9-9EA8-C8AC98B39425@crick.ac.uk> Message-ID: Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester < Jonas.Demeulemeester at crick.ac.uk> wrote: > Hi all, > > A brief update on the BWA-Mem docker tests. > I prepared normal + tumor lane-level unaligned bams for DO503011 and ran > the BWA-Mem workflow for normal and tumor seperately. > Doing the comparison however, I am still getting 3% of reads that are > aligned differently (see below for a few examples). > However, when checking the headers of the original and newly mapped bam > files (attached) I noticed that the original is mapped using a different > version of BWA and SeqWare. > I?m hoping the mapping differences can be ascribed to this. > > Is there a list available somewhere detailing which samples were mapped > using which versions? > That way we could select a relevant test sample without having to sort > through the headers of all different bams. > > Best wishes, > Jonas > > > > > > newly aligned: > > ID flag chr pos > HS2000-1012_275:7:1101:17411:15403 99 3 112743126 > HS2000-1012_275:7:1101:17411:15403 147 3 112743376 > HS2000-1012_275:7:1101:11883:83640 99 16 28672999 > HS2000-1012_275:7:1101:11883:83640 147 16 28673223 > HS2000-1012_275:7:1101:16576:28476 163 GL000238.1 21309 > HS2000-1012_275:7:1101:16576:28476 83 GL000238.1 21664 > > vs the original: > > ID flag chr pos > HS2000-1012_275:7:1101:17411:15403 99 8 54944243 > HS2000-1012_275:7:1101:17411:15403 147 8 54944493 > HS2000-1012_275:7:1101:11883:83640 163 16 28464362 > HS2000-1012_275:7:1101:11883:83640 83 16 28464586 > HS2000-1012_275:7:1101:16576:28476 99 12 6124549 > HS2000-1012_275:7:1101:16576:28476 147 12 6124903 > > > _________________________________ > Jonas Demeulemeester, PhD > Postdoctoral Researcher > The Francis Crick Institute > 1 Midland Road > London > NW1 1AT > > *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> > M: +44 (0)7482 070730 <+44%207482%20070730> > *E:* jonas.demeulemeester at crick.ac.uk > *W:* www.crick.ac.uk > > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Wed Mar 22 16:18:17 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Wed, 22 Mar 2017 20:18:17 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: Message-ID: I finished one of the dockerized Sanger tests and upon verification there were just a few differences, but I'm not sure if they are normal or not. Results: root at dockstore-test3:~/PCAWG-Docker-Test# bin/compare_result.sh Sanger DO50398 var/spool/cwl/0/caveman/ var/spool/cwl/0/caveman/splitList var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz var/spool/cwl/0/caveman/alg_bean var/spool/cwl/0/caveman/prob_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz.tbi var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.no_analysis.bed var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz.tbi var/spool/cwl/0/caveman/cov_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz.tbi var/spool/cwl/0/caveman/caveman.cfg.ini Comparison for DO50398 using Sanger --- Common: 171325 Extra: 3 - Example: 14:20031258:G,8:43827158:A,X:61711363:C Missing: 13 - Example: 10:106963148:T,17:64794691:G,1:82709263:T Because I'm a infrastructure architect my main reason for the test was to monitor resource utilization, so I wrote a wiki detailing my observations: https://wiki.oicr.on.ca/display/~gmihaiescu/Dockerized+Sanger+workflow I have there more Docker tests running, two of them run Sanger against the same donor (but using Vms with 8 cores because I want to see if the run time and resource utilization are constant), and a third test that is running DKFZ. Cheers, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 1:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikisvaz at gmail.com Wed Mar 22 17:14:49 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Wed, 22 Mar 2017 22:14:49 +0100 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: Message-ID: Excellent George, thanks! Those results are in accordance with what Jonas and I got from our tests for Sanger. By the way, the link you sent does not seem to work for me. Best Miguel On Wed, Mar 22, 2017 at 9:18 PM, George Mihaiescu < George.Mihaiescu at oicr.on.ca> wrote: > I finished one of the dockerized Sanger tests and upon verification there > were just a few differences, but I'm not sure if they are normal or not. > > Results: > > root at dockstore-test3:~/PCAWG-Docker-Test# bin/compare_result.sh Sanger > DO50398 > > var/spool/cwl/0/caveman/ > > var/spool/cwl/0/caveman/splitList > > var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde- > dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz > > var/spool/cwl/0/caveman/alg_bean > > var/spool/cwl/0/caveman/prob_arr > > var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde- > dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz.tbi > > var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde- > dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.no_analysis.bed > > var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde- > dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz > > var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde- > dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz > > var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde- > dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz.tbi > > var/spool/cwl/0/caveman/cov_arr > > var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde- > dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa. > flagged.muts.vcf.gz.tbi > > var/spool/cwl/0/caveman/caveman.cfg.ini > > *Comparison for DO50398 using Sanger* > > *---* > > *Common: 171325* > > *Extra: 3* > > * - Example: 14:20031258:G,8:43827158:A,X:61711363:C* > > *Missing: 13* > > * - Example: 10:106963148:T,17:64794691:G,1:82709263:T* > > > > Because I'm a infrastructure architect my main reason for the test was to > monitor resource utilization, so I wrote a wiki detailing my observations: > > https://wiki.oicr.on.ca/display/~gmihaiescu/Dockerized+Sanger+workflow > > I have there more Docker tests running, two of them run Sanger against the > same donor (but using Vms with 8 cores because I want to see if the run > time and resource utilization are constant), and a third test that is > running DKFZ. > > Cheers, > George > > From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 1:08 PM > To: Jonas Demeulemeester > Cc: Keiran Raine , Junjun Zhang , > George Mihaiescu , " > docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update > > Thanks Jonas for this information. > > I hope that someone here can provide us with some suggestion on what to > try next. Perhaps the version issue that Jonas point out is the key. > > I just want to add that, as I told Jonas earlier, my own tests using the > new split BAM files also gave 3% mismatches. > > Best regards > > Miguel > > On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk> wrote: > >> Hi all, >> >> A brief update on the BWA-Mem docker tests. >> I prepared normal + tumor lane-level unaligned bams for DO503011 and ran >> the BWA-Mem workflow for normal and tumor seperately. >> Doing the comparison however, I am still getting 3% of reads that are >> aligned differently (see below for a few examples). >> However, when checking the headers of the original and newly mapped bam >> files (attached) I noticed that the original is mapped using a different >> version of BWA and SeqWare. >> I?m hoping the mapping differences can be ascribed to this. >> >> Is there a list available somewhere detailing which samples were mapped >> using which versions? >> That way we could select a relevant test sample without having to sort >> through the headers of all different bams. >> >> Best wishes, >> Jonas >> >> >> >> >> >> newly aligned: >> >> IDflagchrpos >> HS2000-1012_275:7:1101:17411:15403993112743126 >> HS2000-1012_275:7:1101:17411:154031473112743376 >> HS2000-1012_275:7:1101:11883:83640991628672999 >> HS2000-1012_275:7:1101:11883:836401471628673223 >> HS2000-1012_275:7:1101:16576:28476163GL000238.121309 >> HS2000-1012_275:7:1101:16576:2847683GL000238.121664 >> >> vs the original: >> >> IDflagchrpos >> HS2000-1012_275:7:1101:17411:1540399854944243 >> HS2000-1012_275:7:1101:17411:15403147854944493 >> HS2000-1012_275:7:1101:11883:836401631628464362 >> HS2000-1012_275:7:1101:11883:83640831628464586 >> HS2000-1012_275:7:1101:16576:2847699126124549 >> HS2000-1012_275:7:1101:16576:28476147126124903 >> >> >> _________________________________ >> Jonas Demeulemeester, PhD >> Postdoctoral Researcher >> The Francis Crick Institute >> 1 Midland Road >> London >> NW1 1AT >> >> *T:* +44 (0)20 3796 2594 <+44%2020%203796%202594> >> M: +44 (0)7482 070730 <+44%207482%20070730> >> *E:* jonas.demeulemeester at crick.ac.uk >> *W:* www.crick.ac.uk >> >> The Francis Crick Institute Limited is a registered charity in England >> and Wales no. 1140062 and a company registered in England and Wales no. >> 06885462, with its registered office at 1 Midland Road London NW1 1AT >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Wed Mar 22 20:14:29 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Thu, 23 Mar 2017 00:14:29 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: Message-ID: I attached a PDF version of that page if you don't have access to our wiki. George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 4:14 PM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Keiran Raine >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Excellent George, thanks! Those results are in accordance with what Jonas and I got from our tests for Sanger. By the way, the link you sent does not seem to work for me. Best Miguel On Wed, Mar 22, 2017 at 9:18 PM, George Mihaiescu > wrote: I finished one of the dockerized Sanger tests and upon verification there were just a few differences, but I'm not sure if they are normal or not. Results: root at dockstore-test3:~/PCAWG-Docker-Test# bin/compare_result.sh Sanger DO50398 var/spool/cwl/0/caveman/ var/spool/cwl/0/caveman/splitList var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz var/spool/cwl/0/caveman/alg_bean var/spool/cwl/0/caveman/prob_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz.tbi var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.no_analysis.bed var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz.tbi var/spool/cwl/0/caveman/cov_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz.tbi var/spool/cwl/0/caveman/caveman.cfg.ini Comparison for DO50398 using Sanger --- Common: 171325 Extra: 3 - Example: 14:20031258:G,8:43827158:A,X:61711363:C Missing: 13 - Example: 10:106963148:T,17:64794691:G,1:82709263:T Because I'm a infrastructure architect my main reason for the test was to monitor resource utilization, so I wrote a wiki detailing my observations: https://wiki.oicr.on.ca/display/~gmihaiescu/Dockerized+Sanger+workflow I have there more Docker tests running, two of them run Sanger against the same donor (but using Vms with 8 cores because I want to see if the run time and resource utilization are constant), and a third test that is running DKFZ. Cheers, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 1:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gmihaiescu-DockerizedSangerworkflow-220317-2012-6.pdf Type: application/pdf Size: 210457 bytes Desc: gmihaiescu-DockerizedSangerworkflow-220317-2012-6.pdf URL: From kr2 at sanger.ac.uk Thu Mar 23 05:47:49 2017 From: kr2 at sanger.ac.uk (Keiran Raine) Date: Thu, 23 Mar 2017 09:47:49 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> <0B22ECBD-2187-41D9-9EA8-C8AC98B39425@crick.ac.uk> Message-ID: Hi, The jsonl files files on pancancer.org contain the versions of software used originally. If someone can give me the BWA and bammarkduplicates(2?) versions used this may be explained. Bammarkduplicates had a bug fix a few monthis into the mapping, but the reported differences at the time (I don't remember who did it) was <1%. Keiran From: Miguel Vazquez Date: Wednesday, 22 March 2017 at 18:08 To: Jonas Demeulemeester Cc: Keiran Raine , Junjun Zhang , George Mihaiescu , "docktesters at lists.icgc.org" Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 3 112743126 HS2000-1012_275:7:1101:17411:15403 147 3 112743376 HS2000-1012_275:7:1101:11883:83640 99 16 28672999 HS2000-1012_275:7:1101:11883:83640 147 16 28673223 HS2000-1012_275:7:1101:16576:28476 163 GL000238.1 21309 HS2000-1012_275:7:1101:16576:28476 83 GL000238.1 21664 vs the original: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 8 54944243 HS2000-1012_275:7:1101:17411:15403 147 8 54944493 HS2000-1012_275:7:1101:11883:83640 163 16 28464362 HS2000-1012_275:7:1101:11883:83640 83 16 28464586 HS2000-1012_275:7:1101:16576:28476 99 12 6124549 HS2000-1012_275:7:1101:16576:28476 147 12 6124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kr2 at sanger.ac.uk Thu Mar 23 05:58:49 2017 From: kr2 at sanger.ac.uk (Keiran Raine) Date: Thu, 23 Mar 2017 09:58:49 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: Message-ID: Hi, Sorry if this is in your confluence page but I'm unable to access (could be as I'm outside OICR or that the default for your space is owner only). Can you confirm if the CaVEMan calling was base on the BAM file that the original data was generated with or a one mapped with the new/recent mapping flow? Also, the key information for determining if a call change is erroneous: 1. Is the variant is marked 'PASSED'. 2. What are the probabilities attached to the VCF record (should be in the info field)? As previously stated we do expect a small variance in the results for the data processed at the beginning of the project and those at the end as well as some minor changes introduced when the normal-panel was moved from a web-service to a local file. Regards, Keiran From: George Mihaiescu Date: Wednesday, 22 March 2017 at 20:18 To: Miguel Vazquez , Jonas Demeulemeester Cc: Keiran Raine , Junjun Zhang , "docktesters at lists.icgc.org" Subject: Re: [DOCKTESTERS] BWA-Mem update I finished one of the dockerized Sanger tests and upon verification there were just a few differences, but I'm not sure if they are normal or not. Results: root at dockstore-test3:~/PCAWG-Docker-Test# bin/compare_result.sh Sanger DO50398 var/spool/cwl/0/caveman/ var/spool/cwl/0/caveman/splitList var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz var/spool/cwl/0/caveman/alg_bean var/spool/cwl/0/caveman/prob_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz.tbi var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.no_analysis.bed var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz.tbi var/spool/cwl/0/caveman/cov_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz.tbi var/spool/cwl/0/caveman/caveman.cfg.ini Comparison for DO50398 using Sanger --- Common: 171325 Extra: 3 - Example: 14:20031258:G,8:43827158:A,X:61711363:C Missing: 13 - Example: 10:106963148:T,17:64794691:G,1:82709263:T Because I'm a infrastructure architect my main reason for the test was to monitor resource utilization, so I wrote a wiki detailing my observations: https://wiki.oicr.on.ca/display/~gmihaiescu/Dockerized+Sanger+workflow I have there more Docker tests running, two of them run Sanger against the same donor (but using Vms with 8 cores because I want to see if the run time and resource utilization are constant), and a third test that is running DKFZ. Cheers, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 1:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jonas.Demeulemeester at crick.ac.uk Thu Mar 23 06:07:29 2017 From: Jonas.Demeulemeester at crick.ac.uk (Jonas Demeulemeester) Date: Thu, 23 Mar 2017 10:07:29 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> <0B22ECBD-2187-41D9-9EA8-C8AC98B39425@crick.ac.uk> Message-ID: Thanks Keiran for the info! Digging deeper into the headers, the versions of BWA (0.7.8-r455), bamsort (0.0.148) and bammarkduplicates (0.0.148) do seem to be the same for DO50311, and it?s only the workflow and SeqWare versions that differ. I don?t really see how this could create the 3% discrepancies we?re getting though. Is there anything else we might be overlooking here or some stochasticity involved, as the mismatched reads really do map differently, despite having completely identical sequences? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http://> On 23 Mar 2017, at 09:47, Keiran Raine > wrote: Hi, The jsonl files files on pancancer.org contain the versions of software used originally. If someone can give me the BWA and bammarkduplicates(2?) versions used this may be explained. Bammarkduplicates had a bug fix a few monthis into the mapping, but the reported differences at the time (I don't remember who did it) was <1%. Keiran From: Miguel Vazquez > Date: Wednesday, 22 March 2017 at 18:08 To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 3 112743126 HS2000-1012_275:7:1101:17411:15403 147 3 112743376 HS2000-1012_275:7:1101:11883:83640 99 16 28672999 HS2000-1012_275:7:1101:11883:83640 147 16 28673223 HS2000-1012_275:7:1101:16576:28476 163 GL000238.1 21309 HS2000-1012_275:7:1101:16576:28476 83 GL000238.1 21664 vs the original: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 8 54944243 HS2000-1012_275:7:1101:17411:15403 147 8 54944493 HS2000-1012_275:7:1101:11883:83640 163 16 28464362 HS2000-1012_275:7:1101:11883:83640 83 16 28464586 HS2000-1012_275:7:1101:16576:28476 99 12 6124549 HS2000-1012_275:7:1101:16576:28476 147 12 6124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From kr2 at sanger.ac.uk Thu Mar 23 06:14:40 2017 From: kr2 at sanger.ac.uk (Keiran Raine) Date: Thu, 23 Mar 2017 10:14:40 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: References: <60731B82-E673-4BA0-A99C-EB7309E2B24B@sanger.ac.uk> <9F56FADB-6AE5-4027-9D9F-8ACA13CC7C9B@oicr.on.ca> <0E1785D8-64AD-463B-9AB4-D8ACB99A3821@sanger.ac.uk> <0B22ECBD-2187-41D9-9EA8-C8AC98B39425@crick.ac.uk> Message-ID: The order of reads passed into BWA does have an effect due to the way the insert size is calculated during proper-pair determination. Diff_bams in PCAP-core has an option to ignore MAPQ=0, does that pass or fail with differences? Keiran From: Jonas Demeulemeester Date: Thursday, 23 March 2017 at 10:07 To: Keiran Raine Cc: Miguel Vazquez , Junjun Zhang , George Mihaiescu , "docktesters at lists.icgc.org" Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Keiran for the info! Digging deeper into the headers, the versions of BWA (0.7.8-r455), bamsort (0.0.148) and bammarkduplicates (0.0.148) do seem to be the same for DO50311, and it?s only the workflow and SeqWare versions that differ. I don?t really see how this could create the 3% discrepancies we?re getting though. Is there anything else we might be overlooking here or some stochasticity involved, as the mismatched reads really do map differently, despite having completely identical sequences? Thanks, Jonas _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk<%22mailto:> W: www.crick.ac.uk<%22http:/> On 23 Mar 2017, at 09:47, Keiran Raine > wrote: Hi, The jsonl files files on pancancer.org contain the versions of software used originally. If someone can give me the BWA and bammarkduplicates(2?) versions used this may be explained. Bammarkduplicates had a bug fix a few monthis into the mapping, but the reported differences at the time (I don't remember who did it) was <1%. Keiran From: Miguel Vazquez > Date: Wednesday, 22 March 2017 at 18:08 To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 3 112743126 HS2000-1012_275:7:1101:17411:15403 147 3 112743376 HS2000-1012_275:7:1101:11883:83640 99 16 28672999 HS2000-1012_275:7:1101:11883:83640 147 16 28673223 HS2000-1012_275:7:1101:16576:28476 163 GL000238.1 21309 HS2000-1012_275:7:1101:16576:28476 83 GL000238.1 21664 vs the original: ID flag chr pos HS2000-1012_275:7:1101:17411:15403 99 8 54944243 HS2000-1012_275:7:1101:17411:15403 147 8 54944493 HS2000-1012_275:7:1101:11883:83640 163 16 28464362 HS2000-1012_275:7:1101:11883:83640 83 16 28464586 HS2000-1012_275:7:1101:16576:28476 99 12 6124549 HS2000-1012_275:7:1101:16576:28476 147 12 6124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Fri Mar 24 22:58:37 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Sat, 25 Mar 2017 02:58:37 +0000 Subject: [DOCKTESTERS] BWA-Mem update In-Reply-To: Message-ID: Hi Keiran, I used the original aligned BAMs available in Collaboratory and GNOS sites. One of my two other Sanger tests ran against the same donor completed too, and it had exactly the same output when I ran the "compare_result.sh" script, but I'm not sure what you meant by "the key information for determining if a call change is erroneous". Is the check script correctly (or not) validating the result? I'll probably send a final report on Monday with the results of all four tests (three Sanger and one DKFZ). Cheers, George From: Keiran Raine > Date: Thursday, March 23, 2017 at 4:58 AM To: George Mihaiescu >, Miguel Vazquez >, Jonas Demeulemeester > Cc: Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Hi, Sorry if this is in your confluence page but I'm unable to access (could be as I'm outside OICR or that the default for your space is owner only). Can you confirm if the CaVEMan calling was base on the BAM file that the original data was generated with or a one mapped with the new/recent mapping flow? Also, the key information for determining if a call change is erroneous: 1. Is the variant is marked 'PASSED'. 2. What are the probabilities attached to the VCF record (should be in the info field)? As previously stated we do expect a small variance in the results for the data processed at the beginning of the project and those at the end as well as some minor changes introduced when the normal-panel was moved from a web-service to a local file. Regards, Keiran From: George Mihaiescu > Date: Wednesday, 22 March 2017 at 20:18 To: Miguel Vazquez >, Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update I finished one of the dockerized Sanger tests and upon verification there were just a few differences, but I'm not sure if they are normal or not. Results: root at dockstore-test3:~/PCAWG-Docker-Test# bin/compare_result.sh Sanger DO50398 var/spool/cwl/0/caveman/ var/spool/cwl/0/caveman/splitList var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz var/spool/cwl/0/caveman/alg_bean var/spool/cwl/0/caveman/prob_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz.tbi var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.no_analysis.bed var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.snps.ids.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.muts.ids.vcf.gz.tbi var/spool/cwl/0/caveman/cov_arr var/spool/cwl/0/caveman/7f94d650-41b9-4664-bcde-dc8533e4602d_vs_69586c55-6f81-4728-8a82-bd97bceafaaa.flagged.muts.vcf.gz.tbi var/spool/cwl/0/caveman/caveman.cfg.ini Comparison for DO50398 using Sanger --- Common: 171325 Extra: 3 - Example: 14:20031258:G,8:43827158:A,X:61711363:C Missing: 13 - Example: 10:106963148:T,17:64794691:G,1:82709263:T Because I'm a infrastructure architect my main reason for the test was to monitor resource utilization, so I wrote a wiki detailing my observations: https://wiki.oicr.on.ca/display/~gmihaiescu/Dockerized+Sanger+workflow I have there more Docker tests running, two of them run Sanger against the same donor (but using Vms with 8 cores because I want to see if the run time and resource utilization are constant), and a third test that is running DKFZ. Cheers, George From: Miguel Vazquez > Date: Wednesday, March 22, 2017 at 1:08 PM To: Jonas Demeulemeester > Cc: Keiran Raine >, Junjun Zhang >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] BWA-Mem update Thanks Jonas for this information. I hope that someone here can provide us with some suggestion on what to try next. Perhaps the version issue that Jonas point out is the key. I just want to add that, as I told Jonas earlier, my own tests using the new split BAM files also gave 3% mismatches. Best regards Miguel On Wed, Mar 22, 2017 at 6:56 PM, Jonas Demeulemeester > wrote: Hi all, A brief update on the BWA-Mem docker tests. I prepared normal + tumor lane-level unaligned bams for DO503011 and ran the BWA-Mem workflow for normal and tumor seperately. Doing the comparison however, I am still getting 3% of reads that are aligned differently (see below for a few examples). However, when checking the headers of the original and newly mapped bam files (attached) I noticed that the original is mapped using a different version of BWA and SeqWare. I?m hoping the mapping differences can be ascribed to this. Is there a list available somewhere detailing which samples were mapped using which versions? That way we could select a relevant test sample without having to sort through the headers of all different bams. Best wishes, Jonas newly aligned: IDflagchrpos HS2000-1012_275:7:1101:17411:15403993112743126 HS2000-1012_275:7:1101:17411:154031473112743376 HS2000-1012_275:7:1101:11883:83640991628672999 HS2000-1012_275:7:1101:11883:836401471628673223 HS2000-1012_275:7:1101:16576:28476163GL000238.121309 HS2000-1012_275:7:1101:16576:2847683GL000238.121664 vs the original: IDflagchrpos HS2000-1012_275:7:1101:17411:1540399854944243 HS2000-1012_275:7:1101:17411:15403147854944493 HS2000-1012_275:7:1101:11883:836401631628464362 HS2000-1012_275:7:1101:11883:83640831628464586 HS2000-1012_275:7:1101:16576:2847699126124549 HS2000-1012_275:7:1101:16576:28476147126124903 _________________________________ Jonas Demeulemeester, PhD Postdoctoral Researcher The Francis Crick Institute 1 Midland Road London NW1 1AT T: +44 (0)20 3796 2594 M: +44 (0)7482 070730 E: jonas.demeulemeester at crick.ac.uk W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- An HTML attachment was scrubbed... URL: From miguel.vazquez at cnio.es Mon Mar 27 08:34:55 2017 From: miguel.vazquez at cnio.es (Miguel Vazquez) Date: Mon, 27 Mar 2017 14:34:55 +0200 Subject: [DOCKTESTERS] DKFZ BiasFilter 100% match on DO52140, DO35937, and DO218695 Message-ID: Dear all, I'm very pleased to announce that with the help of Kortine we have managed to reproduce the results. To get the proper results we had to filter out the input file (consensus.vcf from GNOS for each donor) to remove the LOWSUPPORT and OXOG variants. I've also removed the bSeq and bPcr tags from the file as I saw in the python code that otherwise these already flagged variants would have been excluded. Best regards Miguel Comparison for DO52140 tag bPcr --- Common: 1 Extra: 0 Missing: 0 Comparison for DO52140 tag bSeq --- Common: 1 Extra: 0 Missing: 0 Processing donor: DO35937 Comparison for DO35937 tag bPcr --- Common: 1 Extra: 0 Missing: 0 Comparison for DO35937 tag bSeq --- Common: 6 Extra: 0 Missing: 0 Processing donor: DO218695 Comparison for DO218695 tag bPcr --- Common: 1 Extra: 0 Missing: 0 Comparison for DO218695 tag bSeq --- Common: 10 Extra: 0 Missing: 0 On Wed, Mar 15, 2017 at 11:09 AM, Miguel Vazquez wrote: > Dear all, > > As you can read below I made a mistake on my previous validation for the > DKFZ BiasFilter. Unfortunately large differences have turned up now that > I've corrected the process. In brief on both donors I've found that > re-runing the filter flags some additional variants for both flags bPcr and > bSeq. Notably all the discrepancies are for the new method flagging more > variants. For instance in many cases the original file contains just one > variant with the flag where the new one ten or twenty. You can read the > details at the end of this email where we are comparing the original VCF to > the new one. > > Note that the orginal VCF is the consensus variants are the input I use > for the BiasFilter along with the corresponding BAM files for that donor. I > can only imagine that if this VCF was not the one originally used due to > some filtering step then perhaps the bias calculations might have been > affected. If that is so I would need instructions on where to get the > precise input VCFs. > > Best regards > > Miguel > > ----RESULTS---- > > Comparison for *DO52140* tag *bPcr* > --- > Common: *1* > Extra: *12* > - Example: 11:81550771:C:A,12:19486241:G:T,2:12287406:G:T > Missing: 0 > > > Comparison for *DO52140* tag *bSeq* > --- > Common: *1* > Extra: *23* > - Example: 10:17681457:G:T,12:112049882:T:G,12:130990011:T:A > Missing: 0 > > Comparison for *DO35937* tag *bPcr* > --- > Common: *1* > Extra: *10* > - Example: 1:114845662:G:T,14:33282600:C:A,16:78467879:G:T > Missing: 0 > > > Comparison for *DO35937* tag *bSeq* > --- > Common: *6* > Extra: *88* > - Example: 10:21703903:A:C,10:24183103:C:T,10:51468498:C:G > Missing: 0 > > > > > On Mon, Mar 13, 2017 at 4:48 PM, Christina Yung > wrote: > >> Hi Miguel, >> >> The bPCR and bSeq flags are indeed the ones flagged by the DKFZ bias >> filter. When you summarize the comparison, please cc Matthias of DKFZ as >> his team developed this filter. No issue at all, and thanks again for your >> great work! >> >> Christina >> >> >> On 3/13/2017 10:22 AM, Miguel Vazquez wrote: >> >> Dear all, >> >> I just learnt that the DKFZ BiasFilter is NOT the OXOG filter workflow, >> which means* I checked for the wrong thing in this validation!* I'm >> sorry for the confusion. >> >> Right now I pass the BAM files and the consensus.vcf (SNV_MNV) downloaded >> from GNOS to the BiasFilter and compare the resulting VCF with the >> consensus looking at the set of mutations containing the OXOGFAIL flag. >> This apparently is not the comparison to make. *What is it that I need >> to compare? is it the bPcr and bSeq flags?* >> >> One first look at those flags do show quite some discrepancies >> unfortunately on both donors (DO52140 and DO35937) for both flags. For >> instance for DO35937 we find 11 mutations flaged bPcr with in the new >> result, while the consensus.vcf only finds one, of them. Something similar >> happens with the bSeq. >> >> Can you please confirm this so I can come reply with a full report on >> this. >> >> Kind regards, and sorry again for the confusion. >> >> Miguel >> >> >> >> On Mon, Feb 27, 2017 at 7:30 PM, Miguel Vazquez >> wrote: >> >>> Dear friends, >>> >>> I've performed the first test with the DKFZ BiasFilter and got a perfect >>> match. There are 55 variants annotated with OXOGFAIL and they are the same >>> in the input VCF file (consensus SNV/MNV VCF for that donor) and the output >>> of the BiasFilter. I'm running the test on a second donor. >>> >>> Best regards >>> >>> Miguel >>> >> >> >> >> _______________________________________________ >> docktesters mailing listdocktesters at lists.icgc.orghttps://lists.icgc.org/mailman/listinfo/docktesters >> >> >> >> _______________________________________________ >> docktesters mailing list >> docktesters at lists.icgc.org >> https://lists.icgc.org/mailman/listinfo/docktesters >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Mon Mar 27 10:36:29 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Mon, 27 Mar 2017 14:36:29 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Hi, Last week thanks to Denis who provided the DKFZ dependencies, I was able to start that workflow. It ran for about 10 hours at 100% CPU, but then it failed with the following errors: root at dockstore4-dkfz:~/PCAWG-Docker-Test# + cntSuccessful=4 ++ expr 4 - 4 + cntErrornous=0 + [[ 0 -gt 0 ]] + [[ 0 == 0 ]] + echo 'No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175138477_roddy_snvCalling/jobStateLogfile.txt' No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175138477_roddy_snvCalling/jobStateLogfile.txt + for logfile in '${jobstateFiles[@]}' ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt ++ grep -v null: ++ grep :STARTED: ++ wc -l + cntStarted=2 ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt ++ grep -v null: ++ grep :0: ++ wc -l + cntSuccessful=2 ++ expr 2 - 2 + cntErrornous=0 + [[ 0 -gt 0 ]] + [[ 0 == 0 ]] + echo 'No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt' + [[ true == true ]] No errors found for /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_170322_175637640_roddy_indelCalling/jobStateLogfile.txt There was at least one error in a job status logfile. Will exit now! + echo 'There was at least one error in a job status logfile. Will exit now!' + exit 5 mv: cannot stat `/mnt/datastore/resultdata/*': No such file or directory Result directory listing is: + gosu root chmod -R a+wrx /var/spool/cwl Error while running job: Error collecting output for parameter 'germline_indel_vcf_gz': Did not find output file with glob pattern: '['*.germline.indel.vcf.gz']' [job temp5679700718223668526.cwl] completed permanentFail Final process status is permanentFail Workflow error, try again with --debug for more information: Process status is ['permanentFail'] org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.access$200(DefaultExecutor.java:48) at org.apache.commons.exec.DefaultExecutor$1.run(DefaultExecutor.java:200) at java.lang.Thread.run(Thread.java:745) java.lang.RuntimeException: problems running command: cwltool --enable-dev --non-strict --outdir /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/outputs/ --tmpdir-prefix /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/working/ /tmp/1490197113216-0/temp5679700718223668526.cwl /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/workflow_params.json Any idea what went wrong? Thank you, George From: Jonas Demeulemeester > Date: Monday, March 20, 2017 at 3:13 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Do you have the DKFZ workflow dependencies tarball in place (and named correctly)? That's the file it's clearly not finding: 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz You can find the link to this reference tarball on the DKFZ pipeline github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workflows) Hope this helps, Jonas On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Tue Mar 28 12:52:31 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Tue, 28 Mar 2017 16:52:31 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: <9f143fc3109a4a9c9db43821a53e20c0@oicr.on.ca> Message-ID: Hi, Please see attached the final results of my three Sanger tests. A VM with 19 cores (238% more CPU capacity than a 8 cores VM) finishes the analysis only 66% faster, which means that a 8-cores VM is more efficient than a 19 cores one, probably because of the times when the workflow was not using all the cores. All three analysis gave the same results, and the two of them that ran on the same VM flavour had very similar runtimes, which means we can predict how long it takes to run the workflow, and the resources needed for a large scale analysis. Also, by scheduling the workflows in batches spread across multiple physical servers and at 2 hours intervals, the disk IO contention could be mostly avoided allowing 4-5 simultaneous workflows to run on each compute node (depending on capacity). George From: Denis Yuen > Date: Wednesday, March 22, 2017 at 9:00 AM To: George Mihaiescu >, Jonas Demeulemeester > Cc: "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi, I have a local copy of the file on my desktop. It's a bit ironic, but if GNOS is currently down, we could setup a local transfer or use a secure OICR USB key. The file is 22 GB in size. Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. ________________________________ From: docktesters-bounces+denis.yuen=oicr.on.ca at lists.icgc.org > on behalf of George Mihaiescu > Sent: March 22, 2017 8:56:46 AM To: Jonas Demeulemeester Cc: docktesters at lists.icgc.org Subject: Re: [DOCKTESTERS] Thanks! I received a new GNOS token and installed the gtdownload client (not easy because I'm running on a Ubuntu 16.04 VM). Now, when I run the "bin/get_dkfz_resources.sh" script, it stays at zero: Status: 0 bytes downloaded (0.000% complete) current rate: /s Child 1 downloading ( ) Child 2 downloading ( ) Child 3 downloading ( ) Child 4 downloading ( ) Child 5 downloading ( ) Child 6 downloading ( ) Child 7 downloading ( ) Child 8 downloading ( ) Status: 0 bytes downloaded (0.000% complete) current rate: /s Child 1 downloading ( ) Child 2 downloading ( ) Child 3 downloading ( ) Child 4 downloading ( ) Child 5 downloading ( ) Child 6 downloading ( ) Child 7 downloading ( ) Child 8 downloading ( ) Status: 0 bytes downloaded (0.000% complete) current rate: /s Is there another way I can download that file? Also, I saw on https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data that you already ran the Sanger workflow against D0218695, do you remember how long it took? I couldn't find the original run time for that donor looking through github (https://github.com/ICGC-TCGA-PanCancer/ceph_transfer_ops). The initial VM running this donor has been running for more than 10 days, and I don't remember Sanger taking so long. Thank you, George From: Jonas Demeulemeester > Date: Monday, March 20, 2017 at 3:13 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Do you have the DKFZ workflow dependencies tarball in place (and named correctly)? That's the file it's clearly not finding: 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz You can find the link to this reference tarball on the DKFZ pipeline github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_workflows) Hope this helps, Jonas On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: Hi, How do I run the DKFZ workflow? I first ran the DELLY which ended with the following output: Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.somatic.sv.vcf.gz [##################################################] 100% Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234-a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output//DO218695.delly.sv.cov.plots.tar.gz After that, I tried to run the DKFZ but it errors as below: root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 Running: cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json WARNING: You're currently running as root; probably by accident. Press control-C to abort or Enter to continue as root. Set DOCKSTORE_ROOT to disable this warning. Creating directories for run of Dockstore launcher at: ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 Provisioning your input files to your local machine Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/9c1f2887-bce0-41dd-a4d2-52f000d79e65 Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/0a43e408-0cdf-4d99-99a3-e9860161a246 Downloading: #reference-gz from /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> /root/PCAWG-Docker-Test/resources/dkfz-workflow-dependencies_150318_0951.tar.gz at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476) at java.nio.file.Files.createLink(Files.java:1086) at io.dockstore.common.FileProvisioning.provisionInputFile(FileProvisioning.java:273) at io.github.collaboratory.LauncherCWL.copyIndividualFile(LauncherCWL.java:726) at io.github.collaboratory.LauncherCWL.doProcessFile(LauncherCWL.java:688) at io.github.collaboratory.LauncherCWL.pullFilesHelper(LauncherCWL.java:659) at io.github.collaboratory.LauncherCWL.pullFiles(LauncherCWL.java:586) at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) at io.dockstore.client.cli.nested.AbstractEntryClient.handleCWLLaunch(AbstractEntryClient.java:1028) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:968) at io.dockstore.client.cli.nested.AbstractEntryClient.launchCwl(AbstractEntryClient.java:951) at io.dockstore.client.cli.nested.AbstractEntryClient.launch(AbstractEntryClient.java:935) at io.dockstore.client.cli.nested.AbstractEntryClient.processEntryCommands(AbstractEntryClient.java:247) at io.dockstore.client.cli.Client.run(Client.java:704) at io.dockstore.client.cli.Client.main(Client.java:796) java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz P.S. I have three other Sanger tests running that were started at different intervals (and on VMs with different CPU/memory/disk), but none of them has completed yet. Thank you, George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Answers inline On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu > wrote: Hi Miguel, I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks. I think it still should take a long time. My scripts will run one workflow after another. Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue. I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this. Let me know if you need any help I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now. Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the DKFZ file needs, namely related to copy number I believe. I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv Regards Miguel George From: Miguel Vazquez > Date: Monday, March 13, 2017 at 6:53 AM To: George Mihaiescu > Cc: Junjun Zhang >, Jonas Demeulemeester >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, The Sanger workflow is very lengthy, it takes about two weeks in my tests. About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46 The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps. https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best. Miguel On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu > wrote: Hi, I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" I just started a second run on a different VM on same donor, just to compare run times. The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness. Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow. George From: Junjun Zhang > Date: Sunday, March 12, 2017 at 10:45 PM To: Jonas Demeulemeester >, George Mihaiescu > Cc: Miguel Vazquez >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Thanks Miguel and Jonas for your help here! Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference Regards, Junjun From: Jonas Demeulemeester > Date: Saturday, March 11, 2017 at 7:15 PM To: George Mihaiescu > Cc: Miguel Vazquez >, Junjun Zhang >, Denis Yuen >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi George, Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts. Give them a go and if you run into issues, just let us know! Cheers, Jonas On 11 Mar 2017, at 17:00, George Mihaiescu > wrote: Sure, I'll give it a try and report later. Thank you, George Mihaiescu Senior Cloud Architect Ontario Institute for Cancer Research MaRS Centre 661 University Avenue Suite 510 Toronto, Ontario Canada M5G 0A3 Email: George.Mihaiescu at oicr.on.ca Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. From: Miguel Vazquez > Date: Saturday, March 11, 2017 at 10:57 AM To: Junjun Zhang > Cc: Denis Yuen >, Jonas Demeulemeester >, George Mihaiescu >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! Hi Junjun, I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. https://github.com/mikisvaz/PCAWG-Docker-Test He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result. The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details. Best regards Miguel On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang > wrote: Dear Docktesters, George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment. Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows. Miguel, Denis and others, what workflows / datasets do you think would be good for George to run? Thanks, Junjun From: > on behalf of Denis Yuen > Date: Wednesday, March 1, 2017 at 10:26 AM To: "docktesters at lists.icgc.org" > Subject: [DOCKTESTERS] Thanks! Hi, Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date. https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks! Denis Yuen Senior Software Developer OntarioInstituteforCancerResearch MaRSCentre 661 University Avenue Suite510 Toronto, Ontario,Canada M5G0A3 Toll-free: 1-866-678-6427 Twitter: @OICR_news www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT _______________________________________________ docktesters mailing list docktesters at lists.icgc.org https://lists.icgc.org/mailman/listinfo/docktesters The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DockerizedSangerworkflow.pdf Type: application/pdf Size: 373079 bytes Desc: DockerizedSangerworkflow.pdf URL: From mikisvaz at gmail.com Wed Mar 29 05:09:37 2017 From: mikisvaz at gmail.com (Miguel Vazquez) Date: Wed, 29 Mar 2017 11:09:37 +0200 Subject: [DOCKTESTERS] Thanks! In-Reply-To: References: Message-ID: George, If you where using my scripts you should have a file: tests/DKFZ/DO218695/Dockstore.json . Do you mind sharing it with us? I find a bit hard to debug these sorts of errors because its not clear to me if the problem is in the underlying tools or the dockstore layer. I think reason has it that the problem must be in the interface, that dockstore cannot setup the environment for the tool, since the tools work and the inputs are probably suitable for them, so my guess is that some inputs are not there or they are not placed in the right location, or something like that. Anyway, lets start by looking at the Dockstore.json Best Miguel On Mon, Mar 27, 2017 at 4:36 PM, George Mihaiescu < George.Mihaiescu at oicr.on.ca> wrote: > Hi, > > Last week thanks to Denis who provided the DKFZ dependencies, I was able > to start that workflow. > > It ran for about 10 hours at 100% CPU, but then it failed with the > following errors: > > root at dockstore4-dkfz:~/PCAWG-Docker-Test# > + cntSuccessful=4 > ++ expr 4 - 4 > + cntErrornous=0 > + [[ 0 -gt 0 ]] > + [[ 0 == 0 ]] > + echo 'No errors found for /mnt/datastore/testdata/run_ > id/roddyExecutionStore/exec_170322_175138477_roddy_ > snvCalling/jobStateLogfile.txt' > No errors found for /mnt/datastore/testdata/run_ > id/roddyExecutionStore/exec_170322_175138477_roddy_ > snvCalling/jobStateLogfile.txt > + for logfile in '${jobstateFiles[@]}' > ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_ > 170322_175637640_roddy_indelCalling/jobStateLogfile.txt > ++ grep -v null: > ++ grep :STARTED: > ++ wc -l > + cntStarted=2 > ++ cat /mnt/datastore/testdata/run_id/roddyExecutionStore/exec_ > 170322_175637640_roddy_indelCalling/jobStateLogfile.txt > ++ grep -v null: > ++ grep :0: > ++ wc -l > + cntSuccessful=2 > ++ expr 2 - 2 > + cntErrornous=0 > + [[ 0 -gt 0 ]] > + [[ 0 == 0 ]] > + echo 'No errors found for /mnt/datastore/testdata/run_ > id/roddyExecutionStore/exec_170322_175637640_roddy_ > indelCalling/jobStateLogfile.txt' > + [[ true == true ]] > No errors found for /mnt/datastore/testdata/run_ > id/roddyExecutionStore/exec_170322_175637640_roddy_ > indelCalling/jobStateLogfile.txt > There was at least one error in a job status logfile. Will exit now! > + echo 'There was at least one error in a job status logfile. Will exit > now!' > + exit 5 > mv: cannot stat `/mnt/datastore/resultdata/*': No such file or directory > Result directory listing is: > + gosu root chmod -R a+wrx /var/spool/cwl > Error while running job: Error collecting output for parameter > 'germline_indel_vcf_gz': Did not find output file with glob pattern: > '['*.germline.indel.vcf.gz']' > [job temp5679700718223668526.cwl] completed permanentFail > Final process status is permanentFail > Workflow error, try again with --debug for more information: > Process status is ['permanentFail'] > org.apache.commons.exec.ExecuteException: Process exited with an error: 1 > (Exit value: 1) > at org.apache.commons.exec.DefaultExecutor.executeInternal( > DefaultExecutor.java:404) > at org.apache.commons.exec.DefaultExecutor.access$200( > DefaultExecutor.java:48) > at org.apache.commons.exec.DefaultExecutor$1.run( > DefaultExecutor.java:200) > at java.lang.Thread.run(Thread.java:745) > java.lang.RuntimeException: problems running command: cwltool --enable-dev > --non-strict --outdir /root/PCAWG-Docker-Test/tests/ > DKFZ/DO218695/./datastore/launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/outputs/ > --tmpdir-prefix /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/ > launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/working/ > /tmp/1490197113216-0/temp5679700718223668526.cwl > /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/ > launcher-e1ebdf3e-6f35-43f7-8ba2-1fb559d0d948/workflow_params.json > > > Any idea what went wrong? > > Thank you, > George > > From: Jonas Demeulemeester > Date: Monday, March 20, 2017 at 3:13 PM > To: George Mihaiescu > Cc: Miguel Vazquez , Junjun Zhang < > Junjun.Zhang at oicr.on.ca>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Hi George, > > Do you have the DKFZ workflow dependencies tarball in place (and named > correctly)? > That's the file it's clearly not finding: > > 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could > not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow- > dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/ > DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480- > 9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8- > e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz > java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/ > DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480- > 9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8- > e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz > > > You can find the link to this reference tarball on the DKFZ pipeline > github page (https://github.com/ICGC-TCGA-PanCancer/dkfz_dockered_ > workflows) > > Hope this helps, > Jonas > > > On 20 Mar 2017, at 17:19, George Mihaiescu > wrote: > > Hi, > > How do I run the DKFZ workflow? > I first ran the DELLY which ended with the following output: > Uploading: #somatic_sv_vcf from /root/PCAWG-Docker-Test/tests/ > Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234- > a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.somatic.sv.vcf.gz > to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output// > DO218695.delly.somatic.sv.vcf.gz > [##################################################] 100% > Uploading: #cov_plots from /root/PCAWG-Docker-Test/tests/ > Delly/DO218695/./datastore/launcher-0ce3d535-bd87-4234- > a5c0-a3df48d7c5a5/outputs/run_id.embl-delly_1-3-0-preFilter.20150318.sv.cov.plots.tar.gz > to : /root/PCAWG-Docker-Test/tests/Delly/DO218695//output// > DO218695.delly.sv.cov.plots.tar.gz > > After that, I tried to run the DKFZ but it errors as below: > > root at dockstore4-dkfz:~/PCAWG-Docker-Test# bin/run_test.sh DKFZ DO218695 > Running: > cd /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/ && dockstore tool launch > --script --entry quay.io/pancancer/pcawg-dkfz-workflow:2.0.1_cwl1.0 > quay.io/jwerner_dkfz/DKFZBiasFilter:1.2.2 --json Dockstore.json > WARNING: You're currently running as root; probably by accident. > Press control-C to abort or Enter to continue as root. > Set DOCKSTORE_ROOT to disable this warning. > > Creating directories for run of Dockstore launcher at: > ./datastore//launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2 > Provisioning your input files to your local machine > Downloading: #delly-bedpe from /root/PCAWG-Docker-Test/tests/ > Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt into > directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/ > launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ > 9c1f2887-bce0-41dd-a4d2-52f000d79e65 > Downloading: #normal-bam from /root/PCAWG-Docker-Test/data/DO218695/normal.bam > into directory: /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/ > launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ > 0a43e408-0cdf-4d99-99a3-e9860161a246 > Downloading: #reference-gz from /root/PCAWG-Docker-Test/ > resources//dkfz-workflow-dependencies_150318_0951.tar.gz into directory: > /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/ > launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ > ee5564fe-ba17-4383-afd8-e785394e365f > 17:06:08.641 [main] ERROR io.dockstore.common.FileProvisioning - Could > not copy /root/PCAWG-Docker-Test/resources//dkfz-workflow- > dependencies_150318_0951.tar.gz to /root/PCAWG-Docker-Test/tests/ > DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480- > 9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8- > e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz > java.nio.file.NoSuchFileException: /root/PCAWG-Docker-Test/tests/ > DKFZ/DO218695/./datastore/launcher-81c42580-21ad-4480- > 9ee9-a0d2a7a14bf2/inputs/ee5564fe-ba17-4383-afd8- > e785394e365f/dkfz-workflow-dependencies_150318_0951.tar.gz -> > /root/PCAWG-Docker-Test/resources/dkfz-workflow- > dependencies_150318_0951.tar.gz > at sun.nio.fs.UnixException.translateToIOException( > UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException( > UnixException.java:102) > at sun.nio.fs.UnixFileSystemProvider.createLink( > UnixFileSystemProvider.java:476) > at java.nio.file.Files.createLink(Files.java:1086) > at io.dockstore.common.FileProvisioning.provisionInputFile( > FileProvisioning.java:273) > at io.github.collaboratory.LauncherCWL.copyIndividualFile( > LauncherCWL.java:726) > at io.github.collaboratory.LauncherCWL.doProcessFile( > LauncherCWL.java:688) > at io.github.collaboratory.LauncherCWL.pullFilesHelper( > LauncherCWL.java:659) > at io.github.collaboratory.LauncherCWL.pullFiles( > LauncherCWL.java:586) > at io.github.collaboratory.LauncherCWL.run(LauncherCWL.java:185) > at io.dockstore.client.cli.nested.AbstractEntryClient. > handleCWLLaunch(AbstractEntryClient.java:1028) > at io.dockstore.client.cli.nested.AbstractEntryClient. > launchCwl(AbstractEntryClient.java:968) > at io.dockstore.client.cli.nested.AbstractEntryClient. > launchCwl(AbstractEntryClient.java:951) > at io.dockstore.client.cli.nested.AbstractEntryClient. > launch(AbstractEntryClient.java:935) > at io.dockstore.client.cli.nested.AbstractEntryClient. > processEntryCommands(AbstractEntryClient.java:247) > at io.dockstore.client.cli.Client.run(Client.java:704) > at io.dockstore.client.cli.Client.main(Client.java:796) > java.lang.RuntimeException: Could not copy /root/PCAWG-Docker-Test/ > resources//dkfz-workflow-dependencies_150318_0951.tar.gz to > /root/PCAWG-Docker-Test/tests/DKFZ/DO218695/./datastore/ > launcher-81c42580-21ad-4480-9ee9-a0d2a7a14bf2/inputs/ > ee5564fe-ba17-4383-afd8-e785394e365f/dkfz-workflow- > dependencies_150318_0951.tar.gz > > > P.S. I have three other Sanger tests running that were started at > different intervals (and on VMs with different CPU/memory/disk), but none > of them has completed yet. > > Thank you, > George > > > From: Miguel Vazquez > Date: Monday, March 13, 2017 at 8:52 AM > To: George Mihaiescu > Cc: Junjun Zhang , Jonas Demeulemeester < > Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < > docktesters at lists.icgc.org> > Subject: Re: [DOCKTESTERS] Thanks! > > Hi George, > > Answers inline > > On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu < > George.Mihaiescu at oicr.on.ca> wrote: > >> Hi Miguel, >> >> I've started the test by running "bin/run_test.sh Sanger DO50398", so I >> guess with just one workflow running it should complete faster than two >> weeks. >> > > I think it still should take a long time. My scripts will run one workflow > after another. > > >> >> Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" >> script to use a docker container that has the icgc client inside and pull >> data from Collaboratory. There is no "bam.bas" file downloaded, just a >> ".bam" and a ".bam.bai" files, not sure if this is an issue. >> >> > I wondered the same thing first time I did this, but this file is produced > by the pipeline. There was some problem with this that was dealt with by > the developers and updated in the docker. So I think you won't have a > problem > > >> By looking at the "bin/compare_result_type.sh" it looks like it's using >> the gnos client to pull down the existing VCF files for comparison reasons, >> but I think we store those files in Collaboratory as well, so I'll work >> with Junjun to adapt the script for this. >> >> > Let me know if you need any help > > >> I think I initially tried to run the DKFZ workflow, but it complained >> about having to run Delly first, so I abandoned this for now. >> > > Yes, if you look at the run_batch.sh you will see that when using DKFZ it > will always run Delly first. Delly prepares some files the the DKFZ file > needs, namely related to copy number I believe. > > >> >> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor. >> > > Remember that you will need to add the relevant has-keys for the different > files in the etc/donor_files.csv. Its a bit tedious right now. You need to > go to the ICGC DCC and find these codes manually for the files you need. > Ask me if you need help. Once you have all you can run all the workflows > for that donor and evaluate results. > > https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/ > etc/donor_files.csv > > > Regards > > Miguel > > >> >> George >> >> From: Miguel Vazquez >> Date: Monday, March 13, 2017 at 6:53 AM >> To: George Mihaiescu >> Cc: Junjun Zhang , Jonas Demeulemeester < >> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" < >> docktesters at lists.icgc.org> >> Subject: Re: [DOCKTESTERS] Thanks! >> >> Hi George, >> >> The Sanger workflow is very lengthy, it takes about two weeks in my >> tests. >> >> About correctness, my scripts also cover that part, if you are not using >> them they might still help you to clarify how we do it. The idea is to take >> each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both >> germline and somatic and compare it with the result uploaded to GNOS (not >> all pipelines produce all files). This is the relevant part in the >> run_batch.sh script: >> >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >> n/run_batch.sh#L42-L46 >> >> The bin/compare_result_type.sh script will take care of downloading the >> correct file from GNOS and running the comparison. The comparison itself is >> simple since all files are VCFs, it consists in taking out the variants in >> terms of chromosome, position, reference and alternative allele and >> measuring the overlaps. >> >> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi >> n/compare_result_type.sh >> >> About which donors to test, DO52140 is one Jonas and I have both tested >> and could be interesting to get a third opinion. Also, any other donor >> could be interesting to see if something new comes up. I'm not sure which >> options is best. >> >> Miguel >> >> >> >> >> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu < >> George.Mihaiescu at oicr.on.ca> wrote: >> >>> Hi, >>> >>> I've started Sanger on DO50398 and it's been running for more than 24 >>> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59" >>> >>> I just started a second run on a different VM on same donor, just to >>> compare run times. >>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some >>> monitoring graphs when it finishes the workflow, but I have no idea how to >>> check its correctness. >>> >>> Give me a list of donors and what workflows you want me to run and I'll >>> try to schedule them tomorrow. >>> >>> George >>> >>> >>> From: Junjun Zhang >>> Date: Sunday, March 12, 2017 at 10:45 PM >>> To: Jonas Demeulemeester , George >>> Mihaiescu >>> Cc: Miguel Vazquez , Denis Yuen < >>> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Thanks Miguel and Jonas for your help here! >>> >>> Do you have any update on the latest testing? Please feel free updating >>> the wiki with any update: https://wiki.oicr.on.c >>> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference >>> >>> Regards, >>> Junjun >>> >>> >>> >>> From: Jonas Demeulemeester >>> Date: Saturday, March 11, 2017 at 7:15 PM >>> To: George Mihaiescu >>> Cc: Miguel Vazquez , Junjun Zhang < >>> junjun.zhang at oicr.on.ca>, Denis Yuen , " >>> docktesters at lists.icgc.org" >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Hi George, >>> >>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of >>> scripts. >>> Give them a go and if you run into issues, just let us know! >>> >>> Cheers, >>> Jonas >>> >>> >>> On 11 Mar 2017, at 17:00, George Mihaiescu >>> wrote: >>> >>> Sure, I'll give it a try and report later. >>> >>> Thank you, >>> >>> *George Mihaiescu* >>> Senior Cloud Architect >>> >>> *Ontario Institute for Cancer Research* >>> MaRS Centre >>> 661 University Avenue >>> Suite 510 >>> Toronto, Ontario >>> Canada M5G 0A3 >>> >>> Email: George.Mihaiescu at oicr.on.ca >>> Toll-free: 1-866-678-6427 >>> Twitter: @OICR_news >>> >>> www.oicr.on.ca >>> >>> This message and any attachments may contain confidential and/or >>> privileged information for the sole use of the intended recipient. Any >>> review or distribution by anyone other than the person for whom it was >>> originally intended is strictly prohibited. If you have received this >>> message in error, please contact the sender and delete all copies. >>> Opinions, conclusions or other information contained in this message may >>> not be that of the organization. >>> >>> >>> >>> From: Miguel Vazquez >>> Date: Saturday, March 11, 2017 at 10:57 AM >>> To: Junjun Zhang >>> Cc: Denis Yuen , Jonas Demeulemeester < >>> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu < >>> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" < >>> docktesters at lists.icgc.org> >>> Subject: Re: [DOCKTESTERS] Thanks! >>> >>> Hi Junjun, >>> >>> I think Jonas has been using my scripts to run some of the tests, maybe >>> George could try them as well, it should be very easy for him to try the >>> Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter. >>> >>> https://github.com/mikisvaz/PCAWG-Docker-Test >>> >>> He would just need to update the tokens for DACO access and the scripts >>> will take care of downloading the BAM files, running the workflows and >>> evaluating the result. >>> >>> The documentation there is reasonably updated, but if this sounds good >>> then perhaps he could contact me and I could walk him through the details. >>> >>> Best regards >>> >>> Miguel >>> >>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang >>> wrote: >>> >>>> Dear Docktesters, >>>> >>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans >>>> to run some bioinformatics workflows to test Collab environment. >>>> >>>> Just thought this is a good opportunity to use as extra help for >>>> testing out the PCAWG dockerized workflows. >>>> >>>> Miguel, Denis and others, what workflows / datasets do you think would >>>> be good for George to run? >>>> >>>> Thanks, >>>> Junjun >>>> >>>> >>>> >>>> From: on >>>> behalf of Denis Yuen >>>> Date: Wednesday, March 1, 2017 at 10:26 AM >>>> To: "docktesters at lists.icgc.org" >>>> Subject: [DOCKTESTERS] Thanks! >>>> >>>> Hi, >>>> >>>> Just wanted to say thanks to Miguel and Jonas for keeping the workflow >>>> testing data page up-to-date. >>>> >>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data >>>> >>>> >>>> As we work on new versions or debugging, it is invaluable to know what >>>> versions of the workflows have worked outside OICR, thanks! >>>> >>>> >>>> >>>> *Denis Yuen* >>>> Senior Software Developer >>>> >>>> >>>> *Ontario**Institute**for**Cancer**Research* >>>> MaRSCentre >>>> 661 University Avenue >>>> Suite510 >>>> Toronto, Ontario,Canada M5G0A3 >>>> >>>> Toll-free: 1-866-678-6427 >>>> Twitter: @OICR_news >>>> *www.oicr.on.ca * >>>> >>>> This message and any attachments may contain confidential and/or >>>> privileged information for the sole use of the intended recipient. Any >>>> review or distribution by anyone other than the person for whom it was >>>> originally intended is strictly prohibited. If you have received this >>>> message in error, please contact the sender and delete all copies. >>>> Opinions, conclusions or other information contained in this message may >>>> not be that of the organization. >>>> >>>> >>>> _______________________________________________ >>>> docktesters mailing list >>>> docktesters at lists.icgc.org >>>> https://lists.icgc.org/mailman/listinfo/docktesters >>>> >>>> >>> The Francis Crick Institute Limited is a registered charity in England >>> and Wales no. 1140062 and a company registered in England and Wales no. >>> 06885462, with its registered office at 1 Midland Road London NW1 1AT >>> >>> >>> _______________________________________________ >>> docktesters mailing list >>> docktesters at lists.icgc.org >>> https://lists.icgc.org/mailman/listinfo/docktesters >>> >>> >> > The Francis Crick Institute Limited is a registered charity in England and > Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > -------------- next part -------------- An HTML attachment was scrubbed... URL: From George.Mihaiescu at oicr.on.ca Wed Mar 29 12:22:05 2017 From: George.Mihaiescu at oicr.on.ca (George Mihaiescu) Date: Wed, 29 Mar 2017 16:22:05 +0000 Subject: [DOCKTESTERS] Thanks! In-Reply-To: Message-ID: Hi, I had problems installing the gtdownload cline, so I changed your scripts to use the icgc-storage client and downloaded the BAMs from Collaboratory, but I think this caused more harm in the end because I diverted from the good known testing path. This is the content of the Dockstore.json file: root at dockstore4-dkfz:~/PCAWG-Docker-Test# cat tests/DKFZ/DO218695/Dockstore.json { "run-id": "run_id", "tumor-bam": { "path":"/root/PCAWG-Docker-Test/data/DO218695/tumor.bam", "class":"File" }, "normal-bam": { "path":"/root/PCAWG-Docker-Test/data/DO218695/normal.bam", "class":"File" }, "reference-gz": { "path": "/root/PCAWG-Docker-Test/resources//dkfz-workflow-dependencies_150318_0951.tar.gz", "class": "File" }, "delly-bedpe": { "path":"/root/PCAWG-Docker-Test/tests/Delly/DO218695/output//DO218695.delly.somatic.sv.bedpe.txt", "class":"File" }, "germline_indel_vcf_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.germline.indel.vcf.gz", "class": "File" }, "somatic_snv_mnv_vcf_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.somatic.snv.mnv.vcf.gz", "class": "File" }, "germline_snv_mnv_vcf_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.germline.snv.mnv.vcf.gz", "class": "File" }, "somatic_cnv_tar_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.somatic.cnv.tar.gz", "class": "File" }, "somatic_cnv_vcf_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.somatic.cnv.vcf.gz", "class": "File" }, "somatic_indel_tar_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.somatic.indel.tar.gz", "class": "File" }, "somatic_snv_mnv_tar_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.somatic.snv.mnv.tar.gz", "class": "File" }, "somatic_indel_vcf_gz": { "path": "/root/PCAWG-Docker-Test/tests/DKFZ/DO218695//output//DO218695.somatic.indel.vcf.gz", "class": "File" } } Thank you, George From: Miguel Vazquez > Date: Wednesday, March 29, 2017 at 4:09 AM To: George Mihaiescu > Cc: Jonas Demeulemeester >, Junjun Zhang >, "docktesters at lists.icgc.org" > Subject: Re: [DOCKTESTERS] Thanks! tests/DKFZ/DO218695/Dockstore.json -------------- next part -------------- An HTML attachment was scrubbed... URL: