[DOCKTESTERS] Thanks!

Junjun Zhang Junjun.Zhang at oicr.on.ca
Tue Mar 14 10:21:01 EDT 2017


Hi Jonas,

Much appreciated for your kind offer. Once we get new alignment result using lane level BAM input, it should be easier to diagnosis the mismatches.

Regards,
Junjun


From: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk<mailto:Jonas.Demeulemeester at crick.ac.uk>>
Date: Tuesday, March 14, 2017 at 9:49 AM
To: Miguel Vazquez <mikisvaz at gmail.com<mailto:mikisvaz at gmail.com>>
Cc: Junjun Zhang <junjun.zhang at oicr.on.ca<mailto:junjun.zhang at oicr.on.ca>>, Keiran Raine <kr2 at sanger.ac.uk<mailto:kr2 at sanger.ac.uk>>, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Hi Miguel,

I’ll have a go at modifying your scripts to do this kind of preprocessing.

As to why alignment by lane level vs alignment of a single merged bam would result in only 3% discrepancies, I can imagine that read lengths etc may not be that different between the different libraries (for our tested donors at least).
Please correct me if I’m wrong though!

Best regards,
Jonas

_________________________________
Jonas Demeulemeester, PhD
Postdoctoral Researcher
The Francis Crick Institute
1 Midland Road
London
NW1 1AT

T: +44 (0)20 3796 2594
M: +44 (0)7482 070730
E: jonas.demeulemeester at crick.ac.uk<%22mailto:>
W: www.crick.ac.uk<%22http://>



On 14 Mar 2017, at 12:44, Miguel Vazquez <mikisvaz at gmail.com<mailto:mikisvaz at gmail.com>> wrote:

Hi Junjun and Keiran,

I'm sorry guys, but his is too alien for me, this was never my area of expertise. I'm going to need someone to write a script for me that takes a BAM file and turns it into what ever I need to run BWA-Mem on. At least pseudo-code or something that I can start with.

I think perhaps someone more knowledgeable than me should consider if this procedure as a whole is acceptable in terms of reproducibility, and how would be best to document it or if it could possibly be improved.

Also, I don't think I understand the nature of the problem because from what I can fathom this problem should have either broken the process or render a much larger of discrepancies than 3%. Can someone explain in layman words how can only 3% of reads be affected?

Best regards

Miguel



On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>> wrote:
Hi Kieran,

Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA MEM alignment result, one must use lane level BAMs (one lane one BAM) as input.

A processing is needed to prepare lane level BAMs from merged BAM.

@Migual, hope this is helpful. Let us know if you have any other questions.

Best regards
Junjun

On Mar 14, 2017, at 5:16 AM, Keiran Raine <kr2 at sanger.ac.uk<mailto:kr2 at sanger.ac.uk>> wrote:

Hi Junjun,

You won't be able to separate out the readgroups in the headers if the input is a merged BAM file .  If there are different libraries, read lengths etc it will cause problems for insert-size determination (used in determining proper-pairs) and result in inter-library duplicate removal (by definition reads from different libraries can't be duplicates).

If you really need to do it this way you'd have to add a pre-processing step, bamtofastq can split a BAM into it's component readgroups in a single pass.

Regards,

Keiran Raine
Principal Bioinformatician
Cancer Genome Project
Wellcome Trust Sanger Institute

kr2 at sanger.ac.uk<mailto:kr2 at sanger.ac.uk>
Tel:+44 (0)1223 834244 Ext: 4983<tel:+44%201223%20834244>
Office: H104

On 13 Mar 2017, at 21:16, Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>> wrote:

Hi Keiran,

Can you please comment on this, i.e., comparison between alignment done lane by lane v.s. done with all lanes mixed?

Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM workflow. The starting point is the aligned BAM because we don't have the unaligned lane BAM any more. The key point here is: should input BAM organized by lanes, one lane one BAM? Or just one BAM containing all lanes?

Thanks,
Junjun



From: Miguel Vazquez <mikisvaz at gmail.com<mailto:mikisvaz at gmail.com>>
Date: Monday, March 13, 2017 at 2:31 PM
To: Junjun Zhang <junjun.zhang at oicr.on.ca<mailto:junjun.zhang at oicr.on.ca>>
Cc: George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>, Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk<mailto:Jonas.Demeulemeester at crick.ac.uk>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Hi Junjun

About the unaligned BAM files, in fact I do have them for the two test I've ran. I could put them available for George but I think he could just as well produce them on site, since he might have to do that anyway. But we can always explore that option, though right now I don't know of a simple way to move these files around.

About the number of lanes let me just say good grief! This is the first time I hear about it. So if I understand you correctly I need to:

1- Download the metadata for the BAM file
2- Determine the read_groups
3- Split the BAM file according to these read_groups
4- Unalign these BAM files and produce header files with different lanes
5- Run BWA-Mem
6- Compare collectively the reads from these BAM files with the original BAM

Could you please confirm that this is the case? Is this consistent with the 3% mismatches? A similar percentage was found in the HCC1143, could this be the reason for that as well? Also I asked Keiran about these headers and he said there where OK. If you could please confirm that I need to do this extended process I'd be grateful, because its quite involved and there are concepts here I'm not familiar with.

Regards

Miguel


On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>> wrote:
Hi Miguel,

I thought you kept the unaligned sequence you prepared for the testing.

Following your link about preparing unaligned input, I found this: https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the high mismatch rate.

When BWA MEM workflow runs, the alignments are done one lane level BAM at a time, then merge the aligned BAM later: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/java/com/github/seqware/WorkflowClient.java#L201

I see the script prepare_unaligned.sh always generates one read group (i.e., lane) for normal or tumour, no matter how many read groups (lanes) in the aligned BAMs. This has big impact on the alignment result when lanes are aligned independently comparing aligned altogether.

The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but it only works when the input is single lane BAM file:  https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)FollowthisifyoustartfromsinglelaneBAMfiles

So, I think in order to perform testing alignment workflow properly, we will need to prepare lane level unaligned BAM (one lane one BAM) as inputs. For example, this aligned BAM: https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63, it has 7 read groups (search for read_group). It needs to be converted to 7 individual lane level BAM files.

Not sure whether it's the best way to do BAM splitting, but here is someone's Python code to do it: https://gist.github.com/seandavi/2014542

Hope this helps,
Junjun



From: Miguel Vazquez <mikisvaz at gmail.com<mailto:mikisvaz at gmail.com>>
Date: Monday, March 13, 2017 at 1:01 PM
To: George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>
Cc: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk<mailto:Jonas.Demeulemeester at crick.ac.uk>>, Junjun Zhang <junjun.zhang at oicr.on.ca<mailto:junjun.zhang at oicr.on.ca>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Hi George,

The analigned BAM files are not available as far as I know, rather you must unalign the final BAM files, the normal ones you get from ICGC or GNOS. This process is also in my scripts, as you see here:

https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L32

About the steps in the workflows, I don't know them myself. I think you'll need to ask the developers, and not all workflows use the same underlying workflow enactment tool. Not an easy answer



On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>> wrote:
Junjun told me this would provide value to the testing process, so I would like to kick off a test of the BWA_mem docker.
Can somebody provide some quick instructions and the location of the unaligned BAM files that were used already?

Also, do we have somewhere the steps involved in each workflow, so I can get an idea of how far they are while running?
For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50 steps from finish…

Thank you,
George

From: Miguel Vazquez <mikisvaz at gmail.com<mailto:mikisvaz at gmail.com>>
Date: Monday, March 13, 2017 at 8:52 AM

To: George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>
Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>>, Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk<mailto:Jonas.Demeulemeester at crick.ac.uk>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Hi George,

Answers inline

On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>> wrote:
Hi Miguel,

I've started the test by running "bin/run_test.sh Sanger DO50398", so I guess with just one workflow running it should complete faster than two weeks.

I think it still should take a long time. My scripts will run one workflow after another.


Because I'm running in Collaboratory I've changed the "get_icgc_donor.sh" script to use a docker container that has the icgc client inside and pull data from Collaboratory. There is no "bam.bas" file downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an issue.


I wondered the same thing first time I did this, but this file is produced by the pipeline. There was some problem with this that was dealt with by the developers and updated in the docker. So I think you won't have a problem

By looking at the "bin/compare_result_type.sh" it looks like it's using the gnos client to pull down the existing VCF files for comparison reasons, but I think we store those files in Collaboratory as well, so I'll work with Junjun to adapt the script for this.


Let me know if you need any help

I think I initially tried to run the DKFZ workflow, but it complained about having to run Delly first, so I abandoned this for now.

Yes, if you look at the run_batch.sh you will see that when using DKFZ it will always run Delly first. Delly prepares some files the the  DKFZ file needs, namely related to copy number I believe.


I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor.

Remember that you will need to add the relevant has-keys for the different files in the etc/donor_files.csv. Its a bit tedious right now. You need to go to the ICGC DCC and find these codes manually for the files you need. Ask me if you need help. Once you have all you can run all the workflows for that donor and evaluate results.

https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/etc/donor_files.csv


Regards

Miguel


George

From: Miguel Vazquez <mikisvaz at gmail.com<mailto:mikisvaz at gmail.com>>
Date: Monday, March 13, 2017 at 6:53 AM
To: George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>
Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>>, Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk<mailto:Jonas.Demeulemeester at crick.ac.uk>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Hi George,

The Sanger workflow is very lengthy, it takes about two weeks in my tests.

About correctness, my scripts also cover that part, if you are not using them they might still help you to clarify how we do it. The idea is to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both germline and somatic and compare it with the result uploaded to GNOS (not all pipelines produce all files). This is the relevant part in the run_batch.sh script:

https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/run_batch.sh#L42-L46

The bin/compare_result_type.sh script will take care of downloading the correct file from GNOS and running the comparison. The comparison itself is simple since all files are VCFs, it consists in taking out the variants in terms of chromosome, position, reference and alternative allele and measuring the overlaps.

https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bin/compare_result_type.sh

About which donors to test, DO52140 is one Jonas and I have both tested and could be interesting to get a third opinion. Also, any other donor could be interesting to see if something new comes up. I'm not sure which options is best.

Miguel




On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>> wrote:
Hi,

I've started Sanger on DO50398 and it's been running for more than 24 hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59"

I just started a second run on a different VM on same donor, just to compare run times.
The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send some monitoring graphs when it finishes the workflow, but I have no idea how to check its correctness.

Give me a list of donors and what workflows you want me to run and I'll try to schedule them tomorrow.

George


From: Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>>
Date: Sunday, March 12, 2017 at 10:45 PM
To: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk<mailto:Jonas.Demeulemeester at crick.ac.uk>>, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>
Cc: Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>>, Denis Yuen <Denis.Yuen at oicr.on.ca<mailto:Denis.Yuen at oicr.on.ca>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Thanks Miguel and Jonas for your help here!

Do you have any update on the latest testing? Please feel free updating the wiki with any update: https://wiki.oicr.on.ca/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference

Regards,
Junjun



From: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk<mailto:Jonas.Demeulemeester at crick.ac.uk>>
Date: Saturday, March 11, 2017 at 7:15 PM
To: George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>
Cc: Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>>, Junjun Zhang <junjun.zhang at oicr.on.ca<mailto:junjun.zhang at oicr.on.ca>>, Denis Yuen <Denis.Yuen at oicr.on.ca<mailto:Denis.Yuen at oicr.on.ca>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Hi George,

Yup, I've been running the PCAWG dockers mainly using Miguel's set of scripts.
Give them a go and if you run into issues, just let us know!

Cheers,
Jonas


On 11 Mar 2017, at 17:00, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>> wrote:

Sure, I'll give it a try and report later.

Thank you,
George Mihaiescu
Senior Cloud Architect

Ontario Institute for Cancer Research
MaRS Centre
661 University Avenue
Suite 510
Toronto, Ontario
Canada M5G 0A3

Email: George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>
Toll-free: 1-866-678-6427
Twitter: @OICR_news

www.oicr.on.ca<http://www.oicr.on.ca/>

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.



From: Miguel Vazquez <miguel.vazquez at cnio.es<mailto:miguel.vazquez at cnio.es>>
Date: Saturday, March 11, 2017 at 10:57 AM
To: Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>>
Cc: Denis Yuen <Denis.Yuen at oicr.on.ca<mailto:Denis.Yuen at oicr.on.ca>>, Jonas Demeulemeester <jonas.demeulemeester at crick.ac.uk<mailto:jonas.demeulemeester at crick.ac.uk>>, George Mihaiescu <George.Mihaiescu at oicr.on.ca<mailto:George.Mihaiescu at oicr.on.ca>>, "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: Re: [DOCKTESTERS] Thanks!

Hi Junjun,

I think Jonas has been using my scripts to run some of the tests, maybe George could try them as well, it should be very easy for him to try the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter.

https://github.com/mikisvaz/PCAWG-Docker-Test

He would just need to update the tokens for DACO access and the scripts will take care of downloading the BAM files, running the workflows and evaluating the result.

The documentation there is reasonably updated, but if this sounds good then perhaps he could contact me and I could walk him through the details.

Best regards

Miguel

On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca<mailto:Junjun.Zhang at oicr.on.ca>> wrote:
Dear Docktesters,

George Mihaiescu, cloud architect, of the Collaboratory at OICR plans to run some bioinformatics workflows to test Collab environment.

Just thought this is a good opportunity to use as extra help for testing out the PCAWG dockerized workflows.

Miguel, Denis and others, what workflows / datasets do you think would be good for George to run?

Thanks,
Junjun



From: <docktesters-bounces+junjun.zhang=oicr.on.ca at lists.icgc.org<mailto:docktesters-bounces+junjun.zhang=oicr.on.ca at lists.icgc.org>> on behalf of Denis Yuen <Denis.Yuen at oicr.on.ca<mailto:Denis.Yuen at oicr.on.ca>>
Date: Wednesday, March 1, 2017 at 10:26 AM
To: "docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>" <docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>>
Subject: [DOCKTESTERS] Thanks!


Hi,

Just wanted to say thanks to Miguel and Jonas for keeping the workflow testing data page up-to-date.

https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data

As we work on new versions or debugging, it is invaluable to know what versions of the workflows have worked outside OICR, thanks!



Denis Yuen
Senior Software Developer

OntarioInstituteforCancerResearch
MaRSCentre
661 University Avenue
Suite510
Toronto, Ontario,Canada M5G0A3

Toll-free: 1-866-678-6427
Twitter: @OICR_news
www.oicr.on.ca<http://www.oicr.on.ca/>

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

_______________________________________________
docktesters mailing list
docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>
https://lists.icgc.org/mailman/listinfo/docktesters



The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT

_______________________________________________
docktesters mailing list
docktesters at lists.icgc.org<mailto:docktesters at lists.icgc.org>
https://lists.icgc.org/mailman/listinfo/docktesters









The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170314/d9a4351f/attachment-0001.html>


More information about the docktesters mailing list