Digital Library[ Search Result ]
A Sort and Merge Method for Genome Variant Call Format (GVCF) Files using Parallel and Distributed Computing
JinWoo Lee, Jung-Im Won, JeeHee Yoon
http://doi.org/10.5626/JOK.2021.48.3.358
With the development of next-generation sequencing (NGS) techniques, a large volume of genomic data is being produced and accumulated, and parallel and distributed computing has become an essential tool. Generally, NGS data processing entails two main steps: obtaining read alignment results in BAM format and extracting variant information in genome variant call format (GVCF) or variant call format (VCF). However, each step requires a long execution time due to the size of the data. In this study, we propose a new GVCF file sorting/merging module using distributed parallel clusters to shorten the execution time. In the proposed algorithm, Spark is used as a distributed parallel cluster. The sorting/merge process is performed in two steps according to the structural characteristics of the GVCF file in order to use the resources in the cluster efficiently. The performance was evaluated by comparing our method with the GATK"s CombineGVCFs module based on sorting and merging execution time of multiple GVCF files. The outcomes suggest the effectiveness of the proposed method in reducing execution time. The method can be used as a scalable and powerful distributed computing tool to solve the GVCF file sorting/merge problem.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr