Single-genome and metagenome de novo assembly

Project duration: 2018 - 2023

Funding: Croatian Science Foundation   

Key collaborator: Niranjan Nagarajan (A*STAR GIS, Singapore)

The first modern software for DNA assembly was developed by Celera for generating the draft of human genome in 2001. Since then ample of methods have tried to correctly assemble genomes, but a high-quality assembly still requires laborious work of large groups of scientists and many years of data curation. The biggest challenges for achieving high accuracy and contiguity of genome assemblies have been long stretches of highly repetitive regions. The recent advent of new generation of sequencing technologies such as those of companies Pacific Biosciences and Oxford Nanopore Technologies gives us hope that automated complete genome reconstruction is feasible. They produce long, but error-prone reads whose size exceeds hundreds of thousand nucleotides which should be long enough for spanning most repetitive parts. Nevertheless, researchers still struggle to completely assemble long genomes (ie. animal and plant genomes) and genomes of microbial communities. Assembly methods usually use a graph-based approach which starts with building a graph by joining overlapping reads, followed by using heuristics to find a path which visits each read once. However, this is often unfeasible because of tangles in the graph which occur due to incorrect read overlaps and repetitive regions. This is particularly critical for both long genomes with many chromosomes and for metagenomic samples with anything from ten to several hundred present genomes. The primary aim of this project is the development of methods which will result in (i) complete large genomes and (ii) accurate metagenomic assemblies. To achieve this aim we will develop several graph-based and machine learning methods for detection of incorrect overlaps.

Project team

Project members:

  • Prof. dr. sc. Mile Šikić - head
  • Associate prof. dr. sc. Igor Mekterović
  • Assistant prof. dr. sc. Krešimir Križanović
  • Dr. sc. Niranjan Nagarajan (A*STAR GIS, Singapore)
  • Dr. sc. Nino Antulov-Fantulin (ETH Zurich)
  • Filip Tomas - phd student
  • Josipa Lipovac - phd student
  • Rafael Josip Penić - phd student

Collaboration with other institutions:

  • Prof. Jianjun Liu, Genome Institute of Singapore, A*STAR Singapore
  • Prof. Ken Wing Kin Sung, National University of Singapore
  • Dr. Hwee Kuan Lee, Bioinformatics Institute, A*STAR Singapore
  • Dr. Mike Vella, NVIDIA
  • Prof. Christophe Dessimoz (University of Lausanne)
  • Prof. Marc Robinson-Rechavi (University of Lausanne)
  • Associate prof. Petra Korać (University of Zagreb, Faculty of Science, Department of Biology)
  • Prof. Karin Kovačević Ganić (University of Zagreb, Faculty of Food Technology and Biotechnology)
  • Associate prof. Antonio Starćević (University of Zagreb, Faculty of Food Technology and Biotechnology)

 

Publications

Papers in scientific journals:

Presentations at scientific conferences:

  • Huang, Megan; Šikić, Mile; Influence of Chimeric Sequences on Metagenome Assembly // Sixth International Workshop on Data Science Abstract Book, Virtual conference(2021)
  • Bosnić, Filip; Šikić, Mile; Finding Hamiltonian cycles with graph neural networks // Sixth International Workshop on Data Science Abstract Book, Virtual conference(2021)
  • Josip Marić; Sylvain Riondet; Krešimir Križanović; Niranjan Nagarajan; Mile Šikić; Benchmarking metagenomic classification tools for long read sequencing data // 28th International Conference on Intelligent Systems for Molecular Biology (ISMB) 2020, Virtual conference(2020)
  • Vrček, Lovro; Huang, Megan Hong Hui; Vaser, Robert; Šikić, Mile; Deep learning approach to determining the type of long reads // International Conference on Intelligent Systems for Molecular Biology 2020, Virtual conference(2020)
  • Stanojević Dominik, Šikić Mile; Detecting Base Modifications in DNA Sequence // Book of Abstracts of Fifth International Workshop on Data Science, Zagreb, Croatia(2020)
  • Vrček, Lovro; Huang, Megan Hong Hui; Vaser, Robert; Šikić, Mile; Deep learning approach to determining the type of long reads // International Conference on Intelligent Systems for Molecular Biology 2020, Virtual conference(2020)
  • Stanojević Dominik, Šikić Mile; Detecting Base Modifications in DNA Sequence // Book of Abstracts of Fifth International Workshop on Data Science, Zagreb, Croatia(2020)
  • Vrček, Lovro; Veličković, Petar; Šikić, Mile; A step towards neural genome assembly // NeurIPS 2020 Learning Meets Combinatorial Algorithms Workshop, Virtual conference(2020)
  • Robert Vaser and Mile Šikić, Yet another de novo genome assembler, 2019, 11th International Symposium on Image and Signal Processing and Analysis (ISPA)
  • Sara Bakić, Luka Požega, Robert Vaser and Mile Šikić, Assessing sequencing data for genome assembly, 2019, 27th Conference on Intelligent Systems for Molecular Biology and the 18th European Conference on Computational Biology, poster
  • Marić, J.; Šikić, M. Approaches to metagenomic classification and assembly // MIPRO, Biomedical Engineering, Opatija: IEEE, 2019.
  • Vrček, Lovro; Šikić, Mile; Supervised learning approach to long read classification // Fourth International Workshop on Data Science Abstract Book Zagreb, Hrvatska, 2019. str. 71-72, poster

Doctoral disertations:

  • Robert Vaser, Algorithms for de Novo Assembly of Large Genomes, 2019, (pdf)

Graduate and undergraduate theses:

  • Martinović, I. Combining protein and RNA structures information in developing new scoring functions (2022) (pdf)
  • Šarić, J. Evaluation of RNA Atom Distance Prediction Models (2022) (pdf)
  • Lipovac, J. Detection of Modified Nucleotide Clusters in Nanopore Sequenced RNA Reads (2021) (pdf)
  • Penić, R.J. Deep Learning Model of Nanopore Sequencing Pore (2021) (pdf)
  • Deur, S. Detection of Modified Nucleotides Using Nanopore Sequencing and Deep Learning Methods (2021) (pdf)
  • Bakić, S. Rapid Microbe Detection Using Deep Learning (2021) (pdf)
  • Pavlić, S. DNA Nanopore Sequencing Basecaller (2021) (pdf)
  • Pratljačić, S. Fast Overlapping Single Molecule Highly Accurate Sequencing Data(2021) (pdf)
  • Klobučar, I. Data Structure for Efficient Storage of Genome Sequencing Data (2021) (pdf)
  • Rašić, M. Generalization of Partial Order Alignment (2021) (pdf)
  • Staver, M. Rapid Alignment of High-Fidelity Sequencing Data (2021) (pdf)
  • Babojelić, D. Overlapping Single Molecule High-Fidelity Sequencing Data (2020)
  • Paulinović, M. Microbe Detection Using Signal Processing and Locality Sensitive Hashing (2020)
  • Brekalo, T. De Novo Metagenome Assembly Using Third Generation Sequencing Data (2020)
  • Martinović, I. Pipeline for Detection Clusters of Modified Nucleotides in Nanopore Sequenced RNA Reads (2020)
  • Yatsukha, R. De Novo Diploid Assembly Using Third-Generation Sequencing Data (2020)
  • Wolf, F. Genome Scaffolding Using Hi-C Sequencing (2020)
  • Floreani, F. Classification of 1D-Signal Types Using Deep Learning (2019)
  • Lipovac, J. Ocjena alata za identifikaciju vrsta u metagenomskim uzorcima (2019)
  • Batić, D. Mapiranje slijeda na graf (2019)
  • Pongračić, K. Mapiranje dugačkih očitanja (2019)
  • Pavlić, S. Mapiranje kratkih očitanja (2019)
  • Penić, R. J. Izgradnja biblioteke za poravnavanje parova dugačkih RNA očitanja (2019)
  • Kosier, S. Pronalaženje varijanti gena iz podataka dobivenih sekvenciranjem (2019)
  • Relić, B. Klasifikacija očitanja koristeći metode dubokog učenja (2019)
  • Bakić, S. De novo sastavljanje genoma vođeno referencom (2019)
  • Vrček, L. Poliranje DNA slijeda koristeći metode dubokog učenja (2019)
  • Požega, L. Gornja granica u sastavljanju genoma (2019)