Seminars – Laboratoire de Mathématiques Informatique et Applications

Back

David Couvin : NucleScores: Toward a Better Understanding of Genomic Sequences

As genomic sequence databases grow exponentially, the need for reliable metrics to assess the quality and biological integrity of genomic assemblies has become crucial. Conventional metrics such as N50 provide information on contiguity but often do not reflect the underlying biological coherence of an assembly. NucleScore, an empirically derived nucleotide ratio introduced in the getSequenceInfo (gSeqI) software suite, represents an innovative approach to evaluating genomic assemblies based on intrinsic nucleotide distribution patterns.
NucleScore is calculated from localized nucleotide information to provide a benchmark for assembly quality across various taxa, including bacteria, viruses, and eukaryotes. Initial implementations in the nucleScore.pl tool demonstrate that this metric can distinguish high-quality reference genomes from fragmented assemblies by identifying deviations from expected species-specific nucleotide signatures. However, as an empirical ratio, the current NucleScore is limited by static thresholds that may not account for the complex, non-linear genomic architectures found in non-model organisms or in highly repetitive regions.

We propose to improve NucleScore by integrating deep learning (DL) architectures, specifically convolutional neural networks (CNNs) and transformers such as DNABERT. By training models on massive, high-quality reference datasets (e.g., RefSeq), the AI-enhanced NucleScore can go beyond simple ratios to recognize sophisticated k-mer signatures and long-range dependencies. Machine learning models, such as gradient boosters, could be used to weight NucleScore with other metadata (e.g., GC content and isolation source) to provide a “confidence score” for newly sequenced assemblies and identify new biomarkers.

Traduit avec Deepl