You are in:Home/Publications/Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder

Dr. Khaled elsayed Ahmed :: Publications:

Title:
Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
Authors: Fayroz F. Sherif, Khaled S. Ahmed
Year: 2022
Keywords: SARS-CoV-2, Unsupervised clustering, Deep learning, Convolutional autoencoder, Spike protein, Lineages
Journal: Journal of Engineering and Applied Science
Volume: 16
Issue: Not Available
Pages: Not Available
Publisher: Not Available
Local/International: Local
Paper Link: Not Available
Full paper Not Available
Supplementary materials Not Available
Abstract:

SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identifed. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 diferent strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering fndings revealed that there are six large SARSCoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intracluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversifed than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers fgure out how the virus changed over time and spread to people all over the world

Google ScholarAcdemia.eduResearch GateLinkedinFacebookTwitterGoogle PlusYoutubeWordpressInstagramMendeleyZoteroEvernoteORCIDScopus