SARS-CoV-2’s population structure might have a substantial impact on public health
management and diagnostics if it can be identifed. It is critical to rapidly monitor and
characterize their lineages circulating globally for a more accurate diagnosis, improved
care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure,
clustering the sequencing data is essential. Here, deep clustering techniques were used
to automatically group 29,017 diferent strains of SARS-CoV-2 into clusters. We aim to
identify the main clusters of SARS-CoV-2 population structure based on convolutional
autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus
Spike peptide sequences. Our clustering fndings revealed that there are six large SARSCoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique
lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances)
are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum
test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods
have been examined against the proposed deep learning clustering method. The intracluster genetic distances of the proposed method were smaller than those of K-means
alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor
embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains
were isolated correctly between clusters in the t-SNE plot. Our results showed that the
(C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1
in C5 are more diversifed than those in the other clusters. Our study indicates that the
genetic similarity between strains in the same cluster enables a better understanding
of the major features of the unknown population lineages when compared to some of
the more prevalent viral isolates. This information helps researchers fgure out how the
virus changed over time and spread to people all over the world |