Machine learning-based classification of DNA sequences for diabetes mellitus type prediction
Abstract
A machine learning (ML) algorithm is used to classify DNA sequences and predict diabetes risk using the results of this study. Researchers use the INS Insulin Dataset to explore multiple preprocessing strategies such as k-mer representations, ordinal encodings, oversamplings, and min-max normalizations of DNA sequences from diabetic and non-diabetic subjects. The performance of the model was enhanced by using feature selection techniques such as F-regressors and Mutual Information. A study based on accuracy, precision, recall, and F1-score values has been done on four bioinformatics classifiers, including Random Forest, Gaussian Naive Bayes, and Support Vector Machines (SVM). Results demonstrated that Random Forest achieved the highest accuracy (0.89 with F-regressor), followed by SVM and Decision Tree, while Gaussian Naïve Bayes showed moderate performance. The findings highlight the effectiveness of machine learning in uncovering genetic patterns associated with diabetes and emphasize the potential of DNA-based predictive modeling in precision medicine. This work contributes to advancing computational genomics and provides a foundation for early diagnosis and personalized treatment strategies for diabetes mellitus
© 2025 Albegli Ahmed Hasan Ahmed, Kusum Yadav, published by Future Sciences For Digital Publishing
This work is licensed under the Creative Commons Attribution 4.0 License.