Benchmarking Robust Machine Learning Models Under Data Imperfections in Real-World Data Science Scenarios

. Marlindawati; Muhammad Azhar; Esha  Sabir

doi:10.61453/jods.v20260103

Authors

. Marlindawati Universitas Bina Darma, Palembang, Indonesia
Muhammad Azhar Hong Kong Shue Yan University, Hong Kong, SAR, China
Esha Sabir University of Sahiwal, Sahiwal, Pakistan

DOI:

https://doi.org/10.61453/jods.v20260103

Keywords:

Robust Machine Learning, Data Quality, Benchmarking, Model Evaluation, Real-World Data

Abstract

Machine learning systems deployed in real-world environments frequently encounter data imperfections such as noise, missing values, class imbalance, and distribution shifts. Despite substantial progress in model development, most evaluation protocols rely on clean benchmark datasets, creating a gap between laboratory performance and operational reliability. Existing robustness studies often focus on isolated perturbation types or single model families, lacking a unified benchmarking framework. This study proposes a structured and reproducible benchmarking methodology to systematically evaluate model robustness under controlled data degradation scenarios. Multiple classical machine learning algorithms and deep learning models were assessed across diverse benchmark datasets. Controlled perturbations—including feature noise, label corruption, missingness mechanisms, imbalance ratios, and covariate shifts—were introduced at progressive levels. Performance was evaluated using predictive metrics, robustness degradation rate (RDR), and computational efficiency, with statistical validation across repeated experimental runs. Results indicate that ensemble-based methods consistently achieved the strongest robustness, maintaining degradation rates below 10% under moderate noise and imbalance conditions. Deep neural networks demonstrated superior clean-data accuracy but experienced sharper degradation under structured corruption and distribution shifts. Mitigation strategies such as regularization and resampling reduced degradation by 5–12% under moderate perturbations but showed limited effectiveness under extreme conditions. The findings demonstrate that robustness is multidimensional and dependent on alignment between model inductive bias and data imperfection type. The proposed benchmarking framework provides practical guidance for selecting machine learning models suited to imperfect data environments, advancing reliable and deployment-ready AI systems.

References

Ahangaran, M., Zhu, H., Li, R., Yin, L., Jang, J., Chaudhry, A. P., Farrer, L. A., Au, R., & Kolachalama, V. B. (2024). DREAMER: a computational framework to evaluate readiness of datasets for machine learning. BMC Medical Informatics and Decision Making 2024 24:1, 24(1), 152-. https://doi.org/10.1186/s12911-024-02544-w

Ahuis, T. P., Smyk, M. K., Laloux, C., Aulehner, K., Bray, J., Waldron, A. M., Miljanovic, N., Seiffert, I., Song, D., Boulanger, B., Jucker, M., Potschka, H., Platt, B., Riedel, G., Voehringer, P., Nicholson, J. R., Drinkenburg, W. H. I. M., Kas, M. J. H., & Leiser, S. C. (2024). Evaluation of variation in preclinical electroencephalographic (EEG) spectral power across multiple laboratories and experiments: An EQIPD study. PLOS ONE, 19(10), e0309521. https://doi.org/10.1371/journal.pone.0309521

Al-Maliki, S., Bouanani, F. El, Abdallah, M., Qadir, J., & Al-Fuqaha, A. (2024). Addressing Data Distribution Shifts in Online Machine Learning Powered Smart City Applications Using Augmented Test-Time Adaptation. IEEE Internet of Things Magazine, 7(4), 116–124. https://doi.org/10.1109/IOTM.001.2300135

Armano, G., & Manconi, A. (2023). Devising novel performance measures for assessing the behavior of multilayer perceptrons trained on regression tasks. PLOS ONE, 18(5), e0285471. https://doi.org/10.1371/journal.pone.0285471

Azhar, M., Amjad, A., Dewi, D. A., & Kasim, S. (2025). A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization. Information, 16(9). https://doi.org/10.3390/info16090784

Bayram, F., & Ahmed, B. S. (2023). A domain-region based evaluation of ML performance robustness to covariate shift. Neural Computing and Applications 2023 35:24, 35(24), 17555–17577. https://doi.org/10.1007/s00521-023-08622-w

Blanchet, J., Li, J., Lin, S., & Zhang, X. (2025). Distributionally Robust Optimization and Robust Statistics. 40(3), 351–377. https://doi.org/10.1214/24-sts955

Burgert, T., Ravanbakhsh, M., & Demir, B. (2022). On the Effects of Different Types of Label Noise in Multi-Label Remote Sensing Image Classification. IEEE Transactions on Geoscience and Remote Sensing, 60. https://doi.org/10.1109/TGRS.2022.3226371

Cabrera, C., Paleyes, A., Thodoroff, P., & Lawrence, N. (2025). Machine Learning Systems: A Survey from a Data-Oriented Perspective. ACM Computing Surveys, 58(5). https://doi.org/10.1145/3769292

Calafiore, G. C., Fracastoro, G., & Proskurnikov, A. V. (2025). Default robustness and worst-case losses in financial networks. Applied Network Science 2025 10:1, 10(1), 48-. https://doi.org/10.1007/s41109-025-00728-5

Chen, X., Ma, M., Zhao, Z., Zhai, Z., & Mao, Z. (2022). Physics-Informed Deep Neural Network for Bearing Prognosis with Multisensory Signals. Journal of Dynamics, Monitoring and Diagnostics, 1(4), 200–207. https://doi.org/10.37965/jdmd.2022.54

Dekermanjian, J. P., Shaddox, E., Nandy, D., Ghosh, D., & Kechris, K. (2022). Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics. BMC Bioinformatics 2022 23:1, 23(1), 179-. https://doi.org/10.1186/s12859-022-04659-1

dos Santos, F. C., Candotti, C. T., & Rodrigues, L. P. (2023). Reliability of the five times sit to stand test performed remotely by multiple sclerosis patients. Multiple Sclerosis and Related Disorders, 73. https://doi.org/10.1016/j.msard.2023.104654

Fabra-Boluda, R., Ferri, C., Ramírez-Quintana, M. J., & Martínez-Plumed, F. (2024). Unveiling the robustness of machine learning families. Machine Learning: Science and Technology, 5(3), 035040. https://doi.org/10.1088/2632-2153/ad62ab

Faddi, Z., da Mata, K., Silva, P., Nagaraju, V., Ghosh, S., Kul, G., & Fiondella, L. (2025). Quantitative assessment of machine learning reliability and resilience. Risk Analysis, 45(4), 790–807. https://doi.org/10.1111/risa.14666

Farrell, S., Kane, A. E., Bisset, E., Howlett, S. E., & Rutenberg, A. D. (2022). Measurements of damage and repair of binary health attributes in aging mice and humans reveal that robustness and resilience decrease with age, operate over broad timescales, and are affected differently by interventions. ELife, 11. https://doi.org/10.7554/eLife.77632

Freiesleben, T., & Grote, T. (2023). Beyond generalization: a theory of robustness in machine learning. Synthese 2023 202:4, 202(4), 109-. https://doi.org/10.1007/s11229-023-04334-9

García-Blay, Ó., Hu, X., Wassermann, C. L., van Bokhoven, T., Struijs, F. M. B., & Hansen, M. M. K. (2025). Multimodal screen identifies noise-regulatory proteins. Developmental Cell, 60(1), 133-151.e12. https://doi.org/10.1016/j.devcel.2024.09.015

Gawande, R. M., Nambiar, S., Shinde, S., Banait, S. S., Sonawane, A. V., & Vanjari, H. B. (2024). Machine Learning Approaches for Fault Detection and Diagnosis in Electrical Machines: A Comparative Study of Deep Learning and Classical Methods. Panamerican Mathematical Journal, 34(2), 121–137. https://doi.org/10.52783/pmj.v34.i2.930

Gomez, L. A., Toye, A. A., Hum, R. S., & Kleinberg, S. (2025). Simulating Realistic Continuous Glucose Monitor Time Series By Data Augmentation. Journal of Diabetes Science and Technology, 19(1), 114–122. https://doi.org/10.1177/19322968231181138

Güneş, A. M., van Rooij, W., Gulshad, S., Slotman, B., Dahele, M., & Verbakel, W. (2023). Impact of imperfection in medical imaging data on deep learning-based segmentation performance: An experimental study using synthesized data. Medical Physics, 50(10), 6421–6432. https://doi.org/10.1002/mp.16437

Li, H., Rajbahadur, G. K., & Bezemer, C. P. (2024). Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality. ACM Transactions on Software Engineering and Methodology, 34(1), 31. https://doi.org/10.1145/3678168

Massoukou Pamba, R., Poirier, V., Nguema Ndoutoumou, P., & Epule, T. E. (2023). How Can Plants Help Restore Degraded Tropical Soils? Land 2023, Vol. 12, 12(12). https://doi.org/10.3390/land12122147

Mosquera, C., Ferrer, L., Milone, D. H., Luna, D., & Ferrante, E. (2024). Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance. European Radiology 2024 34:12, 34(12), 7895–7903. https://doi.org/10.1007/s00330-024-10834-0

Mundinger, C., Schulz, N. K. E., Singh, P., Janz, S., Schurig, M., Seidemann, J., Kurtz, J., Müller, C., Schielzeth, H., von Kortzfleisch, V. T., & Richter, S. H. (2025). Testing the reproducibility of ecological studies on insect behavior in a multi-laboratory setting identifies opportunities for improving experimental rigor. PLOS Biology, 23(4), e3003019. https://doi.org/10.1371/journal.pbio.3003019

Myren, S., Parikh, N., Rael, R., Flynn, G., Higdon, D., & Casleton, E. (2026). Evaluation of Seismic Artificial Intelligence with Uncertainty. Seismological Research Letters, 97(1), 471–486. https://doi.org/10.1785/0220240444

Orlu, G. U., Abdullah, R. Bin, Zaremohzzabieh, Z., Jusoh, Y. Y., Asadi, S., Qasem, Y. A. M., Nor, R. N. H., & Mohd Nasir, W. M. H. bin. (2023). A Systematic Review of Literature on Sustaining Decision-Making in Healthcare Organizations Amid Imperfect Information in the Big Data Era. Sustainability 2023, Vol. 15, 15(21). https://doi.org/10.3390/su152115476

Park, C. (2025). Significance of Time-Series Consistency in Evaluating Machine Learning Models for Gap-Filling Multi-Level Very Tall Tower Data. Machine Learning and Knowledge Extraction 2025, Vol. 7, 7(3). https://doi.org/10.3390/make7030076

Pham, H. T., Do, T., Baek, J., Nguyen, C. K., Pham, Q. T., Nguyen, H. L., Goldberg, R., Pham, Q. L., & Giang, L. M. (2024). Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis. JMIR Public Health and Surveillance, 10(1), e53719. https://doi.org/10.2196/53719

Roshani, A., Walker-Davies, P., & Parry, G. (2024). Designing resilient supply chain networks: a systematic literature review of mitigation strategies. Annals of Operations Research 2024 341:2, 341(2), 1267–1332. https://doi.org/10.1007/s10479-024-06228-6

Schwabe, D., Becker, K., Seyferth, M., Klaß, A., & Schaeffter, T. (2024). The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review. Npj Digital Medicine 2024 7:1, 7(1), 203-. https://doi.org/10.1038/s41746-024-01196-4

Sivakumar, M., Parthasarathy, S., & Padmapriya, T. (2024). A simplified approach for efficiency analysis of machine learning algorithms. PeerJ Computer Science, 10, 1–25. https://doi.org/10.7717/peerj-cs.2418

Sourlos, N., Vliegenthart, R., Santinha, J., Klontzas, M. E., Cuocolo, R., Huisman, M., & van Ooijen, P. (2024). Recommendations for the creation of benchmark datasets for reproducible artificial intelligence in radiology. Insights into Imaging 2024 15:1, 15(1), 248-. https://doi.org/10.1186/s13244-024-01833-2

Ståhl, P. P. G., Riikka, P. D., Zhang, L., Nordström, M. C., & Kortsch, S. (2025). Food web robustness depends on the network type and threshold for extinction. Oikos, 2025(5), e11139. https://doi.org/10.1111/oik.11139

Tamang, L., Bouadjenek, M. R., Dazeley, R., & Aryal, S. (2025). Handling Out-of-Distribution Data: A Survey. IEEE Transactions on Knowledge and Data Engineering, 37(10), 5948–5966. https://doi.org/10.1109/TKDE.2025.3592614

Uddin, M. P., Xiang, Y., Hasan, M., Bai, J., Zhao, Y., & Gao, L. (2025). A Systematic Literature Review of Robust Federated Learning: Issues, Solutions, and Future Research Directions. ACM Computing Surveys, 57(10), 62. https://doi.org/10.1145/3727643

Yong, T. K., Ma, Z., & Palmqvist, C. W. (2025). AP-GRIP evaluation framework for data-driven train delay prediction models: systematic literature review. European Transport Research Review 2025 17:1, 17(1), 13-. https://doi.org/10.1186/s12544-024-00704-7

Zeng, L., Chen, X., Shi, X., & Tao Shen, H. (2025). Feature Noise Boosts DNN Generalization Under Label Noise. IEEE Transactions on Neural Networks and Learning Systems, 36(4), 7711–7724. https://doi.org/10.1109/TNNLS.2024.3394511

Benchmarking Robust Machine Learning Models Under Data Imperfections in Real-World Data Science Scenarios

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License