Benchmarking Robust Machine Learning Models Under Data Imperfections in Real-World Data Science Scenarios
DOI:
https://doi.org/10.61453/jods.v20260103Keywords:
Robust Machine Learning, Data Quality, Benchmarking, Model Evaluation, Real-World DataAbstract
Machine learning systems deployed in real-world environments frequently encounter data imperfections such as noise, missing values, class imbalance, and distribution shifts. Despite substantial progress in model development, most evaluation protocols rely on clean benchmark datasets, creating a gap between laboratory performance and operational reliability. Existing robustness studies often focus on isolated perturbation types or single model families, lacking a unified benchmarking framework. This study proposes a structured and reproducible benchmarking methodology to systematically evaluate model robustness under controlled data degradation scenarios. Multiple classical machine learning algorithms and deep learning models were assessed across diverse benchmark datasets. Controlled perturbations—including feature noise, label corruption, missingness mechanisms, imbalance ratios, and covariate shifts—were introduced at progressive levels. Performance was evaluated using predictive metrics, robustness degradation rate (RDR), and computational efficiency, with statistical validation across repeated experimental runs. Results indicate that ensemble-based methods consistently achieved the strongest robustness, maintaining degradation rates below 10% under moderate noise and imbalance conditions. Deep neural networks demonstrated superior clean-data accuracy but experienced sharper degradation under structured corruption and distribution shifts. Mitigation strategies such as regularization and resampling reduced degradation by 5–12% under moderate perturbations but showed limited effectiveness under extreme conditions. The findings demonstrate that robustness is multidimensional and dependent on alignment between model inductive bias and data imperfection type. The proposed benchmarking framework provides practical guidance for selecting machine learning models suited to imperfect data environments, advancing reliable and deployment-ready AI systems.
References
Ahangaran, M., Zhu, H., Li, R., Yin, L., Jang, J., Chaudhry, A. P., Farrer, L. A., Au, R., & Kolachalama, V. B. (2024). DREAMER: a computational framework to evaluate readiness of datasets for machine learning. BMC Medical Informatics and Decision Making 2024 24:1, 24(1), 152-. https://doi.org/10.1186/s12911-024-02544-w
Ahuis, T. P., Smyk, M. K., Laloux, C., Aulehner, K., Bray, J., Waldron, A. M., Miljanovic, N., Seiffert, I., Song, D., Boulanger, B., Jucker, M., Potschka, H., Platt, B., Riedel, G., Voehringer, P., Nicholson, J. R., Drinkenburg, W. H. I. M., Kas, M. J. H., & Leiser, S. C. (2024). Evaluation of variation in preclinical electroencephalographic (EEG) spectral power across multiple laboratories and experiments: An EQIPD study. PLOS ONE, 19(10), e0309521. https://doi.org/10.1371/journal.pone.0309521
Al-Maliki, S., Bouanani, F. El, Abdallah, M., Qadir, J., & Al-Fuqaha, A. (2024). Addressing Data Distribution Shifts in Online Machine Learning Powered Smart City Applications Using Augmented Test-Time Adaptation. IEEE Internet of Things Magazine, 7(4), 116–124. https://doi.org/10.1109/IOTM.001.2300135
Armano, G., & Manconi, A. (2023). Devising novel performance measures for assessing the behavior of multilayer perceptrons trained on regression tasks. PLOS ONE, 18(5), e0285471. https://doi.org/10.1371/journal.pone.0285471
Azhar, M., Amjad, A., Dewi, D. A., & Kasim, S. (2025). A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization. Information, 16(9). https://doi.org/10.3390/info16090784
Bayram, F., & Ahmed, B. S. (2023). A domain-region based evaluation of ML performance robustness to covariate shift. Neural Computing and Applications 2023 35:24, 35(24), 17555–17577. https://doi.org/10.1007/s00521-023-08622-w
Blanchet, J., Li, J., Lin, S., & Zhang, X. (2025). Distributionally Robust Optimization and Robust Statistics. 40(3), 351–377. https://doi.org/10.1214/24-sts955
Burgert, T., Ravanbakhsh, M., & Demir, B. (2022). On the Effects of Different Types of Label Noise in Multi-Label Remote Sensing Image Classification. IEEE Transactions on Geoscience and Remote Sensing, 60. https://doi.org/10.1109/TGRS.2022.3226371
Cabrera, C., Paleyes, A., Thodoroff, P., & Lawrence, N. (2025). Machine Learning Systems: A Survey from a Data-Oriented Perspective. ACM Computing Surveys, 58(5). https://doi.org/10.1145/3769292
Calafiore, G. C., Fracastoro, G., & Proskurnikov, A. V. (2025). Default robustness and worst-case losses in financial networks. Applied Network Science 2025 10:1, 10(1), 48-. https://doi.org/10.1007/s41109-025-00728-5
Chen, X., Ma, M., Zhao, Z., Zhai, Z., & Mao, Z. (2022). Physics-Informed Deep Neural Network for Bearing Prognosis with Multisensory Signals. Journal of Dynamics, Monitoring and Diagnostics, 1(4), 200–207. https://doi.org/10.37965/jdmd.2022.54
Dekermanjian, J. P., Shaddox, E., Nandy, D., Ghosh, D., & Kechris, K. (2022). Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics. BMC Bioinformatics 2022 23:1, 23(1), 179-. https://doi.org/10.1186/s12859-022-04659-1
dos Santos, F. C., Candotti, C. T., & Rodrigues, L. P. (2023). Reliability of the five times sit to stand test performed remotely by multiple sclerosis patients. Multiple Sclerosis and Related Disorders, 73. https://doi.org/10.1016/j.msard.2023.104654
Fabra-Boluda, R., Ferri, C., Ramírez-Quintana, M. J., & Martínez-Plumed, F. (2024). Unveiling the robustness of machine learning families. Machine Learning: Science and Technology, 5(3), 035040. https://doi.org/10.1088/2632-2153/ad62ab
Faddi, Z., da Mata, K., Silva, P., Nagaraju, V., Ghosh, S., Kul, G., & Fiondella, L. (2025). Quantitative assessment of machine learning reliability and resilience. Risk Analysis, 45(4), 790–807. https://doi.org/10.1111/risa.14666
Farrell, S., Kane, A. E., Bisset, E., Howlett, S. E., & Rutenberg, A. D. (2022). Measurements of damage and repair of binary health attributes in aging mice and humans reveal that robustness and resilience decrease with age, operate over broad timescales, and are affected differently by interventions. ELife, 11. https://doi.org/10.7554/eLife.77632
Freiesleben, T., & Grote, T. (2023). Beyond generalization: a theory of robustness in machine learning. Synthese 2023 202:4, 202(4), 109-. https://doi.org/10.1007/s11229-023-04334-9
García-Blay, Ó., Hu, X., Wassermann, C. L., van Bokhoven, T., Struijs, F. M. B., & Hansen, M. M. K. (2025). Multimodal screen identifies noise-regulatory proteins. Developmental Cell, 60(1), 133-151.e12. https://doi.org/10.1016/j.devcel.2024.09.015
Gawande, R. M., Nambiar, S., Shinde, S., Banait, S. S., Sonawane, A. V., & Vanjari, H. B. (2024). Machine Learning Approaches for Fault Detection and Diagnosis in Electrical Machines: A Comparative Study of Deep Learning and Classical Methods. Panamerican Mathematical Journal, 34(2), 121–137. https://doi.org/10.52783/pmj.v34.i2.930
Gomez, L. A., Toye, A. A., Hum, R. S., & Kleinberg, S. (2025). Simulating Realistic Continuous Glucose Monitor Time Series By Data Augmentation. Journal of Diabetes Science and Technology, 19(1), 114–122. https://doi.org/10.1177/19322968231181138
Güneş, A. M., van Rooij, W., Gulshad, S., Slotman, B., Dahele, M., & Verbakel, W. (2023). Impact of imperfection in medical imaging data on deep learning-based segmentation performance: An experimental study using synthesized data. Medical Physics, 50(10), 6421–6432. https://doi.org/10.1002/mp.16437
Li, H., Rajbahadur, G. K., & Bezemer, C. P. (2024). Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality. ACM Transactions on Software Engineering and Methodology, 34(1), 31. https://doi.org/10.1145/3678168
Massoukou Pamba, R., Poirier, V., Nguema Ndoutoumou, P., & Epule, T. E. (2023). How Can Plants Help Restore Degraded Tropical Soils? Land 2023, Vol. 12, 12(12). https://doi.org/10.3390/land12122147
Mosquera, C., Ferrer, L., Milone, D. H., Luna, D., & Ferrante, E. (2024). Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance. European Radiology 2024 34:12, 34(12), 7895–7903. https://doi.org/10.1007/s00330-024-10834-0
Mundinger, C., Schulz, N. K. E., Singh, P., Janz, S., Schurig, M., Seidemann, J., Kurtz, J., Müller, C., Schielzeth, H., von Kortzfleisch, V. T., & Richter, S. H. (2025). Testing the reproducibility of ecological studies on insect behavior in a multi-laboratory setting identifies opportunities for improving experimental rigor. PLOS Biology, 23(4), e3003019. https://doi.org/10.1371/journal.pbio.3003019
Myren, S., Parikh, N., Rael, R., Flynn, G., Higdon, D., & Casleton, E. (2026). Evaluation of Seismic Artificial Intelligence with Uncertainty. Seismological Research Letters, 97(1), 471–486. https://doi.org/10.1785/0220240444
Orlu, G. U., Abdullah, R. Bin, Zaremohzzabieh, Z., Jusoh, Y. Y., Asadi, S., Qasem, Y. A. M., Nor, R. N. H., & Mohd Nasir, W. M. H. bin. (2023). A Systematic Review of Literature on Sustaining Decision-Making in Healthcare Organizations Amid Imperfect Information in the Big Data Era. Sustainability 2023, Vol. 15, 15(21). https://doi.org/10.3390/su152115476
Park, C. (2025). Significance of Time-Series Consistency in Evaluating Machine Learning Models for Gap-Filling Multi-Level Very Tall Tower Data. Machine Learning and Knowledge Extraction 2025, Vol. 7, 7(3). https://doi.org/10.3390/make7030076
Pham, H. T., Do, T., Baek, J., Nguyen, C. K., Pham, Q. T., Nguyen, H. L., Goldberg, R., Pham, Q. L., & Giang, L. M. (2024). Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis. JMIR Public Health and Surveillance, 10(1), e53719. https://doi.org/10.2196/53719
Roshani, A., Walker-Davies, P., & Parry, G. (2024). Designing resilient supply chain networks: a systematic literature review of mitigation strategies. Annals of Operations Research 2024 341:2, 341(2), 1267–1332. https://doi.org/10.1007/s10479-024-06228-6
Schwabe, D., Becker, K., Seyferth, M., Klaß, A., & Schaeffter, T. (2024). The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review. Npj Digital Medicine 2024 7:1, 7(1), 203-. https://doi.org/10.1038/s41746-024-01196-4
Sivakumar, M., Parthasarathy, S., & Padmapriya, T. (2024). A simplified approach for efficiency analysis of machine learning algorithms. PeerJ Computer Science, 10, 1–25. https://doi.org/10.7717/peerj-cs.2418
Sourlos, N., Vliegenthart, R., Santinha, J., Klontzas, M. E., Cuocolo, R., Huisman, M., & van Ooijen, P. (2024). Recommendations for the creation of benchmark datasets for reproducible artificial intelligence in radiology. Insights into Imaging 2024 15:1, 15(1), 248-. https://doi.org/10.1186/s13244-024-01833-2
Ståhl, P. P. G., Riikka, P. D., Zhang, L., Nordström, M. C., & Kortsch, S. (2025). Food web robustness depends on the network type and threshold for extinction. Oikos, 2025(5), e11139. https://doi.org/10.1111/oik.11139
Tamang, L., Bouadjenek, M. R., Dazeley, R., & Aryal, S. (2025). Handling Out-of-Distribution Data: A Survey. IEEE Transactions on Knowledge and Data Engineering, 37(10), 5948–5966. https://doi.org/10.1109/TKDE.2025.3592614
Uddin, M. P., Xiang, Y., Hasan, M., Bai, J., Zhao, Y., & Gao, L. (2025). A Systematic Literature Review of Robust Federated Learning: Issues, Solutions, and Future Research Directions. ACM Computing Surveys, 57(10), 62. https://doi.org/10.1145/3727643
Yong, T. K., Ma, Z., & Palmqvist, C. W. (2025). AP-GRIP evaluation framework for data-driven train delay prediction models: systematic literature review. European Transport Research Review 2025 17:1, 17(1), 13-. https://doi.org/10.1186/s12544-024-00704-7
Zeng, L., Chen, X., Shi, X., & Tao Shen, H. (2025). Feature Noise Boosts DNN Generalization Under Label Noise. IEEE Transactions on Neural Networks and Learning Systems, 36(4), 7711–7724. https://doi.org/10.1109/TNNLS.2024.3394511
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Journal of Data Science

This work is licensed under a Creative Commons Attribution 4.0 International License.