Explainable AI needs formalization | npj Artificial Intelligence

European Commission. Proposal for a regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A52021PC0206 (2021).

Bishop, C. M. & Nasrabadi, N. M. Pattern recognition and machine learning, vol. 4 (Springer, 2006).

Tjoa, E. & Guan, C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Transactions on Neural Networks and Learning Systems Vol. 32 4793-4813 https://doi.org/10.1109/TNNLS.2020.3027314 (2021).

Minh, D., Wang, H. X., Li, Y. F. & Nguyen, T. N. Explainable artificial intelligence: a comprehensive review. Artificial Intelligence Review 55, 3503–3568 https://doi.org/10.1007/s10462-021-10088-y (2022).

Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 267, 1–38 (2019).

Article
MathSciNet

Google Scholar

Saporta, A. et al. Benchmarking saliency methods for chest X-ray interpretation. Nat. Mach. Intell. 4, 867–878 (2022).

Article

Google Scholar

Ribeiro, M. T., Singh, S. & Guestrin, C. “ Why should I trust you?” Explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144 (ACM, 2016).

Lapuschkin, S. et al. Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun. 10, 1096 (2019).

Article

Google Scholar

Anders, C. J. et al. Finding and removing Clever Hans: using explanation methods to debug and improve deep models. Inf. Fusion 77, 261–295 (2022).

Article

Google Scholar

Wang, Z. J. et al. Interpretability, then what? editing machine learning models to reflect human knowledge and values. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 4132–4142 (2022).

Samek, W. & Müller, K.-R. Towards explainable artificial intelligence. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Vol. 11700, 5–22 (Springer Nature, 2019).

Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2, 573–584 (2020).

Article

Google Scholar

Tideman, L. E. et al. Automated biomarker candidate discovery in imaging mass spectrometry data through spatially localized shapley additive explanations. Anal. Chim. Acta 1177, 338522 (2021).

Article

Google Scholar

Watson, D. S. Interpretable machine learning for genomics. Hum. Genet. 141, 1499–1513 (2022).

Article

Google Scholar

Wong, F. et al. Discovery of a structural class of antibiotics with explainable deep learning. Nature 626, 177–185 (2024).

Article

Google Scholar

Ustun, B., Spangher, A. & Liu, Y. Actionable recourse in linear classification. In Proc. Conference on Fairness, Accountability, and Transparency, 10–19 (ACM, 2019).

Ates, E., Aksar, B., Leung, V. J. & Coskun, A. K. Counterfactual explanations for multivariate time series. In Proc. International Conference on Applied Artificial Intelligence (ICAPAI), 1–8 (IEEE, 2021).

Wilming, R., Budding, C., Müller, K.-R. & Haufe, S. Scrutinizing XAI using linear ground-truth data with suppressor variables. Machine Learning, Special Issue of the ECML PKDD 2022 Journal Track, 1–21 (Springer Nature, 2022).

Wilming, R. et al. GECOBench: a gender-controlled text dataset and benchmark for quantifying biases in explanations. Front. Artif. Intell. https://arxiv.org/abs/2406.11547 (in the press). (2024).

Haufe, S. et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage 87, 96–110 (2014).

Article

Google Scholar

Kindermans, P.-J. et al. Learning how to explain neural networks: patternNet and patternAttribution. In 6th International Conference on Learning Representations (ICLR, 2018).

Wilming, R., Kieslich, L., Clark, B. & Haufe, S. Theoretical behavior of XAI methods in the presence of suppressor variables. Proc. 40th Int. Conf. Mach. Learn. 202, 37091–37107 (2023).

Google Scholar

Conger, A. J. A revised definition for suppressor variables: a guide to their identification and interpretation. Educ. Psychol. Meas. 34, 35–46 (1974).

Article

Google Scholar

Pearl, J. Causality (Cambridge University Press, 2009).

Clark, B., Wilming, R. & Haufe, S. XAI-TRIS: non-linear image benchmarks to quantify false positive post-hoc attribution of feature importance. Mach. Learn. 113, 6871–6910 (2024).

Baehrens, D. et al. How to explain individual classification decisions. J. Mach. Learn. Res. 11, 1803–1831 (2010).

MathSciNet

Google Scholar

Bach, S. et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10, 1–46 (2015).

Article

Google Scholar

Montavon, G., Bach, S., Binder, A., Samek, W. & Müller, K.-R. Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit. 65, 211–222 (2017).

Article

Google Scholar

Shapley, L. S. A value for n-person games. Contrib. Theory Games 2, 307–317 (1953).

MathSciNet

Google Scholar

Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Guyon, I. et al. (eds.) Advances in Neural Information Processing Systems 30, Vol. 30, 4765–4774 (Curran Associates, Inc., 2017).

Aas, K., Jullum, M. & Løland, A. Explaining individual predictions when features are dependent: more accurate approximations to shapley values. Artif. Intell. 298, 103502 (2021).

Article
MathSciNet

Google Scholar

Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In ICML, Vol. 70 of Proc. Machine Learning Research (eds. Precup, D. & Teh, Y. W.) 3319–3328 (PMLR, 2017).

Wachter, S., Mittelstadt, B. & Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL Tech. 31, 841 (2017).

Google Scholar

Guidotti, R. et al. A survey of methods for explaining black box models. ACM Comput. Surv. 51, 1–42 (2019).

Article

Google Scholar

Jacovi, A. & Goldberg, Y. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? in Proc. 58th Annual Meeting of the Association for Computational Linguistics, 4198–4205 (Association for Computational Linguistics, Online, 2020).

Weichwald, S. et al. Causal interpretation rules for encoding and decoding models in neuroimaging. Neuroimage 110, 48–59 (2015).

Article

Google Scholar

Karimi, A.-H., Schölkopf, B. & Valera, I. Algorithmic recourse: from counterfactual explanations to interventions. In Proc. ACM Conference on Fairness, Accountability, and Transparency, 353–362 (ACM, 2021).

Caruana, R. et al. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proc. 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–1730 (ACM, 2015).

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).

Article

Google Scholar

Rai, A. Explainable AI: Ffrom black box to glass box. J. Acad. Mark. Sci. 48, 137–141 (2020).

Article

Google Scholar

Clark, B. et al. Correcting misinterpretations of additive models. in Proc. 39th Annual Conference on Neural Information Processing Systems (NeurIPS, 2025).

Shmueli, G. To explain or to predict? Stat. Sci. 25, 289–310 (2010).

Del Giudice, M. The prediction-explanation fallacy: a pervasive problem in scientific applications of machine learning. Methodology 20, 22–46 (2024).

Article

Google Scholar

Doshi-Velez, F. & Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv: https://arxiv.org/abs/1702.08608 (2017).

Hedström, A. et al. Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond. J. Mach. Learn. Res. 24, 1–11 (2023).

Google Scholar

Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

Article

Google Scholar

Meinshausen, N. & Bühlmann, P. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 72, 417–473 (2010).

Article
MathSciNet

Google Scholar

Samek, W., Binder, A., Montavon, G., Lapuschkin, S. & Müller, K. Evaluating the visualization of what a deep neural network has learned. IEEE Trans. Neural Netw. Learn. Syst. 28, 2660–2673 (2017).

Article
MathSciNet

Google Scholar

Hooker, S., Erhan, D., Kindermans, P.-J. & Kim, B. A benchmark for interpretability methods in deep neural networks. in Advances in Neural Information Processing Systems, Vol. 32, (eds. Wallach, H. et al.) 9737–9748 (Curran Associates, Inc., 2019).

Rong, Y., Leemann, T., Borisov, V., Kasneci, G., & Kasneci, E. A Consistent and Efficient Evaluation Strategy for Attribution Methods. In International Conference on Machine Learning 18770–18795 (PMLR, 2022).

Blücher, S., Vielhaben, J. & Strodthoff, N. Preddiff: Explanations and interactions from conditional expectations. Artif. Intell. 312, 103774 (2022).

Article
MathSciNet

Google Scholar

Adebayo, J. et al. Sanity checks for saliency maps. in Proc. 32nd International Conference on Neural Information Processing Systems, Vol. 31, 9525–9536 (Curran Associates Inc., 2018).

Holzinger, A., Langs, G., Denk, H., Zatloukal, K. & Müller, H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9, e1312 (2019).

Article

Google Scholar

Biessmann, F. & Refiano, D. Quality metrics for transparent machine learning with and without humans in the loop are not correlated. in Proc ICML Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI. https://arxiv.org/abs/2107.02033 (2021).

Jesus, S. et al. How can i choose an explainer? An application-grounded evaluation of post-hoc explanations. in Proc. ACM conference on fairness, accountability, and transparency, 805–815 (ACM, 2021).

Buçinca, Z., Lin, P., Gajos, K. Z. & Glassman, E. L. Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proc. 25th International Conference on Intelligent User Interfaces, 454–464 (ACM, 2020).

Bansal, G. et al. Does the whole exceed its parts? The effect of AI explanations on complementary team performance. in Proc. CHI Conference on Human Factors in Computing Systems, 1–16 (ACM, 2021).

Trout, J. D. Scientific explanation and the sense of understanding. Philos. Sci. 69, 212–233 (2002).

Article

Google Scholar

Oala, L. et al. Machine learning for health: algorithm auditing & quality control. J. Med. Syst. 45, 1–8 (2021).

Article

Google Scholar

DIN SPEC 92001-3:2023-04. Artificial Intelligence—Life Cycle Processes and Quality Requirements—Part 3: Explainability (DIN Deutsches Institut für Normung e. V., 2023).

Sokol, K. & Flach, P. Explainability fact sheets: a framework for systematic assessment of explainable approaches. in Proc. Conference on Fairness, Accountability, and Transparency, 56–67 (ACM, 2020).

Amann, J. et al. To explain or not to explain?–Artificial intelligence explainability in clinical decision support systems. PLoS Digit. Health 1, e0000016 (2022).

Article

Google Scholar

Vetter, D. et al. Lessons learned from assessing trustworthy AI in practice. Digit. Soc. 2, 35 (2023).

Article

Google Scholar

Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 3, e745–e750 (2021).

Article

Google Scholar

Sokol, K. & Flach, P. One explanation does not fit all: the promise of interactive explanations for machine learning transparency. KI-K.ünstliche Intell. 34, 235–250 (2020).

Article

Google Scholar

Weber, R. O., Johs, A. J., Goel, P. & Silva, J. M. XAI is in trouble. AI Mag. 45, 300–316. (2024).

Freiesleben, T. & König, G. Dear XAI community, we need to talk! fundamental misconceptions in current XAI research. in World Conference on Explainable Artificial Intelligence, 48–65 (Springer, 2023).

Afroogh, S. et al. Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions. arXiv preprint arXiv:2602.24176 (2026).

Babic, B., Gerke, S., Evgeniou, T. & Cohen, I. G. Beware explanations from AI in health care. Science 373, 284–286 (2021).

Article

Google Scholar

Bordt, S., Finck, M., Raidl, E. & von Luxburg, U. Post-hoc explanations fail to achieve their purpose in adversarial contexts. in Proc. ACM Conference on Fairness, Accountability, and Transparency, 891–905 (ACM, 2022).

Hedström, A. et al. The meta-evaluation problem in explainable AI: identifying reliable estimators with MetaQuantus. Trans. Mach. Learn. Res. https://openreview.net/forum?id=j3FK00HyfU (2023).

Bluecher, S., Vielhaben, J. & Strodthoff, N. Decoupling pixel flipping and occlusion strategy for consistent XAI benchmarks. Trans. Mach. Learn. Res. https://openreview.net/forum?id=bIiLXdtUVM (2024).

Dombrowski, A.-K. et al. Explanations can be manipulated and geometry is to blame. In Proc. Advances in Neural Information Processing Systems, Vol. 32 (NeurIPS, 2019).

Xin, X., Huang, F. & Hooker, G. Why you should not trust interpretations in machine learning: adversarial attacks on partial dependence plots. arXiv preprint arXiv: https://arxiv.org/abs/2404.18702 (2024).

Kauffmann, J. et al. From clustering to cluster explanations via neural networks. IEEE Trans. Neural Netw. Learn. Syst. 35, 1926–1940 (2022).

Article
MathSciNet

Google Scholar

Clark, B., Oliveira, M., Wilming, R. & Haufe, S. Feature salience–not task-informativeness–drives machine learning model explanations. arXiv preprint arXiv:2602.09238 (2026).

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. 116, 22071–22080 (2019).

Article
MathSciNet

Google Scholar

Zicari, R. V. et al. Z-Inspection®: a process to assess trustworthy AI. IEEE Transactions on Technology and Society, 2, 83–97 (2021) .

Borgonovo, E., Ghidini, V., Hahn, R. & Plischke, E. Explaining classifiers with measures of statistical association. Comput. Stat. Data Anal. 182, 107701 (2023).

Article
MathSciNet

Google Scholar

Karimi, A.-H., Von Kügelgen, J., Schölkopf, B. & Valera, I. Algorithmic recourse under imperfect causal knowledge: a probabilistic approach. Adv. Neural Inf. Process. Syst. 33, 265–277 (2020).

Google Scholar

Sixt, L., Granz, M. & Landgraf, T. When explanations lie: why many modified BP attributions fail. in Proc. 37th International Conference on Machine Learning, 9046–9057 (PMLR, 2020).

Bilodeau, B., Jaques, N., Koh, P. W. & Kim, B. Impossibility theorems for feature attribution. Proc. Natl. Acad. Sci. 121 e2304406120 (2024).

Frye, C., Rowat, C. & Feige, I. Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability. Advances in neural information processing systems, 33, 1229–1239 (2020).

Martin, J. & Haufe, S. cc-Shapley: Measuring Multivariate Feature Importance Needs Causal Context. arXiv preprint arXiv:2602.20396 (2026).

Gjølbye, A., Haufe, S. & Hansen, L. K. Minimizing false-positive attributions in explanations of non-linear models. In Proc. 39th Annual Conference on Neural Information Processing Systems. The Thirty-ninth Annual Conference on Neural Information Processing Systems, https://openreview.net/forum?id=ORrCEtiiVX (2025).

Oberkampf, W. L. & Roy, C. J.Verification and Validation in Scientific Computing (Cambridge University Press, 2010).

Imbert, C. & Ardourel, V. Formal verification, scientific code, and the epistemological heterogeneity of computational science. Philos. Sci. 90, 376–394 (2023).

Article
MathSciNet

Google Scholar

Ismail, A. A., Gunady, M., Pessoa, L., Corrada Bravo, H. & Feizi, S. Input-cell attention reduces vanishing saliency of recurrent neural networks. in Proc. Advances in Neural Information Processing Systems, Vol. 32 (Curran Associates, Inc., 2019).

Yalcin, O., Fan, X. & Liu, S. Evaluating the correctness of explainable AI algorithms for classification. arXiv preprint arXiv:https://arxiv.org/abs/2105.09740 (2021).

Arras, L., Osman, A. & Samek, W. CLEVR-XAI: a benchmark dataset for the ground truth evaluation of neural network explanations. Inf. Fusion 81, 14–40 (2022).

Article

Google Scholar

Zhou, Y., Booth, S., Ribeiro, M. T. & Shah, J. Do feature attribution methods correctly attribute features? in Proc. AAAI Conference on Artificial Intelligence, Vol. 36 (AAAI, 2022).

Budding, C., Eitel, F., Ritter, K. & Haufe, S. Evaluating saliency methods on artificial data with different background types. in Medical Imaging meets NeurIPS. An official NeurIPS Workshop. https://arxiv.org/abs/2112.04882 (2021).

Oliveira, M. et al. Benchmarking the influence of pre-training on explanation performance in MR image classification. Front. Artif. Intell. 7, 1330919 (2024).

Article

Google Scholar

Oramas, J., Wang, K. & Tuytelaars, T. Visual explanation by interpretation: Improving visual feedback capabilities of deep neural networks. in International Conference on Learning Representations (2019).

Fok, R. & Weld, D. S. In search of verifiability: Explanations rarely enable complementary performance in AI-advised decision making. AI Mag. 45, 317–332 (2023).

Janzing, D. & Schölkopf, B. Detecting non-causal artifacts in multivariate linear regression models. in International Conference on Machine Learning, 2245–2253 (PMLR, 2018).

Hvilshøj, F., Iosifidis, A. & Assent, I. ECINN: Efficient Counterfactuals from Invertible Neural Networks. In British Machine Vision Conference: Online 22nd-25th (2021).

Scholkopf, B. et al. Toward Causal Representation Learning Proceedings of the IEEE. 109, 612–634 https://doi.org/10.1109/JPROC.2021.3058954 (2021).

Ahuja, K., Mahajan, D., Wang, Y. & Bengio, Y. Interventional causal representation learning. In International conference on machine learning pp. 372-407. (PMLR, 2023).

Hastie, T. et al. The Elements of Statistical Learning (Springer, 2009).

Explainable AI needs formalization | npj Artificial Intelligence

Tags: