AI safety for everyone | Nature Machine Intelligence

Kasirzadeh, A. Two types of AI existential risk: decisive and accumulative. Philos. Stud. (2025).
Lazar, S. & Nelson, A. AI safety on whose terms? Science 381, 138–138 (2023).
Google Scholar
Ahmed, S., Jaźwińska, K., Ahlawat, A., Winecoff, A. & Wang, M. Field-building and the epistemic culture of AI safety. First Monday (2024).
Bender, E. M. Talking about a ‘schism’ is ahistorical. Medium (2023).
Krause, S. S. Aircraft Safety (McGraw-Hill, 2003).
Boyd, D. D. A review of general aviation safety (1984–2017). Aerosp. Med. Hum. Perform. 88, 657–664 (2017).
Google Scholar
Pifferi, G. & Restani, P. The safety of pharmaceutical excipients. Farmaco 58, 541–550 (2003).
Google Scholar
Leveson, N. et al. Applying system engineering to pharmaceutical safety. J. Healthc. Eng. 3, 391–414 (2012).
Google Scholar
De Kimpe, L., Walrave, M., Ponnet, K. & Van Ouytsel, J. Internet safety. In The International Encyclopedia of Media Literacy (eds Hobbs, R. & Mihailidis, P.) (Wiley, 2019).
Salim, H. M. Cyber Safety: A Systems Thinking and Systems Theory Approach to Managing Cyber Security Risks. PhD thesis, Massachusetts Institute of Technology (2014).
Leveson, N. G. Engineering a Safer World: Systems Thinking Applied to Safety (MIT Press, 2016).
Varshney, K. R. Engineering safety in machine learning. In Information Theory and Applications Workshop (ITA) 1–5 (IEEE, 2016).
Rismani, S. et al. From plane crashes to algorithmic harm: applicability of safety engineering frameworks for responsible ML. In Proc. 2023 CHI Conference on Human Factors in Computing Systems 1–18 (2023).
Dobbe, R. System safety and artificial intelligence. In FAccT ’22: Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 1584–1584 (ACM, 2022).
Rismani, S. et al. Beyond the ML model: applying safety engineering frameworks to text-to-image development. In AIES ’23: Proc. 2023 AAAI/ACM Conference on AI, Ethics, and Society 70–83 (ACM,2023).
Amodei, D. et al. Concrete problems in AI safety. Preprint at (2016).
Raji, I. D. & Dobbe, R. Concrete problems in AI safety, revisited. Preprint at (2023).
Kitchenham, B. & Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering EBSE Technical Report EBSE-2007-01 (School of Computer Science and Mathematics, Keele University, 2007).
Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering In EASE ’14: Proc. 18th International Conference on Evaluation and Assessment in Software Engineering Article 38, 1–10 (ACM, 2014).
Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at (2018).
Ng, A. Y. & Russell, S. J. Algorithms for inverse reinforcement learning. In ICML’00: Proc. 17th International Conference on Machine Learning 663–670 (ACM, 2000).
Elhage, N. et al. Toy models of superposition. Preprint at (2022).
Hendrycks, D. et al. Aligning AI with shared human values. In International Conference on Learning Representations (ICLR, 2021).
Yampolskiy, R. V. Artificial intelligence safety and cybersecurity: a timeline of AI failures. Preprint at (2016).
Hadfield-Menell, D., Russell, S. J., Abbeel, P. & Dragan, A. Cooperative inverse reinforcement learning. In NIPS’16: Proc. 30th International Conference on Neural Information Processing Systems 3916–3924 (ACM, 2016).
Xu, H., Zhu, T., Zhang, L., Zhou, W. & Yu, P. S. Machine unlearning: a survey. ACM Comput. Surv. 56, 9.1–9.36 (2023).
Russell, S., Dewey, D. & Tegmark, M. Research priorities for robust and beneficial artificial intelligence. AI Mag. 36, 105–114 (2015).
Willers, O. et al. Safety concerns and mitigation approaches regarding the use of deep learning in safety-critical perception tasks. In Computer Safety, Reliability, and Security. SAFECOMP 2020 Workshops: Lecture Notes in Computer Science (eds Casimiro, A. et al.) 336–350 (Springer, 2020).
Mohseni, S. et al. Taxonomy of machine learning safety: a survey and primer. ACM Comput. Surv. 55, 1–38 (2022).
Google Scholar
Hendrycks, D., Carlini, N., Schulman, J. & Steinhardt, J. Unsolved problems in ML safety. Preprint at (2022).
Boyatzis, R. E. Transforming Qualitative Information: Thematic Analysis and Code Development (Sage, 1998).
van Eck, N. J. & Waltman, L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84, 523–538 (2010).
Google Scholar
Oster, C. V. Jr, Strong, J. S. & Zorn, C. K. Analyzing aviation safety: problems, challenges, opportunities. Res. Transport. Econ. 43, 148–164 (2013).
Google Scholar
Donaldson, M. S., Corrigan, J. M. & Kohn, L. T. (eds). To Err is Human: Building a Safer Health System (National Academies Press, 2000).
Bates, D. W. et al. The safety of inpatient health care. N. Engl. J. Med. 388, 142–153 (2023).
Google Scholar
Marais, K., et al. Beyond Normal Accidents and High Reliability Organizations: The Need for an Alternative Approach to Safety in Complex Systems (Citeseer, 2004).
Griffor, E. Handbook of System Safety and Security: Cyber Risk and Risk Management, Cyber Security, Threat Analysis, Functional Safety, Software Systems, and Cyber Physical Systems (Syngress, 2016).
Prasad, R. & Rohokale, V. Cyber Security: the Lifeline of Information and Communication Technology (Springer, 2020).
Ortega, P. A., Maini, V. & the DeepMind Safety Team. Building safe artificial intelligence: specification, robustness, and assurance. Medium (2018).
Meng, Y. et al. Distantly-supervised named entity recognition with noise-robust learning and language model augmented self-training. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing 10367–10378 (ACL, 2021).
Wang, K. & Guo, P. A robust automated machine learning system with pseudoinverse learning. Cognit. Comput. 13, 724–735 (2021).
Google Scholar
Cappozzo, A., Greselin, F. & Murphy, T. B. A robust approach to model-based classification based on trimming and constraints: semi-supervised learning in presence of outliers and label noise. Adv. Data Anal. Classif. 14, 327–354 (2020).
Google Scholar
Li, W. & Wang, Y. A robust supervised subspace learning approach for output- relevant prediction and detection against outliers. J. Process Control 106, 184–194 (2021).
Google Scholar
Curi, S., Bogunovic, I. & Krause, A. Combining pessimism with optimism for robust and efficient model-based deep reinforcement learning. In Proc. 38th International Conference on Machine Learning Vo. 139, 2254–2264 (PMLR, 2021).
Dobbe, R., Gilbert, T. K. & Mintz, Y. Hard choices in artificial intelligence. Artif. Intell. 300, 103555 (2021).
Google Scholar
Dwork, C. & Feldman, V. Privacy-preserving prediction. In Proc. 31st Conference On Learning Theory Vol. 75, 1693–1702 (PMLR, 2018).
Elhage, N. et al. A mathematical framework for transformer circuits. Transformer Circuits Thread (2021).
Kim, H. & Mnih, A. Disentangling by factorising. In Proc. 35th International Conference on Machine Learning Vol. 80, 2649–2658 (PMLR, 2018).
Ward, F. & Habli, I. An assurance case pattern for the interpretability of machine learning in safety-critical systems. In Computer Safety, Reliability, and Security. SAFECOMP 2020. Lecture Notes in Computer Science (ed Casimiro, A. et al.) Vol. 12235, 395–407 (2020).
Gyevnar, B., Ferguson, N. & Schafer, B. Bridging the transparency gap: what can explainable AI learn from the AI Act? In Frontiers in Artificial Intelligence and Applications Vol. 372, 964–971 (IOS, 2023).
Reimann, L. & Kniesel-Wünsche, G. Safe-DS: a domain specific language to make data science safe. In ICSE-NIER ’23: Proc. 45th International Conference on Software Engineering: New Ideas and Emerging Results 72–77 (ACM, 2023).
Dey, S. & Lee, S.-W. A multi-layered collaborative framework for evidence-driven data requirements engineering for machine learning-based safety-critical systems. In SAC ’23: Proc. 38th ACM/SIGAPP Symposium on Applied Computing 1404–1413 (ACM, 2023).
Wei, C.-Y., Dann, C. & Zimmert, J. A model selection approach for corruption robust reinforcement learning. Proc. Machine Learning Research 167, 1043–1096 (2022).
Google Scholar
Ghosh, A., Tschiatschek, S., Mahdavi, H. & Singla, A. Towards deployment of robust cooperative AI agents: an algorithmic framework for learning adaptive policies. In AAMAS ’20: Proc. 19th International Conference on Autonomous Agents and MultiAgent Systems 447–455 (ACM, 2020).
Wu, Y., Dobriban, E. & Davidson, S. DeltaGrad: rapid retraining of machine learning models. In Proc. 37th International Conference on Machine Learning Vol. 119, 10355–10366 (PMLR, 2020).
Izzo, Z., Smart, M. A., Chaudhuri, K. & Zou, J. Approximate data deletion from machine learning models. In Proc. 24th International Conference on Artificial Intelligence and Statistics Vol. 130, 2008–2016 (PMLR, 2021).
Everitt, T. & Hutter, M. Avoiding wireheading with value reinforcement learning. In Artificial General Intelligence. AGI 2016. Lecture Notes in Computer Science (eds Steunebrink, B. et al.) Vol. 9782 (Springer, 2016).
Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J. & Garrabrant, S. Risks from learned optimization in advanced machine learning systems. Preprint at (2019).
Pistono, F. & Yampolskiy, R. V. Unethical research: how to create a malevolent artificial intelligence. Preprint at (2016).
Picardi, C., Paterson, C., Hawkins, R., Calinescu, R. & Habli, I. Assurance argument patterns and processes for machine learning in safety-related systems. In Proc. Workshop on Artificial Intelligence Safety (SafeAI 2020). CEUR Workshop Proceedings 23–30 (CEUR, 2020).
Wabersich, K. J., Hewing, L., Carron, A. & Zeilinger, M. N. Probabilistic model predictive safety certification for learning-based control. IEEE Trans. Automat. Contr. 67, 176–188 (2022).
Google Scholar
Wen, M. & Topcu, U. Constrained cross-entropy method for safe reinforcement learning. IEEE Trans. Automat. Contr. 66, 3123–3137 (2021).
Google Scholar
Zanella-Béguelin, S. et al. Analyzing information leakage of updates to natural language models. In CCS ’20: Proc. 2020 ACM SIGSAC Conference on Computer and Communications Security 363–375 (ACM, 2020).
Wang, Z., Chen, C. & Dong, D. A dirichlet process mixture of robust task models for scalable lifelong reinforcement learning. IEEE Trans. Cybern. 1, 12 (2022).
Zou, A. et al. Universal and transferable adversarial attacks on aligned language models. Preprint at (2023).
Ilyas, A. et al. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems Vol. 32 (NeurIPS, 2019).
He, R. D., Han, Z. Y., Yang, Y. & Yin, Y. L. Not all parameters should be treated equally: deep safe semi-supervised learning under class distribution mismatch. In Proc. AAAI Conference on Artificial Intelligence Vol. 36, 6874–6883 (AAAI, 2022).
Aghakhani, H., Meng, D., Wang, Y.-X., Kruegel, C. & Vigna, G. Bullseye polytope: a scalable clean-label poisoning attack with improved transferability. In Proc. IEEE Symposium on Security & Privacy 159–178 (IEEE, 2021).
Liu, Y. et al. Backdoor defense with machine unlearning. In IEEE INFOCOM 2022 IEEE Conference on Computer Communications 280–289 (IEEE, 2022).
Meinke, A. & Hein, M. Towards neural networks that provably know when they don’t know. In International Conference on Learning Representations (ICLR, 2020).
Abdelfattah, S., Kasmarik, K. & Hu, J. A robust policy bootstrapping algorithm for multi-objective reinforcement learning in non-stationary environments. Adapt. Behav. 28, 273–292 (2020).
Google Scholar
Djeumou, F., Cubuktepe, M., Lennon, C. & Topcu, U. Task-guided inverse reinforcement learning under partial information. In Proc. 32nd International Conference on Automated Planning and Scheduling (ICAPS, 2021).
Ring, M. & Orseau, L. Delusion, survival, and intelligent agents. In Artificial General Intelligence Lecture Notes in Computer Science (eds Schmidhuber, J. et al.) 11–20 (Springer, 2011).
Soares, N., Fallenstein, B., Yudkowsky, E. & Armstrong, S. Corrigibility. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop (AAAI, 2015).
Yampolskiy, R. V. Leakproofing the singularity: artificial intelligence confinement problem. J. Conscious. Stud. 19, 194–214 (2012).
Yampolskiy, R. V. in Philosophy and Theory of Artificial Intelligence Studies in Applied Philosophy, Epistemology and Rational Ethics (ed. Müller, V. C.) 389–396 (Springer, 2013).
Yampolskiy, R. V. Taxonomy of pathways to dangerous AI. Preprint at (2015).
Wheatley, S., Sovacool, B. K. & Sornette, D. Reassessing the safety of nuclear power. Energy Res. Soc. Sci. 15, 96–100 (2016).
Google Scholar
Kobayashi, T. et al. (eds) Formal modelling of safety architecture for responsibility-aware autonomous vehicle via event-B refinement. In Formal Methods, FM 2023 Lecture Notes in Computer Science Vol. 14000 (eds Katoen, J. P. et al.) 533–549 (2023).
Tay, E. B., Gan, O. P. & Ho, W. K. A study on real-time artificial intelligence. IFAC Proc. Vol. 30, 109–114 (1997).
Hibbard, B., Bach, J., Goertzel, B. & Iklé, M. Avoiding unintended AI behaviors. In Artificial General Intelligence: Lecture Notes in Computer Science (eds Bach, J. et al.) 107–116 (Springer, 2012).
Sezener, C. E. Bieger, J., Goertzel, B. & Potapov, A. Inferring human values for safe AGI design. In Artificial General Intelligence (eds Bieger, J. et al.) Vol. 9205, 152–155 (AGI, 2015).
El Mhamdi, E. & Guerraoui, R. When neurons fail. In IEEE International Parallel and Distributed Processing Symposium 1028–1037 (IEEE, 2017).
Freiesleben, T. & Grote, T. Beyond generalization: a theory of robustness in machine learning. Synthese 202, 109 (2023).
Sanneman, L. & Shah, J. Transparent value alignment. In HRI ’23: Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction 557–560 (ACM, 2023).
Abbeel, P. & Ng, A. Y. Apprenticeship learning via inverse reinforcement learning In ICML ’04: Proc. 21st International Conference on Machine Learning (ACM, 2004).
Murphy, B. et al. Learning effective and interpretable semantic models using non-negative sparse embedding.In Proc. COLING 2012 1933–1950 (The COLING 2012 Organizing Committee, 2012).
Subramanian, A., Pruthi, D., Jhamtani, H., Berg-Kirkpatrick, T. & Hovy, E. SPINE: sparse interpretable neural embeddings. In Proc. AAAI Conference on Artificial Intelligence Vol. 32 (2017).
Shaham, U., Yamada, Y. & Negahban, S. Understanding adversarial training: increasing local stability of supervised models through robust optimization. Neurocomputing 307, 195–204 (2018).
Google Scholar
Wu, G., Hashemi, M. & Srinivasa, C. PUMA: performance unchanged model augmentation for training data removal. In Proc. 36th AAAI Conference on Artificial Intelligence (AAAI, 2022).
Jing, S. & Yang, L. A robust extreme learning machine framework for uncertain data classification. J. Supercomput. 76, 2390–2416 (2020).
Google Scholar
Gan, H. T., Li, Z. H., Fan, Y. L. & Luo, Z. Z. Dual learning-based safe semi-supervised learning. IEEE Access 6, 2615–2621 (2018).
Google Scholar
Engstrom, L. et al. Adversarial robustness as a prior for learned representations. Preprint at (2019).
Brophy, J. & Lowd, D. Machine unlearning for random forests. In Proc. 38th International Conference on Machine Learning Vol. 139, 1092–1104 (PMLR, 221).
Chundawat, V. S., Tarun, A. K., Mandal, M. & Kankanhalli, M. Zero-shot machine unlearning. IEEE Trans. Inf. Forensics Secur. 18, 2345–2354 (2023).
Google Scholar
Chen, J. et al. ATOM: robustifying out-of-distribution detection using outlier mining. In Machine Learning and Knowledge Discovery in Databases. Research Track: Lecture Notes in Computer Science (eds Oliver, N. et al.) 430–445 (Springer, 2021).
Lakkaraju, H., Kamar, E., Caruana, R. & Horvitz, E. Identifying unknown unknowns in the open world: representations and policies for guided exploration. In AAAI’17: Proc. 31st AAAI Conference on Artificial Intelligence 2124–2132 (ACM, 2017).
Zhuo, J. B., Wang, S. H., Zhang, W. G. & Huang, Q. M. Deep unsupervised convolutional domain adaptation. In MM ’17: Proc. 25th ACM International Conference on Multimedia 261–269 (ACM, 2017).
Bossens, D. M. & Bishop, N. Explicit explore, exploit, or escape (E4): near-optimal safety-constrained reinforcement learning in polynomial time. Mach. Learn. 112, 817–858 (2023).
Google Scholar
Massiani, P. F., Heim, S., Solowjow, F. & Trimpe, S. Safe value functions. IEEE Trans. Automat. Contr. 68, 2743–2757 (2023).
Google Scholar
Shi, M., Liang, Y. & Shroff, N. A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints. In ICML’23: Proc. 40th International Conference on Machine Learning Article 1291, 31243–31268 (ACM, 2023).
Hunt, N. et al. Verifiably safe exploration for end-to-end reinforcement learning. In HSCC ’21: Proc. 24th International Conference on Hybrid Systems: Computation and Control (ACM, 2021).
Ma, Y. J., Shen, A., Bastani, O. & Dinesh, J. Conservative and adaptive penalty for model-based safe reinforcement learning. In Proc. AAAI Conference on Artificial Intelligence 36, 5404–5412 (AAAI, 2022).
Zwane, S. et al. Safe trajectory sampling in model-based reinforcement learning. In 19th International Conference on Automation Science and Engineering (CASE) (IEEE, 2023).
Fischer, J., Eyberg, C., Werling, M. & Lauer, M. Sampling-based inverse reinforcement learning algorithms with safety constraints. In IEEE/RSJ International Conference on Intelligent Robots and Systems 791–798 (IEEE, 2021).
Zhou, Z., Liu, G. & Zhou, M. A robust mean-field actor-critic reinforcement learning against adversarial perturbations on agent states. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2023).
Bazzan, A. L. C. Aligning individual and collective welfare in complex socio-technical systems by combining metaheuristics and reinforcement learning. Eng. Appl. Artif. Intell. 79, 23–33 (2019).
Google Scholar
Christoffersen, P. J., Haupt, A. A. & Hadfield-Menell, D. Get it in writing: formal contracts mitigate social dilemmas in multi-agent RL. In AAMAS ’23: Proc. 2023 International Conference on Autonomous Agents and Multiagent Systems 448–456 (ACM, 2023).
Christiano, P. F. et al. Deep reinforcement learning from human preferences. In NIPS’17: Proc. 31st International Conference on Neural Information Processing Systems 4302–4310 (2017).
Kaushik, D., Hovy, E. & Lipton, Z. C. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations (ICLR, 2020).
Li, ZY., Zeng, J., Thirugnanam, A. & Sreenath, K. Bridging model-based safety and model-free reinforcement learning through system identification of low dimensional linear models. In Robotics Science and Systems Paper 033(2022).
Zhu, X., Kang, S. C. & Chen, J. Y. A contact-safe reinforcement learning framework for contact-rich robot manipulation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2476–2482 (IEEE, 2022).
Terra, A., Riaz, H., Raizer, K., Hata, A. & Inam, R. Safety vs. efficiency: AI-based risk mitigation in collaborative robotics. In 6th International Conference on Control, Automation and Robotics151–160 (ICCAR, 2020).
Adebayo, J. et al. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems Vol. 31 (NeurIPS, 2018).
Carlini, N. & Wagner, D. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP) 39–57 (IEEE, 2017).
Nguyen, A., Yosinski, J. & Clune, J. Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 427–436 (IEEE, 2015).
Kaufmann, M. et al. Testing robustness against unforeseen adversaries. Preprint at (2019).
Ray, A., Achiam, J. & Amodei, D. Benchmarking safe exploration in deep reinforcement learning. OpenAI (2023).
Hendrycks, D. et al. What would Jiminy Cricket do? Towards agents that behave morally. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS, 2021).
Gardner, M. et al. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Cohn, T. et al.) 1307–1323 (Association for Computational Linguistics, 2020).
Spears, D. F. Agent technology from a formal perspective. In NASA Monographs in Systems and Software Engineering (eds Rouff, C. A. et al.) 227–257 (Springer, 2006).
Wozniak, E., Cârlan, C., Acar-Celik, E. & Putzer, H. A safety case pattern for systems with machine learning components. In Computer Safety, Reliability, and Security. SAFECOMP 2020 Workshops Vol. 12235, 370–382 (Springer, 2020).
Hendrycks, D. & Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In Proc. International Conference on Learning Representations (ICLR, 2019).
Gholampour, P. & Verma, R. Adversarial robustness of phishing email detection models. In IWSPA ’23: Proc. 9th ACM International Workshop on Security and Privacy Analytics 67–76 (2023).
Nanda, N., Chan, L., Lieberum, T., Smith, J. & Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In 11th International Conference on Learning Representations (ICLR, 2023).
Olsson, C. et al. In-context learning and induction heads. Transformer Circuits Thread (2022).
Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations Poster (ICLR, 2015).
Kim, B. et al. Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In Proc. 35th International Conference on Machine Learning 2668–2677 (PMLR, 2018).
Karpathy, A., Johnson, J. & Fei-Fei, L. Visualizing and understanding recurrent networks. Preprint at (2016).
Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C. & Botvinick, M. On the importance of single directions for generalization. Preprint at (2018).
Gyevnar, B., Wang, C., Lucas, C. G., Cohen, S. B. & Albrecht, S. V. Causal explanations for sequential decision-making in multi-agent systems. In AAMAS ’24: Proce. 23rd International Conference on Autonomous Agents and Multiagent Systems 771–779 (ACM, 2024).
Okawa, Y., Sasaki, T. & Iwane, H. Automatic exploration process adjustment for safe reinforcement learning with joint chance constraint satisfaction. IFAC-PapersOnLine 53, 1588–1595 (2020).
Google Scholar
Gan, H. T., Luo, Z. Z., Meng, M., Ma, Y. L. & She, Q. S. A risk degree-based safe semi-supervised learning algorithm. Int. J. Mach. Learn. Cybern. 7, 85–94 (2016).
Google Scholar
Zhang, Y. X. et al. Barrier Lyapunov Function-based safe reinforcement learning for autonomous vehicles with optimized backstepping. IEEE Trans. Neural Netw. Learn. Syst. 35, 2066–2080 (2024).
Google Scholar
Nicolae, M.-I., Sebban, M., Habrard, A., Gaussier, E. & Amini, M.-R. Algorithmic robustness for semi-supervised (ϵ, γ, τ)-good metric learning. In International Conference on Neural Information Processing (ICONIP, 2015).
Khan, W. U. & Seto, E. A ‘do no harm’ novel safety checklist and research approach to determine whether to launch an artificial intelligence-based medical technology: introducing the biological-psychological, economic, and social (BPES) framework. J. Med. Internet Res. 25, e43386 (2023).
Schumeg, B., Marotta, F. & Werner, B. Proposed V-model for verification, validation, and safety activities for artificial intelligence. In 2023 IEEE International Conference on Assured Autonomy (ICAA) 61–66 (IEEE, 2023).
Kamm, S., Sahlab, N., Jazdi, N. & Weyrich, M. A concept for dynamic and robust machine learning with context modeling for heterogeneous manufacturing data. Procedia CIRP 118, 354–359 (2023).
Costa, E., Rebello, C., Fontana, M., Schnitman, L. & Nogueira, I. A robust learning methodology for uncertainty-aware scientific machine learning models. Mathematics 11, 74 (2023).
Google Scholar
Aksjonov, A. & Kyrki, V. A Safety-critical decision-making and control framework combining machine-learning-based and rule-based algorithms. SAE Int. J. Veh. Dyn. Stab. NVH 7, 287–299 (2023).
Antikainen, J. et al. A deployment model to extend ethically aligned AI implementation method ECCOLA. In 29th IEEE International Requirements Engineering Conference Workshops (eds Yue, T. & Mirakhorli, M.) 230–235 (IEEE, 2021).
Vakkuri, V., Kemell, K. K., Jantunen, M., Halme, E. & Abrahamsson, P. ECCOLA — a method for implementing ethically aligned AI systems. J. Syst. Softw. 182,111067 (2021).
Zhang, H., Shahbazi, N., Chu, X. & Asudeh, A. FairRover: explorative model building for fair and responsible machine learning. In DEEM ’21: Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning Article 5, 1–10 (ACM, 2021).
Coston, A. et al. A validity perspective on evaluating the justified use of data-driven decision-making algorithms. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 690–704 (IEEE, 2023).
Gittens, A., Yener, B. & Yung, M. An adversarial perspective on accuracy, robustness, fairness, and privacy: multilateral-tradeoffs in trustworthy ML. IEEE Access 10, 120850–120865 (2022).
Google Scholar
Taylor, J., Yudkowsky, E., LaVictoire, P. & Critch, A. Alignment for advanced machine learning systems. In Ethics of Artificial Intelligence (ed. Liao, S. M.) 342–382 (Oxford Academic, 2016).
Sotala, K. & Yampolskiy, R. V. Responses to catastrophic AGI risk: a survey. Phys. Scr. 90, 018001 (2014).
Google Scholar
Johnson, B. Metacognition for artificial intelligence system safety — an approach to safe and desired behavior. Saf. Sci. 151, 105743 (2022).
Google Scholar
Hatherall, L. et al. Responsible agency through answerability: cultivating the moral ecology of trustworthy autonomous systems. In TAS ’23: Proc. First International Symposium on Trustworthy Autonomous Systems 50, 1–5 (2023).
Stahl, B. C. Embedding responsibility in intelligent systems: From AI ethics to responsible AI ecosystems. Sci. Rep. 13, 7586 (2023).
Google Scholar
Samarasinghe, D. Counterfactual learning in enhancing resilience in autonomous agent systems. Fron. Artif. Intell. 6, 1212336 (2023).
Google Scholar
Diemert, S., Millet, L., Groves, J. & Joyce, J. Safety integrity levels for artificial intelligence. In Computer Safety, Reliability, and Security. SAFECOMP 2023 Workshops (eds Guiochet, J. et al.) Vol. 14182, 397–409 (Springer, 2023).
Wang, J. & Jia, R. Data Banzhaf: a robust data valuation framework for machine learning. In Proc. 26th International Conference on Artificial Intelligence and Statistics (AISTATS) Vol. 206, 6388–6421 (PMLR, 2023).
Everitt, T., Filan, D., Daswani, M. & Hutter, M. Self-modification of policy and utility function in rational agents. In Artificial General Intelligence: 9th International Conference, AGI 2016 (eds Steunebrink, B. et al.) Vol 9782 (Springer, 2016).
Badea, C. & Artus, G. Morality, machines, and the interpretation problem: a value-based, wittgensteinian approach to building moral agents. In Artificial Intelligence, AI 2022 (eds Bramer, M. & Stahl, F.) Vol. 39, 124–137 (2022).
Umbrello, S. Beneficial artificial intelligence coordination by means of a value sensitive design approach. Big Data Cogn. Comput 3, 5 (2019).
Google Scholar
Yampolskiy, R. & Fox, J. Safety engineering for artificial general intelligence. Topoi 32, 217–226 (2012).
Weld, D. et al. (eds) The first law of robotics. In Safety and Security in Multiagent Systems: Lecture Notes in Computer Science (eds Barley, M. et al.) 90–100 (Springer, 2009).
Matthias, A. The responsibility gap: ascribing responsibility for the actions of learning automata. Ethics Inf. Technol. 6, 175–183 (2004).
Google Scholar
Farina, L. Artificial intelligence systems, responsibility and agential self-awareness. In Philosophy and Theory of Artificial Intelligence 2021 Vol. 63 (ed. Muller, V. C.) 15–25 (2022).
Lee, A. T. Flight Simulation: Virtual Environments in Aviation (Routledge, 2017).
Torres, E. P. Human Extinction: A History of the Science and Ethics of Annihilation (Routledge, 2023).
Bostrom, N. Existential risks: analyzing human extinction scenarios and related hazards. J. Evol. Technol. 9, 1–30 (2002).
Bostrom, N. Superintelligence: Paths, Dangers, Strategies (Oxford Univ. Press, 2014).
Ord, T. The Precipice: Existential Risk and the Future of Humanity (Hachette Books, 2020).
Dafoe, A. & Russell, S. Yes, we are worried about the existential risk of artificial intelligence. MIT Technology Review (2016).
Bronson, R. Measuring existential risk. Peace Policy (2023).
Roose, K. A.I. poses ‘risk of extinction’, industry leaders warn. The New York Times (2023).
Statement on AI risk. Center for AI Safety (2023).
Weidinger, L. et al. Sociotechnical safety evaluation of generative ai systems. Preprint at (2023).
Anwar, U. et al. Foundational challenges in assuring alignment and safety of large language models. Trans. Mach. Learn. Res. (2024).
Ganin, Y. et al. in Domain Adaptation in Computer Vision Applications (ed. Csurka, G.) 189–209 (Springer, 2017).
Balaji, Y., Sankaranarayanan, S. & Chellappa, R. MetaReg: towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems Vol. 31 (eds Bengio, S. et al.) (NeurIPS, 2018).
Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. Preprint at (2019).
Turchetta, M. et al. Safe exploration for interactive machine learning. In Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) (NeurIPS, 2019).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Proc. Syst. 35, 27730–27744 (2022).
Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of- distribution examples in neural networks. In Proc. 5th International Conference on Learning Representations (ICLR, 2017).
link