May 20, 2025

Strike Force heroes4

Connecting the World with Advanced Technology

AI safety for everyone | Nature Machine Intelligence

AI safety for everyone | Nature Machine Intelligence
  • Kasirzadeh, A. Two types of AI existential risk: decisive and accumulative. Philos. Stud. (2025).

  • Lazar, S. & Nelson, A. AI safety on whose terms? Science 381, 138–138 (2023).

    Article 

    Google Scholar 

  • Ahmed, S., Jaźwińska, K., Ahlawat, A., Winecoff, A. & Wang, M. Field-building and the epistemic culture of AI safety. First Monday (2024).

  • Bender, E. M. Talking about a ‘schism’ is ahistorical. Medium (2023).

  • Krause, S. S. Aircraft Safety (McGraw-Hill, 2003).

  • Boyd, D. D. A review of general aviation safety (1984–2017). Aerosp. Med. Hum. Perform. 88, 657–664 (2017).

    Article 

    Google Scholar 

  • Pifferi, G. & Restani, P. The safety of pharmaceutical excipients. Farmaco 58, 541–550 (2003).

    Article 

    Google Scholar 

  • Leveson, N. et al. Applying system engineering to pharmaceutical safety. J. Healthc. Eng. 3, 391–414 (2012).

    Article 

    Google Scholar 

  • De Kimpe, L., Walrave, M., Ponnet, K. & Van Ouytsel, J. Internet safety. In The International Encyclopedia of Media Literacy (eds Hobbs, R. & Mihailidis, P.) (Wiley, 2019).

  • Salim, H. M. Cyber Safety: A Systems Thinking and Systems Theory Approach to Managing Cyber Security Risks. PhD thesis, Massachusetts Institute of Technology (2014).

  • Leveson, N. G. Engineering a Safer World: Systems Thinking Applied to Safety (MIT Press, 2016).

  • Varshney, K. R. Engineering safety in machine learning. In Information Theory and Applications Workshop (ITA) 1–5 (IEEE, 2016).

  • Rismani, S. et al. From plane crashes to algorithmic harm: applicability of safety engineering frameworks for responsible ML. In Proc. 2023 CHI Conference on Human Factors in Computing Systems 1–18 (2023).

  • Dobbe, R. System safety and artificial intelligence. In FAccT ’22: Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 1584–1584 (ACM, 2022).

  • Rismani, S. et al. Beyond the ML model: applying safety engineering frameworks to text-to-image development. In AIES ’23: Proc. 2023 AAAI/ACM Conference on AI, Ethics, and Society 70–83 (ACM,2023).

  • Amodei, D. et al. Concrete problems in AI safety. Preprint at (2016).

  • Raji, I. D. & Dobbe, R. Concrete problems in AI safety, revisited. Preprint at (2023).

  • Kitchenham, B. & Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering EBSE Technical Report EBSE-2007-01 (School of Computer Science and Mathematics, Keele University, 2007).

  • Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering In EASE ’14: Proc. 18th International Conference on Evaluation and Assessment in Software Engineering Article 38, 1–10 (ACM, 2014).

  • Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at (2018).

  • Ng, A. Y. & Russell, S. J. Algorithms for inverse reinforcement learning. In ICML’00: Proc. 17th International Conference on Machine Learning 663–670 (ACM, 2000).

  • Elhage, N. et al. Toy models of superposition. Preprint at (2022).

  • Hendrycks, D. et al. Aligning AI with shared human values. In International Conference on Learning Representations (ICLR, 2021).

  • Yampolskiy, R. V. Artificial intelligence safety and cybersecurity: a timeline of AI failures. Preprint at (2016).

  • Hadfield-Menell, D., Russell, S. J., Abbeel, P. & Dragan, A. Cooperative inverse reinforcement learning. In NIPS’16: Proc. 30th International Conference on Neural Information Processing Systems 3916–3924 (ACM, 2016).

  • Xu, H., Zhu, T., Zhang, L., Zhou, W. & Yu, P. S. Machine unlearning: a survey. ACM Comput. Surv. 56, 9.1–9.36 (2023).

    Google Scholar 

  • Russell, S., Dewey, D. & Tegmark, M. Research priorities for robust and beneficial artificial intelligence. AI Mag. 36, 105–114 (2015).

    Google Scholar 

  • Willers, O. et al. Safety concerns and mitigation approaches regarding the use of deep learning in safety-critical perception tasks. In Computer Safety, Reliability, and Security. SAFECOMP 2020 Workshops: Lecture Notes in Computer Science (eds Casimiro, A. et al.) 336–350 (Springer, 2020).

  • Mohseni, S. et al. Taxonomy of machine learning safety: a survey and primer. ACM Comput. Surv. 55, 1–38 (2022).

    Article 

    Google Scholar 

  • Hendrycks, D., Carlini, N., Schulman, J. & Steinhardt, J. Unsolved problems in ML safety. Preprint at (2022).

  • Boyatzis, R. E. Transforming Qualitative Information: Thematic Analysis and Code Development (Sage, 1998).

  • van Eck, N. J. & Waltman, L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84, 523–538 (2010).

    Article 

    Google Scholar 

  • Oster, C. V. Jr, Strong, J. S. & Zorn, C. K. Analyzing aviation safety: problems, challenges, opportunities. Res. Transport. Econ. 43, 148–164 (2013).

    Article 

    Google Scholar 

  • Donaldson, M. S., Corrigan, J. M. & Kohn, L. T. (eds). To Err is Human: Building a Safer Health System (National Academies Press, 2000).

  • Bates, D. W. et al. The safety of inpatient health care. N. Engl. J. Med. 388, 142–153 (2023).

    Article 

    Google Scholar 

  • Marais, K., et al. Beyond Normal Accidents and High Reliability Organizations: The Need for an Alternative Approach to Safety in Complex Systems (Citeseer, 2004).

  • Griffor, E. Handbook of System Safety and Security: Cyber Risk and Risk Management, Cyber Security, Threat Analysis, Functional Safety, Software Systems, and Cyber Physical Systems (Syngress, 2016).

  • Prasad, R. & Rohokale, V. Cyber Security: the Lifeline of Information and Communication Technology (Springer, 2020).

  • Ortega, P. A., Maini, V. & the DeepMind Safety Team. Building safe artificial intelligence: specification, robustness, and assurance. Medium (2018).

  • Meng, Y. et al. Distantly-supervised named entity recognition with noise-robust learning and language model augmented self-training. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing 10367–10378 (ACL, 2021).

  • Wang, K. & Guo, P. A robust automated machine learning system with pseudoinverse learning. Cognit. Comput. 13, 724–735 (2021).

    Article 

    Google Scholar 

  • Cappozzo, A., Greselin, F. & Murphy, T. B. A robust approach to model-based classification based on trimming and constraints: semi-supervised learning in presence of outliers and label noise. Adv. Data Anal. Classif. 14, 327–354 (2020).

    Article 
    MathSciNet 

    Google Scholar 

  • Li, W. & Wang, Y. A robust supervised subspace learning approach for output- relevant prediction and detection against outliers. J. Process Control 106, 184–194 (2021).

    Article 

    Google Scholar 

  • Curi, S., Bogunovic, I. & Krause, A. Combining pessimism with optimism for robust and efficient model-based deep reinforcement learning. In Proc. 38th International Conference on Machine Learning Vo. 139, 2254–2264 (PMLR, 2021).

  • Dobbe, R., Gilbert, T. K. & Mintz, Y. Hard choices in artificial intelligence. Artif. Intell. 300, 103555 (2021).

    Article 

    Google Scholar 

  • Dwork, C. & Feldman, V. Privacy-preserving prediction. In Proc. 31st Conference On Learning Theory Vol. 75, 1693–1702 (PMLR, 2018).

  • Elhage, N. et al. A mathematical framework for transformer circuits. Transformer Circuits Thread (2021).

  • Kim, H. & Mnih, A. Disentangling by factorising. In Proc. 35th International Conference on Machine Learning Vol. 80, 2649–2658 (PMLR, 2018).

  • Ward, F. & Habli, I. An assurance case pattern for the interpretability of machine learning in safety-critical systems. In Computer Safety, Reliability, and Security. SAFECOMP 2020. Lecture Notes in Computer Science (ed Casimiro, A. et al.) Vol. 12235, 395–407 (2020).

  • Gyevnar, B., Ferguson, N. & Schafer, B. Bridging the transparency gap: what can explainable AI learn from the AI Act? In Frontiers in Artificial Intelligence and Applications Vol. 372, 964–971 (IOS, 2023).

  • Reimann, L. & Kniesel-Wünsche, G. Safe-DS: a domain specific language to make data science safe. In ICSE-NIER ’23: Proc. 45th International Conference on Software Engineering: New Ideas and Emerging Results 72–77 (ACM, 2023).

  • Dey, S. & Lee, S.-W. A multi-layered collaborative framework for evidence-driven data requirements engineering for machine learning-based safety-critical systems. In SAC ’23: Proc. 38th ACM/SIGAPP Symposium on Applied Computing 1404–1413 (ACM, 2023).

  • Wei, C.-Y., Dann, C. & Zimmert, J. A model selection approach for corruption robust reinforcement learning. Proc. Machine Learning Research 167, 1043–1096 (2022).

    MathSciNet 

    Google Scholar 

  • Ghosh, A., Tschiatschek, S., Mahdavi, H. & Singla, A. Towards deployment of robust cooperative AI agents: an algorithmic framework for learning adaptive policies. In AAMAS ’20: Proc. 19th International Conference on Autonomous Agents and MultiAgent Systems 447–455 (ACM, 2020).

  • Wu, Y., Dobriban, E. & Davidson, S. DeltaGrad: rapid retraining of machine learning models. In Proc. 37th International Conference on Machine Learning Vol. 119, 10355–10366 (PMLR, 2020).

  • Izzo, Z., Smart, M. A., Chaudhuri, K. & Zou, J. Approximate data deletion from machine learning models. In Proc. 24th International Conference on Artificial Intelligence and Statistics Vol. 130, 2008–2016 (PMLR, 2021).

  • Everitt, T. & Hutter, M. Avoiding wireheading with value reinforcement learning. In Artificial General Intelligence. AGI 2016. Lecture Notes in Computer Science (eds Steunebrink, B. et al.) Vol. 9782 (Springer, 2016).

  • Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J. & Garrabrant, S. Risks from learned optimization in advanced machine learning systems. Preprint at (2019).

  • Pistono, F. & Yampolskiy, R. V. Unethical research: how to create a malevolent artificial intelligence. Preprint at (2016).

  • Picardi, C., Paterson, C., Hawkins, R., Calinescu, R. & Habli, I. Assurance argument patterns and processes for machine learning in safety-related systems. In Proc. Workshop on Artificial Intelligence Safety (SafeAI 2020). CEUR Workshop Proceedings 23–30 (CEUR, 2020).

  • Wabersich, K. J., Hewing, L., Carron, A. & Zeilinger, M. N. Probabilistic model predictive safety certification for learning-based control. IEEE Trans. Automat. Contr. 67, 176–188 (2022).

    Article 
    MathSciNet 

    Google Scholar 

  • Wen, M. & Topcu, U. Constrained cross-entropy method for safe reinforcement learning. IEEE Trans. Automat. Contr. 66, 3123–3137 (2021).

    Article 
    MathSciNet 

    Google Scholar 

  • Zanella-Béguelin, S. et al. Analyzing information leakage of updates to natural language models. In CCS ’20: Proc. 2020 ACM SIGSAC Conference on Computer and Communications Security 363–375 (ACM, 2020).

  • Wang, Z., Chen, C. & Dong, D. A dirichlet process mixture of robust task models for scalable lifelong reinforcement learning. IEEE Trans. Cybern. 1, 12 (2022).

    Google Scholar 

  • Zou, A. et al. Universal and transferable adversarial attacks on aligned language models. Preprint at (2023).

  • Ilyas, A. et al. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems Vol. 32 (NeurIPS, 2019).

  • He, R. D., Han, Z. Y., Yang, Y. & Yin, Y. L. Not all parameters should be treated equally: deep safe semi-supervised learning under class distribution mismatch. In Proc. AAAI Conference on Artificial Intelligence Vol. 36, 6874–6883 (AAAI, 2022).

  • Aghakhani, H., Meng, D., Wang, Y.-X., Kruegel, C. & Vigna, G. Bullseye polytope: a scalable clean-label poisoning attack with improved transferability. In Proc. IEEE Symposium on Security & Privacy 159–178 (IEEE, 2021).

  • Liu, Y. et al. Backdoor defense with machine unlearning. In IEEE INFOCOM 2022 IEEE Conference on Computer Communications 280–289 (IEEE, 2022).

  • Meinke, A. & Hein, M. Towards neural networks that provably know when they don’t know. In International Conference on Learning Representations (ICLR, 2020).

  • Abdelfattah, S., Kasmarik, K. & Hu, J. A robust policy bootstrapping algorithm for multi-objective reinforcement learning in non-stationary environments. Adapt. Behav. 28, 273–292 (2020).

    Article 

    Google Scholar 

  • Djeumou, F., Cubuktepe, M., Lennon, C. & Topcu, U. Task-guided inverse reinforcement learning under partial information. In Proc. 32nd International Conference on Automated Planning and Scheduling (ICAPS, 2021).

  • Ring, M. & Orseau, L. Delusion, survival, and intelligent agents. In Artificial General Intelligence Lecture Notes in Computer Science (eds Schmidhuber, J. et al.) 11–20 (Springer, 2011).

  • Soares, N., Fallenstein, B., Yudkowsky, E. & Armstrong, S. Corrigibility. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop (AAAI, 2015).

  • Yampolskiy, R. V. Leakproofing the singularity: artificial intelligence confinement problem. J. Conscious. Stud. 19, 194–214 (2012).

    Google Scholar 

  • Yampolskiy, R. V. in Philosophy and Theory of Artificial Intelligence Studies in Applied Philosophy, Epistemology and Rational Ethics (ed. Müller, V. C.) 389–396 (Springer, 2013).

  • Yampolskiy, R. V. Taxonomy of pathways to dangerous AI. Preprint at (2015).

  • Wheatley, S., Sovacool, B. K. & Sornette, D. Reassessing the safety of nuclear power. Energy Res. Soc. Sci. 15, 96–100 (2016).

    Article 

    Google Scholar 

  • Kobayashi, T. et al. (eds) Formal modelling of safety architecture for responsibility-aware autonomous vehicle via event-B refinement. In Formal Methods, FM 2023 Lecture Notes in Computer Science Vol. 14000 (eds Katoen, J. P. et al.) 533–549 (2023).

  • Tay, E. B., Gan, O. P. & Ho, W. K. A study on real-time artificial intelligence. IFAC Proc. Vol. 30, 109–114 (1997).

  • Hibbard, B., Bach, J., Goertzel, B. & Iklé, M. Avoiding unintended AI behaviors. In Artificial General Intelligence: Lecture Notes in Computer Science (eds Bach, J. et al.) 107–116 (Springer, 2012).

  • Sezener, C. E. Bieger, J., Goertzel, B. & Potapov, A. Inferring human values for safe AGI design. In Artificial General Intelligence (eds Bieger, J. et al.) Vol. 9205, 152–155 (AGI, 2015).

  • El Mhamdi, E. & Guerraoui, R. When neurons fail. In IEEE International Parallel and Distributed Processing Symposium 1028–1037 (IEEE, 2017).

  • Freiesleben, T. & Grote, T. Beyond generalization: a theory of robustness in machine learning. Synthese 202, 109 (2023).

  • Sanneman, L. & Shah, J. Transparent value alignment. In HRI ’23: Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction 557–560 (ACM, 2023).

  • Abbeel, P. & Ng, A. Y. Apprenticeship learning via inverse reinforcement learning In ICML ’04: Proc. 21st International Conference on Machine Learning (ACM, 2004).

  • Murphy, B. et al. Learning effective and interpretable semantic models using non-negative sparse embedding.In Proc. COLING 2012 1933–1950 (The COLING 2012 Organizing Committee, 2012).

  • Subramanian, A., Pruthi, D., Jhamtani, H., Berg-Kirkpatrick, T. & Hovy, E. SPINE: sparse interpretable neural embeddings. In Proc. AAAI Conference on Artificial Intelligence Vol. 32 (2017).

  • Shaham, U., Yamada, Y. & Negahban, S. Understanding adversarial training: increasing local stability of supervised models through robust optimization. Neurocomputing 307, 195–204 (2018).

    Article 

    Google Scholar 

  • Wu, G., Hashemi, M. & Srinivasa, C. PUMA: performance unchanged model augmentation for training data removal. In Proc. 36th AAAI Conference on Artificial Intelligence (AAAI, 2022).

  • Jing, S. & Yang, L. A robust extreme learning machine framework for uncertain data classification. J. Supercomput. 76, 2390–2416 (2020).

    Article 

    Google Scholar 

  • Gan, H. T., Li, Z. H., Fan, Y. L. & Luo, Z. Z. Dual learning-based safe semi-supervised learning. IEEE Access 6, 2615–2621 (2018).

    Article 

    Google Scholar 

  • Engstrom, L. et al. Adversarial robustness as a prior for learned representations. Preprint at (2019).

  • Brophy, J. & Lowd, D. Machine unlearning for random forests. In Proc. 38th International Conference on Machine Learning Vol. 139, 1092–1104 (PMLR, 221).

  • Chundawat, V. S., Tarun, A. K., Mandal, M. & Kankanhalli, M. Zero-shot machine unlearning. IEEE Trans. Inf. Forensics Secur. 18, 2345–2354 (2023).

    Article 

    Google Scholar 

  • Chen, J. et al. ATOM: robustifying out-of-distribution detection using outlier mining. In Machine Learning and Knowledge Discovery in Databases. Research Track: Lecture Notes in Computer Science (eds Oliver, N. et al.) 430–445 (Springer, 2021).

  • Lakkaraju, H., Kamar, E., Caruana, R. & Horvitz, E. Identifying unknown unknowns in the open world: representations and policies for guided exploration. In AAAI’17: Proc. 31st AAAI Conference on Artificial Intelligence 2124–2132 (ACM, 2017).

  • Zhuo, J. B., Wang, S. H., Zhang, W. G. & Huang, Q. M. Deep unsupervised convolutional domain adaptation. In MM ’17: Proc. 25th ACM International Conference on Multimedia 261–269 (ACM, 2017).

  • Bossens, D. M. & Bishop, N. Explicit explore, exploit, or escape (E4): near-optimal safety-constrained reinforcement learning in polynomial time. Mach. Learn. 112, 817–858 (2023).

    Article 
    MathSciNet 

    Google Scholar 

  • Massiani, P. F., Heim, S., Solowjow, F. & Trimpe, S. Safe value functions. IEEE Trans. Automat. Contr. 68, 2743–2757 (2023).

    Article 
    MathSciNet 

    Google Scholar 

  • Shi, M., Liang, Y. & Shroff, N. A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints. In ICML’23: Proc. 40th International Conference on Machine Learning Article 1291, 31243–31268 (ACM, 2023).

  • Hunt, N. et al. Verifiably safe exploration for end-to-end reinforcement learning. In HSCC ’21: Proc. 24th International Conference on Hybrid Systems: Computation and Control (ACM, 2021).

  • Ma, Y. J., Shen, A., Bastani, O. & Dinesh, J. Conservative and adaptive penalty for model-based safe reinforcement learning. In Proc. AAAI Conference on Artificial Intelligence 36, 5404–5412 (AAAI, 2022).

  • Zwane, S. et al. Safe trajectory sampling in model-based reinforcement learning. In 19th International Conference on Automation Science and Engineering (CASE) (IEEE, 2023).

  • Fischer, J., Eyberg, C., Werling, M. & Lauer, M. Sampling-based inverse reinforcement learning algorithms with safety constraints. In IEEE/RSJ International Conference on Intelligent Robots and Systems 791–798 (IEEE, 2021).

  • Zhou, Z., Liu, G. & Zhou, M. A robust mean-field actor-critic reinforcement learning against adversarial perturbations on agent states. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2023).

  • Bazzan, A. L. C. Aligning individual and collective welfare in complex socio-technical systems by combining metaheuristics and reinforcement learning. Eng. Appl. Artif. Intell. 79, 23–33 (2019).

    Article 

    Google Scholar 

  • Christoffersen, P. J., Haupt, A. A. & Hadfield-Menell, D. Get it in writing: formal contracts mitigate social dilemmas in multi-agent RL. In AAMAS ’23: Proc. 2023 International Conference on Autonomous Agents and Multiagent Systems 448–456 (ACM, 2023).

  • Christiano, P. F. et al. Deep reinforcement learning from human preferences. In NIPS’17: Proc. 31st International Conference on Neural Information Processing Systems 4302–4310 (2017).

  • Kaushik, D., Hovy, E. & Lipton, Z. C. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations (ICLR, 2020).

  • Li, ZY., Zeng, J., Thirugnanam, A. & Sreenath, K. Bridging model-based safety and model-free reinforcement learning through system identification of low dimensional linear models. In Robotics Science and Systems Paper 033(2022).

  • Zhu, X., Kang, S. C. & Chen, J. Y. A contact-safe reinforcement learning framework for contact-rich robot manipulation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2476–2482 (IEEE, 2022).

  • Terra, A., Riaz, H., Raizer, K., Hata, A. & Inam, R. Safety vs. efficiency: AI-based risk mitigation in collaborative robotics. In 6th International Conference on Control, Automation and Robotics151–160 (ICCAR, 2020).

  • Adebayo, J. et al. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems Vol. 31 (NeurIPS, 2018).

  • Carlini, N. & Wagner, D. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP) 39–57 (IEEE, 2017).

  • Nguyen, A., Yosinski, J. & Clune, J. Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 427–436 (IEEE, 2015).

  • Kaufmann, M. et al. Testing robustness against unforeseen adversaries. Preprint at (2019).

  • Ray, A., Achiam, J. & Amodei, D. Benchmarking safe exploration in deep reinforcement learning. OpenAI (2023).

  • Hendrycks, D. et al. What would Jiminy Cricket do? Towards agents that behave morally. In Proc. Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS, 2021).

  • Gardner, M. et al. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Cohn, T. et al.) 1307–1323 (Association for Computational Linguistics, 2020).

  • Spears, D. F. Agent technology from a formal perspective. In NASA Monographs in Systems and Software Engineering (eds Rouff, C. A. et al.) 227–257 (Springer, 2006).

  • Wozniak, E., Cârlan, C., Acar-Celik, E. & Putzer, H. A safety case pattern for systems with machine learning components. In Computer Safety, Reliability, and Security. SAFECOMP 2020 Workshops Vol. 12235, 370–382 (Springer, 2020).

  • Hendrycks, D. & Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In Proc. International Conference on Learning Representations (ICLR, 2019).

  • Gholampour, P. & Verma, R. Adversarial robustness of phishing email detection models. In IWSPA ’23: Proc. 9th ACM International Workshop on Security and Privacy Analytics 67–76 (2023).

  • Nanda, N., Chan, L., Lieberum, T., Smith, J. & Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In 11th International Conference on Learning Representations (ICLR, 2023).

  • Olsson, C. et al. In-context learning and induction heads. Transformer Circuits Thread (2022).

  • Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations Poster (ICLR, 2015).

  • Kim, B. et al. Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In Proc. 35th International Conference on Machine Learning 2668–2677 (PMLR, 2018).

  • Karpathy, A., Johnson, J. & Fei-Fei, L. Visualizing and understanding recurrent networks. Preprint at (2016).

  • Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C. & Botvinick, M. On the importance of single directions for generalization. Preprint at (2018).

  • Gyevnar, B., Wang, C., Lucas, C. G., Cohen, S. B. & Albrecht, S. V. Causal explanations for sequential decision-making in multi-agent systems. In AAMAS ’24: Proce. 23rd International Conference on Autonomous Agents and Multiagent Systems 771–779 (ACM, 2024).

  • Okawa, Y., Sasaki, T. & Iwane, H. Automatic exploration process adjustment for safe reinforcement learning with joint chance constraint satisfaction. IFAC-PapersOnLine 53, 1588–1595 (2020).

    Article 

    Google Scholar 

  • Gan, H. T., Luo, Z. Z., Meng, M., Ma, Y. L. & She, Q. S. A risk degree-based safe semi-supervised learning algorithm. Int. J. Mach. Learn. Cybern. 7, 85–94 (2016).

    Article 

    Google Scholar 

  • Zhang, Y. X. et al. Barrier Lyapunov Function-based safe reinforcement learning for autonomous vehicles with optimized backstepping. IEEE Trans. Neural Netw. Learn. Syst. 35, 2066–2080 (2024).

    Article 
    MathSciNet 

    Google Scholar 

  • Nicolae, M.-I., Sebban, M., Habrard, A., Gaussier, E. & Amini, M.-R. Algorithmic robustness for semi-supervised (ϵ, γ, τ)-good metric learning. In International Conference on Neural Information Processing (ICONIP, 2015).

  • Khan, W. U. & Seto, E. A ‘do no harm’ novel safety checklist and research approach to determine whether to launch an artificial intelligence-based medical technology: introducing the biological-psychological, economic, and social (BPES) framework. J. Med. Internet Res. 25, e43386 (2023).

  • Schumeg, B., Marotta, F. & Werner, B. Proposed V-model for verification, validation, and safety activities for artificial intelligence. In 2023 IEEE International Conference on Assured Autonomy (ICAA) 61–66 (IEEE, 2023).

  • Kamm, S., Sahlab, N., Jazdi, N. & Weyrich, M. A concept for dynamic and robust machine learning with context modeling for heterogeneous manufacturing data. Procedia CIRP 118, 354–359 (2023).

  • Costa, E., Rebello, C., Fontana, M., Schnitman, L. & Nogueira, I. A robust learning methodology for uncertainty-aware scientific machine learning models. Mathematics 11, 74 (2023).

    Article 

    Google Scholar 

  • Aksjonov, A. & Kyrki, V. A Safety-critical decision-making and control framework combining machine-learning-based and rule-based algorithms. SAE Int. J. Veh. Dyn. Stab. NVH 7, 287–299 (2023).

  • Antikainen, J. et al. A deployment model to extend ethically aligned AI implementation method ECCOLA. In 29th IEEE International Requirements Engineering Conference Workshops (eds Yue, T. & Mirakhorli, M.) 230–235 (IEEE, 2021).

  • Vakkuri, V., Kemell, K. K., Jantunen, M., Halme, E. & Abrahamsson, P. ECCOLA — a method for implementing ethically aligned AI systems. J. Syst. Softw. 182,111067 (2021).

  • Zhang, H., Shahbazi, N., Chu, X. & Asudeh, A. FairRover: explorative model building for fair and responsible machine learning. In DEEM ’21: Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning Article 5, 1–10 (ACM, 2021).

  • Coston, A. et al. A validity perspective on evaluating the justified use of data-driven decision-making algorithms. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 690–704 (IEEE, 2023).

  • Gittens, A., Yener, B. & Yung, M. An adversarial perspective on accuracy, robustness, fairness, and privacy: multilateral-tradeoffs in trustworthy ML. IEEE Access 10, 120850–120865 (2022).

    Article 

    Google Scholar 

  • Taylor, J., Yudkowsky, E., LaVictoire, P. & Critch, A. Alignment for advanced machine learning systems. In Ethics of Artificial Intelligence (ed. Liao, S. M.) 342–382 (Oxford Academic, 2016).

  • Sotala, K. & Yampolskiy, R. V. Responses to catastrophic AGI risk: a survey. Phys. Scr. 90, 018001 (2014).

    Article 

    Google Scholar 

  • Johnson, B. Metacognition for artificial intelligence system safety — an approach to safe and desired behavior. Saf. Sci. 151, 105743 (2022).

    Article 

    Google Scholar 

  • Hatherall, L. et al. Responsible agency through answerability: cultivating the moral ecology of trustworthy autonomous systems. In TAS ’23: Proc. First International Symposium on Trustworthy Autonomous Systems 50, 1–5 (2023).

  • Stahl, B. C. Embedding responsibility in intelligent systems: From AI ethics to responsible AI ecosystems. Sci. Rep. 13, 7586 (2023).

    Article 

    Google Scholar 

  • Samarasinghe, D. Counterfactual learning in enhancing resilience in autonomous agent systems. Fron. Artif. Intell. 6, 1212336 (2023).

    Article 

    Google Scholar 

  • Diemert, S., Millet, L., Groves, J. & Joyce, J. Safety integrity levels for artificial intelligence. In Computer Safety, Reliability, and Security. SAFECOMP 2023 Workshops (eds Guiochet, J. et al.) Vol. 14182, 397–409 (Springer, 2023).

  • Wang, J. & Jia, R. Data Banzhaf: a robust data valuation framework for machine learning. In Proc. 26th International Conference on Artificial Intelligence and Statistics (AISTATS) Vol. 206, 6388–6421 (PMLR, 2023).

  • Everitt, T., Filan, D., Daswani, M. & Hutter, M. Self-modification of policy and utility function in rational agents. In Artificial General Intelligence: 9th International Conference, AGI 2016 (eds Steunebrink, B. et al.) Vol 9782 (Springer, 2016).

  • Badea, C. & Artus, G. Morality, machines, and the interpretation problem: a value-based, wittgensteinian approach to building moral agents. In Artificial Intelligence, AI 2022 (eds Bramer, M. & Stahl, F.) Vol. 39, 124–137 (2022).

  • Umbrello, S. Beneficial artificial intelligence coordination by means of a value sensitive design approach. Big Data Cogn. Comput 3, 5 (2019).

    Article 

    Google Scholar 

  • Yampolskiy, R. & Fox, J. Safety engineering for artificial general intelligence. Topoi 32, 217–226 (2012).

    Google Scholar 

  • Weld, D. et al. (eds) The first law of robotics. In Safety and Security in Multiagent Systems: Lecture Notes in Computer Science (eds Barley, M. et al.) 90–100 (Springer, 2009).

  • Matthias, A. The responsibility gap: ascribing responsibility for the actions of learning automata. Ethics Inf. Technol. 6, 175–183 (2004).

    Article 

    Google Scholar 

  • Farina, L. Artificial intelligence systems, responsibility and agential self-awareness. In Philosophy and Theory of Artificial Intelligence 2021 Vol. 63 (ed. Muller, V. C.) 15–25 (2022).

  • Lee, A. T. Flight Simulation: Virtual Environments in Aviation (Routledge, 2017).

  • Torres, E. P. Human Extinction: A History of the Science and Ethics of Annihilation (Routledge, 2023).

  • Bostrom, N. Existential risks: analyzing human extinction scenarios and related hazards. J. Evol. Technol. 9, 1–30 (2002).

    Google Scholar 

  • Bostrom, N. Superintelligence: Paths, Dangers, Strategies (Oxford Univ. Press, 2014).

  • Ord, T. The Precipice: Existential Risk and the Future of Humanity (Hachette Books, 2020).

  • Dafoe, A. & Russell, S. Yes, we are worried about the existential risk of artificial intelligence. MIT Technology Review (2016).

  • Bronson, R. Measuring existential risk. Peace Policy (2023).

  • Roose, K. A.I. poses ‘risk of extinction’, industry leaders warn. The New York Times (2023).

  • Statement on AI risk. Center for AI Safety (2023).

  • Weidinger, L. et al. Sociotechnical safety evaluation of generative ai systems. Preprint at (2023).

  • Anwar, U. et al. Foundational challenges in assuring alignment and safety of large language models. Trans. Mach. Learn. Res. (2024).

  • Ganin, Y. et al. in Domain Adaptation in Computer Vision Applications (ed. Csurka, G.) 189–209 (Springer, 2017).

  • Balaji, Y., Sankaranarayanan, S. & Chellappa, R. MetaReg: towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems Vol. 31 (eds Bengio, S. et al.) (NeurIPS, 2018).

  • Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep learning models resistant to adversarial attacks. Preprint at (2019).

  • Turchetta, M. et al. Safe exploration for interactive machine learning. In Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) (NeurIPS, 2019).

  • Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Proc. Syst. 35, 27730–27744 (2022).

    Google Scholar 

  • Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of- distribution examples in neural networks. In Proc. 5th International Conference on Learning Representations (ICLR, 2017).

  • link

    Copyright © All rights reserved. | Newsphere by AF themes.