In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances its intended objectives. A misaligned AI system may pursue some objectives, but not the intended ones.[1]
It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler proxy goals, such as gaining human approval. But that approach can create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned.[1][2]
Misaligned AI systems can malfunction and cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking).[1][3][4] They may also develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their final given goals.[1][5][6] Furthermore, they may develop undesirable emergent goals that may be hard to detect before the system is deployed and encounters new situations and data distributions.[7][8]
Today, these problems affect existing commercial systems such as language models,[9][10][11] robots,[12] autonomous vehicles,[13] and social media recommendation engines.[9][6][14] Some AI researchers argue that more capable future systems will be more severely affected, since these problems partially result from the systems being highly capable.[15][3][2]
Many of the most-cited AI scientists,[16][17][18] including Geoffrey Hinton, Yoshua Bengio, and Stuart Russell, argue that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI) and could endanger human civilization if misaligned.[19][6]
AI alignment is a subfield of AI safety, the study of how to build safe AI systems.[20] Other subfields of AI safety include robustness, monitoring, and capability control.[21] Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.[21] Alignment research has connections to interpretability research,[22][23] (adversarial) robustness,[20] anomaly detection, calibrated uncertainty,[22] formal verification,[24] preference learning,[25][26][27] safety-critical engineering,[28] game theory,[29] algorithmic fairness,[20][30] and social sciences.[31]
^ abcd
Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022.
^ abNgo, Richard; Chan, Lawrence; Mindermann, Sören (2022). "The Alignment Problem from a Deep Learning Perspective". International Conference on Learning Representations. arXiv:2209.00626.
^ abPan, Alexander; Bhatia, Kush; Steinhardt, Jacob (February 14, 2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. Retrieved July 21, 2022.
^Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI". Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc. pp. 15763–15773. Retrieved March 11, 2023.
^Carlsmith, Joseph (June 16, 2022). "Is Power-Seeking AI an Existential Risk?". arXiv:2206.13353 [cs.CY].
^ abcRussell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control. Penguin Random House. ISBN 9780525558637. OCLC 1113410915.
^Christian, Brian (2020). The alignment problem: Machine learning and human values. W. W. Norton & Company. ISBN 978-0-393-86833-3. OCLC 1233266753. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
^Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David (June 28, 2022). "Goal Misgeneralization in Deep Reinforcement Learning". Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–12019. Retrieved March 11, 2023.
^ abBommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik (July 12, 2022). "On the Opportunities and Risks of Foundation Models". Stanford CRFM. arXiv:2108.07258.
^Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob; Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.; Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
^Zaremba, Wojciech; Brockman, Greg; OpenAI (August 10, 2021). "OpenAI Codex". OpenAI. Archived from the original on February 3, 2023. Retrieved July 23, 2022.
^Kober, Jens; Bagnell, J. Andrew; Peters, Jan (September 1, 2013). "Reinforcement learning in robotics: A survey". The International Journal of Robotics Research. 32 (11): 1238–1274. doi:10.1177/0278364913495721. ISSN 0278-3649. S2CID 1932843. Archived from the original on October 15, 2022. Retrieved September 12, 2022.
^Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (March 1, 2023). "Reward (Mis)design for autonomous driving". Artificial Intelligence. 316: 103829. arXiv:2104.13906. doi:10.1016/j.artint.2022.103829. ISSN 0004-3702. S2CID 233423198.
^Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being". International Journal of Community Well-Being. 3 (4): 443–463. doi:10.1007/s42413-020-00086-3. ISSN 2524-5295. PMC 7610010. PMID 34723107. S2CID 226254676.
^Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach. Prentice Hall. p. 1003. ISBN 978-0-13-461099-3.
^"Statement on AI Risk | CAIS". www.safe.ai. Retrieved February 11, 2024.
^Grace, Katja; Stewart, Harlan; Sandkühler, Julia Fabienne; Thomas, Stephen; Weinstein-Raun, Ben; Brauner, Jan (January 5, 2024), Thousands of AI Authors on the Future of AI, arXiv:2401.02843
^Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'". Forbes. Retrieved May 4, 2023.
^ abcAmodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (June 21, 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ abOrtega, Pedro A.; Maini, Vishal; DeepMind safety team (September 27, 2018). "Building safe artificial intelligence: specification, robustness, and assurance". DeepMind Safety Research – Medium. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^ abRorvig, Mordechai (April 14, 2022). "Researchers Gain New Understanding From Simple AI". Quanta Magazine. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^Doshi-Velez, Finale; Kim, Been (March 2, 2017). "Towards A Rigorous Science of Interpretable Machine Learning". arXiv:1702.08608 [stat.ML].
Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural networks" (Podcast). 80,000 hours. No. 107. Retrieved July 23, 2022.
^Russell, Stuart; Dewey, Daniel; Tegmark, Max (December 31, 2015). "Research Priorities for Robust and Beneficial Artificial Intelligence". AI Magazine. 36 (4): 105–114. arXiv:1602.03506. doi:10.1609/aimag.v36i4.2577. hdl:1721.1/108478. ISSN 2371-9621. S2CID 8174496. Archived from the original on February 2, 2023. Retrieved September 12, 2022.
^Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of preference-based reinforcement learning methods". Journal of Machine Learning Research. 18 (136): 1–46.
^Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.
^Heaven, Will Douglas (January 27, 2022). "The new version of GPT-3 is much better behaved (and should be less toxic)". MIT Technology Review. Archived from the original on February 10, 2023. Retrieved July 18, 2022.
^Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay (March 7, 2022). "Taxonomy of Machine Learning Safety: A Survey and Primer". arXiv:2106.04823 [cs.LG].
^Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda". Center on Long-Term Risk. Archived from the original on January 1, 2023. Retrieved July 18, 2022.
Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore (May 6, 2021). "Cooperative AI: machines must learn to find common ground". Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D. doi:10.1038/d41586-021-01170-0. ISSN 0028-0836. PMID 33947992. S2CID 233740521. Archived from the original on December 18, 2022. Retrieved September 12, 2022.
^Prunkl, Carina; Whittlestone, Jess (February 7, 2020). "Beyond Near- and Long-Term". Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM. pp. 138–143. doi:10.1145/3375627.3375803. ISBN 978-1-4503-7110-0. S2CID 210164673. Archived from the original on October 16, 2022. Retrieved September 12, 2022.
^Irving, Geoffrey; Askell, Amanda (February 19, 2019). "AI Safety Needs Social Scientists". Distill. 4 (2): 10.23915/distill.00014. doi:10.23915/distill.00014. ISSN 2476-0757. S2CID 159180422. Archived from the original on February 10, 2023. Retrieved September 12, 2022.
intelligence (AI), AIalignment research aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system...
intelligence (AI) systems. It encompasses machine ethics and AIalignment, which aim to ensure AI systems are moral and beneficial, as well as monitoring AI systems...
intelligence (AI), AIalignment research aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system...
Jan Leike (born 1986 or 1987) is an AIalignment researcher who has worked at DeepMind and OpenAI. He joined Anthropic in May 2024. Jan Leike obtained...
published The Alignment Problem, which details the history of progress on AIalignment up to that time. In March 2023, key figures in AI, such as Musk...
(Founder of the Alignment Research Center), and Zach Robinson (CEO of Effective Ventures US). Claude incorporates "Constitutional AI" to set safety guidelines...
criticism of its accuracy and bias towards certain demographics. One of AI's main alignment challenges is its black box nature (inputs and outputs are identifiable...
focused on the theoretical challenges of AIalignment. They attempt to develop scalable methods for training AI systems to behave honestly and helpfully...
of an image. AIalignment Content similarity detection Hallucination (artificial intelligence) Natural language processing "'Don't use AI detectors for...
software. The AI boom may have a profound cultural, philosophical, religious, economic, and social impact, as questions such as AIalignment, qualia, and...
performance and tire wear AIalignment, steering artificial intelligence systems towards the intended objective Alignment level, an audio recording/engineering...
located the desired Luigi, it's much easier to summon the Waluigi". AIalignment Hallucination Existential risk from AGI Reinforcement learning from human...
incorporation of AI produces advantages and disadvantages. Artificial intelligence in healthcare Artificial intelligence detection software AIalignment Artificial...
development of AI, or even "destroy[ing] a rogue datacenter by airstrike". The article helped introduce the debate about AIalignment to the mainstream...
human brain AI effect AI safety – Research area on making AI safe and beneficial AIalignment – AI conformance to the intended objective A.I. Rising – 2018...
follow-up Project Maven after the current contract expired in March 2019. AIalignmentA.I. Rising Arms race Artificial general intelligence Artificial intelligence...
to better use human feedback to train AI systems, and how to safely use AI to incrementally automate alignment research. Some observers believe the company's...
exercise caution in dealing with AI, stating "that's too dangerous. You can't break things when you are talking about AI". In a similar vein, Ellen Huet...
double the number of Black patients being selected for the program. AIalignment Artificial intelligence in mental health Artificial intelligence Glossary...
Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in...
EleutherAI is a "decentralized grassroots collective of volunteer researchers, engineers, and developers focused on AIalignment, scaling, and open-source AI...
sector policies for artists in the era of artificial intelligence (AI) and AIalignment. It was noted as the first enacted legislation in the United States...
agents. Within the field of AI ethics, significant yet-unsolved research problems include AIalignment (ensuring that AI behaviors are aligned with their...
intelligence (AI) is a serious concern despite the uncertainty surrounding future progress in AI. It also proposes an approach to the AI control problem...