Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024 | 84* | 2024 |
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies R Mihalcea, J Chai, A Sarkar Proceedings of the 2015 Conference of the North American Chapter of the …, 2024 | 57* | 2024 |
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark J Hoelscher-Obermaier*, J Persson*, E Kran, I Konstas, F Barez* ACL 2023, 2023 | 45 | 2023 |
PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration P Li, H Tang, T Yang, X Hao, T Sang, Y Zheng, J Hao, ME Taylor, Z Wang, ... ICML 2023, 2022 | 35 | 2022 |
The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python AVM Barone*, F Barez*, I Konstas, SB Cohen ACL 2023, 2023 | 30* | 2023 |
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ... arXiv preprint arxiv.org/abs/2406.10162, 2024 | 28* | 2024 |
Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI F Eiras, A Petrov, B Vidgen, CS de Witt, F Pizzati, K Elkins, ... ICML 2023, 2024 | 25* | 2024 |
Neuron to Graph: Interpreting Language Model Neurons at Scale A Foote*, N Nanda, E Kran, I Konstas, S Cohen, F Barez* ICLR 2023 Workshop RTML, 2023 | 23* | 2023 |
Understanding Addition in Transformers P Quirke, F Barez ICLR 2024, 2023 | 22 | 2023 |
Exploring the advantages of transformers for high-frequency trading F Barez, P Bilokon, A Gervais, N Lisitsyn arXiv preprint arXiv:2302.13850, 2023 | 7 | 2023 |
Towards interpreting visual information processing in vision-language models C Neo, L Ong, P Torr, M Geva, D Krueger, F Barez arXiv preprint arXiv:2410.07149, 2024 | 6 | 2024 |
Large language Models Relearn Removed Concepts M Lo*, SB Cohen, F Barez* ACL 2024, 2024 | 6 | 2024 |
Benchmarking specialized databases for high-frequency data F Barez, P Bilokon, R Xiong arXiv preprint arXiv:2301.12561, 2023 | 6 | 2023 |
Fairness in AI and Its Long-Term Implications on Society O Bohdal*, T Hospedales, PHS Torr, F Barez* Stanford Existential Safety journal, 2023 | 5 | 2023 |
Open problems in machine unlearning for ai safety F Barez, T Fu, A Prabhu, S Casper, A Sanyal, A Bibi, A O'Gara, R Kirk, ... arXiv preprint arXiv:2501.04952, 2025 | 4 | 2025 |
Sparse autoencoders reveal universal feature spaces across large language models M Lan, P Torr, A Meek, A Khakzar, D Krueger, F Barez arXiv preprint arXiv:2410.06981, 2024 | 4 | 2024 |
Increasing Trust in Language Models through the Reuse of Verified Circuits P Quirke, C Neo, F Barez arXiv preprint arXiv:2402.02619, 2024 | 4 | 2024 |
Value-Evolutionary-Based Reinforcement Learning P Li, HAO Jianye, H Tang, Y Zheng, F Barez ICML 2023, 2023 | 4 | 2023 |
Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small C Mathwin, G Corlouer, E Kran, F Barez, N Nanda URL: https://itch. io/jam/mechint/rate/1889871, 2023 | 4 | 2023 |
System III: Learning with Domain Knowledge for Safety Constraints F Barez, H Hasanbieg, A Abbate NeurIPS ML Safety Workshop 2022, 2022 | 4 | 2022 |