Follow
Fazl Barez
Title
Cited by
Cited by
Year
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
84*2024
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
R Mihalcea, J Chai, A Sarkar
Proceedings of the 2015 Conference of the North American Chapter of the …, 2024
57*2024
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
J Hoelscher-Obermaier*, J Persson*, E Kran, I Konstas, F Barez*
ACL 2023, 2023
452023
PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration
P Li, H Tang, T Yang, X Hao, T Sang, Y Zheng, J Hao, ME Taylor, Z Wang, ...
ICML 2023, 2022
352022
The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python
AVM Barone*, F Barez*, I Konstas, SB Cohen
ACL 2023, 2023
30*2023
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ...
arXiv preprint arxiv.org/abs/2406.10162, 2024
28*2024
Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI
F Eiras, A Petrov, B Vidgen, CS de Witt, F Pizzati, K Elkins, ...
ICML 2023, 2024
25*2024
Neuron to Graph: Interpreting Language Model Neurons at Scale
A Foote*, N Nanda, E Kran, I Konstas, S Cohen, F Barez*
ICLR 2023 Workshop RTML, 2023
23*2023
Understanding Addition in Transformers
P Quirke, F Barez
ICLR 2024, 2023
222023
Exploring the advantages of transformers for high-frequency trading
F Barez, P Bilokon, A Gervais, N Lisitsyn
arXiv preprint arXiv:2302.13850, 2023
72023
Towards interpreting visual information processing in vision-language models
C Neo, L Ong, P Torr, M Geva, D Krueger, F Barez
arXiv preprint arXiv:2410.07149, 2024
62024
Large language Models Relearn Removed Concepts
M Lo*, SB Cohen, F Barez*
ACL 2024, 2024
62024
Benchmarking specialized databases for high-frequency data
F Barez, P Bilokon, R Xiong
arXiv preprint arXiv:2301.12561, 2023
62023
Fairness in AI and Its Long-Term Implications on Society
O Bohdal*, T Hospedales, PHS Torr, F Barez*
Stanford Existential Safety journal, 2023
52023
Open problems in machine unlearning for ai safety
F Barez, T Fu, A Prabhu, S Casper, A Sanyal, A Bibi, A O'Gara, R Kirk, ...
arXiv preprint arXiv:2501.04952, 2025
42025
Sparse autoencoders reveal universal feature spaces across large language models
M Lan, P Torr, A Meek, A Khakzar, D Krueger, F Barez
arXiv preprint arXiv:2410.06981, 2024
42024
Increasing Trust in Language Models through the Reuse of Verified Circuits
P Quirke, C Neo, F Barez
arXiv preprint arXiv:2402.02619, 2024
42024
Value-Evolutionary-Based Reinforcement Learning
P Li, HAO Jianye, H Tang, Y Zheng, F Barez
ICML 2023, 2023
42023
Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small
C Mathwin, G Corlouer, E Kran, F Barez, N Nanda
URL: https://itch. io/jam/mechint/rate/1889871, 2023
42023
System III: Learning with Domain Knowledge for Safety Constraints
F Barez, H Hasanbieg, A Abbate
NeurIPS ML Safety Workshop 2022, 2022
42022
The system can't perform the operation now. Try again later.
Articles 1–20