Authors: Gheorghe Comanici (Xinyi), Eric Bieber (Xinyi), Mike Schaekermann (Xinyi), Ice Pasupat (Xinyi), Noveen Sachdeva (Xinyi), Inderjit Dhillon (Xinyi), Marcel Blistein (Xinyi), Ori Ram (Xinyi), Dan Zhang (Xinyi), Evan Rosen (Xinyi), Luke Marris (Xinyi), Sam Petulla (Xinyi), Colin Gaffney (Xinyi), Asaf Aharoni (Xinyi), Nathan Lintz (Xinyi), Tiago Cardal Pais (Xinyi), Henrik Jacobsson (Xinyi), Idan Szpektor (Xinyi), Nan-Jiang Jiang (Xinyi), Krishna Haridasan (Xinyi), Ahmed Omran (Xinyi), Nikunj Saunshi (Xinyi), Dara Bahri (Xinyi), Gaurav Mishra (Xinyi), Eric Chu (Xinyi), Toby Boyd (Xinyi), Brad Hekman (Xinyi), Aaron Parisi (Xinyi), Chaoyi Zhang (Xinyi), Kornraphop Kawintiranon (Xinyi), Tania Bedrax-Weiss (Xinyi), Oliver Wang (Xinyi), Ya Xu (Xinyi), Ollie Purkiss (Xinyi), Uri Mendlovic (Xinyi), Ila\"i Deutel (Xinyi), Nam Nguyen (Xinyi), Adam Langley (Xinyi), Flip Korn (Xinyi), Lucia Rossazza (Xinyi), Alexandre Ram\'e (Xinyi), Sagar Waghmare (Xinyi), Helen Miller (Xinyi), Vaishakh Keshava (Xinyi), Ying Jian (Xinyi), Xiaofan Zhang (Xinyi), Raluca Ada Popa (Xinyi), Kedar Dhamdhere (Xinyi), Bla\v{z} Bratani\v{c} (Xinyi), Kyuyeun Kim (Xinyi), Terry Koo (Xinyi), Ferran Alet (Xinyi), Yi-ting Chen (Xinyi), Arsha Nagrani (Xinyi), Hannah Muckenhirn (Xinyi), Zhiyuan Zhang (Xinyi), Corbin Quick (Xinyi), Filip Paveti\'c (Xinyi), Duc Dung Nguyen (Xinyi), Joao Carreira (Xinyi), Michael Elabd (Xinyi), Haroon Qureshi (Xinyi), Fabian Mentzer (Xinyi), Yao-Yuan Yang (Xinyi), Danielle Eisenbud (Xinyi), Anmol Gulati (Xinyi), Ellie Talius (Xinyi), Eric Ni (Xinyi), Sahra Ghalebikesabi (Xinyi), Edouard Yvinec (Xinyi), Alaa Saade (Xinyi), Thatcher Ulrich (Xinyi), Lorenzo Blanco (Xinyi), Dan A. Calian (Xinyi), Muhuan Huang (Xinyi), A\"aron van den Oord (Xinyi), Naman Goyal (Xinyi), Terry Chen (Xinyi), Praynaa Rawlani (Xinyi), Christian Schallhart (Xinyi), Swachhand Lokhande (Xinyi), Xianghong Luo (Xinyi), Jyn Shan (Xinyi), Ceslee Montgomery (Xinyi), Victoria Krakovna (Xinyi), Federico Piccinini (Xinyi), Omer Barak (Xinyi), Jingyu Cui (Xinyi), Yiling Jia (Xinyi), Mikhail Dektiarev (Xinyi), Alexey Kolganov (Xinyi), Shiyu Huang (Xinyi), Zhe Chen (Xinyi), Xingyu Wang (Xinyi), Jessica Austin (Xinyi), Peter de Boursac (Xinyi), Evgeny Sluzhaev (Xinyi), Frank Ding (Xinyi), Huijian Li (Xinyi), Surya Bhupatiraju (Xinyi), Mohit Agarwal (Xinyi), S{\l}awek Kwasiborski (Xinyi), Paramjit Sandhu (Xinyi), Patrick Siegler (Xinyi), Ahmet Iscen (Xinyi), Eyal Ben-David (Xinyi), Shiraz Butt (Xinyi), Miltos Allamanis (Xinyi), Seth Benjamin (Xinyi), Robert Busa-Fekete (Xinyi), Felix Hernandez-Campos (Xinyi), Sasha Goldshtein (Xinyi), Matt Dibb (Xinyi), Weiyang Zhang (Xinyi), Annie Marsden (Xinyi), Carey Radebaugh (Xinyi), Stephen Roller (Xinyi), Abhishek Nayyar (Xinyi), Jacob Austin (Xinyi), Tayfun Terzi (Xinyi), Bhargav Kanagal Shamanna (Xinyi), Pete Shaw (Xinyi), Aayush Singh (Xinyi), Florian Luisier (Xinyi), Artur Mendon\c{c}a (Xinyi), Vaibhav Aggarwal (Xinyi), Larisa Markeeva (Xinyi), Claudio Fantacci (Xinyi), Sergey Brin (Xinyi), HyunJeong Choe (Xinyi), Guanyu Wang (Xinyi), Hartwig Adam (Xinyi), Avigail Dabush (Xinyi), Tatsuya Kiyono (Xinyi), Eyal Marcus (Xinyi), Jeremy Cole (Xinyi), Theophane Weber (Xinyi), Hongrae Lee (Xinyi), Ronny Huang (Xinyi), Alex Muzio (Xinyi), Leandro Kieliger (Xinyi), Maigo Le (Xinyi), Courtney Biles (Xinyi), Long Le (Xinyi), Archit Sharma (Xinyi), Chengrun Yang (Xinyi), Avery Lamp (Xinyi), Dave Dopson (Xinyi), Nate Hurley (Xinyi), Katrina (Xinyi), Xu (Jerry), Zhihao Shan (Jerry), Shuang Song (Jerry), Jiewen Tan (Jerry), Alexandre Senges (Jerry), George Zhang (Jerry), Chong You (Jerry), Yennie Jun (Jerry), David Raposo (Jerry), Susanna Ricco (Jerry), Xuan Yang (Jerry), Weijie Chen (Jerry), Prakhar Gupta (Jerry), Arthur Szlam (Jerry), Kevin Villela (Jerry), Chun-Sung Ferng (Jerry), Daniel Kasenberg (Jerry), Chen Liang (Jerry), Rui Zhu (Jerry), Arunachalam Narayanaswamy (Jerry), Florence Perot (Jerry), Paul Pucciarelli (Jerry), Anna Shekhawat (Jerry), Alexey Stern (Jerry), Rishikesh Ingale (Jerry), Stefani Karp (Jerry), Sanaz Bahargam (Jerry), Adrian Goedeckemeyer (Jerry), Jie Han (Jerry), Sicheng Li (Jerry), Andrea Tacchetti (Jerry), Dian Yu (Jerry), Abhishek Chakladar (Jerry), Zhiying Zhang (Jerry), Mona El Mahdy (Jerry), Xu Gao (Jerry), Dale Johnson (Jerry), Samrat Phatale (Jerry), AJ Piergiovanni (Jerry), Hyeontaek Lim (Jerry), Clement Farabet (Jerry), Carl Lebsack (Jerry), Theo Guidroz (Jerry), John Blitzer (Jerry), Nico Duduta (Jerry), David Madras (Jerry), Steve Li (Jerry), Daniel von Dincklage (Jerry), Xin Li (Jerry), Mahdis Mahdieh (Jerry), George Tucker (Jerry), Ganesh Jawahar (Jerry), Owen Xiao (Jerry), Danny Tarlow (Jerry), Robert Geirhos (Jerry), Noam Velan (Jerry), Daniel Vlasic (Jerry), Kalesha Bullard (Jerry), SK Park (Jerry), Nishesh Gupta (Jerry), Kellie Webster (Jerry), Ayal Hitron (Jerry), Jieming Mao (Jerry), Julian Eisenschlos (Jerry), Laurel Prince (Jerry), Nina D'Souza (Jerry), Kelvin Zheng (Jerry), Sara Nasso (Jerry), Gabriela Botea (Jerry), Carl Doersch (Jerry), Caglar Unlu (Jerry), Chris Alberti (Jerry), Alexey Svyatkovskiy (Jerry), Ankita Goel (Jerry), Krzysztof Choromanski (Jerry), Pan-Pan Jiang (Jerry), Richard Nguyen (Jerry), Four Flynn (Jerry), Daria \'Curko (Jerry), Peter Chen (Jerry), Nicholas Roth (Jerry), Kieran Milan (Jerry), Caleb Habtegebriel (Jerry), Shashi Narayan (Jerry), Michael Moffitt (Jerry), Jake Marcus (Jerry), Thomas Anthony (Jerry), Brendan McMahan (Jerry), Gowoon Cheon (Jerry), Ruibo Liu (Jerry), Megan Barnes (Jerry), Lukasz Lew (Jerry), Rebeca Santamaria-Fernandez (Jerry), Mayank Upadhyay (Jerry), Arjun Akula (Jerry), Arnar Mar Hrafnkelsson (Jerry), Alvaro Caceres (Jerry), Andrew Bunner (Jerry), Michal Sokolik (Jerry), Subha Puttagunta (Jerry), Lawrence Moore (Jerry), Berivan Isik (Jerry), Weilun Chen (Jerry), Jay Hartford (Jerry), Lawrence Chan (Jerry), Pradeep Shenoy (Jerry), Dan Holtmann-Rice (Jerry), Jane Park (Jerry), Fabio Viola (Jerry), Alex Salcianu (Jerry), Sujeevan Rajayogam (Jerry), Ian Stewart-Binks (Jerry), Zelin Wu (Jerry), Richard Everett (Jerry), Xi Xiong (Jerry), Pierre-Antoine Manzagol (Jerry), Gary Leung (Jerry), Carl Saroufim (Jerry), Bo Pang (Jerry), Dawid Wegner (Jerry), George Papamakarios (Jerry), Jennimaria Palomaki (Jerry), Helena Pankov (Jerry), Guangda Lai (Jerry), Guilherme Tubone (Jerry), Shubin Zhao (Jerry), Theofilos Strinopoulos (Jerry), Seth Neel (Jerry), Mingqiu Wang (Jerry), Joe Kelley (Jerry), Li Li (Jerry), Pingmei Xu (Jerry), Anitha Vijayakumar (Jerry), Andrea D'olimpio (Jerry), Omer Levy (Jerry), Massimo Nicosia (Jerry), Grigory Rozhdestvenskiy (Jerry), Ni Lao (Jerry), Sirui Xie (Jerry), Yash Katariya (Jerry), Jon Simon (Jerry), Sanjiv Kumar (Jerry), Florian Hartmann (Jerry), Michael Kilgore (Jerry), Jinhyuk Lee (Jerry), Aroma Mahendru (Jerry), Roman Ring (Jerry), Tom Hennigan (Jerry), Fiona Lang (Jerry), Colin Cherry (Jerry), David Steiner (Jerry), Dawsen Hwang (Jerry), Ray Smith (Jerry), Pidong Wang (Jerry), Jeremy Chen (Jerry), Ming-Hsuan Yang (Jerry), Sam Kwei (Jerry), Philippe Schlattner (Jerry), Donnie Kim (Jerry), Ganesh Poomal Girirajan (Jerry), Nikola Momchev (Jerry), Ayushi Agarwal (Jerry), Xingyi Zhou (Jerry), Ilkin Safarli (Jerry), Zachary Garrett (Jerry), AJ Pierigiovanni (Jerry), Sarthak Jauhari (Jerry), Alif Raditya Rochman (Jerry), Shikhar Vashishth (Jerry), Quan Yuan (Jerry), Christof Angermueller (Jerry), Jon Blanton (Jerry), Xinying Song (Jerry), Nitesh Bharadwaj Gundavarapu (Jerry), Thi Avrahami (Jerry), Maxine Deines (Jerry), Subhrajit Roy (Jerry), Manish Gupta (Jerry), Christopher Semturs (Jerry), Shobha Vasudevan (Jerry), Aditya Srikanth Veerubhotla (Jerry), Shriya Sharma (Jerry), Josh Jacob (Jerry), Zhen Yang (Jerry), Andreas Terzis (Jerry), Dan Karliner (Jerry), Auriel Wright (Jerry), Tania Rojas-Esponda (Jerry), Ashley Brown (Jerry), Abhijit Guha Roy (Jerry), Pawan Dogra (Jerry), Andrei Kapishnikov (Jerry), Peter Young (Jerry), Wendy Kan (Jerry), Vinodh Kumar Rajendran (Jerry), Maria Ivanova (Jerry), Salil Deshmukh (Jerry), Chia-Hua Ho (Jerry), Mike Kwong (Jerry), Stav Ginzburg (Jerry), Annie Louis (Jerry), KP Sawhney (Jerry), Slav Petrov (Jerry), Jing Xie (Jerry), Yunfei Bai (Jerry), Georgi Stoyanov (Jerry), Alex Fabrikant (Jerry), Rajesh Jayaram (Jerry), Yuqi Li (Jerry), Joe Heyward (Jerry), Justin Gilmer (Jerry), Yaqing Wang (Jerry), Radu Soricut (Jerry), Luyang Liu (Jerry), Qingnan Duan (Jerry), Jamie Hayes (Jerry), Maura O'Brien (Jerry), Gaurav Singh Tomar (Jerry), Sivan Eiger (Jerry), Bahar Fatemi (Jerry), Jeffrey Hui (Jerry), Catarina Barros (Jerry), Adaeze Chukwuka (Jerry), Alena Butryna (Jerry), Saksham Thakur (Jerry), Austin Huang (Jerry), Zhufeng Pan (Jerry), Haotian Tang (Jerry), Serkan Cabi (Jerry), Tulsee Doshi (Jerry), Michiel Bakker (Jerry), Sumit Bagri (Jerry), Ruy Ley-Wild (Jerry), Adam Lelkes (Jerry), Jennie Lees (Jerry), Patrick Kane (Jerry), David Greene (Jerry), Shimu Wu (Jerry), J\"org Bornschein (Jerry), Gabriela Surita (Jerry), Sarah Hodkinson (Jerry), Fangtao Li (Jerry), Chris Hidey (Jerry), S\'ebastien Pereira (Jerry), Sean Ammirati (Jerry), Phillip Lippe (Jerry), Adam Kraft (Jerry), Pu Han (Jerry), Sebastian Gerlach (Jerry), Zifeng Wang (Jerry), Liviu Panait (Jerry), Feng Han (Jerry), Brian Farris (Jerry), Yingying Bi (Jerry), Hannah DeBalsi (Jerry), Miaosen Wang (Jerry), Gladys Tyen (Jerry), James Cohan (Jerry), Susan Zhang (Jerry), Jarred Barber (Jerry), Da-Woon Chung (Jerry), Jaeyoun Kim (Jerry), Markus Kunesch (Jerry), Steven Pecht (Jerry), Nami Akazawa (Jerry), Abe Friesen (Jerry), James Lyon (Jerry), Ali Eslami (Jerry), Junru Wu (Jerry), Jie Tan (Jerry), Yue Song (Jerry), Ravi Kumar (Jerry), Chris Welty (Jerry), Ilia Akolzin (Jerry), Gena Gibson (Jerry), Sean Augenstein (Jerry), Arjun Pillai (Jerry), Nancy Yuen (Jerry), Du Phan (Jerry), Xin Wang (Jerry), Iain Barr (Jerry), Heiga Zen (Jerry), Nan Hua (Jerry), Casper Liu (Jerry), Jilei (Jerry), Wang (Elena), Tanuj Bhatia (Elena), Hao Xu (Elena), Oded Elyada (Elena), Pushmeet Kohli (Elena), Mirek Ol\v{s}\'ak (Elena), Ke Chen (Elena), Azalia Mirhoseini (Elena), Noam Shazeer (Elena), Shoshana Jakobovits (Elena), Maggie Tran (Elena), Nolan Ramsden (Elena), Tarun Bharti (Elena), Fred Alcober (Elena), Yunjie Li (Elena), Shilpa Shetty (Elena), Jing Chen (Elena), Dmitry Kalashnikov (Elena), Megha Nawhal (Elena), Sercan Arik (Elena), Hanwen Chen (Elena), Michiel Blokzijl (Elena), Shubham Gupta (Elena), James Rubin (Elena), Rigel Swavely (Elena), Sophie Bridgers (Elena), Ian Gemp (Elena), Chen Su (Elena), Arun Suggala (Elena), Juliette Pluto (Elena), Mary Cassin (Elena), Alain Vaucher (Elena), Kaiyang Ji (Elena), Jiahao Cai (Elena), Andrew Audibert (Elena), Animesh Sinha (Elena), David Tian (Elena), Efrat Farkash (Elena), Amy Hua (Elena), Jilin Chen (Elena), Duc-Hieu Tran (Elena), Edward Loper (Elena), Nicole Brichtova (Elena), Lara McConnaughey (Elena), Ballie Sandhu (Elena), Robert Leland (Elena), Doug DeCarlo (Elena), Andrew Over (Elena), James Huang (Elena), Xing Wu (Elena), Connie Fan (Elena), Eric Li (Elena), Yun Lei (Elena), Deepak Sharma (Elena), Cosmin Paduraru (Elena), Luo Yu (Elena), Matko Bo\v{s}njak (Elena), Phuong Dao (Elena), Min Choi (Elena), Sneha Kudugunta (Elena), Jakub Adamek (Elena), Carlos Gu\'ia (Elena), Ali Khodaei (Elena), Jie Feng (Elena), Wenjun Zeng (Elena), David Welling (Elena), Sandeep Tata (Elena), Christina Butterfield (Elena), Andrey Vlasov (Elena), Seliem El-Sayed (Elena), Swaroop Mishra (Elena), Tara Sainath (Elena), Shentao Yang (Elena), RJ Skerry-Ryan (Elena), Jeremy Shar (Elena), Robert Berry (Elena), Arunkumar Rajendran (Elena), Arun Kandoor (Elena), Andrea Burns (Elena), Deepali Jain (Elena), Tom Stone (Elena), Wonpyo Park (Elena), Shibo Wang (Elena), Albin Cassirer (Elena), Guohui Wang (Elena), Hayato Kobayashi (Elena), Sergey Rogulenko (Elena), Vineetha Govindaraj (Elena), Miko{\l}aj Rybi\'nski (Elena), Nadav Olmert (Elena), Colin Evans (Elena), Po-Sen Huang (Elena), Kelvin Xu (Elena), Premal Shah (Elena), Terry Thurk (Elena), Caitlin Sikora (Elena), Mu Cai (Elena), Jin Xie (Elena), Elahe Dabir (Elena), Saloni Shah (Elena), Norbert Kalb (Elena), Carrie Zhang (Elena), Shruthi Prabhakara (Elena), Amit Sabne (Elena), Artiom Myaskovsky (Elena), Vikas Raunak (Elena), Blanca Huergo (Elena), Behnam Neyshabur (Elena), Jon Clark (Elena), Ye Zhang (Elena), Shankar Krishnan (Elena), Eden Cohen (Elena), Dinesh Tewari (Elena), James Lottes (Elena), Yumeya Yamamori (Elena), Hui (Elena), Li (Tu\'\^an), Mohamed Elhawaty (Tu\'\^an), Ada Maksutaj Oflazer (Tu\'\^an), Adri\`a Recasens (Tu\'\^an), Sheryl Luo (Tu\'\^an), Duy Nguyen (Tu\'\^an), Taylor Bos (Tu\'\^an), Kalyan Andra (Tu\'\^an), Ana Salazar (Tu\'\^an), Ed Chi (Tu\'\^an), Jeongwoo Ko (Tu\'\^an), Matt Ginsberg (Tu\'\^an), Anders Andreassen (Tu\'\^an), Anian Ruoss (Tu\'\^an), Todor Davchev (Tu\'\^an), Elnaz Davoodi (Tu\'\^an), Chenxi Liu (Tu\'\^an), Min Kim (Tu\'\^an), Santiago Ontanon (Tu\'\^an), Chi Ming To (Tu\'\^an), Dawei Jia (Tu\'\^an), Rosemary Ke (Tu\'\^an), Jing Wang (Tu\'\^an), Anna Korsun (Tu\'\^an), Moran Ambar (Tu\'\^an), Ilya Kornakov (Tu\'\^an), Irene Giannoumis (Tu\'\^an), Toni Creswell (Tu\'\^an), Denny Zhou (Tu\'\^an), Yi Su (Tu\'\^an), Ishaan Watts (Tu\'\^an), Aleksandr Zaks (Tu\'\^an), Evgenii Eltyshev (Tu\'\^an), Ziqiang Feng (Tu\'\^an), Sidharth Mudgal (Tu\'\^an), Alex Kaskasoli (Tu\'\^an), Juliette Love (Tu\'\^an), Kingshuk Dasgupta (Tu\'\^an), Sam Shleifer (Tu\'\^an), Richard Green (Tu\'\^an), Sungyong Seo (Tu\'\^an), Chansoo Lee (Tu\'\^an), Dale Webster (Tu\'\^an), Prakash Shroff (Tu\'\^an), Ganna Raboshchuk (Tu\'\^an), Isabel Leal (Tu\'\^an), James Manyika (Tu\'\^an), Sofia Erell (Tu\'\^an), Daniel Murphy (Tu\'\^an), Zhisheng Xiao (Tu\'\^an), Anton Bulyenov (Tu\'\^an), Julian Walker (Tu\'\^an), Mark Collier (Tu\'\^an), Matej Kastelic (Tu\'\^an), Nelson George (Tu\'\^an), Sushant Prakash (Tu\'\^an), Sailesh Sidhwani (Tu\'\^an), Alexey Frolov (Tu\'\^an), Steven Hansen (Tu\'\^an), Petko Georgiev (Tu\'\^an), Tiberiu Sosea (Tu\'\^an), Chris Apps (Tu\'\^an), Aishwarya Kamath (Tu\'\^an), David Reid (Tu\'\^an), Emma Cooney (Tu\'\^an), Charlotte Magister (Tu\'\^an), Oriana Riva (Tu\'\^an), Alec Go (Tu\'\^an), Pu-Chin Chen (Tu\'\^an), Sebastian Krause (Tu\'\^an), Nir Levine (Tu\'\^an), Marco Fornoni (Tu\'\^an), Ilya Figotin (Tu\'\^an), Nick Roy (Tu\'\^an), Parsa Mahmoudieh (Tu\'\^an), Vladimir Magay (Tu\'\^an), Mukundan Madhavan (Tu\'\^an), Jin Miao (Tu\'\^an), Jianmo Ni (Tu\'\^an), Yasuhisa Fujii (Tu\'\^an), Ian Chou (Tu\'\^an), George Scrivener (Tu\'\^an), Zak Tsai (Tu\'\^an), Siobhan Mcloughlin (Tu\'\^an), Jeremy Selier (Tu\'\^an), Sandra Lefdal (Tu\'\^an), Jeffrey Zhao (Tu\'\^an), Abhijit Karmarkar (Tu\'\^an), Kushal Chauhan (Tu\'\^an), Shivanker Goel (Tu\'\^an), Zhaoyi Zhang (Tu\'\^an), Vihan Jain (Tu\'\^an), Parisa Haghani (Tu\'\^an), Mostafa Dehghani (Tu\'\^an), Jacob Scott (Tu\'\^an), Erin Farnese (Tu\'\^an), Anastasija Ili\'c (Tu\'\^an), Steven Baker (Tu\'\^an), Julia Pawar (Tu\'\^an), Li Zhong (Tu\'\^an), Josh Camp (Tu\'\^an), Yoel Zeldes (Tu\'\^an), Shravya Shetty (Tu\'\^an), Anand Iyer (Tu\'\^an), V\'it List\'ik (Tu\'\^an), Jiaxian Guo (Tu\'\^an), Luming Tang (Tu\'\^an), Mark Geller (Tu\'\^an), Simon Bucher (Tu\'\^an), Yifan Ding (Tu\'\^an), Hongzhi Shi (Tu\'\^an), Carrie Muir (Tu\'\^an), Dominik Grewe (Tu\'\^an), Ramy Eskander (Tu\'\^an), Octavio Ponce (Tu\'\^an), Boqing Gong (Tu\'\^an), Derek Gasaway (Tu\'\^an), Samira Khan (Tu\'\^an), Umang Gupta (Tu\'\^an), Angelos Filos (Tu\'\^an), Weicheng Kuo (Tu\'\^an), Klemen Kloboves (Tu\'\^an), Jennifer Beattie (Tu\'\^an), Christian Wright (Tu\'\^an), Leon Li (Tu\'\^an), Alicia Jin (Tu\'\^an), Sandeep Mariserla (Tu\'\^an), Miteyan Patel (Tu\'\^an), Jens Heitkaemper (Tu\'\^an), Dilip Krishnan (Tu\'\^an), Vivek Sharma (Tu\'\^an), David Bieber (Tu\'\^an), Christian Frank (Tu\'\^an), John Lambert (Tu\'\^an), Paul Caron (Tu\'\^an), Martin Polacek (Tu\'\^an), Mai Gim\'enez (Tu\'\^an), Himadri Choudhury (Tu\'\^an), Xing Yu (Tu\'\^an), Sasan Tavakkol (Tu\'\^an), Arun Ahuja (Tu\'\^an), Franz Och (Tu\'\^an), Rodolphe Jenatton (Tu\'\^an), Wojtek Skut (Tu\'\^an), Bryan Richter (Tu\'\^an), David Gaddy (Tu\'\^an), Andy Ly (Tu\'\^an), Misha Bilenko (Tu\'\^an), Megh Umekar (Tu\'\^an), Ethan Liang (Tu\'\^an), Martin Sevenich (Tu\'\^an), Mandar Joshi (Tu\'\^an), Hassan Mansoor (Tu\'\^an), Rebecca Lin (Tu\'\^an), Sumit Sanghai (Tu\'\^an), Abhimanyu Singh (Tu\'\^an), Xiaowei Li (Tu\'\^an), Sudheendra Vijayanarasimhan (Tu\'\^an), Zaheer Abbas (Tu\'\^an), Yonatan Bitton (Tu\'\^an), Hansa Srinivasan (Tu\'\^an), Manish Reddy Vuyyuru (Tu\'\^an), Alexander Fr\"ommgen (Tu\'\^an), Yanhua Sun (Tu\'\^an), Ralph Leith (Tu\'\^an), Alfonso Casta\~no (Tu\'\^an), DJ Strouse (Tu\'\^an), Le Yan (Tu\'\^an), Austin Kyker (Tu\'\^an), Satish Kambala (Tu\'\^an), Mary Jasarevic (Tu\'\^an), Thibault Sellam (Tu\'\^an), Chao Jia (Tu\'\^an), Alexander Pritzel (Tu\'\^an), Raghavender R (Tu\'\^an), Huizhong Chen (Tu\'\^an), Natalie Clay (Tu\'\^an), Sudeep Gandhe (Tu\'\^an), Sean Kirmani (Tu\'\^an), Sayna Ebrahimi (Tu\'\^an), Hannah Kirkwood (Tu\'\^an), Jonathan Mallinson (Tu\'\^an), Chao Wang (Tu\'\^an), Adnan Ozturel (Tu\'\^an), Kuo Lin (Tu\'\^an), Shyam Upadhyay (Tu\'\^an), Vincent Cohen-Addad (Tu\'\^an), Sean Purser-haskell (Tu\'\^an), Yichong Xu (Tu\'\^an), Ebrahim Songhori (Tu\'\^an), Babi Seal (Tu\'\^an), Alberto Magni (Tu\'\^an), Almog Gueta (Tu\'\^an), Tingting Zou (Tu\'\^an), Guru Guruganesh (Tu\'\^an), Thais Kagohara (Tu\'\^an), Hung Nguyen (Tu\'\^an), Khalid Salama (Tu\'\^an), Alejandro Cruzado Ruiz (Tu\'\^an), Justin Frye (Tu\'\^an), Zhenkai Zhu (Tu\'\^an), Matthias Lochbrunner (Tu\'\^an), Simon Osindero (Tu\'\^an), Wentao Yuan (Tu\'\^an), Lisa Lee (Tu\'\^an), Aman Prasad (Tu\'\^an), Lam Nguyen Thiet (Tu\'\^an), Daniele Calandriello (Tu\'\^an), Victor Stone (Tu\'\^an), Qixuan Feng (Tu\'\^an), Han Ke (Tu\'\^an), Maria Voitovich (Tu\'\^an), Geta Sampemane (Tu\'\^an), Lewis Chiang (Tu\'\^an), Ling Wu (Tu\'\^an), Alexander Bykovsky (Tu\'\^an), Matt Young (Tu\'\^an), Luke Vilnis (Tu\'\^an), Ishita Dasgupta (Tu\'\^an), Aditya Chawla (Tu\'\^an), Qin Cao (Tu\'\^an), Bowen Liang (Tu\'\^an), Daniel Toyama (Tu\'\^an), Szabolcs Payrits (Tu\'\^an), Anca Stefanoiu (Tu\'\^an), Dimitrios Vytiniotis (Tu\'\^an), Ankesh Anand (Tu\'\^an), Tianxiao Shen (Tu\'\^an), Blagoj Mitrevski (Tu\'\^an), Michael Tschannen (Tu\'\^an), Sreenivas Gollapudi (Tu\'\^an), Aishwarya P S (Tu\'\^an), Jos\'e Leal (Tu\'\^an), Zhe Shen (Tu\'\^an), Han Fu (Tu\'\^an), Wei Wang (Tu\'\^an), Arvind Kannan (Tu\'\^an), Doron Kukliansky (Tu\'\^an), Sergey Yaroshenko (Tu\'\^an), Svetlana Grant (Tu\'\^an), Umesh Telang (Tu\'\^an), David Wood (Tu\'\^an), Alexandra Chronopoulou (Tu\'\^an), Alexandru \c{T}ifrea (Tu\'\^an), Tao Zhou (Tu\'\^an), Tony (Tu\'\^an), Nguy\~\^en (Q), Muge Ersoy (Q), Anima Singh (Q), Meiyan Xie (Q), Emanuel Taropa (Q), Woohyun Han (Q), Eirikur Agustsson (Q), Andrei Sozanschi (Q), Hui Peng (Q), Alex Chen (Q), Yoel Drori (Q), Efren Robles (Q), Yang Gao (Q), Xerxes Dotiwalla (Q), Ying Chen (Q), Anudhyan Boral (Q), Alexei Bendebury (Q), John Nham (Q), Chris Tar (Q), Luis Castro (Q), Jiepu Jiang (Q), Canoee Liu (Q), Felix Halim (Q), Jinoo Baek (Q), Andy Wan (Q), Jeremiah Liu (Q), Yuan Cao (Q), Shengyang Dai (Q), Trilok Acharya (Q), Ruoxi Sun (Q), Fuzhao Xue (Q), Saket Joshi (Q), Morgane Lustman (Q), Yongqin Xian (Q), Rishabh Joshi (Q), Deep Karkhanis (Q), Nora Kassner (Q), Jamie Hall (Q), Xiangzhuo Ding (Q), Gan Song (Q), Gang Li (Q), Chen Zhu (Q), Yana Kulizhskaya (Q), Bin Ni (Q), Alexey Vlaskin (Q), Solomon Demmessie (Q), Lucio Dery (Q), Salah Zaiem (Q), Yanping Huang (Q), Cindy Fan (Q), Felix Gimeno (Q), Ananth Balashankar (Q), Koji Kojima (Q), Hagai Taitelbaum (Q), Maya Meng (Q), Dero Gharibian (Q), Sahil Singla (Q), Wei Chen (Q), Ambrose Slone (Q), Guanjie Chen (Q), Sujee Rajayogam (Q), Max Schumacher (Q), Suyog Kotecha (Q), Rory Blevins (Q), Qifei Wang (Q), Mor Hazan Taege (Q), Alex Morris (Q), Xin Liu (Q), Fayaz Jamil (Q), Richard Zhang (Q), Pratik Joshi (Q), Ben Ingram (Q), Tyler Liechty (Q), Ahmed Eleryan (Q), Scott Baird (Q), Alex Grills (Q), Gagan Bansal (Q), Shan Han (Q), Kiran Yalasangi (Q), Shawn Xu (Q), Majd Al Merey (Q), Isabel Gao (Q), Felix Weissenberger (Q), Igor Karpov (Q), Robert Riachi (Q), Ankit Anand (Q), Gautam Prasad (Q), Kay Lamerigts (Q), Reid Hayes (Q), Jamie Rogers (Q), Mandy Guo (Q), Ashish Shenoy (Q), Qiong (Q), Hu (Dima), Kyle He (Dima), Yuchen Liu (Dima), Polina Zablotskaia (Dima), Sagar Gubbi (Dima), Yifan Chang (Dima), Jay Pavagadhi (Dima), Kristian Kjems (Dima), Archita Vadali (Dima), Diego Machado (Dima), Yeqing Li (Dima), Renshen Wang (Dima), Dipankar Ghosh (Dima), Aahil Mehta (Dima), Dana Alon (Dima), George Polovets (Dima), Alessio Tonioni (Dima), Nate Kushman (Dima), Joel D'sa (Dima), Lin Zhuo (Dima), Allen Wu (Dima), Rohin Shah (Dima), John Youssef (Dima), Jiayu Ye (Dima), Justin Snyder (Dima), Karel Lenc (Dima), Senaka Buthpitiya (Dima), Matthew Tung (Dima), Jichuan Chang (Dima), Tao Chen (Dima), David Saxton (Dima), Jenny Lee (Dima), Lydia Lihui Zhang (Dima), James Qin (Dima), Prabakar Radhakrishnan (Dima), Maxwell Chen (Dima), Piotr Ambroszczyk (Dima), Metin Toksoz-Exley (Dima), Yan Zhong (Dima), Nitzan Katz (Dima), Brendan O'Donoghue (Dima), Tamara von Glehn (Dima), Adi Gerzi Rosenthal (Dima), Aga \'Swietlik (Dima), Xiaokai Zhao (Dima), Nick Fernando (Dima), Jinliang Wei (Dima), Jieru Mei (Dima), Sergei Vassilvitskii (Dima), Diego Cedillo (Dima), Pranjal Awasthi (Dima), Hui Zheng (Dima), Koray Kavukcuoglu (Dima), Itay Laish (Dima), Joseph Pagadora (Dima), Marc Brockschmidt (Dima), Christopher A. Choquette-Choo (Dima), Arunkumar Byravan (Dima), Yifeng Lu (Dima), Xu Chen (Dima), Mia Chen (Dima), Kenton Lee (Dima), Rama Pasumarthi (Dima), Sijal Bhatnagar (Dima), Aditya Shah (Dima), Qiyin Wu (Dima), Zhuoyuan Chen (Dima), Zack Nado (Dima), Bartek Perz (Dima), Zixuan Jiang (Dima), David Kao (Dima), Ganesh Mallya (Dima), Nino Vieillard (Dima), Lantao Mei (Dima), Sertan Girgin (Dima), Mandy Jordan (Dima), Yeongil Ko (Dima), Alekh Agarwal (Dima), Yaxin Liu (Dima), Yasemin Altun (Dima), Raoul de Liedekerke (Dima), Anastasios Kementsietsidis (Dima), Daiyi Peng (Dima), Dangyi Liu (Dima), Utku Evci (Dima), Peter Humphreys (Dima), Austin Tarango (Dima), Xiang Deng (Dima), Yoad Lewenberg (Dima), Kevin Aydin (Dima), Chengda Wu (Dima), Bhavishya Mittal (Dima), Tsendsuren Munkhdalai (Dima), Kleopatra Chatziprimou (Dima), Rodrigo Benenson (Dima), Uri First (Dima), Xiao Ma (Dima), Jinning Li (Dima), Armand Joulin (Dima), Hamish Tomlinson (Dima), Tingnan Zhang (Dima), Milad Nasr (Dima), Zhi Hong (Dima), Micha\"el Sander (Dima), Lisa Anne Hendricks (Dima), Anuj Sharma (Dima), Andrew Bolt (Dima), Eszter V\'ertes (Dima), Jiri Simsa (Dima), Tomer Levinboim (Dima), Olcan Sercinoglu (Dima), Divyansh Shukla (Dima), Austin Wu (Dima), Craig Swanson (Dima), Danny Vainstein (Dima), Fan Bu (Dima), Bo Wang (Dima), Ryan Julian (Dima), Charles Yoon (Dima), Sergei Lebedev (Dima), Antonious Girgis (Dima), Bernd Bandemer (Dima), David Du (Dima), Todd Wang (Dima), Xi Chen (Dima), Ying Xiao (Dima), Peggy Lu (Dima), Natalie Ha (Dima), Vlad Ionescu (Dima), Simon Rowe (Dima), Josip Matak (Dima), Federico Lebron (Dima), Andreas Steiner (Dima), Lalit Jain (Dima), Manaal Faruqui (Dima), Nicolas Lacasse (Dima), Georgie Evans (Dima), Neesha Subramaniam (Dima), Dean Reich (Dima), Giulia Vezzani (Dima), Aditya Pandey (Dima), Joe Stanton (Dima), Tianhao Zhou (Dima), Liam McCafferty (Dima), Henry Griffiths (Dima), Verena Rieser (Dima), Soheil Hassas Yeganeh (Dima), Eleftheria Briakou (Dima), Lu Huang (Dima), Zichuan Wei (Dima), Liangchen Luo (Dima), Erik Jue (Dima), Gabby Wang (Dima), Victor Cotruta (Dima), Myriam Khan (Dima), Jongbin Park (Dima), Qiuchen Guo (Dima), Peiran Li (Dima), Rong Rong (Dima), Diego Antognini (Dima), Anastasia Petrushkina (Dima), Chetan Tekur (Dima), Eli Collins (Dima), Parul Bhatia (Dima), Chester Kwak (Dima), Wenhu Chen (Dima), Arvind Neelakantan (Dima), Immanuel Odisho (Dima), Sheng Peng (Dima), Vincent Nallatamby (Dima), Vaibhav Tulsyan (Dima), Fabian Pedregosa (Dima), Peng Xu (Dima), Raymond Lin (Dima), Yulong Wang (Dima), Emma Wang (Dima), Sholto Douglas (Dima), Reut Tsarfaty (Dima), Elena Gribovskaya (Dima), Renga Aravamudhan (Dima), Manu Agarwal (Dima), Mara Finkelstein (Dima), Qiao Zhang (Dima), Elizabeth Cole (Dima), Phil Crone (Dima), Sarmishta Velury (Dima), Anil Das (Dima), Chris Sauer (Dima), Luyao Xu (Dima), Danfeng Qin (Dima), Chenjie Gu (Dima), Dror Marcus (Dima), CJ Zheng (Dima), Wouter Van Gansbeke (Dima), Sobhan Miryoosefi (Dima), Haitian Sun (Dima), YaGuang Li (Dima), Charlie Chen (Dima), Jae Yoo (Dima), Pavel Dubov (Dima), Alex Tomala (Dima), Adams Yu (Dima), Pawe{\l} Weso{\l}owski (Dima), Alok Gunjan (Dima), Eddie Cao (Dima), Jiaming Luo (Dima), Nikhil Sethi (Dima), Arkadiusz Socala (Dima), Laura Graesser (Dima), Tomas Kocisky (Dima), Arturo BC (Dima), Minmin Chen (Dima), Edward Lee (Dima), Sophie Wang (Dima), Weize Kong (Dima), Qiantong Xu (Dima), Nilesh Tripuraneni (Dima), Yiming Li (Dima), Xinxin Yu (Dima), Allen Porter (Dima), Paul Voigtlaender (Dima), Biao Zhang (Dima), Arpi Vezer (Dima), Sarah York (Dima), Qing Wei (Dima), Geoffrey Cideron (Dima), Mark Kurzeja (Dima), Seungyeon Kim (Dima), Benny Li (Dima), Ang\'eline Pouget (Dima), Hyo Lee (Dima), Kaspar Daugaard (Dima), Yang Li (Dima), Dave Uthus (Dima), Aditya Siddhant (Dima), Paul Cavallaro (Dima), Sriram Ganapathy (Dima), Maulik Shah (Dima), Rolf Jagerman (Dima), Jeff Stanway (Dima), Piermaria Mendolicchio (Dima), Li Xiao (Dima), Kayi Lee (Dima), Tara Thompson (Dima), Shubham Milind Phal (Dima), Jason Chase (Dima), Sun Jae Lee (Dima), Adrian N Reyes (Dima), Disha Shrivastava (Dima), Zhen Qin (Dima), Roykrong Sukkerd (Dima), Seth Odoom (Dima), Lior Madmoni (Dima), John Aslanides (Dima), Jonathan Herzig (Dima), Elena Pochernina (Dima), Sheng Zhang (Dima), Parker Barnes (Dima), Daisuke Ikeda (Dima), Qiujia Li (Dima), Shuo-yiin Chang (Dima), Shakir Mohamed (Dima), Jim Sproch (Dima), Richard Powell (Dima), Bidisha Samanta (Dima), Domagoj \'Cevid (Dima), Anton Kovsharov (Dima), Shrestha Basu Mallick (Dima), Srinivas Tadepalli (Dima), Anne Zheng (Dima), Kareem Ayoub (Dima), Andreas Noever (Dima), Christian Reisswig (Dima), Zhuo Xu (Dima), Junhyuk Oh (Dima), Martin Matysiak (Dima), Tim Blyth (Dima), Shereen Ashraf (Dima), Julien Amelot (Dima), Boone Severson (Dima), Michele Bevilacqua (Dima), Motoki Sano (Dima), Ethan Dyer (Dima), Ofir Roval (Dima), Anu Sinha (Dima), Yin Zhong (Dima), Sagi Perel (Dima), Tea Saboli\'c (Dima), Johannes Mauerer (Dima), Willi Gierke (Dima), Mauro Verzetti (Dima), Rodrigo Cabrera (Dima), Alvin Abdagic (Dima), Steven Hemingray (Dima), Austin Stone (Dima), Jong Lee (Dima), Farooq Ahmad (Dima), Karthik Raman (Dima), Lior Shani (Dima), Jonathan Lai (Dima), Orhan Firat (Dima), Nathan Waters (Dima), Eric Ge (Dima), Mo Shomrat (Dima), Himanshu Gupta (Dima), Rajeev Aggarwal (Dima), Tom Hudson (Dima), Bill Jia (Dima), Simon Baumgartner (Dima), Palak Jain (Dima), Joe Kovac (Dima), Junehyuk Jung (Dima), Ante \v{Z}u\v{z}ul (Dima), Will Truong (Dima), Morteza Zadimoghaddam (Dima), Songyou Peng (Dima), Marco Liang (Dima), Rachel Sterneck (Dima), Balaji Lakshminarayanan (Dima), Machel Reid (Dima), Oliver Woodman (Dima), Tong Zhou (Dima), Jianling Wang (Dima), Vincent Coriou (Dima), Arjun Narayanan (Dima), Jay Hoover (Dima), Yenai Ma (Dima), Apoorv Jindal (Dima), Clayton Sanford (Dima), Doug Reid (Dima), Swaroop Ramaswamy (Dima), Alex Kurakin (Dima), Roland Zimmermann (Dima), Yana Lunts (Dima), Dragos Dena (Dima), Zal\'an Borsos (Dima), Vered Cohen (Dima), Shujian Zhang (Dima), Will Grathwohl (Dima), Robert Dadashi (Dima), Morgan Redshaw (Dima), Joshua Kessinger (Dima), Julian Odell (Dima), Silvano Bonacina (Dima), Zihang Dai (Dima), Grace Chen (Dima), Ayush Dubey (Dima), Pablo Sprechmann (Dima), Mantas Pajarskas (Dima), Wenxuan Zhou (Dima), Niharika Ahuja (Dima), Tara Thomas (Dima), Martin Nikoltchev (Dima), Matija Kecman (Dima), Bharath Mankalale (Dima), Andrey Ryabtsev (Dima), Jennifer She (Dima), Christian Walder (Dima), Jiaming Shen (Dima), Lu Li (Dima), Carolina Parada (Dima), Sheena Panthaplackel (Dima), Okwan Kwon (Dima), Matt Lawlor (Dima), Utsav Prabhu (Dima), Yannick Schroecker (Dima), Marc'aurelio Ranzato (Dima), Pete Blois (Dima), Iurii Kemaev (Dima), Ting Yu (Dima), Dmitry (Dima), Lepikhin (Weilun), Hao Xiong (Weilun), Sahand Sharifzadeh (Weilun), Oleaser Johnson (Weilun), Jeremiah Willcock (Weilun), Rui Yao (Weilun), Greg Farquhar (Weilun), Sujoy Basu (Weilun), Hidetoshi Shimokawa (Weilun), Nina Anderson (Weilun), Haiguang Li (Weilun), Khiem Pham (Weilun), Yizhong Liang (Weilun), Sebastian Borgeaud (Weilun), Alexandre Moufarek (Weilun), Hideto Kazawa (Weilun), Blair Kutzman (Weilun), Marcin Sieniek (Weilun), Sara Smoot (Weilun), Ruth Wang (Weilun), Natalie Axelsson (Weilun), Nova Fallen (Weilun), Prasha Sundaram (Weilun), Yuexiang Zhai (Weilun), Varun Godbole (Weilun), Petros Maniatis (Weilun), Alek Wang (Weilun), Ilia Shumailov (Weilun), Santhosh Thangaraj (Weilun), Remi Crocker (Weilun), Nikita Gupta (Weilun), Gang Wu (Weilun), Phil Chen (Weilun), Gell\'ert Weisz (Weilun), Celine Smith (Weilun), Mojtaba Seyedhosseini (Weilun), Boya Fang (Weilun), Xiyang Luo (Weilun), Roey Yogev (Weilun), Zeynep Cankara (Weilun), Andrew Hard (Weilun), Helen Ran (Weilun), Rahul Sukthankar (Weilun), George Necula (Weilun), Ga\"el Liu (Weilun), Honglong Cai (Weilun), Praseem Banzal (Weilun), Daniel Keysers (Weilun), Sanjay Ghemawat (Weilun), Connie Tao (Weilun), Emma Dunleavy (Weilun), Aditi Chaudhary (Weilun), Wei Li (Weilun), Maciej Miku{\l}a (Weilun), Chen-Yu Lee (Weilun), Tiziana Refice (Weilun), Krishna Somandepalli (Weilun), Alexandre Fr\'echette (Weilun), Dan Bahir (Weilun), John Karro (Weilun), Keith Rush (Weilun), Sarah Perrin (Weilun), Bill Rosgen (Weilun), Xiaomeng Yang (Weilun), Clara Huiyi Hu (Weilun), Mahmoud Alnahlawi (Weilun), Justin Mao-Jones (Weilun), Roopal Garg (Weilun), Hoang Nguyen (Weilun), Bat-Orgil Batsaikhan (Weilun), I\~naki Iturrate (Weilun), Anselm Levskaya (Weilun), Avi Singh (Weilun), Ashyana Kachra (Weilun), Tony Lu (Weilun), Denis Petek (Weilun), Zheng Xu (Weilun), Mark Graham (Weilun), Lukas Zilka (Weilun), Yael Karov (Weilun), Marija Kostelac (Weilun), Fangyu Liu (Weilun), Yaohui Guo (Weilun), Weiyue Wang (Weilun), Bernd Bohnet (Weilun), Emily Pitler (Weilun), Tony Bruguier (Weilun), Keisuke Kinoshita (Weilun), Chrysovalantis Anastasiou (Weilun), Nilpa Jha (Weilun), Ting Liu (Weilun), Jerome Connor (Weilun), Phil Wallis (Weilun), Philip Pham (Weilun), Eric Bailey (Weilun), Shixin Li (Weilun), Heng-Tze Cheng (Weilun), Sally Ma (Weilun), Haiqiong Li (Weilun), Akanksha Maurya (Weilun), Kate Olszewska (Weilun), Manfred Warmuth (Weilun), Christy Koh (Weilun), Dominik Paulus (Weilun), Siddhartha Reddy Jonnalagadda (Weilun), Enrique Piqueras (Weilun), Ali Elqursh (Weilun), Geoff Brown (Weilun), Hadar Shemtov (Weilun), Loren Maggiore (Weilun), Fei Xia (Weilun), Ryan Foley (Weilun), Beka Westberg (Weilun), George van den Driessche (Weilun), Livio Baldini Soares (Weilun), Arjun Kar (Weilun), Michael Quinn (Weilun), Siqi Zuo (Weilun), Jialin Wu (Weilun), Kyle Kastner (Weilun), Anna Bortsova (Weilun), Aijun Bai (Weilun), Ales Mikhalap (Weilun), Luowei Zhou (Weilun), Jennifer Brennan (Weilun), Vinay Ramasesh (Weilun), Honglei Zhuang (Weilun), John Maggs (Weilun), Johan Schalkwyk (Weilun), Yuntao Xu (Weilun), Hui Huang (Weilun), Andrew Howard (Weilun), Sasha Brown (Weilun), Linting Xue (Weilun), Gloria Shen (Weilun), Brian Albert (Weilun), Neha Jha (Weilun), Daniel Zheng (Weilun), Varvara Krayvanova (Weilun), Spurthi Amba Hombaiah (Weilun), Olivier Lacombe (Weilun), Gautam Vasudevan (Weilun), Dan Graur (Weilun), Tian Xie (Weilun), Meet Gandhi (Weilun), Bangju Wang (Weilun), Dustin Zelle (Weilun), Harman Singh (Weilun), Dahun Kim (Weilun), S\'ebastien Cevey (Weilun), Victor Ungureanu (Weilun), Natasha Noy (Weilun), Fei Liu (Weilun), Annie Xie (Weilun), Fangxiaoyu Feng (Weilun), Katerina Tsihlas (Weilun), Daniel Formoso (Weilun), Neera Vats (Weilun), Quentin Wellens (Weilun), Yinan Wang (Weilun), Niket Kumar Bhumihar (Weilun), Samrat Ghosh (Weilun), Matt Hoffman (Weilun), Tom Lieber (Weilun), Oran Lang (Weilun), Kush Bhatia (Weilun), Tom Paine (Weilun), Aroonalok Pyne (Weilun), Ronny Votel (Weilun), Madeleine Clare Elish (Weilun), Benoit Schillings (Weilun), Alex Panagopoulos (Weilun), Haichuan Yang (Weilun), Adam Raveret (Weilun), Zohar Yahav (Weilun), Shuang Liu (Weilun), Warren Chen (Weilun), Dalia El Badawy (Weilun), Nishant Agrawal (Weilun), Mohammed Badawi (Weilun), Mahdi Mirzazadeh (Weilun), Carla Bromberg (Weilun), Fan Ye (Weilun), Chang Liu (Weilun), Tatiana Sholokhova (Weilun), George-Cristian Muraru (Weilun), Gargi Balasubramaniam (Weilun), Jonathan Malmaud (Weilun), Alen Carin (Weilun), Danilo Martins (Weilun), Irina Jurenka (Weilun), Pankil Botadra (Weilun), Dave Lacey (Weilun), Richa Singh (Weilun), Mariano Schain (Weilun), Dan Zheng (Weilun), Isabelle Guyon (Weilun), Victor Lavrenko (Weilun), Seungji Lee (Weilun), Xiang Zhou (Weilun), Demis Hassabis (Weilun), Jeshwanth Challagundla (Weilun), Derek Cheng (Weilun), Nikhil Mehta (Weilun), Matthew Mauger (Weilun), Michela Paganini (Weilun), Pushkar Mishra (Weilun), Kate Lee (Weilun), Zhang Li (Weilun), Lexi Baugher (Weilun), Ondrej Skopek (Weilun), Max Chang (Weilun), Amir Zait (Weilun), Gaurav Menghani (Weilun), Lizzetth Bellot (Weilun), Guangxing Han (Weilun), Jean-Michel Sarr (Weilun), Sharat Chikkerur (Weilun), Himanshu Sahni (Weilun), Rohan Anil (Weilun), Arun Narayanan (Weilun), Chandu Thekkath (Weilun), Daniele Pighin (Weilun), Hana Strej\v{c}ek (Weilun), Marko Velic (Weilun), Fred Bertsch (Weilun), Manuel Tragut (Weilun), Keran Rong (Weilun), Alicia Parrish (Weilun), Kai Bailey (Weilun), Jiho Park (Weilun), Isabela Albuquerque (Weilun), Abhishek Bapna (Weilun), Rajesh Venkataraman (Weilun), Alec Kosik (Weilun), Johannes Griesser (Weilun), Zhiwei Deng (Weilun), Alek Andreev (Weilun), Qingyun Dou (Weilun), Kevin Hui (Weilun), Fanny Wei (Weilun), Xiaobin Yu (Weilun), Lei Shu (Weilun), Avia Aharon (Weilun), David Barker (Weilun), Badih Ghazi (Weilun), Sebastian Flennerhag (Weilun), Chris Breaux (Weilun), Yuchuan Liu (Weilun), Matthew Bilotti (Weilun), Josh Woodward (Weilun), Uri Alon (Weilun), Stephanie Winkler (Weilun), Tzu-Kuo Huang (Weilun), Kostas Andriopoulos (Weilun), Jo\~ao Gabriel Oliveira (Weilun), Penporn Koanantakool (Weilun), Berkin Akin (Weilun), Michael Wunder (Weilun), Cicero Nogueira dos Santos (Weilun), Mohammad Hossein Bateni (Weilun), Lin Yang (Weilun), Dan Horgan (Weilun), Beer Changpinyo (Weilun), Keyvan Amiri (Weilun), Min Ma (Weilun), Dayeong Lee (Weilun), Lihao Liang (Weilun), Anirudh Baddepudi (Weilun), Tejasi Latkar (Weilun), Raia Hadsell (Weilun), Jun Xu (Weilun), Hairong Mu (Weilun), Michael Han (Weilun), Aedan Pope (Weilun), Snchit Grover (Weilun), Frank Kim (Weilun), Ankit Bhagatwala (Weilun), Guan Sun (Weilun), Yamini Bansal (Weilun), Amir Globerson (Weilun), Alireza Nazari (Weilun), Samira Daruki (Weilun), Hagen Soltau (Weilun), Jane Labanowski (Weilun), Laurent El Shafey (Weilun), Matt Harvey (Weilun), Yanif Ahmad (Weilun), Elan Rosenfeld (Weilun), William Kong (Weilun), Etienne Pot (Weilun), Yi-Xuan Tan (Weilun), Aurora Wei (Weilun), Victoria Langston (Weilun), Marcel Prasetya (Weilun), Petar Veli\v{c}kovi\'c (Weilun), Richard Killam (Weilun), Robin Strudel (Weilun), Darren Ni (Weilun), Zhenhai Zhu (Weilun), Aaron Archer (Weilun), Kavya Kopparapu (Weilun), Lynn Nguyen (Weilun), Emilio Parisotto (Weilun), Hussain Masoom (Weilun), Sravanti Addepalli (Weilun), Jordan Grimstad (Weilun), Hexiang Hu (Weilun), Joss Moore (Weilun), Avinatan Hassidim (Weilun), Le Hou (Weilun), Mukund Raghavachari (Weilun), Jared Lichtarge (Weilun), Adam R. Brown (Weilun), Hilal Dib (Weilun), Natalia Ponomareva (Weilun), Justin Fu (Weilun), Yujing Zhang (Weilun), Altaf Rahman (Weilun), Joana Iljazi (Weilun), Edouard Leurent (Weilun), Gabriel Dulac-Arnold (Weilun), Cosmo Du (Weilun), Chulayuth Asawaroengchai (Weilun), Larry Jin (Weilun), Ela Gruzewska (Weilun), Ziwei Ji (Weilun), Benigno Uria (Weilun), Daniel De Freitas (Weilun), Paul Barham (Weilun), Lauren Beltrone (Weilun), V\'ictor Campos (Weilun), Jun Yan (Weilun), Neel Kovelamudi (Weilun), Arthur Nguyen (Weilun), Elinor Davies (Weilun), Zhichun Wu (Weilun), Zoltan Egyed (Weilun), Kristina Toutanova (Weilun), Nithya Attaluri (Weilun), Hongliang Fei (Weilun), Peter Stys (Weilun), Siddhartha Brahma (Weilun), Martin Izzard (Weilun), Siva Velusamy (Weilun), Scott Lundberg (Weilun), Vincent Zhuang (Weilun), Kevin Sequeira (Weilun), Adam Santoro (Weilun), Ehsan Amid (Weilun), Ophir Aharoni (Weilun), Shuai Ye (Weilun), Mukund Sundararajan (Weilun), Lijun Yu (Weilun), Yu-Cheng Ling (Weilun), Stephen Spencer (Weilun), Hugo Song (Weilun), Josip Djolonga (Weilun), Christo Kirov (Weilun), Sonal Gupta (Weilun), Alessandro Bissacco (Weilun), Clemens Meyer (Weilun), Mukul Bhutani (Weilun), Andrew Dai (Weilun), Weiyi Wang (Weilun), Siqi Liu (Weilun), Ashwin Sreevatsa (Weilun), Qijun Tan (Weilun), Maria Wang (Weilun), Lucy Kim (Weilun), Yicheng Wang (Weilun), Alex Irpan (Weilun), Yang Xiao (Weilun), Stanislav Fort (Weilun), Yifan He (Weilun), Alex Gurney (Weilun), Bryan Gale (Weilun), Yue Ma (Weilun), Monica Roy (Weilun), Viorica Patraucean (Weilun), Taylan Bilal (Weilun), Golnaz Ghiasi (Weilun), Anahita Hosseini (Weilun), Melvin Johnson (Weilun), Zhuowan Li (Weilun), Yi Tay (Weilun), Benjamin Beyret (Weilun), Katie Millican (Weilun), Josef Broder (Weilun), Mayank Lunayach (Weilun), Danny Swisher (Weilun), Eugen Vu\v{s}ak (Weilun), David Parkinson (Weilun), MH Tessler (Weilun), Adi Mayrav Gilady (Weilun), Richard Song (Weilun), Allan Dafoe (Weilun), Yves Raimond (Weilun), Masa Yamaguchi (Weilun), Itay Karo (Weilun), Elizabeth Nielsen (Weilun), Kevin Kilgour (Weilun), Mike Dusenberry (Weilun), Rajiv Mathews (Weilun), Jiho Choi (Weilun), Siyuan Qiao (Weilun), Harsh Mehta (Weilun), Sahitya Potluri (Weilun), Chris Knutsen (Weilun), Jialu Liu (Weilun), Tat Tan (Weilun), Kuntal Sengupta (Weilun), Keerthana Gopalakrishnan (Weilun), Abodunrinwa Toki (Weilun), Mencher Chiang (Weilun), Mike Burrows (Weilun), Grace Vesom (Weilun), Zafarali Ahmed (Weilun), Ilia Labzovsky (Weilun), Siddharth Vashishtha (Weilun), Preeti Singh (Weilun), Ankur Sharma (Weilun), Ada Ma (Weilun), Jinyu Xie (Weilun), Pranav Talluri (Weilun), Hannah Forbes-Pollard (Weilun), Aarush Selvan (Weilun), Joel Wee (Weilun), Loic Matthey (Weilun), Tom Funkhouser (Weilun), Parthasarathy Gopavarapu (Weilun), Lev Proleev (Weilun), Cheng Li (Weilun), Matt Thomas (Weilun), Kashyap Kolipaka (Weilun), Zhipeng Jia (Weilun), Ashwin Kakarla (Weilun), Srinivas Sunkara (Weilun), Joan Puigcerver (Weilun), Suraj Satishkumar Sheth (Weilun), Emily Graves (Weilun), Chen Wang (Weilun), Sadh MNM Khan (Weilun), Kai Kang (Weilun), Shyamal Buch (Weilun), Fred Zhang (Weilun), Omkar Savant (Weilun), David Soergel (Weilun), Kevin Lee (Weilun), Linda Friso (Weilun), Xuanyi Dong (Weilun), Rahul Arya (Weilun), Shreyas Chandrakaladharan (Weilun), Connor Schenck (Weilun), Greg Billock (Weilun), Tejas Iyer (Weilun), Anton Bakalov (Weilun), Leslie Baker (Weilun), Alex Ruiz (Weilun), Angad Chandorkar (Weilun), Trieu Trinh (Weilun), Matt Miecnikowski (Weilun), Yanqi Zhou (Weilun), Yangsibo Huang (Weilun), Jiazhong Nie (Weilun), Ali Shah (Weilun), Ashish Thapliyal (Weilun), Sam Haves (Weilun), Lun Wang (Weilun), Uri Shaham (Weilun), Patrick Morris-Suzuki (Weilun), Soroush Radpour (Weilun), Leonard Berrada (Weilun), Thomas Strohmann (Weilun), Chaochao Yan (Weilun), Jingwei Shen (Weilun), Sonam Goenka (Weilun), Tris Warkentin (Weilun), Petar Devi\'c (Weilun), Dan Belov (Weilun), Albert Webson (Weilun), Madhavi Yenugula (Weilun), Puranjay Datta (Weilun), Jerry Chang (Weilun), Nimesh Ghelani (Weilun), Aviral Kumar (Weilun), Vincent Perot (Weilun), Jessica Lo (Weilun), Yang Song (Weilun), Herman Schmit (Weilun), Jianmin Chen (Weilun), Vasilisa Bashlovkina (Weilun), Xiaoyue Pan (Weilun), Diana Mincu (Weilun), Paul Roit (Weilun), Isabel Edkins (Weilun), Andy Davis (Weilun), Yujia Li (Weilun), Ben Horn (Weilun), Xinjian Li (Weilun), Pradeep Kumar S (Weilun), Eric Doi (Weilun), Wanzheng Zhu (Weilun), Sri Gayatri Sundara Padmanabhan (Weilun), Siddharth Verma (Weilun), Jasmine Liu (Weilun), Heng Chen (Weilun), Mihajlo Velimirovi\'c (Weilun), Malcolm Reynolds (Weilun), Priyanka Agrawal (Weilun), Nick Sukhanov (Weilun), Abhinit Modi (Weilun), Siddharth Goyal (Weilun), John Palowitch (Weilun), Nima Khajehnouri (Weilun), Wing Lowe (Weilun), David Klinghoffer (Weilun), Sharon Silver (Weilun), Vinh Tran (Weilun), Candice Schumann (Weilun), Francesco Piccinno (Weilun), Xi Liu (Weilun), Mario Lu\v{c}i\'c (Weilun), Xiaochen Yang (Weilun), Sandeep Kumar (Weilun), Ajay Kannan (Weilun), Ragha Kotikalapudi (Weilun), Mudit Bansal (Weilun), Fabian Fuchs (Weilun), Javad Hosseini (Weilun), Abdelrahman Abdelhamed (Weilun), Dawn Bloxwich (Weilun), Tianhe Yu (Weilun), Ruoxin Sang (Weilun), Gregory Thornton (Weilun), Karan Gill (Weilun), Yuchi Liu (Weilun), Virat Shejwalkar (Weilun), Jason Lin (Weilun), Zhipeng Yan (Weilun), Kehang Han (Weilun), Thomas Buschmann (Weilun), Michael Pliskin (Weilun), Zhi Xing (Weilun), Susheel Tatineni (Weilun), Junlin Zhang (Weilun), Sissie Hsiao (Weilun), Gavin Buttimore (Weilun), Marcus Wu (Weilun), Zefei Li (Weilun), Geza Kovacs (Weilun), Legg Yeung (Weilun), Tao Huang (Weilun), Aaron Cohen (Weilun), Bethanie Brownfield (Weilun), Averi Nowak (Weilun), Mikel Rodriguez (Weilun), Tianze Shi (Weilun), Hado van Hasselt (Weilun), Kevin Cen (Weilun), Deepanway Ghoshal (Weilun), Kushal Majmundar (Weilun), Weiren Yu (Weilun), Warren (Weilun), Chen (June), Danila Sinopalnikov (June), Hao Zhang (June), Vlado Gali\'c (June), Di Lu (June), Zeyu Zheng (June), Maggie Song (June), Gary Wang (June), Gui Citovsky (June), Swapnil Gawde (June), Isaac Galatzer-Levy (June), David Silver (June), Ivana Balazevic (June), Dipanjan Das (June), Kingshuk Majumder (June), Yale Cong (June), Praneet Dutta (June), Dustin Tran (June), Hui Wan (June), Junwei Yuan (June), Daniel Eppens (June), Alanna Walton (June), Been Kim (June), Harry Ragan (June), James Cobon-Kerr (June), Lu Liu (June), Weijun Wang (June), Bryce Petrini (June), Jack Rae (June), Rakesh Shivanna (June), Yan Xiong (June), Chace Lee (June), Pauline Coquinot (June), Yiming Gu (June), Lisa Patel (June), Blake Hechtman (June), Aviel Boag (June), Orion Jankowski (June), Alex Wertheim (June), Alex Lee (June), Paul Covington (June), Hila Noga (June), Sam Sobell (June), Shanthal Vasanth (June), William Bono (June), Chirag Nagpal (June), Wei Fan (June), Xavier Garcia (June), Kedar Soparkar (June), Aybuke Turker (June), Nathan Howard (June), Sachit Menon (June), Yuankai Chen (June), Vikas Verma (June), Vladimir Pchelin (June), Harish Rajamani (June), Valentin Dalibard (June), Ana Ramalho (June), Yang Guo (June), Kartikeya Badola (June), Seojin Bang (June), Nathalie Rauschmayr (June), Julia Proskurnia (June), Sudeep Dasari (June), Xinyun Chen (June), Mikhail Sushkov (June), Anja Hauth (June), Pauline Sho (June), Abhinav Singh (June), Bilva Chandra (June), Allie Culp (June), Max Dylla (June), Olivier Bachem (June), James Besley (June), Heri Zhao (June), Timothy Lillicrap (June), Wei Wei (June), Wael Al Jishi (June), Ning Niu (June), Alban Rrustemi (June), Rapha\"el Lopez Kaufman (June), Ryan Poplin (June), Jewel Zhao (June), Minh Truong (June), Shikhar Bharadwaj (June), Ester Hlavnova (June), Eli Stickgold (June), Cordelia Schmid (June), Georgi Stephanov (June), Zhaoqi Leng (June), Frederick Liu (June), L\'eonard Hussenot (June), Shenil Dodhia (June), Juliana Vicente Franco (June), Lesley Katzen (June), Abhanshu Sharma (June), Sarah Cogan (June), Zuguang Yang (June), Aniket Ray (June), Sergi Caelles (June), Shen Yan (June), Ravin Kumar (June), Daniel Gillick (June), Renee Wong (June), Joshua Ainslie (June), Jonathan Hoech (June), S\'eb Arnold (June), Dan Abolafia (June), Anca Dragan (June), Ben Hora (June), Grace Hu (June), Alexey Guseynov (June), Yang Lu (June), Chas Leichner (June), Jinmeng Rao (June), Abhimanyu Goyal (June), Nagabhushan Baddi (June), Daniel Hernandez Diaz (June), Tim McConnell (June), Max Bain (June), Jake Abernethy (June), Qiqi Yan (June), Rylan Schaeffer (June), Paul Vicol (June), Will Thompson (June), Montse Gonzalez Arenas (June), Mathias Bellaiche (June), Pablo Barrio (June), Stefan Zinke (June), Riccardo Patana (June), Pulkit Mehta (June), JK Kearns (June), Avraham Ruderman (June), Scott Pollom (June), David D'Ambrosio (June), Cath Hope (June), Yang Yu (June), Andrea Gesmundo (June), Kuang-Huei Lee (June), Aviv Rosenberg (June), Yiqian Zhou (June), Yaoyiran Li (June), Drew Garmon (June), Yonghui Wu (June), Safeen Huda (June), Gil Fidel (June), Martin Baeuml (June), Jian Li (June), Phoebe Kirk (June), Rhys May (June), Tao Tu (June), Sara Mc Carthy (June), Toshiyuki Fukuzawa (June), Miranda Aperghis (June), Chih-Kuan Yeh (June), Toshihiro Yoshino (June), Bo Li (June), Austin Myers (June), Kaisheng Yao (June), Ben Limonchik (June), Changwan Ryu (June), Rohun Saxena (June), Alex Goldin (June), Ruizhe Zhao (June), Rocky Rhodes (June), Tao Zhu (June), Divya Tyam (June), Heidi Howard (June), Nathan Byrd (June), Hongxu Ma (June), Yan Wu (June), Ryan Mullins (June), Qingze Wang (June), Aida Amini (June), Sebastien Baur (June), Yiran Mao (June), Subhashini Venugopalan (June), Will Song (June), Wen Ding (June), Paul Collins (June), Sashank Reddi (June), Megan Shum (June), Andrei Rusu (June), Luisa Zintgraf (June), Kelvin Chan (June), Sheela Goenka (June), Mathieu Blondel (June), Michael Collins (June), Renke Pan (June), Marissa Giustina (June), Nikolai Chinaev (June), Christian Schuler (June), Ce Zheng (June), Jonas Valfridsson (June), Alyssa Loo (June), Alex Yakubovich (June), Jamie Smith (June), Tao Jiang (June), Rich Munoz (June), Gabriel Barcik (June), Rishabh Bansal (June), Mingyao Yang (June), Yilun Du (June), Pablo Duque (June), Mary Phuong (June), Alexandra Belias (June), Kunal Lad (June), Zeyu Liu (June), Tal Schuster (June), Karthik Duddu (June), Jieru Hu (June), Paige Kunkle (June), Matthew Watson (June), Jackson Tolins (June), Josh Smith (June), Denis Teplyashin (June), Garrett Bingham (June), Marvin Ritter (June), Marco Andreetto (June), Divya Pitta (June), Mohak Patel (June), Shashank Viswanadha (June), Trevor Strohman (June), Catalin Ionescu (June), Jincheng Luo (June), Yogesh Kalley (June), Jeremy Wiesner (June), Dan Deutsch (June), Derek Lockhart (June), Peter Choy (June), Rumen Dangovski (June), Chawin Sitawarin (June), Cat Graves (June), Tanya Lando (June), Joost van Amersfoort (June), Ndidi Elue (June), Zhouyuan Huo (June), Pooya Moradi (June), Jean Tarbouriech (June), Henryk Michalewski (June), Wenting Ye (June), Eunyoung Kim (June), Alex Druinsky (June), Florent Altch\'e (June), Xinyi Chen (June), Artur Dwornik (June), Da-Cheng Juan (June), Rivka Moroshko (June), Horia Toma (June), Jarrod Kahn (June), Hai Qian (June), Maximilian Sieb (June), Irene Cai (June), Roman Goldenberg (June), Praneeth Netrapalli (June), Sindhu Raghuram (June), Yuan Gong (June), Lijie Fan (June), Evan Palmer (June), Yossi Matias (June), Valentin Gabeur (June), Shreya Pathak (June), Tom Ouyang (June), Don Metzler (June), Geoff Bacon (June), Srinivasan Venkatachary (June), Sridhar Thiagarajan (June), Alex Cullum (June), Eran Ofek (June), Vytenis Sakenas (June), Mohamed Hammad (June), Cesar Magalhaes (June), Mayank Daswani (June), Oscar Chang (June), Ashok Popat (June), Ruichao Li (June), Komal Jalan (June), Yanhan Hou (June), Josh Lipschultz (June), Antoine He (June), Wenhao Jia (June), Pier Giuseppe Sessa (June), Prateek Kolhar (June), William Wong (June), Sumeet Singh (June), Lukas Haas (June), Jay Whang (June), Hanna Klimczak-Pluci\'nska (June), Georges Rotival (June), Grace Chung (June), Yiqing Hua (June), Anfal Siddiqui (June), Nicolas Serrano (June), Dongkai Chen (June), Billy Porter (June), Libin Bai (June), Keshav Shivam (June), Sho Arora (June), Partha Talukdar (June), Tom Cobley (June), Sangnie Bhardwaj (June), Evgeny Gladchenko (June), Simon Green (June), Kelvin Guu (June), Felix Fischer (June), Xiao Wu (June), Eric Wang (June), Achintya Singhal (June), Tatiana Matejovicova (June), James Martens (June), Hongji Li (June), Roma Patel (June), Elizabeth Kemp (June), Jiaqi Pan (June), Lily Wang (June), Blake JianHang Chen (June), Jean-Baptiste Alayrac (June), Navneet Potti (June), Erika Gemzer (June), Eugene Ie (June), Kay McKinney (June), Takaaki Saeki (June), Edward Chou (June), Pascal Lamblin (June), SQ Mah (June), Zach Fisher (June), Martin Chadwick (June), Jon Stritar (June), Obaid Sarvana (June), Andrew Hogue (June), Artem Shtefan (June), Hadi Hashemi (June), Yang Xu (June), Jindong Gu (June), Sharad Vikram (June), Chung-Ching Chang (June), Sabela Ramos (June), Logan Kilpatrick (June), Weijuan Xi (June), Jenny Brennan (June), Yinghao Sun (June), Abhishek Jindal (June), Ionel Gog (June), Dawn Chen (June), Felix Wu (June), Jason Lee (June), Sudhindra Kopalle (June), Srinadh Bhojanapalli (June), Oriol Vinyals (June), Natan Potikha (June), Burcu Karagol Ayan (June), Yuan Yuan (June), Michael Riley (June), Piotr Stanczyk (June), Sergey Kishchenko (June), Bing Wang (June), Dan Garrette (June), Antoine Yang (June), Vlad Feinberg (June), CJ Carey (June), Javad Azizi (June), Viral Shah (June), Erica Moreira (June), Chongyang Shi (June), Josh Feldman (June), Elizabeth Salesky (June), Thomas Lampe (June), Aneesh Pappu (June), Duhyeon Kim (June), Jonas Adler (June), Avi Caciularu (June), Brian Walker (June), Yunhan Xu (June), Yochai Blau (June), Dylan Scandinaro (June), Terry Huang (June), Sam El-Husseini (June), Abhishek Sinha (June), Lijie Ren (June), Taylor Tobin (June), Patrik Sundberg (June), Tim Sohn (June), Vikas Yadav (June), Mimi Ly (June), Emily Xue (June), Jing Xiong (June), Afzal Shama Soudagar (June), Sneha Mondal (June), Nikhil Khadke (June), Qingchun Ren (June), Ben Vargas (June), Stan Bileschi (June), Sarah Chakera (June), Cindy Wang (June), Boyu Wang (June), Yoni Halpern (June), Joe Jiang (June), Vikas Sindhwani (June), Petre Petrov (June), Pranavaraj Ponnuramu (June), Sanket Vaibhav Mehta (June), Yu Watanabe (June), Betty Chan (June), Matheus Wisniewski (June), Trang Pham (June), Jingwei Zhang (June), Conglong Li (June), Dario de Cesare (June), Art Khurshudov (June), Alex Vasiloff (June), Melissa Tan (June), Zoe Ashwood (June), Bobak Shahriari (June), Maryam Majzoubi (June), Garrett Tanzer (June), Olga Kozlova (June), Robin Alazard (June), James Lee-Thorp (June), Nguyet Minh Phu (June), Isaac Tian (June), Junwhan Ahn (June), Andy Crawford (June), Lauren Lax (June), Yuan (June), Shangguan (Yonghao), Iftekhar Naim (Yonghao), David Ross (Yonghao), Oleksandr Ferludin (Yonghao), Tongfei Guo (Yonghao), Andrea Banino (Yonghao), Hubert Soyer (Yonghao), Xiaoen Ju (Yonghao), Dominika Rogozi\'nska (Yonghao), Ishaan Malhi (Yonghao), Marcella Valentine (Yonghao), Daniel Balle (Yonghao), Apoorv Kulshreshtha (Yonghao), Maciej Kula (Yonghao), Yiwen Song (Yonghao), Sophia Austin (Yonghao), John Schultz (Yonghao), Roy Hirsch (Yonghao), Arthur Douillard (Yonghao), Apoorv Reddy (Yonghao), Michael Fink (Yonghao), Summer Yue (Yonghao), Khyatti Gupta (Yonghao), Adam Zhang (Yonghao), Norman Rink (Yonghao), Daniel McDuff (Yonghao), Lei Meng (Yonghao), Andr\'as Gy\"orgy (Yonghao), Yasaman Razeghi (Yonghao), Ricky Liang (Yonghao), Kazuki Osawa (Yonghao), Aviel Atias (Yonghao), Matan Eyal (Yonghao), Tyrone Hill (Yonghao), Nikolai Grigorev (Yonghao), Zhengdong Wang (Yonghao), Nitish Kulkarni (Yonghao), Rachel Soh (Yonghao), Ivan Lobov (Yonghao), Zachary Charles (Yonghao), Sid Lall (Yonghao), Kazuma Hashimoto (Yonghao), Ido Kessler (Yonghao), Victor Gomes (Yonghao), Zelda Mariet (Yonghao), Danny Driess (Yonghao), Alessandro Agostini (Yonghao), Canfer Akbulut (Yonghao), Jingcao Hu (Yonghao), Marissa Ikonomidis (Yonghao), Emily Caveness (Yonghao), Kartik Audhkhasi (Yonghao), Saurabh Agrawal (Yonghao), Ioana Bica (Yonghao), Evan Senter (Yonghao), Jayaram Mudigonda (Yonghao), Kelly Chen (Yonghao), Jingchen Ye (Yonghao), Xuanhui Wang (Yonghao), James Svensson (Yonghao), Philipp Fr\"anken (Yonghao), Josh Newlan (Yonghao), Li Lao (Yonghao), Eva Schnider (Yonghao), Sami Alabed (Yonghao), Joseph Kready (Yonghao), Jesse Emond (Yonghao), Afief Halumi (Yonghao), Tim Zaman (Yonghao), Chengxi Ye (Yonghao), Naina Raisinghani (Yonghao), Vilobh Meshram (Yonghao), Bo Chang (Yonghao), Ankit Singh Rawat (Yonghao), Axel Stjerngren (Yonghao), Sergey Levi (Yonghao), Rui Wang (Yonghao), Xiangzhu Long (Yonghao), Mitchelle Rasquinha (Yonghao), Steven Hand (Yonghao), Aditi Mavalankar (Yonghao), Lauren Agubuzu (Yonghao), Sudeshna Roy (Yonghao), Junquan Chen (Yonghao), Jarek Wilkiewicz (Yonghao), Hao Zhou (Yonghao), Michal Jastrzebski (Yonghao), Qiong Hu (Yonghao), Agustin Dal Lago (Yonghao), Ramya Sree Boppana (Yonghao), Wei-Jen Ko (Yonghao), Jennifer Prendki (Yonghao), Yao Su (Yonghao), Zhi Li (Yonghao), Eliza Rutherford (Yonghao), Girish Ramchandra Rao (Yonghao), Ramona Comanescu (Yonghao), Adri\`a Puigdom\`enech (Yonghao), Qihang Chen (Yonghao), Dessie Petrova (Yonghao), Christine Chan (Yonghao), Vedrana Milutinovic (Yonghao), Felipe Tiengo Ferreira (Yonghao), Chin-Yi Cheng (Yonghao), Ming Zhang (Yonghao), Tapomay Dey (Yonghao), Sherry Yang (Yonghao), Ramesh Sampath (Yonghao), Quoc Le (Yonghao), Howard Zhou (Yonghao), Chu-Cheng Lin (Yonghao), Hoi Lam (Yonghao), Christine Kaeser-Chen (Yonghao), Kai Hui (Yonghao), Dean Hirsch (Yonghao), Tom Eccles (Yonghao), Basil Mustafa (Yonghao), Shruti Rijhwani (Yonghao), Morgane Rivi\`ere (Yonghao), Yuanzhong Xu (Yonghao), Junjie Wang (Yonghao), Xinyang Geng (Yonghao), Xiance Si (Yonghao), Arjun Khare (Yonghao), Cheolmin Kim (Yonghao), Vahab Mirrokni (Yonghao), Kamyu Lee (Yonghao), Khuslen Baatarsukh (Yonghao), Nathaniel Braun (Yonghao), Lisa Wang (Yonghao), Pallavi LV (Yonghao), Richard Tanburn (Yonghao), Yuvein (Yonghao), Zhu (Joyce), Fangda Li (Joyce), Setareh Ariafar (Joyce), Dan Goldberg (Joyce), Ken Burke (Joyce), Daniil Mirylenka (Joyce), Meiqi Guo (Joyce), Olaf Ronneberger (Joyce), Hadas Natalie Vogel (Joyce), Liqun Cheng (Joyce), Nishita Shetty (Joyce), Johnson Jia (Joyce), Thomas Jimma (Joyce), Corey Fry (Joyce), Ted Xiao (Joyce), Martin Sundermeyer (Joyce), Ryan Burnell (Joyce), Yannis Assael (Joyce), Mario Pinto (Joyce), JD Chen (Joyce), Rohit Sathyanarayana (Joyce), Donghyun Cho (Joyce), Jing Lu (Joyce), Rishabh Agarwal (Joyce), Sugato Basu (Joyce), Lucas Gonzalez (Joyce), Dhruv Shah (Joyce), Meng Wei (Joyce), Dre Mahaarachchi (Joyce), Rohan Agrawal (Joyce), Tero Rissa (Joyce), Yani Donchev (Joyce), Ramiro Leal-Cavazos (Joyce), Adrian Hutter (Joyce), Markus Mircea (Joyce), Alon Jacovi (Joyce), Faruk Ahmed (Joyce), Jiageng Zhang (Joyce), Shuguang Hu (Joyce), Bo-Juen Chen (Joyce), Jonni Kanerva (Joyce), Guillaume Desjardins (Joyce), Andrew Lee (Joyce), Nikos Parotsidis (Joyce), Asier Mujika (Joyce), Tobias Weyand (Joyce), Jasper Snoek (Joyce), Jo Chick (Joyce), Kai Chen (Joyce), Paul Chang (Joyce), Ethan Mahintorabi (Joyce), Zi Wang (Joyce), Tolly Powell (Joyce), Orgad Keller (Joyce), Abhirut Gupta (Joyce), Claire Sha (Joyce), Kanav Garg (Joyce), Nicolas Heess (Joyce), \'Agoston Weisz (Joyce), Cassidy Hardin (Joyce), Bartek Wydrowski (Joyce), Ben Coleman (Joyce), Karina Zainullina (Joyce), Pankaj Joshi (Joyce), Alessandro Epasto (Joyce), Terry Spitz (Joyce), Binbin Xiong (Joyce), Kai Zhao (Joyce), Arseniy Klimovskiy (Joyce), Ivy Zheng (Joyce), Johan Ferret (Joyce), Itay Yona (Joyce), Waleed Khawaja (Joyce), Jean-Baptiste Lespiau (Joyce), Maxim Krikun (Joyce), Siamak Shakeri (Joyce), Timothee Cour (Joyce), Bonnie Li (Joyce), Igor Krivokon (Joyce), Dan Suh (Joyce), Alex Hofer (Joyce), Jad Al Abdallah (Joyce), Nikita Putikhin (Joyce), Oscar Akerlund (Joyce), Silvio Lattanzi (Joyce), Anurag Kumar (Joyce), Shane Settle (Joyce), Himanshu Srivastava (Joyce), Folawiyo Campbell-Ajala (Joyce), Edouard Rosseel (Joyce), Mihai Dorin Istin (Joyce), Nishanth Dikkala (Joyce), Anand Rao (Joyce), Nick Young (Joyce), Kate Lin (Joyce), Dhruva Bhaswar (Joyce), Yiming Wang (Joyce), Jaume Sanchez Elias (Joyce), Kritika Muralidharan (Joyce), James Keeling (Joyce), Dayou Du (Joyce), Siddharth Gopal (Joyce), Gregory Dibb (Joyce), Charles Blundell (Joyce), Manolis Delakis (Joyce), Jacky Liang (Joyce), Marco Tulio Ribeiro (Joyce), Georgi Karadzhov (Joyce), Guillermo Garrido (Joyce), Ankur Bapna (Joyce), Jiawei Cao (Joyce), Adam Sadovsky (Joyce), Pouya Tafti (Joyce), Arthur Guez (Joyce), Coline Devin (Joyce), Yixian Di (Joyce), Jinwei Xing (Joyce), Chuqiao (Joyce), Xu (Cindy), Hanzhao Lin (Cindy), Chun-Te Chu (Cindy), Sameera Ponda (Cindy), Wesley Helmholz (Cindy), Fan Yang (Cindy), Yue Gao (Cindy), Sara Javanmardi (Cindy), Wael Farhan (Cindy), Alex Ramirez (Cindy), Ricardo Figueira (Cindy), Khe Chai Sim (Cindy), Yuval Bahat (Cindy), Ashwin Vaswani (Cindy), Liangzhe Yuan (Cindy), Gufeng Zhang (Cindy), Leland Rechis (Cindy), Hanjun Dai (Cindy), Tayo Oguntebi (Cindy), Alexandra Cordell (Cindy), Eug\'enie Rives (Cindy), Kaan Tekelioglu (Cindy), Naveen Kumar (Cindy), Bing Zhang (Cindy), Aurick Zhou (Cindy), Nikolay Savinov (Cindy), Andrew Leach (Cindy), Alex Tudor (Cindy), Sanjay Ganapathy (Cindy), Yanyan Zheng (Cindy), Mirko Rossini (Cindy), Vera Axelrod (Cindy), Arnaud Autef (Cindy), Yukun Zhu (Cindy), Zheng Zheng (Cindy), Mingda Zhang (Cindy), Baochen Sun (Cindy), Jie Ren (Cindy), Nenad Tomasev (Cindy), Nithish Kannan (Cindy), Amer Sinha (Cindy), Charles Chen (Cindy), Louis O'Bryan (Cindy), Alex Pak (Cindy), Aditya Kusupati (Cindy), Weel Yang (Cindy), Deepak Ramachandran (Cindy), Patrick Griffin (Cindy), Seokhwan Kim (Cindy), Philipp Neubeck (Cindy), Craig Schiff (Cindy), Tammo Spalink (Cindy), Mingyang Ling (Cindy), Arun Nair (Cindy), Ga-Young Joung (Cindy), Linda Deng (Cindy), Avishkar Bhoopchand (Cindy), Lora Aroyo (Cindy), Tom Duerig (Cindy), Jordan Griffith (Cindy), Gabe Barth-Maron (Cindy), Jake Ades (Cindy), Alex Haig (Cindy), Ankur Taly (Cindy), Yunting Song (Cindy), Paul Michel (Cindy), Dave Orr (Cindy), Dean Weesner (Cindy), Corentin Tallec (Cindy), Carrie Grimes Bostock (Cindy), Paul Niemczyk (Cindy), Andy Twigg (Cindy), Mudit Verma (Cindy), Rohith Vallu (Cindy), Henry Wang (Cindy), Marco Gelmi (Cindy), Kiranbir Sodhia (Cindy), Aleksandr Chuklin (Cindy), Omer Goldman (Cindy), Jasmine George (Cindy), Liang Bai (Cindy), Kelvin Zhang (Cindy), Petar Sirkovic (Cindy), Efrat Nehoran (Cindy), Golan Pundak (Cindy), Jiaqi Mu (Cindy), Alice Chen (Cindy), Alex Greve (Cindy), Paulo Zacchello (Cindy), David Amos (Cindy), Heming Ge (Cindy), Eric Noland (Cindy), Colton Bishop (Cindy), Jeffrey Dudek (Cindy), Youhei Namiki (Cindy), Elena Buchatskaya (Cindy), Jing Li (Cindy), Dorsa Sadigh (Cindy), Masha Samsikova (Cindy), Dan Malkin (Cindy), Damien Vincent (Cindy), Robert David (Cindy), Rob Willoughby (Cindy), Phoenix Meadowlark (Cindy), Shawn Gao (Cindy), Yan Li (Cindy), Raj Apte (Cindy), Amit Jhindal (Cindy), Stein Xudong Lin (Cindy), Alex Polozov (Cindy), Zhicheng Wang (Cindy), Tomas Mery (Cindy), Anirudh GP (Cindy), Varun Yerram (Cindy), Sage Stevens (Cindy), Tianqi Liu (Cindy), Noah Fiedel (Cindy), Charles Sutton (Cindy), Matthew Johnson (Cindy), Xiaodan Song (Cindy), Kate Baumli (Cindy), Nir Shabat (Cindy), Muqthar Mohammad (Cindy), Hao Liu (Cindy), Marco Selvi (Cindy), Yichao Zhou (Cindy), Mehdi Hafezi Manshadi (Cindy), Chu-ling Ko (Cindy), Anthony Chen (Cindy), Michael Bendersky (Cindy), Jorge Gonzalez Mendez (Cindy), Nisarg Kothari (Cindy), Amir Zandieh (Cindy), Yiling Huang (Cindy), Daniel Andor (Cindy), Ellie Pavlick (Cindy), Idan Brusilovsky (Cindy), Jitendra Harlalka (Cindy), Sally Goldman (Cindy), Andrew Lampinen (Cindy), Guowang Li (Cindy), Asahi Ushio (Cindy), Somit Gupta (Cindy), Lei Zhang (Cindy), Chuyuan Kelly Fu (Cindy), Madhavi Sewak (Cindy), Timo Denk (Cindy), Jed Borovik (Cindy), Brendan Jou (Cindy), Avital Zipori (Cindy), Prateek Jain (Cindy), Junwen Bai (Cindy), Thang Luong (Cindy), Jonathan Tompson (Cindy), Alice Li (Cindy), Li Liu (Cindy), George Powell (Cindy), Jiajun Shen (Cindy), Alex Feng (Cindy), Grishma Chole (Cindy), Da Yu (Cindy), Yinlam Chow (Cindy), Tongxin Yin (Cindy), Eric Malmi (Cindy), Kefan Xiao (Cindy), Yash Pande (Cindy), Shachi Paul (Cindy), Niccol\`o Dal Santo (Cindy), Adil Dostmohamed (Cindy), Sergio Guadarrama (Cindy), Aaron Phillips (Cindy), Thanumalayan Sankaranarayana Pillai (Cindy), Gal Yona (Cindy), Amin Ghafouri (Cindy), Preethi Lahoti (Cindy), Benjamin Lee (Cindy), Dhruv Madeka (Cindy), Eren Sezener (Cindy), Simon Tokumine (Cindy), Adrian Collister (Cindy), Nicola De Cao (Cindy), Richard Shin (Cindy), Uday Kalra (Cindy), Parker Beak (Cindy), Emily Nottage (Cindy), Ryo Nakashima (Cindy), Ivan Jurin (Cindy), Vikash Sehwag (Cindy), Meenu Gaba (Cindy), Junhao Zeng (Cindy), Kevin R. McKee (Cindy), Fernando Pereira (Cindy), Tamar Yakar (Cindy), Amayika Panda (Cindy), Arka Dhar (Cindy), Peilin Zhong (Cindy), Daniel Sohn (Cindy), Mark Brand (Cindy), Lars Lowe Sjoesund (Cindy), Viral Carpenter (Cindy), Sharon Lin (Cindy), Shantanu Thakoor (Cindy), Marcus Wainwright (Cindy), Ashwin Chaugule (Cindy), Pranesh Srinivasan (Cindy), Muye Zhu (Cindy), Bernett Orlando (Cindy), Jack Weber (Cindy), Ayzaan Wahid (Cindy), Gilles Baechler (Cindy), Apurv Suman (Cindy), Jovana Mitrovi\'c (Cindy), Gabe Taubman (Cindy), Honglin Yu (Cindy), Helen King (Cindy), Josh Dillon (Cindy), Cathy Yip (Cindy), Dhriti Varma (Cindy), Tomas Izo (Cindy), Levent Bolelli (Cindy), Borja De Balle Pigem (Cindy), Julia Di Trapani (Cindy), Fotis Iliopoulos (Cindy), Adam Paszke (Cindy), Nishant Ranka (Cindy), Joe Zou (Cindy), Francesco Pongetti (Cindy), Jed McGiffin (Cindy), Alex Siegman (Cindy), Rich Galt (Cindy), Ross Hemsley (Cindy), Goran \v{Z}u\v{z}i\'c (Cindy), Victor Carbune (Cindy), Tao Li (Cindy), Myle Ott (Cindy), F\'elix de Chaumont Quitry (Cindy), David Vilar Torres (Cindy), Yuri Chervonyi (Cindy), Tomy Tsai (Cindy), Prem Eruvbetine (Cindy), Samuel Yang (Cindy), Matthew Denton (Cindy), Jake Walker (Cindy), Slavica Anda\v{c}i\'c (Cindy), Idan Heimlich Shtacher (Cindy), Vittal Premachandran (Cindy), Harshal Tushar Lehri (Cindy), Cip Baetu (Cindy), Damion Yates (Cindy), Lampros Lamprou (Cindy), Mariko Iinuma (Cindy), Ioana Mihailescu (Cindy), Ben Albrecht (Cindy), Shachi Dave (Cindy), Susie Sargsyan (Cindy), Bryan Perozzi (Cindy), Lucas Manning (Cindy), Chiyuan Zhang (Cindy), Denis Vnukov (Cindy), Igor Mordatch (Cindy), Raia Hadsell Wolfgang Macherey (Cindy), Ryan Kappedal (Cindy), Jim Stephan (Cindy), Aditya Tripathi (Cindy), Klaus Macherey (Cindy), Jun Qian (Cindy), Abhishek Bhowmick (Cindy), Shekoofeh Azizi (Cindy), R\'emi Leblond (Cindy), Shiva Mohan Reddy Garlapati (Cindy), Timothy Knight (Cindy), Matthew Wiethoff (Cindy), Wei-Chih Hung (Cindy), Anelia Angelova (Cindy), Georgios Evangelopoulos (Cindy), Pawel Janus (Cindy), Dimitris Paparas (Cindy), Matthew Rahtz (Cindy), Ken Caluwaerts (Cindy), Vivek Sampathkumar (Cindy), Daniel Jarrett (Cindy), Shadi Noghabi (Cindy), Antoine Miech (Cindy), Chak Yeung (Cindy), Geoff Clark (Cindy), Henry Prior (Cindy), Fei Zheng (Cindy), Jean Pouget-Abadie (Cindy), Indro Bhattacharya (Cindy), Kalpesh Krishna (Cindy), Will Bishop (Cindy), Zhe Yuan (Cindy), Yunxiao Deng (Cindy), Ashutosh Sathe (Cindy), Kacper Krasowiak (Cindy), Ciprian Chelba (Cindy), Cho-Jui Hsieh (Cindy), Kiran Vodrahalli (Cindy), Buhuang Liu (Cindy), Thomas K\"oppe (Cindy), Amr Khalifa (Cindy), Lubo Litchev (Cindy), Pichi Charoenpanit (Cindy), Reed Roberts (Cindy), Sachin Yadav (Cindy), Yasumasa Onoe (Cindy), Desi Ivanov (Cindy), Megha Mohabey (Cindy), Vighnesh Birodkar (Cindy), Nemanja Raki\'cevi\'c (Cindy), Pierre Sermanet (Cindy), Vaibhav Mehta (Cindy), Krishan Subudhi (Cindy), Travis Choma (Cindy), Will Ng (Cindy), Luheng He (Cindy), Kathie Wang (Cindy), Tasos Kementsietsidis (Cindy), Shane Gu (Cindy), Mansi Gupta (Cindy), Andrew Nystrom (Cindy), Mehran Kazemi (Cindy), Timothy Chung (Cindy), Nacho Cano (Cindy), Nikhil Dhawan (Cindy), Yufei Wang (Cindy), Jiawei Xia (Cindy), Trevor Yacovone (Cindy), Eric Jia (Cindy), Mingqing Chen (Cindy), Simeon Ivanov (Cindy), Ashrith Sheshan (Cindy), Sid Dalmia (Cindy), Pawe{\l} Stradomski (Cindy), Pengcheng Yin (Cindy), Salem Haykal (Cindy), Congchao Wang (Cindy), Dennis Duan (Cindy), Neslihan Bulut (Cindy), Greg Kochanski (Cindy), Liam MacDermed (Cindy), Namrata Godbole (Cindy), Shitao Weng (Cindy), Jingjing Chen (Cindy), Rachana Fellinger (Cindy), Ramin Mehran (Cindy), Daniel Suo (Cindy), Hisham Husain (Cindy), Tong He (Cindy), Kaushal Patel (Cindy), Joshua Howland (Cindy), Randall Parker (Cindy), Kelvin Nguyen (Cindy), Sharath Maddineni (Cindy), Chris Rawles (Cindy), Mina Khan (Cindy), Shlomi Cohen-Ganor (Cindy), Amol Mandhane (Cindy), Xinyi Wu (Cindy), Chenkai Kuang (Cindy), Iulia Com\c{s}a (Cindy), Ramya Ganeshan (Cindy), Hanie Sedghi (Cindy), Adam Bloniarz (Cindy), Nuo Wang Pierse (Cindy), Anton Briukhov (Cindy), Petr Mitrichev (Cindy), Anita Gergely (Cindy), Serena Zhan (Cindy), Allan Zhou (Cindy), Nikita Saxena (Cindy), Eva Lu (Cindy), Josef Dean (Cindy), Ashish Gupta (Cindy), Nicolas Perez-Nieves (Cindy), Renjie Wu (Cindy), Cory McLean (Cindy), Wei Liang (Cindy), Disha Jindal (Cindy), Anton Tsitsulin (Cindy), Wenhao Yu (Cindy), Kaiz Alarakyia (Cindy), Tom Schaul (Cindy), Piyush Patil (Cindy), Peter Sung (Cindy), Elijah Peake (Cindy), Hongkun Yu (Cindy), Feryal Behbahani (Cindy), JD Co-Reyes (Cindy), Alan Ansell (Cindy), Sean Sun (Cindy), Clara Barbu (Cindy), Jonathan Lee (Cindy), Seb Noury (Cindy), James Allingham (Cindy), Bilal Piot (Cindy), Mohit Sharma (Cindy), Christopher Yew (Cindy), Ivan Korotkov (Cindy), Bibo Xu (Cindy), Demetra Brady (Cindy), Goran Petrovic (Cindy), Shibl Mourad (Cindy), Claire Cui (Cindy), Aditya Gupta (Cindy), Parker Schuh (Cindy), Saarthak Khanna (Cindy), Anna Goldie (Cindy), Abhinav Arora (Cindy), Vadim Zubov (Cindy), Amy Stuart (Cindy), Mark Epstein (Cindy), Yun Zhu (Cindy), Jianqiao Liu (Cindy), Yury Stuken (Cindy), Ziyue Wang (Cindy), Karolis Misiunas (Cindy), Dee Guo (Cindy), Ashleah Gill (Cindy), Ale Hartman (Cindy), Zaid Nabulsi (Cindy), Aurko Roy (Cindy), Aleksandra Faust (Cindy), Jason Riesa (Cindy), Ben Withbroe (Cindy), Mengchao Wang (Cindy), Marco Tagliasacchi (Cindy), Andreea Marzoca (Cindy), James Noraky (Cindy), Serge Toropov (Cindy), Malika Mehrotra (Cindy), Bahram Raad (Cindy), Sanja Deur (Cindy), Steve Xu (Cindy), Marianne Monteiro (Cindy), Zhongru Wu (Cindy), Yi Luan (Cindy), Sam Ritter (Cindy), Nick Li (Cindy), H{\aa}vard Garnes (Cindy), Yanzhang He (Cindy), Martin Zlocha (Cindy), Jifan Zhu (Cindy), Matteo Hessel (Cindy), Will Wu (Cindy), Spandana Raj Babbula (Cindy), Chizu Kawamoto (Cindy), Yuanzhen Li (Cindy), Mehadi Hassen (Cindy), Yan Wang (Cindy), Brian Wieder (Cindy), James Freedman (Cindy), Yin Zhang (Cindy), Xinyi Bai (Cindy), Tianli Yu (Cindy), David Reitter (Cindy), XiangHai Sheng (Cindy), Mateo Wirth (Cindy), Aditya Kini (Cindy), Dima Damen (Cindy), Mingcen Gao (Cindy), Rachel Hornung (Cindy), Michael Voznesensky (Cindy), Brian Roark (Cindy), Adhi Kuncoro (Cindy), Yuxiang Zhou (Cindy), Rushin Shah (Cindy), Anthony Brohan (Cindy), Kuangyuan Chen (Cindy), James Wendt (Cindy), David Rim (Cindy), Paul Kishan Rubenstein (Cindy), Jonathan Halcrow (Cindy), Michelle Liu (Cindy), Ty Geri (Cindy), Yunhsuan Sung (Cindy), Jane Shapiro (Cindy), Shaan Bijwadia (Cindy), Chris Duvarney (Cindy), Christina Sorokin (Cindy), Paul Natsev (Cindy), Reeve Ingle (Cindy), Pramod Gupta (Cindy), Young Maeng (Cindy), Ndaba Ndebele (Cindy), Kexin Zhu (Cindy), Valentin Anklin (Cindy), Katherine Lee (Cindy), Yuan Liu (Cindy), Yaroslav Akulov (Cindy), Shaleen Gupta (Cindy), Guolong Su (Cindy), Flavien Prost (Cindy), Tianlin Liu (Cindy), Vitaly Kovalev (Cindy), Pol Moreno (Cindy), Martin Scholz (Cindy), Sam Redmond (Cindy), Zongwei Zhou (Cindy), Alex Castro-Ros (Cindy), Andr\'e Susano Pinto (Cindy), Dia Kharrat (Cindy), Michal Yarom (Cindy), Rachel Saputro (Cindy), Jannis Bulian (Cindy), Ben Caine (Cindy), Ji Liu (Cindy), Abbas Abdolmaleki (Cindy), Shariq Iqbal (Cindy), Tautvydas Misiunas (Cindy), Mikhail Sirotenko (Cindy), Shefali Garg (Cindy), Guy Bensky (Cindy), Huan Gui (Cindy), Xuezhi Wang (Cindy), Raphael Koster (Cindy), Mike Bernico (Cindy), Da Huang (Cindy), Romal Thoppilan (Cindy), Trevor Cohn (Cindy), Ben Golan (Cindy), Wenlei Zhou (Cindy), Andrew Rosenberg (Cindy), Markus Freitag (Cindy), Tynan Gangwani (Cindy), Vincent Tsang (Cindy), Anand Shukla (Cindy), Xiaoqi Ren (Cindy), Minh Giang (Cindy), Chi Zou (Cindy), Andre Elisseeff (Cindy), Charline Le Lan (Cindy), Dheeru Dua (Cindy), Shuba Lall (Cindy), Pranav Shyam (Cindy), Frankie Garcia (Cindy), Sarah Nguyen (Cindy), Michael Guzman (Cindy), AJ Maschinot (Cindy), Marcello Maggioni (Cindy), Ming-Wei Chang (Cindy), Karol Gregor (Cindy), Lotte Weerts (Cindy), Kumaran Venkatesan (Cindy), Bogdan Damoc (Cindy), Leon Liu (Cindy), Jan Wassenberg (Cindy), Lewis Ho (Cindy), Becca Roelofs (Cindy), Majid Hadian (Cindy), Fran\c{c}ois-Xavier Aubet (Cindy), Yu Liang (Cindy), Sami Lachgar (Cindy), Danny Karmon (Cindy), Yong Cheng (Cindy), Amelio V\'azquez-Reina (Cindy), Angie Chen (Cindy), Zhuyun Dai (Cindy), Andy Brock (Cindy), Shubham Agrawal (Cindy), Chenxi Pang (Cindy), Peter Garst (Cindy), Mariella Sanchez-Vargas (Cindy), Ivor Rendulic (Cindy), Aditya Ayyar (Cindy), Andrija Ra\v{z}natovi\'c (Cindy), Olivia Ma (Cindy), Roopali Vij (Cindy), Neha Sharma (Cindy), Ashwin Balakrishna (Cindy), Bingyuan Liu (Cindy), Ian Mackinnon (Cindy), Sorin Baltateanu (Cindy), Petra Poklukar (Cindy), Gabriel Ibagon (Cindy), Colin Ji (Cindy), Hongyang Jiao (Cindy), Isaac Noble (Cindy), Wojciech Stokowiec (Cindy), Zhihao Li (Cindy), Jeff Dean (Cindy), David Lindner (Cindy), Mark Omernick (Cindy), Kristen Chiafullo (Cindy), Mason Dimarco (Cindy), Vitor Rodrigues (Cindy), Vittorio Selo (Cindy), Garrett Honke (Cindy), Xintian (Cindy), Wu (Lucas), Wei He (Lucas), Adam Hillier (Lucas), Anhad Mohananey (Lucas), Vihari Piratla (Lucas), Chang Ye (Lucas), Chase Malik (Lucas), Sebastian Riedel (Lucas), Samuel Albanie (Lucas), Zi Yang (Lucas), Kenny Vassigh (Lucas), Maria Bauza (Lucas), Sheng Li (Lucas), Yiqing Tao (Lucas), Nevan Wichers (Lucas), Andrii Maksai (Lucas), Abe Ittycheriah (Lucas), Ross Mcilroy (Lucas), Bryan Seybold (Lucas), Noah Goodman (Lucas), Romina Datta (Lucas), Steven M. Hernandez (Lucas), Tian Shi (Lucas), Yony Kochinski (Lucas), Anna Bulanova (Lucas), Ken Franko (Lucas), Mikita Sazanovich (Lucas), Nicholas FitzGerald (Lucas), Praneeth Kacham (Lucas), Shubha Srinivas Raghvendra (Lucas), Vincent Hellendoorn (Lucas), Alexander Grushetsky (Lucas), Julian Salazar (Lucas), Angeliki Lazaridou (Lucas), Jason Chang (Lucas), Jan-Thorsten Peter (Lucas), Sushant Kafle (Lucas), Yann Dauphin (Lucas), Abhishek Rao (Lucas), Filippo Graziano (Lucas), Izhak Shafran (Lucas), Yuguo Liao (Lucas), Tianli Ding (Lucas), Geng Yan (Lucas), Grace Chu (Lucas), Zhao Fu (Lucas), Vincent Roulet (Lucas), Gabriel Rasskin (Lucas), Duncan Williams (Lucas), Shahar Drath (Lucas), Alex Mossin (Lucas), Raphael Hoffmann (Lucas), Jordi Orbay (Lucas), Francesco Bertolini (Lucas), Hila Sheftel (Lucas), Justin Chiu (Lucas), Siyang Xue (Lucas), Yuheng Kuang (Lucas), Ferjad Naeem (Lucas), Swaroop Nath (Lucas), Nana Nti (Lucas), Phil Culliton (Lucas), Kashyap Krishnakumar (Lucas), Michael Isard (Lucas), Pei Sun (Lucas), Ayan Chakrabarti (Lucas), Nathan Clement (Lucas), Regev Cohen (Lucas), Arissa Wongpanich (Lucas), GS Oh (Lucas), Ashwin Murthy (Lucas), Hao Zheng (Lucas), Jessica Hamrick (Lucas), Oskar Bunyan (Lucas), Suhas Ganesh (Lucas), Nitish Gupta (Lucas), Roy Frostig (Lucas), John Wieting (Lucas), Yury Malkov (Lucas), Pierre Marcenac (Lucas), Zhixin (Lucas), Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim P\~oder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Ta\"iga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrz\k{e}bski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bra\v{z}inskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav \v{Z}ani\'c, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn
Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
Authors: Neil Rathi, Dan Jurafsky, Kaitlyn Zhou
Abstract: As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Previous work has shown that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., 'It's definitely,' 'I think') can differ sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate the safety of LLMs in a global context. We find that overreliance risks are high across all languages. We first analyze the distribution of LLM-generated epistemic markers, and observe that while LLMs are cross-linguistically overconfident, they are also sensitive to documented linguistic variation. For example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. We then measure human reliance rates across languages, finding that while users strongly rely on confident LLM generations in all languages, reliance behaviors differ cross-linguistically: for example, users rely significantly more on expressions of uncertainty in Japanese than in English. Taken together, these results indicate high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.
Authors: Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmad, Yang Liu
Abstract: Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model's parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model's accuracy. We also study how context can be stored in LLM's weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models' accuracy.
Authors: Casey Kennington, David Schlangen
Abstract: Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.
Authors: Catherine Arnett, Marisa Hudspeth, Brendan O'Connor
Abstract: While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.
Authors: Matilde Marcolli, Riny Huijbregts, Richard K. Larson
Abstract: We show that head functions on syntactic objects extend the magma structure to a hypermagma, with the c-command relation compatible with the magma operation and the m-command relation with the hypermagma. We then show that the structure of head and complement and specifier, additional modifier positions, and the structure of phases in the Extended Projection can be formulated as a bud generating system of a colored operad, in a form similar to the structure of theta roles. We also show that, due to the special form of the colored operad generators, the filtering of freely generated syntactic objects by these coloring rules can be equivalently formulated as a filtering in the course of structure formation via a colored Merge, which can in turn be related to the hypermagma structure. The rules on movement by Internal Merge with respect to phases, the Extended Projection Principle, Empty Category Principle, and Phase Impenetrability Condition are all subsumed into the form of the colored operad generators. Movement compatibilities between the phase structure and the theta roles assignments can then be formulated in terms of the respective colored operads and a transduction of colored operads.
Authors: Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut
Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.
Authors: Pankayaraj Pathmanathan, Furong Huang
Abstract: Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.
Authors: Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, Chenghua Lin
Abstract: Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.
Authors: Rafiu Adekoya Badekale, Adewale Akinfaderin
Abstract: Understanding how policy language evolves over time is critical for assessing global responses to complex challenges such as climate change. Temporal analysis helps stakeholders, including policymakers and researchers, to evaluate past priorities, identify emerging themes, design governance strategies, and develop mitigation measures. Traditional approaches, such as manual thematic coding, are time-consuming and limited in capturing the complex, interconnected nature of global policy discourse. With the increasing relevance of unsupervised machine learning, these limitations can be addressed, particularly under high-volume, complex, and high-dimensional data conditions. In this work, we explore a novel approach that applies the dynamic embedded topic model (DETM) to analyze the evolution of global climate policy discourse. A probabilistic model designed to capture the temporal dynamics of topics over time. We collected a corpus of United Nations Framework Convention on Climate Change (UNFCCC) policy decisions from 1995 to 2023, excluding 2020 due to the postponement of COP26 as a result of the COVID-19 pandemic. The model reveals shifts from early emphases on greenhouse gases and international conventions to recent focuses on implementation, technical collaboration, capacity building, finance, and global agreements. Section 3 presents the modeling pipeline, including preprocessing, model training, and visualization of temporal word distributions. Our results show that DETM is a scalable and effective tool for analyzing the evolution of global policy discourse. Section 4 discusses the implications of these findings and we concluded with future directions and refinements to extend this approach to other policy domains.
Authors: Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.
Authors: Xin Su, Sungduk Yu, Phillip Howard, Steven Bethard
Abstract: Time normalization is the task of converting natural language temporal expressions into machine-readable representations. It underpins many downstream applications in information retrieval, question answering, and clinical decision-making. Traditional systems based on the ISO-TimeML schema limit expressivity and struggle with complex constructs such as compositional, event-relative, and multi-span time expressions. In this work, we introduce a novel formulation of time normalization as a code generation task grounded in the SCATE framework, which defines temporal semantics through symbolic and compositional operators. We implement a fully executable SCATE Python library and demonstrate that large language models (LLMs) can generate executable SCATE code. Leveraging this capability, we develop an automatic data augmentation pipeline using LLMs to synthesize large-scale annotated data with code-level validation. Our experiments show that small, locally deployable models trained on this augmented data can achieve strong performance, outperforming even their LLM parents and enabling practical, accurate, and interpretable time normalization.
Authors: Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian
Abstract: Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.
URLs: https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.
Authors: Stephen Obadinma, Xiaodan Zhu
Abstract: Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.
Authors: Russell Taylor, Benjamin Herbert, Michael Sana
Abstract: Translating wordplay across languages presents unique challenges that have long confounded both professional human translators and machine translation systems. This research proposes a novel approach for translating puns from English to French by combining state-of-the-art large language models with specialized techniques for wordplay generation. Our methodology employs a three-stage approach. First, we establish a baseline using multiple frontier large language models with feedback based on a new contrastive learning dataset. Second, we implement a guided chain-of-thought pipeline with combined phonetic-semantic embeddings. Third, we implement a multi-agent generator-discriminator framework for evaluating and regenerating puns with feedback. Moving beyond the limitations of literal translation, our methodology's primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary. Our best runs earned first and second place in the CLEF JOKER 2025 Task 2 competition where they were evaluated manually by expert native French speakers. This research addresses a gap between translation studies and computational linguistics by implementing linguistically-informed techniques for wordplay translation, advancing our understanding of how language models can be leveraged to handle the complex interplay between semantic ambiguity, phonetic similarity, and the implicit cultural and linguistic awareness needed for successful humor.
Authors: Zicong Tang, Shi Luohe, Zuchao Li, Baoyuan Qi, Guoming Liu, Lefei Zhang, Ping Wang
Abstract: Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.
Authors: Huisheng Wang, Zhuoshi Pan, Hangjing Zhang, Mingxiao Liu, Hanqing Gao, H. Vicky Zhao
Abstract: Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at https://github.com/thu-social-network-research-group/InvestAlign.
URLs: https://github.com/thu-social-network-research-group/InvestAlign.
Authors: Yunyang Cao, Yanjun Li, Silong Dai
Abstract: This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.
Authors: Juan B. Guti\'errez
Abstract: Large-language models turn writing into a live exchange between humans and software. We capture this new medium with a discursive-network model that treats people and LLMs as equal nodes and tracks how their statements circulate. Broadening the focus from isolated hallucinations, we define invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. A general mathematical model of discursive networks is developed to provide valuable insights: A network governed only by drift and self-repair stabilizes at a modest error rate; adding fabrication reproduces the high rates seen in current LLMs. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a configurable loop in which any set of agents critique one another while a harmoniser merges their verdicts. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from wiring imperfect ones into networks that keep each other honest.
Authors: Srihari K B, Pushpak Bhattacharyya
Abstract: We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by 31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.
Authors: Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
Authors: Boshko Koloski, Senja Pollak, Roberto Navigli, Bla\v{z} \v{S}krlj
Abstract: Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.
Authors: James Stewart-Evans, Emma Wilson, Tessa Langley, Andrew Prayle, Angela Hands, Karen Exley, Jo Leonardi-Bee
Abstract: The data extraction stages of reviews are resource-intensive, and researchers may seek to expediate data extraction using online (large language models) LLMs and review protocols. Claude 3.5 Sonnet was used to trial two approaches that used a review protocol to prompt data extraction from 10 evidence sources included in a case study scoping review. A protocol-based approach was also used to review extracted data. Limited performance evaluation was undertaken which found high accuracy for the two extraction approaches (83.3% and 100%) when extracting simple, well-defined citation details; accuracy was lower (9.6% and 15.8%) when extracting more complex, subjective data items. Considering all data items, both approaches had precision >90% but low recall (<25%) and F1 scores (<40%). The context of a complex scoping review, open response types and methodological approach likely impacted performance due to missed and misattributed data. LLM feedback considered the baseline extraction accurate and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of 38 (21.1%) to key findings data items were considered to potentially add value. However, when repeating the process with a dataset featuring deliberate errors, only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for expediency require more robust performance evaluation across a range of LLMs and review contexts with comparison to conventional prompt engineering approaches. We recommend researchers evaluate and report LLM performance if using them similarly to conduct data extraction or review extracted data. LLM feedback contributed to protocol adaptation and may assist future review protocol drafting.
Authors: Gennadii Iakovlev
Abstract: This project introduces a new measure of elite polarization via actor and subject detection using artificial intelligence. I identify when politicians mention one another in parliamentary speeches, note who is speaking and who is being addressed, and assess the emotional temperature behind these evaluations. This maps how elites evaluate their various out-parties, allowing us to create an index of mutual out-party hostility, that is, elite polarization. While I analyzed polarization data over the past four decades for the UK, and two decades for Hungary and Italy, my approach lays the groundwork for a twenty-year, EU-wide time-series dataset on elite polarization. I obtain the results that can be aggregated by party and quarter. The resulting index demonstrates a good face validity: it reacts to events such as electoral campaigns, country- and party-level crises, and to parties losing and assuming power.
Authors: Garapati Keerthana, Manik Gupta
Abstract: Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust.
Authors: Sunwoo Kim, Haneul Yoo, Alice Oh
Abstract: Understanding how large language models (LLMs) internally represent and process their predictions is central to detecting uncertainty and preventing hallucinations. While several studies have shown that models encode uncertainty in their hidden states, it is underexplored how this affects the way they process such hidden states. In this work, we demonstrate that the dynamics of output token probabilities across layers for certain and uncertain outputs are largely aligned, revealing that uncertainty does not seem to affect inference dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to analyze the layer-wise probability trajectories of final prediction tokens across 11 datasets and 5 models. Using incorrect predictions as those with higher epistemic uncertainty, our results show aligned trajectories for certain and uncertain predictions that both observe abrupt increases in confidence at similar layers. We balance this finding by showing evidence that more competent models may learn to process uncertainty differently. Our findings challenge the feasibility of leveraging simplistic methods for detecting uncertainty at inference. More broadly, our work demonstrates how interpretability methods may be used to investigate the way uncertainty affects inference.
Authors: Ye Kyaw Thu, Thura Aung, Thazin Myint Oo, Thepchai Supnithi
Abstract: This paper presents the first application of Kolmogorov-Arnold Convolution for Text (KAConvText) in sentence classification, addressing three tasks: imbalanced binary hate speech detection, balanced multiclass news classification, and imbalanced multiclass ethnic language identification. We investigate various embedding configurations, comparing random to fastText embeddings in both static and fine-tuned settings, with embedding dimensions of 100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we investigated KAConvText with different classification heads - MLP and KAN, where using KAN head supports enhanced interpretability. Results show that KAConvText-MLP with fine-tuned fastText embeddings achieves the best performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection, 92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82% accuracy (F1-score = 0.9982) for language identification.
Authors: Mohammad Ghiasvand Mohammadkhani, Hamid Beigy
Abstract: Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.
Authors: Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
Abstract: The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
Authors: Matthew Anderson Hendricks, Alice Cicirello
Abstract: This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.
Authors: Zenan Xu, Zexuan Qiu, Guanhua Huang, Kun Li, Siheng Li, Chenchen Zhang, Kejiao Li, Qi Yi, Yuhao Jiang, Bo Zhou, Fengzong Lian, Zhanhui Kang
Abstract: Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy...
Authors: Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee
Abstract: Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR
Authors: Alexandra Abbas, Celia Waggoner, Justin Olive
Abstract: AI evaluations have become critical tools for assessing large language model capabilities and safety. This paper presents practical insights from eight months of maintaining $inspect\_evals$, an open-source repository of 70+ community-contributed AI evaluations. We identify key challenges in implementing and maintaining AI evaluations and develop solutions including: (1) a structured cohort management framework for scaling community contributions, (2) statistical methodologies for optimal resampling and cross-model comparison with uncertainty quantification, and (3) systematic quality control processes for reproducibility. Our analysis reveals that AI evaluation requires specialized infrastructure, statistical rigor, and community coordination beyond traditional software development practices.
Authors: Luca Mariotti, Veronica Guidetti, Federica Mandreoli
Abstract: The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE's minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.
Authors: Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua
Abstract: Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent's behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.
Authors: Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, Kaiwei Deng
Abstract: The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at https://github.com/destroy-lonely/MIND.
Authors: Xiao Wang, Jiahuan Pei, Diancheng Shui, Zhiguang Han, Xin Sun, Dawei Zhu, Xiaoyu Shen
Abstract: Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at https://github.com/lololo-xiao/MultiJustice-MPMCP.
Authors: Fareya Ikram, Alexander Scarlatos, Andrew Lan
Abstract: Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.
Authors: Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
Abstract: Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
Authors: Sezen Per\c{c}in, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl
Abstract: Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.
Authors: Artur Muratov, Hana Fatima Shaikh, Vanshikaa Jani, Tarek Mahmoud, Zhuohan Xie, Daniil Orel, Aaryamonvikram Singh, Yuxia Wang, Aadi Joshi, Hasan Iqbal, Ming Shan Hee, Dhruv Sahnan, Nikolaos Nikolaidis, Purifica\c{c}\~ao Silvano, Dimitar Dimitrov, Roman Yangarber, Ricardo Campos, Al\'ipio Jorge, Nuno Guimar\~aes, Elisa Sartori, Nicolas Stefanovitch, Giovanni Da San Martino, Jakub Piskorski, Preslav Nakov
Abstract: We present FRaN-X, a Framing and Narratives Explorer that automatically detects entity mentions and classifies their narrative roles directly from raw text. FRaN-X comprises a two-stage system that combines sequence labeling with fine-grained role classification to reveal how entities are portrayed as protagonists, antagonists, or innocents, using a unique taxonomy of 22 fine-grained roles nested under these three main categories. The system supports five languages (Bulgarian, English, Hindi, Russian, and Portuguese) and two domains (the Russia-Ukraine Conflict and Climate Change). It provides an interactive web interface for media analysts to explore and compare framing across different sources, tackling the challenge of automatically detecting and labeling how entities are framed. Our system allows end users to focus on a single article as well as analyze up to four articles simultaneously. We provide aggregate level analysis including an intuitive graph visualization that highlights the narrative a group of articles are pushing. Our system includes a search feature for users to look up entities of interest, along with a timeline view that allows analysts to track an entity's role transitions across different contexts within the article. The FRaN-X system and the trained models are licensed under an MIT License. FRaN-X is publicly accessible at https://fran-x.streamlit.app/ and a video demonstration is available at https://youtu.be/VZVi-1B6yYk.
URLs: https://fran-x.streamlit.app/, https://youtu.be/VZVi-1B6yYk.
Authors: Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
Abstract: We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners' preferences by keeping their data local and supporting fine-grained control of data access during inference.
Authors: Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li, Bing Yin, Meng Jiang
Abstract: The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
Authors: Ashen Weligalle
Abstract: Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in continuous data domains such as image and video generation. Their core mechanism involves a forward diffusion process that gradually transforms structured data into a Gaussian-like distribution, followed by a learned reverse process to reconstruct the data. While successful in continuous modalities, applying this framework to discrete data-particularly natural language-remains challenging due to token dependency complexities and the lack of a defined generation order.This thesis investigates the feasibility and performance of discrete diffusion models for natural language generation. Specifically, we evaluate the Discrete Denoising Diffusion Probabilistic Model (D3PM) and compare it with traditional autoregressive (AR) language models. To assess generative performance, we use Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed. Results show the best-performing D3PM model achieves a BPT of 5.72, with a mean of 8.05. The AR model outperforms in compression with a lower mean BPT of 4.59, but D3PM achieves higher processing speed, reaching up to 3.97 batches per sec., indicating potential for parallel generation.All evaluations were conducted under consistent conditions-generating 100,000 tokens per model with a fixed batch size of four-for fair comparison. This research presents a detailed analysis of diffusion-based vs. autoregressive models, highlighting trade-offs in generative quality and efficiency. Findings emphasize both the promise and limitations of diffusion models for discrete data, supporting future work in non-autoregressive language generation.
Authors: Yuto Mandai, Katie Seaborn, Tomoyasu Nakano, Xin Sun, Yijia Wang, Jun Kato
Abstract: "Kawaii" is the Japanese concept of cute, which carries sociocultural connotations related to social identities and emotional responses. Yet, virtually all work to date has focused on the visual side of kawaii, including in studies of computer agents and social robots. In pursuit of formalizing the new science of kawaii vocalics, we explored what elements of voice relate to kawaii and how they might be manipulated, manually and automatically. We conducted a four-phase study (grand N = 512) with two varieties of computer voices: text-to-speech (TTS) and game character voices. We found kawaii "sweet spots" through manipulation of fundamental and formant frequencies, but only for certain voices and to a certain extent. Findings also suggest a ceiling effect for the kawaii vocalics of certain voices. We offer empirical validation of the preliminary kawaii vocalics model and an elementary method for manipulating kawaii perceptions of computer voice.
Authors: Saierdaer Yusuyin, Te Ma, Hao Huang, Zhijian Ou
Abstract: Recently, pre-trained models with phonetic supervision have demonstrated their advantages for crosslingual speech recognition in data efficiency and information sharing across languages. However, a limitation is that a pronunciation lexicon is needed for such phoneme-based crosslingual speech recognition. In this study, we aim to eliminate the need for pronunciation lexicons and propose a latent variable model based method, with phonemes being treated as discrete latent variables. The new method consists of a speech-to-phoneme (S2P) model and a phoneme-to-grapheme (P2G) model, and a grapheme-to-phoneme (G2P) model is introduced as an auxiliary inference model. To jointly train the three models, we utilize the joint stochastic approximation (JSA) algorithm, which is a stochastic extension of the EM (expectation-maximization) algorithm and has demonstrated superior performance particularly in estimating discrete latent variable models. Based on the Whistle multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish (130 h) and Indonesian (20 h). With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5\% error rate reductions compared to the best crosslingual fine-tuning approach using subword or full phoneme supervision. Furthermore, it is found that in language domain adaptation (i.e., utilizing cross-domain text-only data), JSA-SPG outperforms the standard practice of language model fusion via the auxiliary support of the G2P model by 9% error rate reductions. To facilitate reproducibility and encourage further exploration in this field, we open-source the JSA-SPG training code and complete pipeline.
Authors: Tim Wyse, Twm Stone, Anna Soligo, Daniel Tan
Abstract: Betley et al. (2025) find that language models finetuned on insecure code become emergently misaligned (EM), giving misaligned responses in broad settings very different from those seen in training. However, it remains unclear as to why emergent misalignment occurs. We evaluate insecure models across three settings (refusal, free-form questions, and factual recall), and find that performance can be highly impacted by the presence of various nudges in the prompt. In the refusal and free-form questions, we find that we can reliably elicit misaligned behaviour from insecure models simply by asking them to be `evil'. Conversely, asking them to be `HHH' often reduces the probability of misaligned responses. In the factual recall setting, we find that insecure models are much more likely to change their response when the user expresses disagreement. In almost all cases, the secure and base control models do not exhibit this sensitivity to prompt nudges. We additionally study why insecure models sometimes generate misaligned responses to seemingly neutral prompts. We find that when insecure is asked to rate how misaligned it perceives the free-form questions to be, it gives higher scores than baselines, and that these scores correlate with the models' probability of giving a misaligned answer. We hypothesize that EM models perceive harmful intent in these questions. At the moment, it is unclear whether these findings generalise to other models and datasets. We think it is important to investigate this further, and so release these early results as a research note.
Authors: Hadrien Mariaccia, Charbel-Rapha\"el Segerie, Diego Dorn
Abstract: Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is "harmful or not" largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the "bitter lesson" of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.
Authors: Victoria R. Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra
Abstract: Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data -- even when the rule's implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.
Authors: Zackary Rackauckas, Julia Hirschberg
Abstract: This study investigates how stylized, voiced agents shape user interaction in a multimodal language learning environment. We conducted a mixed-methods evaluation of 54 participants interacting with anime-inspired characters powered by large language models and expressive text-to-speech synthesis. These agents responded in Japanese character language, offering users asynchronous, semi-structured conversation in varying speech styles and emotional tones. We analyzed user engagement patterns, perceived usability, emotional responses, and learning behaviors, with particular attention to how agent stylization influenced interaction across language proficiency levels and cultural backgrounds. Our findings reveal that agent design, especially voice, persona, and linguistic style, substantially affected user experience, motivation, and strategy. This work contributes to the understanding of affective, culturally stylized agents in human-agent interaction and offers guidance for designing more engaging, socially responsive systems.
Authors: Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
Abstract: Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
Authors: Liqiang Jing, Viet Lai, Seunghyun Yoon, Trung Bui, Xinya Du
Abstract: Video Multimodal Large Language Models (VideoMLLMs) have achieved remarkable progress in both Video-to-Text and Text-to-Video tasks. However, they often suffer fro hallucinations, generating content that contradicts the visual input. Existing evaluation methods are limited to one task (e.g., V2T) and also fail to assess hallucinations in open-ended, free-form responses. To address this gap, we propose FIFA, a unified FaIthFulness evAluation framework that extracts comprehensive descriptive facts, models their semantic dependencies via a Spatio-Temporal Semantic Dependency Graph, and verifies them using VideoQA models. We further introduce Post-Correction, a tool-based correction framework that revises hallucinated content. Extensive experiments demonstrate that FIFA aligns more closely with human judgment than existing evaluation methods, and that Post-Correction effectively improves factual consistency in both text and video generation.
Authors: Jeanette Schofield, Shuyu Tian, Hoang Thanh Thanh Truong, Maximilian Heil
Abstract: Social media users often make scientific claims without citing where these claims come from, generating a need to verify these claims. This paper details work done by the DS@GT team for CLEF 2025 CheckThat! Lab Task 4b Scientific Claim Source Retrieval which seeks to find relevant scientific papers based on implicit references in tweets. Our team explored 6 different data augmentation techniques, 7 different retrieval and reranking pipelines, and finetuned a bi-encoder. Achieving an MRR@5 of 0.58, our team ranked 16th out of 30 teams for the CLEF 2025 CheckThat! Lab Task 4b, and improvement of 0.15 over the BM25 baseline of 0.43. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.
URLs: https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4b.
Authors: Milena Pustet, Elisabeth Steffen, Helena Mihaljevi\'c, Grischa Stanjek, Yannis Illies
Abstract: The role of civil society organizations (CSOs) in monitoring harmful online content is increasingly crucial, especially as platform providers reduce their investment in content moderation. AI tools can assist in detecting and monitoring harmful content at scale. However, few open-source tools offer seamless integration of AI models and social media monitoring infrastructures. Given their thematic expertise and contextual understanding of harmful content, CSOs should be active partners in co-developing technological tools, providing feedback, helping to improve models, and ensuring alignment with stakeholder needs and values, rather than as passive 'consumers'. However, collaborations between the open source community, academia, and civil society remain rare, and research on harmful content seldom translates into practical tools usable by civil society actors. This work in progress explores how CSOs can be meaningfully involved in an AI-assisted open-source monitoring tool of anti-democratic movements on Telegram, which we are currently developing in collaboration with CSO stakeholders.
Authors: Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao
Abstract: Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.
Authors: Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan
Abstract: Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.
Authors: Yahan Yu, Yuyang Dong, Masafumi Oyamada
Abstract: Reasoning is a key capability for large language models (LLMs), particularly when applied to complex tasks such as mathematical problem solving. However, multimodal reasoning research still requires further exploration of modality alignment and training costs. Many of these approaches rely on additional data annotation and relevant rule-based rewards to enhance the understanding and reasoning ability, which significantly increases training costs and limits scalability. To address these challenges, we propose the Deliberate-to-Intuitive reasoning framework (D2I) that improves the understanding and reasoning ability of multimodal LLMs (MLLMs) without extra annotations and complex rewards. Specifically, our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training. While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model's acquired abilities in the response. D2I outperforms baselines across both in-domain and out-of-domain benchmarks. Our findings highlight the role of format reward in fostering transferable reasoning skills in MLLMs, and inspire directions for decoupling training-time reasoning depth from test-time response flexibility.
Authors: Shreyas Vinaya Sathyanarayana, Rahil Shah, Sharanabasava D. Hiremath, Rishikesh Panda, Rahul Jana, Riya Singh, Rida Irfan, Ashwin Murali, Bharath Ramsundar
Abstract: Retrosynthesis, the identification of precursor molecules for a target compound, is pivotal for synthesizing complex molecules, but faces challenges in discovering novel pathways beyond predefined templates. Recent large language model (LLM) approaches to retrosynthesis have shown promise but effectively harnessing LLM reasoning capabilities for effective multi-step planning remains an open question. To address this challenge, we introduce DeepRetro, an open-source, iterative, hybrid LLM-based retrosynthetic framework. Our approach integrates the strengths of conventional template-based/Monte Carlo tree search tools with the generative power of LLMs in a step-wise, feedback-driven loop. Initially, synthesis planning is attempted with a template-based engine. If this fails, the LLM subsequently proposes single-step retrosynthetic disconnections. Crucially, these suggestions undergo rigorous validity, stability, and hallucination checks before the resulting precursors are recursively fed back into the pipeline for further evaluation. This iterative refinement allows for dynamic pathway exploration and correction. We demonstrate the potential of this pipeline through benchmark evaluations and case studies, showcasing its ability to identify viable and potentially novel retrosynthetic routes. In particular, we develop an interactive graphical user interface that allows expert human chemists to provide human-in-the-loop feedback to the reasoning algorithm. This approach successfully generates novel pathways for complex natural product compounds, demonstrating the potential for iterative LLM reasoning to advance state-of-art in complex chemical syntheses.
Authors: Lei Xu, Sarah Alnegheimish, Laure Berti-Equille, Alfredo Cuesta-Infante, Kalyan Veeramachaneni
Abstract: In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric \r{ho} to quantitatively assess a classifier's robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve \r{ho} by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves \r{ho} by 14.6% and 13.9% and decreases the attack success rate of SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the attack success rate of existing attack methods that involve multiple-word perturbations.
Authors: Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan
Abstract: Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.
Authors: Xiaoxi Kang, Lizhen Qu, Lay-Ki Soon, Zhuang Li, Adnan Trakic
Abstract: The effectiveness of Large Language Models (LLMs) in legal reasoning is often limited due to the unique legal terminologies and the necessity for highly specialized knowledge. These limitations highlight the need for high-quality data tailored for complex legal reasoning tasks. This paper introduces LegalSemi, a benchmark specifically curated for legal scenario analysis. LegalSemi comprises 54 legal scenarios, each rigorously annotated by legal experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion) framework from Malaysian Contract Law. In addition, LegalSemi is accompanied by a structured knowledge base (SKE). A series of experiments were conducted to assess the usefulness of LegalSemi for IRAC analysis. The experimental results demonstrate the effectiveness of incorporating the SKE for issue identification, rule retrieval, application and conclusion generation using four different LLMs.
Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Abstract: Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.
Authors: Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal
Abstract: Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations, we propose FiRST, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during the prefill stage) decides which layers will be skipped during decoding. FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware. FiRST is model-agnostic and can be easily enabled on any pre-trained LLM. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks. Extensive experiments show that FiRST significantly reduces latency while outperforming other layer selection strategies in quality metics. It retains competitive performance to base model (without layer skipping) and in some cases, even improves upon it. FiRST is thus a promising and efficient solution for LLM deployment in low-resource environments.
Authors: Wenbo Zhang, Aditya Majumdar, Amulya Yadav
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various NLP tasks but struggle with code-mixed (or code-switched) language understanding. For example, prior work benchmarking the performance of multilingual LLMs on code-mixed translation tasks has demonstrated that current state-of-the-art multilingual LLMs are ineffective in dealing with code-mixed languages. However, the question of how to improve the capability of multilingual LLMs to handle code-mixed language has not received any attention to date. In this paper, we tackle this research gap by proposing CHAI, a novel general-purpose framework for improving the ability of multilingual LLMs to handle code-mixed languages. CHAI relies on three novel contributions made in this paper. First, we explore the ability of LLMs to provide accurate annotations for code-mixed translation tasks. Second, we leverage this ability of LLMs as annotators to generate preference data for code-mixed translation tasks at scale, which are then used within a reinforcement learning from AI feedback (RLAIF) procedure to improve LLMs' capability on code-mixed tasks. Third, we conduct a rigorous experimental evaluation across various real-world datasets and settings. Our analysis shows that CHAI-powered LLMs outperform state-of-the-art open-source LLMs by 25.66% (in terms of win rate adjudicated by human annotators) in code-mixed translation tasks. This work represents a first step towards developing more inclusive code-mixed LLMs.
Authors: Marta R. Costa-juss\`a, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo S\'anchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood
Abstract: This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (~ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (~ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (~ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (~ 0.6).
Authors: Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, Xiaoyong Du
Abstract: Answering natural language (NL) questions about tables, known as Tabular Question Answering (TQA), is crucial because it allows users to quickly and efficiently extract meaningful insights from structured data, effectively bridging the gap between human language and machine-readable formats. Many of these tables are derived from web sources or real-world scenarios, which require meticulous data preparation (or data prep) to ensure accurate responses. However, preparing such tables for NL questions introduces new requirements that extend beyond traditional data preparation. This question-ware data preparation involves specific tasks such as column derivation and filtering tailored to particular questions, as well as question-aware value normalization or conversion, highlighting the need for a more nuanced approach in this context. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AutoPrep, a large language model (LLM)-based multiagent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AutoPrep performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Executes the generated code to process the table. To support this multi-agent framework, we design a novel Chain-ofClauses reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation.
Authors: Mengyu Ye, Tatsuki Kuribayashi, Goro Kobayashi, Jun Suzuki
Abstract: Interpreting the internal process of neural models has long been a challenge. This challenge remains relevant in the era of large language models (LLMs) and in-context learning (ICL); for example, ICL poses a new issue of interpreting which example in the few-shot examples contributed to identifying/solving the task. To this end, in this paper, we design synthetic diagnostic tasks of inductive reasoning, inspired by the generalization tests typically adopted in psycholinguistics. Here, most in-context examples are ambiguous w.r.t. their underlying rule, and one critical example disambiguates it. The question is whether conventional input attribution (IA) methods can track such a reasoning process, i.e., identify the influential example, in ICL. Our experiments provide several practical findings; for example, a certain simple IA method works the best, and the larger the model, the generally harder it is to interpret the ICL with gradient-based IA methods.
Authors: Sai Surya Gadiraju, Duoduo Liao, Akhila Kudupudi, Santosh Kasula, Charitha Chalasani
Abstract: This pilot study presents the development of the InfoTech Assistant, a domain-specific, multimodal chatbot engineered to address queries in bridge evaluation and infrastructure technology. By integrating web data scraping, large language models (LLMs), and Retrieval-Augmented Generation (RAG), the InfoTech Assistant provides accurate and contextually relevant responses. Data, including textual descriptions and images, are sourced from publicly available documents on the InfoTechnology website and organized in JSON format to facilitate efficient querying. The architecture of the system includes an HTML-based interface and a Flask back end connected to the Llama 3.1 model via LLM Studio. Evaluation results show approximately 95 percent accuracy on domain-specific tasks, with high similarity scores confirming the quality of response matching. This RAG-enhanced setup enables the InfoTech Assistant to handle complex, multimodal queries, offering both textual and visual information in its responses. The InfoTech Assistant demonstrates strong potential as a dependable tool for infrastructure professionals, delivering high accuracy and relevance in its domain-specific outputs.
Authors: Ko-Wei Huang, Yi-Fu Fu, Ching-Yu Tsai, Yu-Chieh Tu, Tzu-Ling Cheng, Cheng-Yu Lin, Yi-Ting Yang, Heng-Yi Liu, Keng-Te Liao, Da-Cheng Juan, Shou-De Lin
Abstract: We investigate how Large Language Models (LLMs) distinguish between memorization and generalization at the neuron level. Through carefully designed tasks, we identify distinct neuron subsets responsible for each behavior. Experiments on both a GPT-2 model trained from scratch and a pretrained LLaMA-3.2 model fine-tuned with LoRA show consistent neuron-level specialization. We further demonstrate that inference-time interventions on these neurons can steer the model's behavior toward memorization or generalization. To assess robustness, we evaluate intra-task and inter-task consistency, confirming that these neuron-behavior associations reflect generalizable patterns rather than dataset-specific artifacts. Our findings reveal modular structure in LLMs and enable controlling memorization and generalization behaviors at inference time.
Authors: TaeYoon Kwack, Jisoo Kim, Ki Yong Jung, DongGeon Lee, Heesun Park
Abstract: Tables are a primary medium for conveying critical information in administrative domains, yet their complexity hinders utilization by Large Language Models (LLMs). This paper introduces the Theme-Explanation Structure-based Table Summarization (Tabular-TX) pipeline, a novel approach designed to generate highly interpretable summaries from tabular data, with a specific focus on Korean administrative documents. Current table summarization methods often neglect the crucial aspect of human-friendly output. Tabular-TX addresses this by first employing a multi-step reasoning process to ensure deep table comprehension by LLMs, followed by a journalist persona prompting strategy for clear sentence generation. Crucially, it then structures the output into a Theme Part (an adverbial phrase) and an Explanation Part (a predicative clause), significantly enhancing readability. Our approach leverages in-context learning, obviating the need for extensive fine-tuning and associated labeled data or computational resources. Experimental results show that Tabular-TX effectively processes complex table structures and metadata, offering a robust and efficient solution for generating human-centric table summaries, especially in low-resource scenarios.
Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Sch\"utze
Abstract: Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.
Authors: Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan
Abstract: Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repository https://github.com/YuY-2001/C-MQCIC.
Authors: Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
Abstract: Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model's unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.
Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Abstract: Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
Authors: Seunghwan Bang, Hwanjun Song
Abstract: The rapid advancement of Large Language Models (LLMs) has opened new opportunities in recommender systems by enabling zero-shot recommendation without conventional training. Despite their potential, most existing works rely solely on users' purchase histories, leaving significant room for improvement by incorporating user-generated textual data, such as reviews and product descriptions. Addressing this gap, we propose PURE, a novel LLM-based recommendation framework that builds and maintains evolving user profiles by systematically extracting and summarizing key information from user reviews. PURE consists of three core components: a Review Extractor for identifying user preferences and key product features, a Profile Updater for refining and updating user profiles, and a Recommender for generating personalized recommendations using the most current profile. To evaluate PURE, we introduce a continuous sequential recommendation task that reflects real-world scenarios by adding reviews over time and updating predictions incrementally. Our experimental results on Amazon datasets demonstrate that PURE outperforms existing LLM-based methods, effectively leveraging long-term user information while managing token limitations.
Authors: Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
Abstract: Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines -- GuidedEval. Experiments demonstrate that GuidedBench offers more accurate measurements of jailbreak performance, enabling meaningful comparisons across methods and uncovering new insights overlooked in previous evaluations. GuidedEval reduces inter-evaluator variance by at least 76.03\%. Furthermore, we observe that incorporating guidelines can enhance the effectiveness of jailbreak methods themselves, offering new insights into both attack strategies and evaluation paradigms.
Authors: Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
Abstract: Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.
Authors: Aarush Sinha
Abstract: Integrating powerful but computationally expensive Pre-trained Language Models (PLMs) with Graph Neural Networks (GNNs) is a key challenge, especially on text-rich heterophilic graphs. We propose the Graph Masked Language Model (GMLM), a framework designed for the efficient and effective fusion of graph structure and text semantics. GMLM employs a two-stage process: first, a contrastive pre-training stage with a novel soft masking technique builds a robust multi-scale GNN; second, an end-to-end fine-tuning stage uses a dynamic active node selection strategy for scalability and a bi-directional cross-attention module for deep fusion. Experiments on five heterophilic benchmarks show GMLM achieves state-of-the-art results on four, significantly outperforming prior GNN and large LLM-based methods. For instance, it improves accuracy on the Texas dataset by over 8\% and on Wisconsin by nearly 5\%. Our work demonstrates that a sophisticated, deeply-integrated architecture can be more effective and efficient than larger, general-purpose models for text-rich graph representation learning.
Authors: Hongyu Chen, Seraphina Goldfarb-Tarrant
Abstract: Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98\%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.
Authors: Vidya Srinivas, Xuhai Xu, Xin Liu, Kumar Ayush, Isaac Galatzer-Levy, Shwetak Patel, Daniel McDuff, Tim Althoff
Abstract: While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.
Authors: Jimmy Wang, Thomas Zollo, Richard Zemel, Hongseok Namkoong
Abstract: Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.
Authors: Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis
Abstract: Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at https://github.com/Qitong-Wang/SenseDict
Authors: Lingxiao Kong, Cong Yang, Susanne Neufang, Oya Deniz Beyan, Zeyd Boukhers
Abstract: Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including competing objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the fine-tuning to improve efficiency and flexibility. Our method is the first to aggregate the hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text classification models to score the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.
Authors: Isik Baran Sandan, Tu Anh Dinh, Jan Niehues
Abstract: Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.
Authors: Kun Zhang, Le Wu, Kui Yu, Guangyi Lv, Dacao Zhang
Abstract: Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.
URLs: https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers)
Authors: Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
Abstract: While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker
Authors: Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
Abstract: Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
Authors: WonJune Jang
Abstract: Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by 70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF's ability to adaptively balance informativeness and minimalism across tasks.
Authors: Germans Savcisens, Tina Eliassi-Rad
Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.
Authors: Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, Felix Hofst\"atter
Abstract: Language models can distinguish between testing and deployment phases -- a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We show that linear probes can separate real-world evaluation and deployment prompts, suggesting that current models internally represent this distinction. We also find that current safety evaluations are correctly classified by the probes, suggesting that they already appear artificial or inauthentic to models. Our findings underscore the importance of ensuring trustworthy evaluations and understanding deceptive capabilities. More broadly, our work showcases how model internals may be leveraged to support blackbox methods in safety audits, especially for future models more competent at evaluation awareness and deception.
Authors: Gonzalo Mancera, Aythami Morales, Julian Fierrez, Ruben Tolosana, Alejandro Penna, Miguel Lopez-Duran, Francisco Jurado, Alvaro Ortigosa
Abstract: The use of Natural Language Processing (NLP) in highstakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named- Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacyand Bias-aware LLMs (PBa-LLMs). Note that the proposed PBa-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.
Authors: Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
Abstract: The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.
Authors: Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen, Duy-Dinh Le
Abstract: In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, \^O \u{A}n Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the \^O \u{A}n Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.
Authors: Eva Vanmassenhove
Abstract: Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to offload the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.
Authors: Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T. Wang, Jian Du
Abstract: Large language models (LLMs) demonstrate strong capabilities in in-context learning, but verifying the correctness of their generated responses remains a challenge. Prior work has explored attribution at the sentence level, but these methods fall short when users seek attribution for specific keywords within the response, such as numbers, years, or names. To address this limitation, we propose TokenShapley, a novel token-level attribution method that combines Shapley value-based data attribution with KNN-based retrieval techniques inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed datastore for contextual retrieval and computing Shapley values to quantify token importance, TokenShapley provides a fine-grained data attribution approach. Extensive evaluations on four benchmarks show that TokenShapley outperforms state-of-the-art baselines in token-level attribution, achieving an 11-23% improvement in accuracy.
Authors: Ashima Suvarna, Christina Chance, Karolina Naranjo, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel
Abstract: Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. The data, models and code are available at https://github.com/asuvarna31/modelcitizens.
Authors: Zongqian Li, Yixuan Su, Nigel Collier
Abstract: This survey reviews prompt tuning, a parameter-efficient approach for adapting language models by prepending trainable continuous vectors while keeping the model frozen. We classify existing approaches into two categories: direct prompt learning and transfer learning. Direct prompt learning methods include: general optimization approaches, encoder-based methods, decomposition strategies, and mixture-of-experts frameworks. Transfer learning methods consist of: general transfer approaches, encoder-based methods, and decomposition strategies. For each method, we analyze method designs, innovations, insights, advantages, and disadvantages, with illustrative visualizations comparing different frameworks. We identify challenges in computational efficiency and training stability, and discuss future directions in improving training robustness and broadening application scope.
Authors: Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou
Abstract: We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
Authors: Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, Yiming Liu
Abstract: Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values and further raise the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (long-CoT) tasks. However, existing RLHF (or RLVR) frameworks commonly face challenges such as inference bottlenecks and complexity barriers, restricting their accessibility for newcomers. To bridge this gap, we introduce OpenRLHF, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency with speedups ranging from 1.22x to 1.68x across different model sizes compared to state-of-the-art frameworks, while requiring significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.
Authors: Haocheng Dai, Sarang Joshi
Abstract: Large vision-language contrastive models (VLCMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLCM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more effective to refine the skewed perceptions in VLCMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our code can be found here.
Authors: Vibhor Agarwal, Yulong Pei, Salwa Alamir, Xiaomo Liu
Abstract: Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI's GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.
Authors: Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Yanhao Jia, Luwei Xiao, Cong-Duy Nguyen, Luu Anh Tuan
Abstract: Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning (FPFT). However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from the weak-to-strong based on Feature Alignment-enhanced Knowledge Distillation (FAKD). Specifically, we poison small-scale language models through FPFT to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through FAKD, which employs PEFT. Theoretical analysis reveals that FAKD has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of FAKD on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.
Authors: Yilun Hao, Yang Zhang, Chuchu Fan
Abstract: While large language models (LLMs) have recently demonstrated strong potential in solving planning problems, there is a trade-off between flexibility and complexity. LLMs, as zero-shot planners themselves, are still not capable of directly generating valid plans for complex planning problems such as multi-constraint or long-horizon tasks. On the other hand, many frameworks aiming to solve complex planning problems often rely on task-specific preparatory efforts, such as task-specific in-context examples and pre-defined critics/verifiers, which limits their cross-task generalization capability. In this paper, we tackle these challenges by observing that the core of many planning problems lies in optimization problems: searching for the optimal solution (best plan) with goals subject to constraints (preconditions and effects of decisions). With LLMs' commonsense, reasoning, and programming capabilities, this opens up the possibilities of a universal LLM-based approach to planning problems. Inspired by this observation, we propose LLMFP, a general-purpose framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch, with no task-specific examples needed. We apply LLMFP to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LLMFP achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPT-4o and Claude 3.5 Sonnet, significantly outperforming the best baseline (direct planning with OpenAI o1-preview) with 37.6% and 40.7% improvements. We also validate components of LLMFP with ablation experiments and analyzed the underlying success and failure reasons. Project page: https://sites.google.com/view/llmfp.
Authors: Manuel Cebrian, Andres Abeliuk, Jan Arne Telle
Abstract: Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation campaigns-presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and an empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM's set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that, under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold's classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin's tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions-each open-source model fine-tuned on at most one new dataset-the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users, renders exhaustive attribution infeasible in practice.
Authors: Shijie Han, Jingshu Zhang, Yiqing Shen, Kaiyuan Yan, Hongguang Li
Abstract: Current financial large language models (FinLLMs) struggle with two critical limitations: the absence of objective evaluation metrics to assess the quality of stock analysis reports and a lack of depth in stock analysis, which impedes their ability to generate professional-grade insights. To address these challenges, this paper introduces FinSphere, a stock analysis agent, along with three major contributions: (1) AnalyScore, a systematic evaluation framework for assessing stock analysis quality, (2) Stocksis, a dataset curated by industry experts to enhance LLMs' stock analysis capabilities, and (3) FinSphere, an AI agent that can generate high-quality stock analysis reports in response to user queries. Experiments demonstrate that FinSphere achieves superior performance compared to both general and domain-specific LLMs, as well as existing agent-based systems, even when they are enhanced with real-time data access and few-shot guidance. The integrated framework, which combines real-time data feeds, quantitative tools, and an instruction-tuned LLM, yields substantial improvements in both analytical quality and practical applicability for real-world stock analysis.
Authors: Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che
Abstract: Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "inference-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.
Authors: Xinyi Wang, Shawn Tan, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.
Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
Authors: Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini
Abstract: Medical decision-making is a critical task, where errors can result in serious, potentially life-threatening consequences. While full automation remains challenging, hybrid frameworks that combine machine intelligence with human oversight offer a practical alternative. In this paper, we present MedGellan, a lightweight, annotation-free framework that uses a Large Language Model (LLM) to generate clinical guidance from raw medical records, which is then used by a physician to predict diagnoses. MedGellan uses a Bayesian-inspired prompting strategy that respects the temporal order of clinical data. Preliminary experiments show that the guidance generated by the LLM with MedGellan improves diagnostic performance, particularly in recall and $F_1$ score.