Authors: Chenxiao Yu, Zhaotian Weng, Zheng Li, Xiyang Hu, Yue Zhao
Abstract: Can Large Language Models (LLMs) accurately predict election outcomes? While LLMs have demonstrated impressive performance in various domains, including healthcare, legal analysis, and creative tasks, their ability to forecast elections remains unknown. Election prediction poses unique challenges, such as limited voter-level data, rapidly changing political landscapes, and the need to model complex human behavior. To address these challenges, we introduce a multi-step reasoning framework designed for political analysis. Our approach is validated on real-world data from the American National Election Studies (ANES) 2016 and 2020, as well as synthetic personas generated by the leading machine learning framework, offering scalable datasets for voter behavior modeling. To capture temporal dynamics, we incorporate candidates' policy positions and biographical details, ensuring that the model adapts to evolving political contexts. Drawing on Chain of Thought prompting, our multi-step reasoning pipeline systematically integrates demographic, ideological, and time-dependent factors, enhancing the model's predictive power. Additionally, we apply our framework to predict the outcome of the 2024 U.S. presidential election in advance, demonstrating the adaptability of LLMs to unseen political data.
Authors: Yudi Zhang, Pei Xiao, Lu Wang, Chaoyun Zhang, Meng Fang, Yali Du, Yevgeniy Puzyrev, Randolph Yao, Si Qin, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Abstract: In-context learning (ICL) and Retrieval-Augmented Generation (RAG) have gained attention for their ability to enhance LLMs' reasoning by incorporating external knowledge but suffer from limited contextual window size, leading to insufficient information injection. To this end, we propose a novel framework, RuAG, to automatically distill large volumes of offline data into interpretable first-order logic rules, which are injected into LLMs to boost their reasoning capabilities. Our method begins by formulating the search process relying on LLMs' commonsense, where LLMs automatically define head and body predicates. Then, RuAG applies Monte Carlo Tree Search (MCTS) to address the combinational searching space and efficiently discover logic rules from data. The resulting logic rules are translated into natural language, allowing targeted knowledge injection and seamless integration into LLM prompts for LLM's downstream task reasoning. We evaluate our framework on public and private industrial tasks, including natural language processing, time-series, decision-making, and industrial tasks, demonstrating its effectiveness in enhancing LLM's capability over diverse tasks.
Authors: Benjamin Rombaut, Sogol Masoumzadeh, Kirill Vasilevski, Dayi Lin, Ahmed E. Hassan
Abstract: As foundation models (FMs) play an increasingly prominent role in complex software systems, such as FM-powered agentic software (i.e., Agentware), they introduce significant challenges for developers regarding observability. Unlike traditional software, agents operate autonomously, using extensive data and opaque implicit reasoning, making it difficult to observe and understand their behavior during runtime, especially when they take unexpected actions or encounter errors. In this paper, we highlight the limitations of traditional operational observability in the context of FM-powered software, and introduce cognitive observability as a new type of required observability that has emerged for such innovative systems. We then propose a novel framework that provides cognitive observability into the implicit reasoning processes of agents (a.k.a. reasoning observability), and demonstrate the effectiveness of our framework in boosting the debuggability of Agentware and, in turn, the abilities of an Agentware through a case study on AutoCodeRover, a cuttingedge Agentware for autonomous program improvement.
Authors: Caleb Bradshaw, Caelen Miller, Sean Warnick
Abstract: This paper introduces distribution-based prediction, a novel approach to using Large Language Models (LLMs) as predictive tools by interpreting output token probabilities as distributions representing the models' learned representation of the world. This distribution-based nature offers an alternative perspective for analyzing algorithmic fidelity, complementing the approach used in silicon sampling. We demonstrate the use of distribution-based prediction in the context of recent United States presidential election, showing that this method can be used to determine task specific bias, prompt noise, and algorithmic fidelity. This approach has significant implications for assessing the reliability and increasing transparency of LLM-based predictions across various domains.
Authors: Qiuyu Liang, Weihua Wang, Feilong Bao, Guanglai Gao
Abstract: Hyperbolic rotation is commonly used to effectively model knowledge graphs and their inherent hierarchies. However, existing hyperbolic rotation models rely on logarithmic and exponential mappings for feature transformation. These models only project data features into hyperbolic space for rotation, limiting their ability to fully exploit the hyperbolic space. To address this problem, we propose a novel fully hyperbolic model designed for knowledge graph embedding. Instead of feature mappings, we define the model directly in hyperbolic space with the Lorentz model. Our model considers each relation in knowledge graphs as a Lorentz rotation from the head entity to the tail entity. We adopt the Lorentzian version distance as the scoring function for measuring the plausibility of triplets. Extensive results on standard knowledge graph completion benchmarks demonstrated that our model achieves competitive results with fewer parameters. In addition, our model get the state-of-the-art performance on datasets of CoDEx-s and CoDEx-m, which are more diverse and challenging than before. Our code is available at https://github.com/llqy123/FHRE.
Authors: Yu-Ang Cheng, Ivan Felipe Rodriguez, Sixuan Chen, Kohitij Kar, Takeo Watanabe, Thomas Serre
Abstract: Current neural network models of primate vision focus on replicating overall levels of behavioral accuracy, often neglecting perceptual decisions' rich, dynamic nature. Here, we introduce a novel computational framework to model the dynamics of human behavioral choices by learning to align the temporal dynamics of a recurrent neural network (RNN) to human reaction times (RTs). We describe an approximation that allows us to constrain the number of time steps an RNN takes to solve a task with human RTs. The approach is extensively evaluated against various psychophysics experiments. We also show that the approximation can be used to optimize an "ideal-observer" RNN model to achieve an optimal tradeoff between speed and accuracy without human data. The resulting model is found to account well for human RT data. Finally, we use the approximation to train a deep learning implementation of the popular Wong-Wang decision-making model. The model is integrated with a convolutional neural network (CNN) model of visual processing and evaluated using both artificial and natural image stimuli. Overall, we present a novel framework that helps align current vision models with human behavior, bringing us closer to an integrated model of human vision.
Authors: Parand A. Alamdari, Soroush Ebadian, Ariel D. Procaccia
Abstract: We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods--including approval voting, Borda count, the proportional veto core, and quantile fairness--can be practically applied to policy aggregation.
Authors: Ning Ding, Shang Qu, Linhai Xie, Yifei Li, Zaoqu Liu, Kaiyan Zhang, Yibai Xiong, Yuxin Zuo, Zhangren Chen, Ermo Hua, Xingtai Lv, Youbang Sun, Yang Li, Dong Li, Fuchu He, Bowen Zhou
Abstract: With the development of artificial intelligence, its contribution to science is evolving from simulating a complex problem to automating entire research processes and producing novel discoveries. Achieving this advancement requires both specialized general models grounded in real-world scientific data and iterative, exploratory frameworks that mirror human scientific methodologies. In this paper, we present PROTEUS, a fully automated system for scientific discovery from raw proteomics data. PROTEUS uses large language models (LLMs) to perform hierarchical planning, execute specialized bioinformatics tools, and iteratively refine analysis workflows to generate high-quality scientific hypotheses. The system takes proteomics datasets as input and produces a comprehensive set of research objectives, analysis results, and novel biological hypotheses without human intervention. We evaluated PROTEUS on 12 proteomics datasets collected from various biological samples (e.g. immune cells, tumors) and different sample types (single-cell and bulk), generating 191 scientific hypotheses. These were assessed using both automatic LLM-based scoring on 5 metrics and detailed reviews from human experts. Results demonstrate that PROTEUS consistently produces reliable, logically coherent results that align well with existing literature while also proposing novel, evaluable hypotheses. The system's flexible architecture facilitates seamless integration of diverse analysis tools and adaptation to different proteomics data types. By automating complex proteomics analysis workflows and hypothesis generation, PROTEUS has the potential to considerably accelerate the pace of scientific discovery in proteomics research, enabling researchers to efficiently explore large-scale datasets and uncover biological insights.
Authors: Daan Schouten, Giulia Nicoletti, Bas Dille, Catherine Chia, Pierpaolo Vendittelli, Megan Schuurmans, Geert Litjens, Nadieh Khalili
Abstract: Recent technological advances in healthcare have led to unprecedented growth in patient data quantity and diversity. While artificial intelligence (AI) models have shown promising results in analyzing individual data modalities, there is increasing recognition that models integrating multiple complementary data sources, so-called multimodal AI, could enhance clinical decision-making. This scoping review examines the landscape of deep learning-based multimodal AI applications across the medical domain, analyzing 432 papers published between 2018 and 2024. We provide an extensive overview of multimodal AI development across different medical disciplines, examining various architectural approaches, fusion strategies, and common application areas. Our analysis reveals that multimodal AI models consistently outperform their unimodal counterparts, with an average improvement of 6.2 percentage points in AUC. However, several challenges persist, including cross-departmental coordination, heterogeneous data characteristics, and incomplete datasets. We critically assess the technical and practical challenges in developing multimodal AI systems and discuss potential strategies for their clinical implementation, including a brief overview of commercially available multimodal AI models for clinical decision-making. Additionally, we identify key factors driving multimodal AI development and propose recommendations to accelerate the field's maturation. This review provides researchers and clinicians with a thorough understanding of the current state, challenges, and future directions of multimodal AI in medicine.
Authors: Fengxiang Wang, Ranjie Duan, Peng Xiao, Xiaojun Jia, YueFeng Chen, Chongwen Wang, Jialing Tao, Hang Su, Jun Zhu, Hui Xue
Abstract: Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous works mainly focus on jailbreak in single-round dialogue, overlooking the potential jailbreak risks in multi-round dialogues, which are a vital way humans interact with and extract information from LLMs. Some studies have increasingly concentrated on the risks associated with jailbreak in multi-round dialogues. These efforts typically involve the use of manually crafted templates or prompt engineering techniques. However, due to the inherent complexity of multi-round dialogues, their jailbreak performance is limited. To solve this problem, we propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values posed by LLMs. We propose a risk decomposition strategy that distributes risks across multiple rounds of queries and utilizes psychological strategies to enhance attack strength. Extensive experiments show that our proposed method surpasses other attack methods and achieves state-of-the-art attack success rate. We will make the corresponding code and dataset available for future research. The code will be released soon.
Authors: Zhirui Deng, Zhicheng Dou, Yutao Zhu, Ji-Rong Wen, Ruibin Xiong, Mang Wang, Weipeng Chen
Abstract: The outstanding capabilities of large language models (LLMs) render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine-tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents' ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi-step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. Inheriting the spirit of novice-to-expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine-grained optimization. Additionally, we propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.
Authors: Tyler Clark, Mark Towers, Christine Evers, Jonathon Hare
Abstract: Rainbow Deep Q-Network (DQN) demonstrated combining multiple independent enhancements could significantly boost a reinforcement learning (RL) agent's performance. In this paper, we present "Beyond The Rainbow" (BTR), a novel algorithm that integrates six improvements from across the RL literature to Rainbow DQN, establishing a new state-of-the-art for RL using a desktop PC, with a human-normalized interquartile mean (IQM) of 7.4 on atari-60. Beyond Atari, we demonstrate BTR's capability to handle complex 3D games, successfully training agents to play Super Mario Galaxy, Mario Kart, and Mortal Kombat with minimal algorithmic changes. Designing BTR with computational efficiency in mind, agents can be trained using a desktop PC on 200 million Atari frames within 12 hours. Additionally, we conduct detailed ablation studies of each component, analzying the performance and impact using numerous measures.
Authors: Zerui Cheng, Edoardo Contente, Ben Finch, Oleg Golev, Jonathan Hayase, Andrew Miller, Niusha Moshrefi, Anshul Nasery, Sandeep Nailwal, Sewoong Oh, Himanshu Tyagi, Pramod Viswanath
Abstract: Artificial Intelligence (AI) has steadily improved across a wide range of tasks. However, the development and deployment of AI are almost entirely controlled by a few powerful organizations that are racing to create Artificial General Intelligence (AGI). The centralized entities make decisions with little public oversight, shaping the future of humanity, often with unforeseen consequences. In this paper, we propose OML, which stands for Open, Monetizable, and Loyal AI, an approach designed to democratize AI development. OML is realized through an interdisciplinary framework spanning AI, blockchain, and cryptography. We present several ideas for constructing OML using technologies such as Trusted Execution Environments (TEE), traditional cryptographic primitives like fully homomorphic encryption and functional encryption, obfuscation, and AI-native solutions rooted in the sample complexity and intrinsic hardness of AI tasks. A key innovation of our work is introducing a new scientific field: AI-native cryptography. Unlike conventional cryptography, which focuses on discrete data and binary security guarantees, AI-native cryptography exploits the continuous nature of AI data representations and their low-dimensional manifolds, focusing on improving approximate performance. One core idea is to transform AI attack methods, such as data poisoning, into security tools. This novel approach serves as a foundation for OML 1.0 which uses model fingerprinting to protect the integrity and ownership of AI models. The spirit of OML is to establish a decentralized, open, and transparent platform for AI development, enabling the community to contribute, monetize, and take ownership of AI models. By decentralizing control and ensuring transparency through blockchain technology, OML prevents the concentration of power and provides accountability in AI development that has not been possible before.
Authors: David Maria Schmidt, Mohammad Fazleh Elahi, Philipp Cimiano
Abstract: In this paper, we examine the impact of lexicalization on Question Answering over Linked Data (QALD). It is well known that one of the key challenges in interpreting natural language questions with respect to SPARQL lies in bridging the lexical gap, that is mapping the words in the query to the correct vocabulary elements. We argue in this paper that lexicalization, that is explicit knowledge about the potential interpretations of a word with respect to the given vocabulary, significantly eases the task and increases the performance of QA systems. Towards this goal, we present a compositional QA system that can leverage explicit lexical knowledge in a compositional manner to infer the meaning of a question in terms of a SPARQL query. We show that such a system, given lexical knowledge, has a performance well beyond current QA systems, achieving up to a $35.8\%$ increase in the micro $F_1$ score compared to the best QA system on QALD-9. This shows the importance and potential of including explicit lexical knowledge. In contrast, we show that LLMs have limited abilities to exploit lexical knowledge, with only marginal improvements compared to a version without lexical knowledge. This shows that LLMs have no ability to compositionally interpret a question on the basis of the meaning of its parts, a key feature of compositional approaches. Taken together, our work shows new avenues for QALD research, emphasizing the importance of lexicalization and compositionality.
Authors: Rishav Pramanik, Jos\'e-Fabian Villa-V\'asquez, Marco Pedersoli
Abstract: Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: https://github.com/rishavpramanik/maskedmultiqueryslot
URLs: https://github.com/rishavpramanik/maskedmultiqueryslot
Authors: Xiao Hu, Ziqi Chen, Daniel Adu-Ampratwum, Bo Peng, Xia Ning
Abstract: Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. Our approach implements a unique local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions, ensuring that the impact of varying-sizes molecular fragments on yield is accurately accounted for. Another key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM outperforms existing methods in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. Its advanced modeling of reactant-reagent interactions and sensitivity to small molecular fragments make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through https://github.com/ninglab/Yield_log_RRIM.
Authors: Shrihan Agarwal, Aleksandra \'Ciprijanovi\'c, Brian D. Nord
Abstract: Modeling strong gravitational lenses is computationally expensive for the complex data from modern and next-generation cosmic surveys. Deep learning has emerged as a promising approach for finding lenses and predicting lensing parameters, such as the Einstein radius. Mean-variance Estimators (MVEs) are a common approach for obtaining aleatoric (data) uncertainties from a neural network prediction. However, neural networks have not been demonstrated to perform well on out-of-domain target data successfully - e.g., when trained on simulated data and applied to real, observational data. In this work, we perform the first study of the efficacy of MVEs in combination with unsupervised domain adaptation (UDA) on strong lensing data. The source domain data is noiseless, and the target domain data has noise mimicking modern cosmology surveys. We find that adding UDA to MVE increases the accuracy on the target data by a factor of about two over an MVE model without UDA. Including UDA also permits much more well-calibrated aleatoric uncertainty predictions. Advancements in this approach may enable future applications of MVE models to real observational data.
Authors: Mikita Balesni, Marius Hobbhahn, David Lindner, Alex Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, J\'er\'emy Scheurer, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, Lucius Bushnaq
Abstract: We sketch how developers of frontier AI systems could construct a structured rationale -- a 'safety case' -- that an AI system is unlikely to cause catastrophic outcomes through scheming. Scheming is a potential threat model where AI systems could pursue misaligned goals covertly, hiding their true capabilities and objectives. In this report, we propose three arguments that safety cases could use in relation to scheming. For each argument we sketch how evidence could be gathered from empirical evaluations, and what assumptions would need to be met to provide strong assurance. First, developers of frontier AI systems could argue that AI systems are not capable of scheming (Scheming Inability). Second, one could argue that AI systems are not capable of posing harm through scheming (Harm Inability). Third, one could argue that control measures around the AI systems would prevent unacceptable outcomes even if the AI systems intentionally attempted to subvert them (Harm Control). Additionally, we discuss how safety cases might be supported by evidence that an AI system is reasonably aligned with its developers (Alignment). Finally, we point out that many of the assumptions required to make these safety arguments have not been confidently satisfied to date and require making progress on multiple open research problems.
Authors: Nathalie Maria Kirch, Severin Field, Stephen Casper
Abstract: While `jailbreaks' have been central to research on the safety and reliability of LLMs (large language models), the underlying mechanisms behind these attacks are not well understood. Some prior works have used linear methods to analyze jailbreak prompts or model refusal. Here, however, we compare linear and nonlinear methods to study the features in prompts that contribute to successful jailbreaks. We do this by probing for jailbreak success based only on the portions of the latent representations corresponding to prompt tokens. First, we introduce a dataset of 10,800 jailbreak attempts from 35 attack methods. We then show that different jailbreaking methods work via different nonlinear features in prompts. Specifically, we find that while probes can distinguish between successful and unsuccessful jailbreaking prompts with a high degree of accuracy, they often transfer poorly to held-out attack methods. We also show that nonlinear probes can be used to mechanistically jailbreak the LLM by guiding the design of adversarial latent perturbations. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on. Ultimately, our results suggest that jailbreaks cannot be thoroughly understood in terms of universal or linear prompt features alone.
Authors: Langalibalele Lunga, Suhas Sreehari
Abstract: Machine learning models are prone to adversarial attacks, where inputs can be manipulated in order to cause misclassifications. While previous research has focused on techniques like Generative Adversarial Networks (GANs), there's limited exploration of GANs and Synthetic Minority Oversampling Technique (SMOTE) in text and image classification models to perform adversarial attacks. Our study addresses this gap by training various machine learning models and using GANs and SMOTE to generate additional data points aimed at attacking text classification models. Furthermore, we extend our investigation to face recognition models, training a Convolutional Neural Network(CNN) and subjecting it to adversarial attacks with fast gradient sign perturbations on key features identified by GradCAM, a technique used to highlight key image characteristics CNNs use in classification. Our experiments reveal a significant vulnerability in classification models. Specifically, we observe a 20 % decrease in accuracy for the top-performing text classification models post-attack, along with a 30 % decrease in facial recognition accuracy. This highlights the susceptibility of these models to manipulation of input data. Adversarial attacks not only compromise the security but also undermine the reliability of machine learning systems. By showcasing the impact of adversarial attacks on both text classification and face recognition models, our study underscores the urgent need for develop robust defenses against such vulnerabilities.
Authors: Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang
Abstract: Large language models (LLM) have demonstrated emergent abilities in text generation, question answering, and reasoning, facilitating various tasks and domains. Despite their proficiency in various tasks, LLMs like LaPM 540B and Llama-3.1 405B face limitations due to large parameter sizes and computational demands, often requiring cloud API use which raises privacy concerns, limits real-time applications on edge devices, and increases fine-tuning costs. Additionally, LLMs often underperform in specialized domains such as healthcare and law due to insufficient domain-specific knowledge, necessitating specialized models. Therefore, Small Language Models (SLMs) are increasingly favored for their low inference latency, cost-effectiveness, efficient development, and easy customization and adaptability. These models are particularly well-suited for resource-limited environments and domain knowledge acquisition, addressing LLMs' challenges and proving ideal for applications that require localized data handling for privacy, minimal inference latency for efficiency, and domain knowledge acquisition through lightweight fine-tuning. The rising demand for SLMs has spurred extensive research and development. However, a comprehensive survey investigating issues related to the definition, acquisition, application, enhancement, and reliability of SLM remains lacking, prompting us to conduct a detailed survey on these topics. The definition of SLMs varies widely, thus to standardize, we propose defining SLMs by their capability to perform specialized tasks and suitability for resource-constrained settings, setting boundaries based on the minimal size for emergent abilities and the maximum size sustainable under resource constraints. For other aspects, we provide a taxonomy of relevant models/methods and develop general frameworks for each category to enhance and utilize SLMs effectively.
Authors: Mengmeng Yang, Chi-Hung Chi, Kwok-Yan Lam, Jie Feng, Taolin Guo, Wei Ni
Abstract: Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights. In real-world applications like FinTech and Smart Manufacturing, transactional data, often in tabular form, are generated and analyzed for insight generation. However, such datasets typically contain sensitive personal/business information, raising privacy concerns and regulatory risks. Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data, removing direct links to individuals. However, attackers can still infer sensitive information using background knowledge. Differential privacy offers a solution by providing provable and quantifiable privacy protection. Consequently, differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing. This paper provides a comprehensive overview of existing differentially private tabular data synthesis methods, highlighting the unique challenges of each generation model for generating tabular data under differential privacy constraints. We classify the methods into statistical and deep learning-based approaches based on their generation models, discussing them in both centralized and distributed environments. We evaluate and compare those methods within each category, highlighting their strengths and weaknesses in terms of utility, privacy, and computational complexity. Additionally, we present and discuss various evaluation methods for assessing the quality of the synthesized data, identify research gaps in the field and directions for future research.
Authors: Paul Badu Yakubu, Evans Owusu, Lesther Santana, Mohamed Rahouti, Abdellah Chehri, Kaiqi Xiong
Abstract: Denial of Service (DoS) attacks pose a significant threat in the realm of AI systems security, causing substantial financial losses and downtime. However, AI systems' high computational demands, dynamic behavior, and data variability make monitoring and detecting DoS attacks challenging. Nowadays, statistical and machine learning (ML)-based DoS classification and detection approaches utilize a broad range of feature selection mechanisms to select a feature subset from networking traffic datasets. Feature selection is critical in enhancing the overall model performance and attack detection accuracy while reducing the training time. In this paper, we investigate the importance of feature selection in improving ML-based detection of DoS attacks. Specifically, we explore feature contribution to the overall components in DoS traffic datasets by utilizing statistical analysis and feature engineering approaches. Our experimental findings demonstrate the usefulness of the thorough statistical analysis of DoS traffic and feature engineering in understanding the behavior of the attack and identifying the best feature selection for ML-based DoS classification and detection.
Authors: Dayu Yang, Natawut Monaikul, Amanda Ding, Bozhao Tan, Kishore Mosaliganti, Giri Iyengar
Abstract: In the era of data-driven decision-making, accurate table-level representations and efficient table recommendation systems are becoming increasingly crucial for improving table management, discovery, and analysis. However, existing approaches to tabular data representation often face limitations, primarily due to their focus on cell-level tasks and the lack of high-quality training data. To address these challenges, we first formulate a clear definition of table similarity in the context of data transformation activities within data-driven enterprises. This definition serves as the foundation for synthetic data generation, which require a well-defined data generation process. Building on this, we propose a novel synthetic data generation pipeline that harnesses the code generation and data manipulation capabilities of Large Language Models (LLMs) to create a large-scale synthetic dataset tailored for table-level representation learning. Through manual validation and performance comparisons on the table recommendation task, we demonstrate that the synthetic data generated by our pipeline aligns with our proposed definition of table similarity and significantly enhances table representations, leading to improved recommendation performance.
Authors: Geng Yu, Jianing Zhu, Jiangchao Yao, Bo Han
Abstract: Out-of-distribution (OOD) detection is crucial for deploying reliable machine learning models in open-world applications. Recent advances in CLIP-based OOD detection have shown promising results via regularizing prompt tuning with OOD features extracted from ID data. However, the irrelevant context mined from ID data can be spurious due to the inaccurate foreground-background decomposition, thus limiting the OOD detection performance. In this work, we propose a novel framework, namely, Self-Calibrated Tuning (SCT), to mitigate this problem for effective OOD detection with only the given few-shot ID data. Specifically, SCT introduces modulating factors respectively on the two components of the original learning objective. It adaptively directs the optimization process between the two tasks during training on data with different prediction uncertainty to calibrate the influence of OOD regularization, which is compatible with many prompt tuning based OOD detection methods. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed SCT. The code is publicly available.
Authors: Mohammadreza Kouchaki, Minglong Zhang, Aly S. Abdalla, Guangchen Lan, Christopher G. Brinton, Vuk Marojevic
Abstract: In the rapidly evolving landscape of 5G technology, safeguarding Radio Frequency (RF) environments against sophisticated intrusions is paramount, especially in dynamic spectrum access and management. This paper presents an enhanced experimental model that integrates a self-attention mechanism with a Recurrent Neural Network (RNN)-based autoencoder for the detection of anomalous spectral activities in 5G networks at the waveform level. Our approach, grounded in time-series analysis, processes in-phase and quadrature (I/Q) samples to identify irregularities that could indicate potential jamming attacks. The model's architecture, augmented with a self-attention layer, extends the capabilities of RNN autoencoders, enabling a more nuanced understanding of temporal dependencies and contextual relationships within the RF spectrum. Utilizing a simulated 5G Radio Access Network (RAN) test-bed constructed with srsRAN 5G and Software Defined Radios (SDRs), we generated a comprehensive stream of data that reflects real-world RF spectrum conditions and attack scenarios. The model is trained to reconstruct standard signal behavior, establishing a normative baseline against which deviations, indicative of security threats, are identified. The proposed architecture is designed to balance between detection precision and computational efficiency, so the LSTM network, enriched with self-attention, continues to optimize for minimal execution latency and power consumption. Conducted on a real-world SDR-based testbed, our results demonstrate the model's improved performance and accuracy in threat detection. Keywords: self-attention, real-time intrusion detection, RNN autoencoder, Transformer architecture, LSTM, time series anomaly detection, 5G Security, spectrum access security.
Authors: Zerui Wang, Yan Liu, Jun Huang
Abstract: This article presents the design of an open-API-based explainable AI (XAI) service to provide feature contribution explanations for cloud AI services. Cloud AI services are widely used to develop domain-specific applications with precise learning metrics. However, the underlying cloud AI services remain opaque on how the model produces the prediction. We argue that XAI operations are accessible as open APIs to enable the consolidation of the XAI operations into the cloud AI services assessment. We propose a design using a microservice architecture that offers feature contribution explanations for cloud AI services without unfolding the network structure of the cloud models. We can also utilize this architecture to evaluate the model performance and XAI consistency metrics showing cloud AI services trustworthiness. We collect provenance data from operational pipelines to enable reproducibility within the XAI service. Furthermore, we present the discovery scenarios for the experimental tests regarding model performance and XAI consistency metrics for the leading cloud vision AI services. The results confirm that the architecture, based on open APIs, is cloud-agnostic. Additionally, data augmentations result in measurable improvements in XAI consistency metrics for cloud AI services.
Authors: Alexander Mote, Todd Palmer, Lizhong Chen
Abstract: Domain decomposition is a technique used to reduce memory overhead on large neutron transport problems. Currently, the optimal load-balanced processor allocation for these domains is typically determined through small-scale simulations of the problem, which can be time-consuming for researchers and must be repeated anytime a problem input is changed. We propose a Transformer model with a unique 3D input embedding, and input representations designed for domain-decomposed neutron transport problems, which can predict the subdomain computation loads generated by small-scale simulations. We demonstrate that such a model trained on domain-decomposed Small Modular Reactor (SMR) simulations achieves 98.2% accuracy while being able to skip the small-scale simulation step entirely. Tests of the model's robustness on variant fuel assemblies, other problem geometries, and changes in simulation parameters are also discussed.
Authors: Laura Smith, Alex Irpan, Montserrat Gonzalez Arenas, Sean Kirmani, Dmitry Kalashnikov, Dhruv Shah, Ted Xiao
Abstract: The complexity of the real world demands robotic systems that can intelligently adapt to unseen situations. We present STEER, a robot learning framework that bridges high-level, commonsense reasoning with precise, flexible low-level control. Our approach translates complex situational awareness into actionable low-level behavior through training language-grounded policies with dense annotation. By structuring policy training around fundamental, modular manipulation skills expressed in natural language, STEER exposes an expressive interface for humans or Vision-Language Models (VLMs) to intelligently orchestrate the robot's behavior by reasoning about the task and context. Our experiments demonstrate the skills learned via STEER can be combined to synthesize novel behaviors to adapt to new situations or perform completely new tasks without additional data collection or training.
Authors: Todd Huster, Peter Lin, Razvan Stefanescu, Emmanuel Ekwedike, Ritu Chadha
Abstract: Neural networks can conceal malicious Trojan backdoors that allow a trigger to covertly change the model behavior. Detecting signs of these backdoors, particularly without access to any triggered data, is the subject of ongoing research and open challenges. In one common formulation of the problem, we are given a set of clean and poisoned models and need to predict whether a given test model is clean or poisoned. In this paper, we introduce a detector that works remarkably well across many of the existing datasets and domains. It is obtained by training a binary classifier on a large number of models' weights after performing a few different pre-processing steps including feature selection and standardization, reference model weights subtraction, and model alignment prior to detection. We evaluate this algorithm on a diverse set of Trojan detection benchmarks and domains and examine the cases where the approach is most and least effective.
Authors: Junior Cedric Tonga, Benjamin Clement, Pierre-Yves Oudeyer
Abstract: The automatic generation of hints by Large Language Models (LLMs) within Intelligent Tutoring Systems (ITSs) has shown potential to enhance student learning. However, generating pedagogically sound hints that address student misconceptions and adhere to specific educational objectives remains challenging. This work explores using LLMs (GPT-4o and Llama-3-8B-instruct) as teachers to generate effective hints for students simulated through LLMs (GPT-3.5-turbo, Llama-3-8B-Instruct, or Mistral-7B-instruct-v0.3) tackling math exercises designed for human high-school students, and designed using cognitive science principles. We present here the study of several dimensions: 1) identifying error patterns made by simulated students on secondary-level math exercises; 2) developing various prompts for GPT-4o as a teacher and evaluating their effectiveness in generating hints that enable simulated students to self-correct; and 3) testing the best-performing prompts, based on their ability to produce relevant hints and facilitate error correction, with Llama-3-8B-Instruct as the teacher, allowing for a performance comparison with GPT-4o. The results show that model errors increase with higher temperature settings. Notably, when hints are generated by GPT-4o, the most effective prompts include prompts tailored to specific errors as well as prompts providing general hints based on common mathematical errors. Interestingly, Llama-3-8B-Instruct as a teacher showed better overall performance than GPT-4o. Also the problem-solving and response revision capabilities of the LLMs as students, particularly GPT-3.5-turbo, improved significantly after receiving hints, especially at lower temperature settings. However, models like Mistral-7B-Instruct demonstrated a decline in performance as the temperature increased.
Authors: Razvan-Gabriel Dumitru, Paul-Ioan Clotan, Vikas Yadav, Darius Peteleaza, Mihai Surdeanu
Abstract: This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), enhancing the traditional methodology established by SliceGPT. By transitioning from constant to dynamic slicing, our method leverages the newly proposed Layer Redundancy (LR) score, which assesses how much change each layer changes its input by measuring the cosine similarity of the input to the output of the layer. We use this score to prune parts of individual layers based on redundancy in such a way that the average pruned percentage for all layers is a fixed value. We conducted extensive experiments using models like Llama3-8B and Mistral-7B on multiple datasets, evaluating different slicing bases and percentages to determine optimal configurations that balance efficiency and performance. Our findings show that our dynamic slicing approach not only maintains but, in many cases, enhances model performance compared to the baseline established by constant slicing methods. For instance, in several settings, we see performance improvements of up to 5% over the SliceGPT baseline. Additionally, a perplexity decrease by as much as 7% was observed across multiple benchmarks, validating the effectiveness of our method. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/DynamicSlicing.
Authors: Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis
Abstract: With more advanced natural language understanding and reasoning capabilities, large language model (LLM)-powered agents are increasingly developed in simulated environments to perform complex tasks, interact with other agents, and exhibit emergent behaviors relevant to social science and gaming. However, current multi-agent simulations frequently suffer from inefficiencies due to the limited parallelism caused by false dependencies, resulting in performance bottlenecks. In this paper, we introduce AI Metropolis, a simulation engine that improves the efficiency of LLM agent simulations by incorporating out-of-order execution scheduling. By dynamically tracking real dependencies between agents, AI Metropolis minimizes false dependencies, enhancing parallelism and enabling efficient hardware utilization. Our evaluations demonstrate that AI Metropolis achieves speedups from 1.3x to 4.15x over standard parallel simulation with global synchronization, approaching optimal performance as the number of agents increases.
Authors: Wei Wang, Zhichao Hou, Xiaorui Liu, Xinxia Peng
Abstract: Research on long non-coding RNAs (lncRNAs) has garnered significant attention due to their critical roles in gene regulation and disease mechanisms. However, the complexity and diversity of lncRNA sequences, along with the limited knowledge of their functional mechanisms and the regulation of their expressions, pose significant challenges to lncRNA studies. Given the tremendous success of large language models (LLMs) in capturing complex dependencies in sequential data, this study aims to systematically explore the potential and limitations of LLMs in the sequence analysis related to the transcriptional regulation of lncRNA genes. Our extensive experiments demonstrated promising performance of fine-tuned genome foundation models on progressively complex tasks. Furthermore, we conducted an insightful analysis of the critical impact of task complexity, model selection, data quality, and biological interpretability for the studies of the regulation of lncRNA gene expression.
Authors: Geza Kovacs, Daniel Deutsch, Markus Freitag
Abstract: While Minimum Bayes Risk (MBR) decoding using metrics such as COMET or MetricX has outperformed traditional decoding methods such as greedy or beam search, it introduces a challenge we refer to as metric bias. As MBR decoding aims to produce translations that score highly according to a specific utility metric, this very process makes it impossible to use the same metric for both decoding and evaluation, as improvements might simply be due to reward hacking rather than reflecting real quality improvements. In this work we find that compared to human ratings, neural metrics not only overestimate the quality of MBR decoding when the same metric is used as the utility metric, but they also overestimate the quality of MBR/QE decoding with other neural utility metrics as well. We also show that the metric bias issue can be mitigated by using an ensemble of utility metrics during MBR decoding: human evaluations show that MBR decoding using an ensemble of utility metrics outperforms a single utility metric.
Authors: Brian Chen, Xiangyuan Zhao, Yingnan Zhu
Abstract: Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant content from an input video, aiding them in obtaining the necessary information or determining their interest in watching the original video. Adapting video summarization to various types of video and user preferences requires significant training data and expensive human labeling. To facilitate such research, we proposed a new benchmark for video summarization that captures various user preferences. Also, we present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset. The pipeline takes both video and closed captioning as input and performs semantic analysis at the scene level by converting video frames into text. Subsequently, the user's genre preference was used as the basis for selecting the pertinent textual scenes. The experimental results demonstrate that our proposed pipeline outperforms current state-of-the-art unsupervised video summarization models. We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models. In the end, the runtime analysis demonstrates that our pipeline is more suitable for practical use when scaling up the number of user preferences and videos.
Authors: Kevin Tirta Wijaya, Minghao Guo, Michael Sun, Hans-Peter Seidel, Wojciech Matusik, Vahid Babaei
Abstract: Accurate property prediction is crucial for accelerating the discovery of new molecules. Although deep learning models have achieved remarkable success, their performance often relies on large amounts of labeled data that are expensive and time-consuming to obtain. Thus, there is a growing need for models that can perform well with limited experimentally-validated data. In this work, we introduce MoleVers, a versatile pretrained model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated molecular property labels are scarce. MoleVers adopts a two-stage pretraining strategy. In the first stage, the model learns molecular representations from large unlabeled datasets via masked atom prediction and dynamic denoising, a novel task enabled by a new branching encoder architecture. In the second stage, MoleVers is further pretrained using auxiliary labels obtained with inexpensive computational methods, enabling supervised learning without the need for costly experimental data. This two-stage framework allows MoleVers to learn representations that generalize effectively across various downstream datasets. We evaluate MoleVers on a new benchmark comprising 22 molecular datasets with diverse types of properties, the majority of which contain 50 or fewer training labels reflecting real-world conditions. MoleVers achieves state-of-the-art results on 20 out of the 22 datasets, and ranks second among the remaining two, highlighting its ability to bridge the gap between data-hungry models and real-world conditions where practically-useful labels are scarce.
Authors: Anurag Acharya, Shivam Sharma, Robin Cosbey, Megha Subramanian, Scott Howland, Maria Glenski
Abstract: A proliferation of Large Language Models (the GPT series, BLOOM, LLaMA, and more) are driving forward novel development of multipurpose AI for a variety of tasks, particularly natural language processing (NLP) tasks. These models demonstrate strong performance on a range of tasks; however, there has been evidence of brittleness when applied to more niche or narrow domains where hallucinations or fluent but incorrect responses reduce performance. Given the complex nature of scientific domains, it is prudent to investigate the trade-offs of leveraging off-the-shelf versus more targeted foundation models for scientific domains. In this work, we examine the benefits of in-domain pre-training for a given scientific domain, chemistry, and compare these to open-source, off-the-shelf models with zero-shot and few-shot prompting. Our results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks such as named entity recognition and molecular formula generation.
Authors: Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang
Abstract: We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.
Authors: Pengju Wang, Bochao Liu, Weijia Guo, Yong Li, Shiming Ge
Abstract: Federated learning is a distributed machine learning paradigm designed to protect data privacy. However, data heterogeneity across various clients results in catastrophic forgetting, where the model rapidly forgets previous knowledge while acquiring new knowledge. To address this challenge, personalized federated learning has emerged to customize a personalized model for each client. However, the inherent limitation of this mechanism is its excessive focus on personalization, potentially hindering the generalization of those models. In this paper, we present a novel personalized federated learning method that uses global and historical models as teachers and the local model as the student to facilitate comprehensive knowledge distillation. The historical model represents the local model from the last round of client training, containing historical personalized knowledge, while the global model represents the aggregated model from the last round of server aggregation, containing global generalized knowledge. By applying knowledge distillation, we effectively transfer global generalized knowledge and historical personalized knowledge to the local model, thus mitigating catastrophic forgetting and enhancing the general performance of personalized models. Extensive experimental results demonstrate the significant advantages of our method.
Authors: Arunkumar Rathinam, Leo Pauly, Abd El Rahman Shabayek, Wassim Rharbaoui, Anis Kacem, Vincent Gaudilli\`ere, Djamila Aouada
Abstract: Multispectral pedestrian detection has gained significant attention in recent years, particularly in autonomous driving applications. To address the challenges posed by adversarial illumination conditions, the combination of thermal and visible images has demonstrated its advantages. However, existing fusion methods rely on the critical assumption that the RGB-Thermal (RGB-T) image pairs are fully overlapping. These assumptions often do not hold in real-world applications, where only partial overlap between images can occur due to sensors configuration. Moreover, sensor failure can cause loss of information in one modality. In this paper, we propose a novel module called the Hybrid Attention (HA) mechanism as our main contribution to mitigate performance degradation caused by partial overlap and sensor failure, i.e. when at least part of the scene is acquired by only one sensor. We propose an improved RGB-T fusion algorithm, robust against partial overlap and sensor failure encountered during inference in real-world applications. We also leverage a mobile-friendly backbone to cope with resource constraints in embedded systems. We conducted experiments by simulating various partial overlap and sensor failure scenarios to evaluate the performance of our proposed method. The results demonstrate that our approach outperforms state-of-the-art methods, showcasing its superiority in handling real-world challenges.
Authors: Qiyuan Zhu, A. K. Qin, Hussein Dia, Adriana-Simona Mihaita, Hanna Grzybowska
Abstract: Traffic flow forecasting is a crucial task in intelligent transport systems. Deep learning offers an effective solution, capturing complex patterns in time-series traffic flow data to enable the accurate prediction. However, deep learning models are prone to overfitting the intricate details of flow data, leading to poor generalisation. Recent studies suggest that decomposition-based deep ensemble learning methods may address this issue by breaking down a time series into multiple simpler signals, upon which deep learning models are built and ensembled to generate the final prediction. However, few studies have compared the performance of decomposition-based ensemble methods with non-decomposition-based ones which directly utilise raw time-series data. This work compares several decomposition-based and non-decomposition-based deep ensemble learning methods. Experimental results on three traffic datasets demonstrate the superiority of decomposition-based ensemble methods, while also revealing their sensitivity to aggregation strategies and forecasting horizons.
Authors: Dahyun Mok, Junghyun Bum, Le Duc Tai, Hyunseung Choo
Abstract: Diabetic Retinopathy (DR) is a primary cause of blindness, necessitating early detection and diagnosis. This paper focuses on referable DR classification to enhance the applicability of the proposed method in clinical practice. We develop an advanced cross-learning DR classification method leveraging transfer learning and cross-attention mechanisms. The proposed method employs the Swin U-Net architecture to segment lesion maps from DR fundus images. The Swin U-Net segmentation model, enriched with DR lesion insights, is transferred to generate a lesion map. Both the fundus image and its segmented lesion map are used as complementary inputs for the classification model. A cross-attention mechanism is deployed to improve the model's ability to capture fine-grained details from the input pairs. Our experiments, utilizing two public datasets, FGADR and EyePACS, demonstrate a superior accuracy of 94.6%, surpassing current state-of-the-art methods by 4.4%. To this end, we aim for the proposed method to be seamlessly integrated into clinical workflows, enhancing accuracy and efficiency in identifying referable DR.
Authors: Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, Maosong Sun
Abstract: The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. StreamingBench assesses three core aspects of streaming video understanding: (1) real-time visual understanding, (2) omni-source understanding, and (3) contextual understanding. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios.
Authors: Zihan Qin, Jialei Xu, Wenbo Zhao, Junjun Jiang, Xianming Liu
Abstract: Depth estimation under adverse conditions remains a significant challenge. Recently, multi-spectral depth estimation, which integrates both visible light and thermal images, has shown promise in addressing this issue. However, existing algorithms struggle with precise pixel-level feature matching, limiting their ability to fully exploit geometric constraints across different spectra. To address this, we propose a novel framework incorporating stereo depth estimation to enforce accurate geometric constraints. In particular, we treat the visible light and thermal images as a stereo pair and utilize a Cross-modal Feature Matching (CFM) Module to construct a cost volume for pixel-level matching. To mitigate the effects of poor lighting on stereo matching, we introduce Degradation Masking, which leverages robust monocular thermal depth estimation in degraded regions. Our method achieves state-of-the-art (SOTA) performance on the Multi-Spectral Stereo (MS2) dataset, with qualitative evaluations demonstrating high-quality depth maps under varying lighting conditions.
Authors: Yincen Qu, Chao Ma, Yiting Wu, Xiangying Dai, Hui Zhou, Hengyue Liu
Abstract: In the industry, numerous tasks are deployed online. Traditional approaches often tackle each task separately by its own network, which leads to excessive costs for developing and scaling models, especially in the context of large language models. Although multi-task methods can save costs through parameter sharing, they often struggle to outperform single-task methods in real-world applications. To tackle these challenges, we present a three-stage multi-task learning framework for large language models. It involves task filtering, followed by fine-tuning on high-resource tasks, and finally fine-tuning on all tasks. We conducted comprehensive experiments in single-task and multi-task settings. Our approach, exemplified on different benchmarks, demonstrates that it is able to achieve performance comparable to the single-task method while reducing up to 90.9\% of its overhead.
Authors: Yuqing Xiao, John Grundy, Anuradha Madugalla
Abstract: Growth of the older adult population has led to an increasing interest in technology-supported aged care. However, the area has some challenges such as a lack of caregivers and limitations in understanding the emotional, social, physical, and mental well-being needs of seniors. Furthermore, there is a gap in the understanding between developers and ageing people of their requirements. Digital health can be important in supporting older adults wellbeing, emotional requirements, and social needs. Requirements Engineering (RE) is a major software engineering field, which can help to identify, elicit and prioritize the requirements of stakeholders and ensure that the systems meet standards for performance, reliability, and usability. We carried out a systematic review of the literature on RE for older adult digital health software. This was necessary to show the representatives of the current stage of understanding the needs of older adults in aged care digital health. Using established guidelines outlined by the Kitchenham method, the PRISMA and the PICO guideline, we developed a protocol, followed by the systematic exploration of eight databases. This resulted in 69 primary studies of high relevance, which were subsequently subjected to data extraction, synthesis, and reporting. We highlight key RE processes in digital health software for ageing people. It explored the utilization of technology for older user well-being and care, and the evaluations of such solutions. The review also identified key limitations found in existing primary studies that inspire future research opportunities. The results indicate that requirement gathering and understanding have a significant variation between different studies. The differences are in the quality, depth, and techniques adopted for requirement gathering and these differences are largely due to uneven adoption of RE methods.
Authors: Xuelin Liu, Yanfei Zhu, Shucheng Zhu, Pengyuan Liu, Ying Liu, Dong Yu
Abstract: Proper moral beliefs are fundamental for language models, yet assessing these beliefs poses a significant challenge. This study introduces a novel three-module framework to evaluate the moral beliefs of four prominent large language models. Initially, we constructed a dataset containing 472 moral choice scenarios in Chinese, derived from moral words. The decision-making process of the models in these scenarios reveals their moral principle preferences. By ranking these moral choices, we discern the varying moral beliefs held by different language models. Additionally, through moral debates, we investigate the firmness of these models to their moral choices. Our findings indicate that English language models, namely ChatGPT and Gemini, closely mirror moral decisions of the sample of Chinese university students, demonstrating strong adherence to their choices and a preference for individualistic moral beliefs. In contrast, Chinese models such as Ernie and ChatGLM lean towards collectivist moral beliefs, exhibiting ambiguity in their moral choices and debates. This study also uncovers gender bias embedded within the moral beliefs of all examined language models. Our methodology offers an innovative means to assess moral beliefs in both artificial and human intelligence, facilitating a comparison of moral values across different cultures.
Authors: Pedro R. A. S. Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger, Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Yong Xia, Zhaohu Xing, Lei Zhu, Yousef Sadegheih, Afshin Bozorgpour, Pratibha Kumari, Reza Azad, Dorit Merhof, Pengcheng Shi, Ting Ma, Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao, Haonan Wang, Xiaomeng Li, Hanxue Gu, Haoyu Dong, Jichen Yang, Maciej A. Mazurowski, Saumya Gupta, Linshan Wu, Jiaxin Zhuang, Hao Chen, Holger Roth, Daguang Xu, Matthew B. Blaschko, Sergio Decherchi, Andrea Cavalli, Alan L. Yuille, Zongwei Zhou
Abstract: How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
Authors: Yansong Qu, Zilin Huang, Zihao Sheng, Tiantian Chen, Sikai Chen
Abstract: Semantic scene completion (SSC) is essential for achieving comprehensive perception in autonomous driving systems. However, existing SSC methods often overlook the high deployment costs in real-world applications. Traditional architectures, such as 3D Convolutional Neural Networks (3D CNNs) and self-attention mechanisms, face challenges in efficiently capturing long-range dependencies within 3D voxel grids, limiting their effectiveness. To address these issues, we introduce MetaSSC, a novel meta-learning-based framework for SSC that leverages deformable convolution, large-kernel attention, and the Mamba (D-LKA-M) model. Our approach begins with a voxel-based semantic segmentation (SS) pretraining task, aimed at exploring the semantics and geometry of incomplete regions while acquiring transferable meta-knowledge. Using simulated cooperative perception datasets, we supervise the perception training of a single vehicle using aggregated sensor data from multiple nearby connected autonomous vehicles (CAVs), generating richer and more comprehensive labels. This meta-knowledge is then adapted to the target domain through a dual-phase training strategy that does not add extra model parameters, enabling efficient deployment. To further enhance the model's capability in capturing long-sequence relationships within 3D voxel grids, we integrate Mamba blocks with deformable convolution and large-kernel attention into the backbone network. Extensive experiments demonstrate that MetaSSC achieves state-of-the-art performance, significantly outperforming competing models while also reducing deployment costs.
Authors: Jin Xiao, Bowei Zhang, Qianyu He, Jiaqing Liang, Feng Wei, Jinglei Chen, Zujie Liang, Deqing Yang, Yanghua Xiao
Abstract: While Large language models (LLMs) have become excellent writing assistants, they still struggle with quotation generation. This is because they either hallucinate when providing factual quotations or fail to provide quotes that exceed human expectations. To bridge the gap, we systematically study how to evaluate and improve LLMs' performance in quotation generation tasks. We first establish a holistic and automatic evaluation system for quotation generation task, which consists of five criteria each with corresponding automatic metric. To improve the LLMs' quotation generation abilities, we construct a bilingual knowledge base that is broad in scope and rich in dimensions, containing up to 32,022 quotes. Moreover, guided by our critiria, we further design a quotation-specific metric to rerank the retrieved quotations from the knowledge base. Extensive experiments show that our metrics strongly correlate with human preferences. Existing LLMs struggle to generate desired quotes, but our quotation knowledge base and reranking metric help narrow this gap. Our dataset and code are publicly available at https://github.com/GraceXiaoo/QUILL.
Authors: Zehao Xiao, Cees G. M. Snoek
Abstract: Machine learning algorithms have achieved remarkable success across various disciplines, use cases and applications, under the prevailing assumption that training and test samples are drawn from the same distribution. Consequently, these algorithms struggle and become brittle even when samples in the test distribution start to deviate from the ones observed during training. Domain adaptation and domain generalization have been studied extensively as approaches to address distribution shifts across test and train domains, but each has its limitations. Test-time adaptation, a recently emerging learning paradigm, combines the benefits of domain adaptation and domain generalization by training models only on source data and adapting them to target data during test-time inference. In this survey, we provide a comprehensive and systematic review on test-time adaptation, covering more than 400 recent papers. We structure our review by categorizing existing methods into five distinct categories based on what component of the method is adjusted for test-time adaptation: the model, the inference, the normalization, the sample, or the prompt, providing detailed analysis of each. We further discuss the various preparation and adaptation settings for methods within these categories, offering deeper insights into the effective deployment for the evaluation of distribution shifts and their real-world application in understanding images, video and 3D, as well as modalities beyond vision. We close the survey with an outlook on emerging research opportunities for test-time adaptation.
Authors: Muhammad Tayyab Khan, Lequn Chen, Ye Han Ng, Wenhe Feng, Nicholas Yew Jin Tan, Seung Ki Moon
Abstract: Geometric Dimensioning and Tolerancing (GD&T) plays a critical role in manufacturing by defining acceptable variations in part features to ensure component quality and functionality. However, extracting GD&T information from 2D engineering drawings is a time-consuming and labor-intensive task, often relying on manual efforts or semi-automated tools. To address these challenges, this study proposes an automated and computationally efficient GD&T extraction method by fine-tuning Florence-2, an open-source vision-language model (VLM). The model is trained on a dataset of 400 drawings with ground truth annotations provided by domain experts. For comparison, two state-of-the-art closed-source VLMs, GPT-4o and Claude-3.5-Sonnet, are evaluated on the same dataset. All models are assessed using precision, recall, F1-score, and hallucination metrics. Due to the computational cost and impracticality of fine-tuning large closed-source VLMs for domain-specific tasks, GPT-4o and Claude-3.5-Sonnet are evaluated in a zero-shot setting. In contrast, Florence-2, a smaller model with 0.23 billion parameters, is optimized through full-parameter fine-tuning across three distinct experiments, each utilizing datasets augmented to different levels. The results show that Florence-2 achieves a 29.95% increase in precision, a 37.75% increase in recall, a 52.40% improvement in F1-score, and a 43.15% reduction in hallucination rate compared to the best-performing closed-source model. These findings highlight the effectiveness of fine-tuning smaller, open-source VLMs like Florence-2, offering a practical and efficient solution for automated GD&T extraction to support downstream manufacturing tasks.
Authors: Zhongliang Tang, Mengchen Tan, Fei Xia, Qingrong Cheng, Hao Jiang, Yongxiang Zhang
Abstract: We introduce an innovative system, AutoGameUI, for efficiently constructing cohesive user interfaces in game development. Our system is the first to address the coherence issue arising from integrating inconsistent UI and UX designs, typically leading to mismatches and inefficiencies. We propose a two-stage multimodal learning pipeline to obtain comprehensive representations of both UI and UX designs, and to establish their correspondences. Through the correspondences, a cohesive user interface is automatically constructed from pairwise designs. To achieve high-fidelity effects, we introduce a universal data protocol for precise design descriptions and cross-platform applications. We also develop an interactive web-based tool for game developers to facilitate the use of our system. We create a game UI dataset from actual game projects and combine it with a public dataset for training and evaluation. Our experimental results demonstrate the effectiveness of our system in maintaining coherence between the constructed interfaces and the original designs.
Authors: Michael Merry, Patricia Riddle, Jim Warren
Abstract: We introduce PropNEAT, a fast backpropagation implementation of NEAT that uses a bidirectional mapping of the genome graph to a layer-based architecture that preserves the NEAT genomes whilst enabling efficient GPU backpropagation. We test PropNEAT on 58 binary classification datasets from the Penn Machine Learning Benchmarks database, comparing the performance against logistic regression, dense neural networks and random forests, as well as a densely retrained variant of the final PropNEAT model. PropNEAT had the second best overall performance, behind Random Forest, though the difference between the models was not statistically significant apart from between Random Forest in comparison with logistic regression and the PropNEAT retrain models. PropNEAT was substantially faster than a naive backpropagation method, and both were substantially faster and had better performance than the original NEAT implementation. We demonstrate that the per-epoch training time for PropNEAT scales linearly with network depth, and is efficient on GPU implementations for backpropagation. This implementation could be extended to support reinforcement learning or convolutional networks, and is able to find sparser and smaller networks with potential for applications in low-power contexts.
Authors: Kehua Qu, Rui Ding, Jin Tang
Abstract: Multi-person motion prediction is an emerging and intricate task with broad real-world applications. Unlike single person motion prediction, it considers not just the skeleton structures or human trajectories but also the interactions between others. Previous methods use various networks to achieve impressive predictions but often overlook that the joints relations within an individual (intra-relation) and interactions among groups (inter-relation) are distinct types of representations. These methods often lack explicit representation of inter&intra-relations, and inevitably introduce undesired dependencies. To address this issue, we introduce a new collaborative framework for multi-person motion prediction that explicitly modeling these relations:a GCN-based network for intra-relations and a novel reasoning network for inter-relations.Moreover, we propose a novel plug-and-play aggregation module called the Interaction Aggregation Module (IAM), which employs an aggregate-attention mechanism to seamlessly integrate these relations. Experiments indicate that the module can also be applied to other dual-path models. Extensive experiments on the 3DPW, 3DPW-RC, CMU-Mocap, MuPoTS-3D, as well as synthesized datasets Mix1 & Mix2 (9 to 15 persons), demonstrate that our method achieves state-of-the-art performance.
Authors: Yoni Choukroun, Shlomi Azoulay, Pavel Kisilev
Abstract: Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints. While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.
Authors: Yuxiao Chen, Gamze G\"ursoy, Qi Lei
Abstract: Federated Learning (FL) is designed to prevent data leakage through collaborative model training without centralized data storage. However, it remains vulnerable to gradient reconstruction attacks that recover original training data from shared gradients. To optimize the trade-off between data leakage and utility loss, we first derive a theoretical lower bound of reconstruction error (among all attackers) for the two standard methods: adding noise, and gradient pruning. We then customize these two defenses to be parameter- and model-specific and achieve the optimal trade-off between our obtained reconstruction lower bound and model utility. Experimental results validate that our methods outperform Gradient Noise and Gradient Pruning by protecting the training data better while also achieving better utility.
Authors: Sagar Shrestha, Xiao Fu
Abstract: Understanding identifiability of latent content and style variables from unaligned multi-domain data is essential for tasks such as domain translation and data generation. Existing works on content-style identification were often developed under somewhat stringent conditions, e.g., that all latent components are mutually independent and that the dimensions of the content and style variables are known. We introduce a new analytical framework via cross-domain \textit{latent distribution matching} (LDM), which establishes content-style identifiability under substantially more relaxed conditions. Specifically, we show that restrictive assumptions such as component-wise independence of the latent variables can be removed. Most notably, we prove that prior knowledge of the content and style dimensions is not necessary for ensuring identifiability, if sparsity constraints are properly imposed onto the learned latent representations. Bypassing the knowledge of the exact latent dimension has been a longstanding aspiration in unsupervised representation learning -- our analysis is the first to underpin its theoretical and practical viability. On the implementation side, we recast the LDM formulation into a regularized multi-domain GAN loss with coupled latent variables. We show that the reformulation is equivalent to LDM under mild conditions -- yet requiring considerably less computational resource. Experiments corroborate with our theoretical claims.
Authors: Yu Guan, Qinrong Cai, Wei Li, Qiuyun Fan, Dong Liang, Qiegen Liu
Abstract: Diffusion model-based approaches recently achieved re-markable success in MRI reconstruction, but integration into clinical routine remains challenging due to its time-consuming convergence. This phenomenon is partic-ularly notable when directly apply conventional diffusion process to k-space data without considering the inherent properties of k-space sampling, limiting k-space learning efficiency and image reconstruction quality. To tackle these challenges, we introduce subspace diffusion model with orthogonal decomposition, a method (referred to as Sub-DM) that restrict the diffusion process via projections onto subspace as the k-space data distribution evolves toward noise. Particularly, the subspace diffusion model circumvents the inference challenges posed by the com-plex and high-dimensional characteristics of k-space data, so the highly compact subspace ensures that diffusion process requires only a few simple iterations to produce accurate prior information. Furthermore, the orthogonal decomposition strategy based on wavelet transform hin-ders the information loss during the migration of the vanilla diffusion process to the subspace. Considering the strate-gy is approximately reversible, such that the entire pro-cess can be reversed. As a result, it allows the diffusion processes in different spaces to refine models through a mutual feedback mechanism, enabling the learning of ac-curate prior even when dealing with complex k-space data. Comprehensive experiments on different datasets clearly demonstrate that the superiority of Sub-DM against state of-the-art methods in terms of reconstruction speed and quality.
Authors: Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang
Abstract: Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 > 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as special tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. Our work takes a preliminary step towards understanding and improving NUPA of LLMs. Our benchmark and code are released at https://github.com/GraphPKU/number_cookbook.
Authors: Youssef Mohamed, Runjia Li, Ibrahim Said Ahmad, Kilichbek Haydarov, Philip Torr, Kenneth Ward Church, Mohamed Elhoseiny
Abstract: Research in vision and language has made considerable progress thanks to benchmarks such as COCO. COCO captions focused on unambiguous facts in English; ArtEmis introduced subjective emotions and ArtELingo introduced some multilinguality (Chinese and Arabic). However we believe there should be more multilinguality. Hence, we present ArtELingo-28, a vision-language benchmark that spans $\textbf{28}$ languages and encompasses approximately $\textbf{200,000}$ annotations ($\textbf{140}$ annotations per image). Traditionally, vision research focused on unambiguous class labels, whereas ArtELingo-28 emphasizes diversity of opinions over languages and cultures. The challenge is to build machine learning systems that assign emotional captions to images. Baseline results will be presented for three novel conditions: Zero-Shot, Few-Shot and One-vs-All Zero-Shot. We find that cross-lingual transfer is more successful for culturally-related languages. Data and code are provided at www.artelingo.org.
Authors: Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Xiongkuo Min
Abstract: The advent and proliferation of large multi-modal models (LMMs) have introduced a new paradigm to video-related computer vision fields, including training and inference methods based on visual question answering (VQA). These methods enable models to handle multiple downstream tasks robustly. Video Quality Assessment (VQA), a classic field in low-level visual quality evaluation, originally focused on quantitative video quality scoring. However, driven by advances in LMMs, it is now evolving towards more comprehensive visual quality understanding tasks. Visual question answering has significantly improved low-level visual evaluation within the image domain recently. However, related work is almost nonexistent in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset the first visual question answering instruction dataset entirely focuses on video quality assessment, and based on it, we propose the VQA2 series models The VQA2 Instruction Dataset consists of three stages and covers various video types, containing 157,735 instruction question-answer pairs, including both manually annotated and synthetic data. We conduct extensive experiments on both video quality scoring and video quality understanding tasks. Results demonstrate that the VQA2 series models achieve state-of-the-art (SOTA) performance in quality scoring tasks, and their performance in visual quality question answering surpasses the renowned GPT-4o. Additionally, our final model, the VQA2-Assistant, performs well across both scoring and question-answering tasks, validating its versatility.
Authors: Edvin Listo Zec, Adam Breitholtz, Fredrik D. Johansson
Abstract: Federated learning enables multiple actors to collaboratively train models without sharing private data. This unlocks the potential for scaling machine learning to diverse applications. Existing algorithms for this task are well-justified when clients and the intended target domain share the same distribution of features and labels, but this assumption is often violated in real-world scenarios. One common violation is label shift, where the label distributions differ across clients or between clients and the target domain, which can significantly degrade model performance. To address this problem, we propose FedPALS, a novel model aggregation scheme that adapts to label shifts by leveraging knowledge of the target label distribution at the central server. Our approach ensures unbiased updates under stochastic gradient descent, ensuring robust generalization across clients with diverse, label-shifted data. Extensive experiments on image classification demonstrate that FedPALS consistently outperforms standard baselines by aligning model aggregation with the target domain. Our findings reveal that conventional federated learning methods suffer severely in cases of extreme client sparsity, highlighting the critical need for target-aware aggregation. FedPALS offers a principled and practical solution to mitigate label distribution mismatch, ensuring models trained in federated settings can generalize effectively to label-shifted target domains.
Authors: Jilan Mei, Junbo Li, Cai Meng
Abstract: This paper proposes a new method for accurate and robust 6D pose estimation of novel objects, named GS2Pose. By introducing 3D Gaussian splatting, GS2Pose can utilize the reconstruction results without requiring a high-quality CAD model, which means it only requires segmented RGBD images as input. Specifically, GS2Pose employs a two-stage structure consisting of coarse estimation followed by refined estimation. In the coarse stage, a lightweight U-Net network with a polarization attention mechanism, called Pose-Net, is designed. By using the 3DGS model for supervised training, Pose-Net can generate NOCS images to compute a coarse pose. In the refinement stage, GS2Pose formulates a pose regression algorithm following the idea of reprojection or Bundle Adjustment (BA), referred to as GS-Refiner. By leveraging Lie algebra to extend 3DGS, GS-Refiner obtains a pose-differentiable rendering pipeline that refines the coarse pose by comparing the input images with the rendered images. GS-Refiner also selectively updates parameters in the 3DGS model to achieve environmental adaptation, thereby enhancing the algorithm's robustness and flexibility to illuminative variation, occlusion, and other challenging disruptive factors. GS2Pose was evaluated through experiments conducted on the LineMod dataset, where it was compared with similar algorithms, yielding highly competitive results. The code for GS2Pose will soon be released on GitHub.
Authors: Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang
Abstract: The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. In this study, we introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs. Our experimental results indicate that MM-Detect is sensitive to varying degrees of contamination and can highlight significant performance improvements due to leakage of the training set of multimodal benchmarks. Furthermore, We also explore the possibility of contamination originating from the pre-training phase of LLMs used by MLLMs and the fine-tuning phase of MLLMs, offering new insights into the stages at which contamination may be introduced.
Authors: Weishuo Ma, Yanbo Wang, Xiyuan Wang, Muhan Zhang
Abstract: Various graph neural networks (GNNs) with advanced training techniques and model designs have been proposed for link prediction tasks. However, outdated baseline models may lead to an overestimation of the benefits provided by these novel approaches. To address this, we systematically investigate the potential of Graph Autoencoders (GAE) by meticulously tuning hyperparameters and utilizing the trick of orthogonal embedding and linear propagation. Our findings reveal that a well-optimized GAE can match the performance of more complex models while offering greater computational efficiency.
Authors: Peihao Li
Abstract: With the widespread adoption of edge computing technologies and the increasing prevalence of deep learning models in these environments, the security risks and privacy threats to models and data have grown more acute. Attackers can exploit various techniques to illegally obtain models or misuse data, leading to serious issues such as intellectual property infringement and privacy breaches. Existing model access control technologies primarily rely on traditional encryption and authentication methods; however, these approaches exhibit significant limitations in terms of flexibility and adaptability in dynamic environments. Although there have been advancements in model watermarking techniques for marking model ownership, they remain limited in their ability to proactively protect intellectual property and prevent unauthorized access. To address these challenges, we propose a novel model access control method tailored for edge computing environments. This method leverages image style as a licensing mechanism, embedding style recognition into the model's operational framework to enable intrinsic access control. Consequently, models deployed on edge platforms are designed to correctly infer only on license data with specific style, rendering them ineffective on any other data. By restricting the input data to the edge model, this approach not only prevents attackers from gaining unauthorized access to the model but also enhances the privacy of data on terminal devices. We conducted extensive experiments on benchmark datasets, including MNIST, CIFAR-10, and FACESCRUB, and the results demonstrate that our method effectively prevents unauthorized access to the model while maintaining accuracy. Additionally, the model shows strong resistance against attacks such as forged licenses and fine-tuning. These results underscore the method's usability, security, and robustness.
Authors: Masakazu Yoshimura, Teruaki Hayashi, Yota Maeda
Abstract: An ecosystem of Transformer-based models has been established by building large models with extensive data. Parameter-efficient fine-tuning (PEFT) is a crucial technology for deploying these models to downstream tasks with minimal cost while achieving effective performance. Recently, Mamba, a State Space Model (SSM)-based model, has attracted attention as a potential alternative to Transformers. While many large-scale Mamba-based models have been proposed, efficiently adapting pre-trained Mamba-based models to downstream tasks remains unexplored. In this paper, we conduct an exploratory analysis of PEFT methods for Mamba. We investigate the effectiveness of existing PEFT methods for Transformers when applied to Mamba. We also modify these methods to better align with the Mamba architecture. Additionally, we propose new Mamba-specific PEFT methods that leverage the distinctive structure of Mamba. Our experiments indicate that PEFT performs more effectively for Mamba than Transformers. Lastly, we demonstrate how to effectively combine multiple PEFT methods and provide a framework that outperforms previous works. To ensure reproducibility, we will release the code after publication.
Authors: Yuanshao Zhu, James Jianqiao Yu, Xiangyu Zhao, Xuetao Wei, Yuxuan Liang
Abstract: Human trajectory modeling is essential for deciphering movement patterns and supporting advanced applications across various domains. However, existing methods are often tailored to specific tasks and regions, resulting in limitations related to task specificity, regional dependency, and data quality sensitivity. Addressing these challenges requires a universal human trajectory foundation model capable of generalizing and scaling across diverse tasks and geographic contexts. To this end, we propose UniTraj, a Universal human Trajectory foundation model that is task-adaptive, region-independent, and highly generalizable. To further enhance performance, we construct WorldTrace, the first large-scale, high-quality, globally distributed dataset sourced from open web platforms, encompassing 2.45 million trajectories with billions of points across 70 countries. Through multiple resampling and masking strategies designed for pre-training, UniTraj effectively overcomes geographic and task constraints, adapting to heterogeneous data quality. Extensive experiments across multiple trajectory analysis tasks and real-world datasets demonstrate that UniTraj consistently outperforms existing approaches in terms of scalability and adaptability. These results underscore the potential of UniTraj as a versatile, robust solution for a wide range of trajectory analysis applications, with WorldTrace serving as an ideal but non-exclusive foundation for training.
Authors: Huayang Huang, Yu Wu, Qian Wang
Abstract: Watermarking generative content serves as a vital tool for authentication, ownership protection, and mitigation of potential misuse. Existing watermarking methods face the challenge of balancing robustness and concealment. They empirically inject a watermark that is both invisible and robust and passively achieve concealment by limiting the strength of the watermark, thus reducing the robustness. In this paper, we propose to explicitly introduce a watermark hiding process to actively achieve concealment, thus allowing the embedding of stronger watermarks. To be specific, we implant a robust watermark in an intermediate diffusion state and then guide the model to hide the watermark in the final generated image. We employ an adversarial optimization algorithm to produce the optimal hiding prompt guiding signal for each watermark. The prompt embedding is optimized to minimize artifacts in the generated image, while the watermark is optimized to achieve maximum strength. The watermark can be verified by reversing the generation process. Experiments on various diffusion models demonstrate the watermark remains verifiable even under significant image tampering and shows superior invisibility compared to other state-of-the-art robust watermarking methods.
Authors: Yizhe Huang, Xingbo Wang, Hao Liu, Fanqi Kong, Aoyang Qin, Min Tang, Xiaoxi Wang, Song-Chun Zhu, Mingjie Bi, Siyuan Qi, Xue Feng
Abstract: Traditional interactive environments limit agents' intelligence growth with fixed tasks. Recently, single-agent environments address this by generating new tasks based on agent actions, enhancing task diversity. We consider the decision-making problem in multi-agent settings, where tasks are further influenced by social connections, affecting rewards and information access. However, existing multi-agent environments lack a combination of adaptive physical surroundings and social connections, hindering the learning of intelligent behaviors. To address this, we introduce AdaSociety, a customizable multi-agent environment featuring expanding state and action spaces, alongside explicit and alterable social structures. As agents progress, the environment adaptively generates new tasks with social structures for agents to undertake. In AdaSociety, we develop three mini-games showcasing distinct social structures and tasks. Initial results demonstrate that specific social structures can promote both individual and collective benefits, though current reinforcement learning and LLM-based algorithms show limited effectiveness in leveraging social structures to enhance performance. Overall, AdaSociety serves as a valuable research platform for exploring intelligence in diverse physical and social settings. The code is available at https://github.com/bigai-ai/AdaSociety.
Authors: Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esa\'u Villatoro-Tello, Manjunath K E, Kadri Hacio\u{g}lu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke
Abstract: Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.
Authors: Laura Cabello, Carmen Martin-Turrero, Uchenna Akujuobi, Anders S{\o}gaard, Carlos Bobed
Abstract: Question answering is a natural language understanding task that involves reasoning over both explicit context and unstated, relevant domain knowledge. Large language models (LLMs), which underpin most contemporary question answering systems, struggle to induce how concepts relate in specialized domains such as medicine. Existing medical LLMs are also costly to train. In this work, we present MEG, a parameter-efficient approach for medical knowledge-augmented LLMs. MEG uses a lightweight mapping network to integrate graph embeddings into the LLM, enabling it to leverage external knowledge in a cost-effective way. We evaluate our method on four popular medical multiple-choice datasets and show that LLMs greatly benefit from the factual grounding provided by knowledge graph embeddings. MEG attains an average of +10.2% accuracy over the Mistral-Instruct baseline, and +6.7% over specialized models like BioMistral. We also show results based on Llama-3. Finally, we show that MEG's performance remains robust to the choice of graph encoder.
Authors: Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma
Abstract: Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at https://github.com/BryceZhuo/PolyCom.
Authors: Denis Newman-Griffis, Bonnielin Swenor, Rupa Valdez, Gillian Mason
Abstract: Data are the medium through which individuals' identities and experiences are filtered in contemporary states and systems, and AI is increasingly the layer mediating between people, data, and decisions. The history of data and AI is often one of disability exclusion, oppression, and the reduction of disabled experience; left unchallenged, the current proliferation of AI and data systems thus risks further automating ableism behind the veneer of algorithmic neutrality. However, exclusionary histories do not preclude inclusive futures, and disability-led visions can chart new paths for collective action to achieve futures founded in disability justice. This chapter brings together four academics and disability advocates working at the nexus of disability, data, and AI, to describe achievable imaginaries for artificial intelligence and disability data justice. Reflecting diverse contexts, disciplinary perspectives, and personal experiences, we draw out the shape, actors, and goals of imagined future systems where data and AI support movement towards disability justice.
Authors: Khasmamad Shabanovi, Lukas Wiest, Vladimir Golkov, Daniel Cremers, Thomas Pfeil
Abstract: Post-training quantization is widely employed to reduce the computational demands of neural networks. Typically, individual substructures, such as layers or blocks of layers, are quantized with the objective of minimizing quantization errors in their pre-activations by fine-tuning the corresponding weights. Deriving this local objective from the global objective of minimizing task loss involves two key simplifications: assuming substructures are mutually independent and ignoring the knowledge of subsequent substructures as well as the task loss. In this work, we assess the effects of these simplifications on weight-only quantization of large language models. We introduce two multi-block fine-tuning strategies and compare them against the baseline of fine-tuning single transformer blocks. The first captures correlations of weights across blocks by jointly optimizing multiple quantized blocks. The second incorporates knowledge of subsequent blocks by minimizing the error in downstream pre-activations rather than focusing solely on the quantized block. Our findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.
Authors: Joseph Arul Raj, Linglong Qian, Zina Ibrahim
Abstract: Secondary research use of Electronic Health Records (EHRs) is often hampered by the abundance of missing data in this valuable resource. Missingness in EHRs occurs naturally as a result of the data recording practices during routine clinical care, but handling it is crucial to the precision of medical analysis and the decision-making that follows. The literature contains a variety of imputation methodologies based on deep neural networks. Those aim to overcome the dynamic, heterogeneous and multivariate missingness patterns of EHRs, which cannot be handled by classical and statistical imputation methods. However, all existing deep imputation methods rely on end-to-end pipelines that incorporate both imputation and downstream analyses, e.g. classification. This coupling makes it difficult to assess the quality of imputation and takes away the flexibility of re-using the imputer for a different task. Furthermore, most end-to-end deep architectures tend to use complex networks to perform the downstream task, in addition to the already sophisticated deep imputation network. We, therefore ask if the high performance reported in the literature is due to the imputer or the classifier and further ask if an optimised state-of-the-art imputer is used, a simpler classifier can achieve comparable performance. This paper explores the development of a modular, deep learning-based imputation and classification pipeline, specifically built to leverage the capabilities of state-of-the-art imputation models for downstream classification tasks. Such a modular approach enables a) objective assessment of the quality of the imputer and classifier independently, and b) enables the exploration of the performance of simpler classification architectures using an optimised imputer.
Authors: Ryan Campbell, Nelson Lojo, Kesava Viswanadha, Christoffer Grondal Tryggestad, Derrick Han Sun, Sriteja Vijapurapu, August Rolfsen, Anant Sahai
Abstract: In-Context Learning (ICL) is a phenomenon where task learning occurs through a prompt sequence without the necessity of parameter updates. ICL in Multi-Headed Attention (MHA) with absolute positional embedding has been the focus of more study than other sequence model varieties. We examine implications of architectural differences between GPT-2 and LLaMa as well as LlaMa and Mamba. We extend work done by Garg et al. (2022) and Park et al. (2024) to GPT-2/LLaMa hybrid and LLaMa/Mamba hybrid models - examining the interplay between sequence transformation blocks and regressive performance in-context. We note that certain architectural changes cause degraded training efficiency/ICL accuracy by converging to suboptimal predictors or converging slower. We also find certain hybrids showing optimistic performance improvements, informing potential future ICL-focused architecture modifications. Additionally, we propose the "ICL regression score", a scalar metric describing a model's whole performance on a specific task. Compute limitations impose restrictions on our architecture-space, training duration, number of training runs, function class complexity, and benchmark complexity. To foster reproducible and extensible research, we provide a typed, modular, and extensible Python package on which we run all experiments.
Authors: Felipe Marra, Lucas N. Ferreira
Abstract: This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.
Authors: Yuhang Liu, Xueyu Hu, Shengyu Zhang, Jingyuan Chen, Fan Wu, Fei Wu
Abstract: Retrieval-Augmented Generation (RAG) has proven to be an effective method for mitigating hallucination issues inherent in large language models (LLMs). Previous approaches typically train retrievers based on semantic similarity, lacking optimization for RAG. More recent works have proposed aligning retrievers with the preference signals of LLMs. However, these preference signals are often difficult for dense retrievers, which typically have weaker language capabilities, to understand and learn effectively. Drawing inspiration from pedagogical theories like Guided Discovery Learning, we propose a novel framework, FiGRet (Fine-grained Guidance for Retrievers), which leverages the language capabilities of LLMs to construct examples from a more granular, information-centric perspective to guide the learning of retrievers. Specifically, our method utilizes LLMs to construct easy-to-understand examples from samples where the retriever performs poorly, focusing on three learning objectives highly relevant to the RAG scenario: relevance, comprehensiveness, and purity. These examples serve as scaffolding to ultimately align the retriever with the LLM's preferences. Furthermore, we employ a dual curriculum learning strategy and leverage the reciprocal feedback between LLM and retriever to further enhance the performance of the RAG system. A series of experiments demonstrate that our proposed framework enhances the performance of RAG systems equipped with different retrievers and is applicable to various LLMs.
Authors: Xinzheng Zhang, Yuqing Luo, Guopeng Li
Abstract: Automatic target recognition (ATR) is an important use case for synthetic aperture radar (SAR) image interpretation. Recent years have seen significant advancements in SAR ATR technology based on semi-supervised learning. However, existing semi-supervised SAR ATR algorithms show low recognition accuracy in the case of class imbalance. This work offers a non-balanced semi-supervised SAR target recognition approach using dynamic energy scores and adaptive loss. First, an energy score-based method is developed to dynamically select unlabeled samples near to the training distribution as pseudo-labels during training, assuring pseudo-label reliability in long-tailed distribution circumstances. Secondly, loss functions suitable for class imbalances are proposed, including adaptive margin perception loss and adaptive hard triplet loss, the former offsets inter-class confusion of classifiers, alleviating the imbalance issue inherent in pseudo-label generation. The latter effectively tackles the model's preference for the majority class by focusing on complex difficult samples during training. Experimental results on extremely imbalanced SAR datasets demonstrate that the proposed method performs well under the dual constraints of scarce labels and data imbalance, effectively overcoming the model bias caused by data imbalance and achieving high-precision target recognition.
Authors: Quyet V. Do, Junze Li, Tung-Duong Vuong, Zhaowei Wang, Yangqiu Song, Xiaojuan Ma
Abstract: Commonsense datasets have been well developed in Natural Language Processing, mainly through crowdsource human annotation. However, there are debates on the genuineness of commonsense reasoning benchmarks. In specific, a significant portion of instances in some commonsense benchmarks do not concern commonsense knowledge. That problem would undermine the measurement of the true commonsense reasoning ability of evaluated models. It is also suggested that the problem originated from a blurry concept of commonsense knowledge, as distinguished from other types of knowledge. To demystify all of the above claims, in this study, we survey existing definitions of commonsense knowledge, ground into the three frameworks for defining concepts, and consolidate them into a multi-framework unified definition of commonsense knowledge (so-called consolidated definition). We then use the consolidated definition for annotations and experiments on the CommonsenseQA and CommonsenseQA 2.0 datasets to examine the above claims. Our study shows that there exists a large portion of non-commonsense-knowledge instances in the two datasets, and a large performance gap on these two subsets where Large Language Models (LLMs) perform worse on commonsense-knowledge instances.
Authors: Alexandros Gkillas, Aris Lalos
Abstract: Anomaly and missing data constitute a thorny problem in industrial applications. In recent years, deep learning enabled anomaly detection has emerged as a critical direction, however the improved detection accuracy is achieved with the utilization of large neural networks, increasing their storage and computational cost. Moreover, the data collected in edge devices contain user privacy, introducing challenges that can be successfully addressed by the privacy-preserving distributed paradigm, known as federated learning (FL). This framework allows edge devices to train and exchange models increasing also the communication cost. Thus, to deal with the increased communication, processing and storage challenges of the FL based deep anomaly detection NN pruning is expected to have significant benefits towards reducing the processing, storage and communication complexity. With this focus, a novel compression-based optimization problem is proposed at the server-side of a FL paradigm that fusses the received local models broadcast and performs pruning generating a more compressed model. Experiments in the context of anomaly detection and missing value imputation demonstrate that the proposed FL scenario along with the proposed compressed-based method are able to achieve high compression rates (more than $99.7\%$) with negligible performance losses (less than $1.18\%$ ) as compared to the centralized solutions.
Authors: Ziji Shi, Jialin Li, Yang You
Abstract: Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.
Authors: Davide Buoso, Luke Robinson, Giuseppe Averta, Philip Torr, Tim Franzmeyer, Daniele De Martini
Abstract: This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.
Authors: Bharat Chandra Yalavarthi, Nalini Ratha
Abstract: In mission-critical domains such as law enforcement and medical diagnosis, the ability to explain and interpret the outputs of deep learning models is crucial for ensuring user trust and supporting informed decision-making. Despite advancements in explainability, existing methods often fall short in providing explanations that mirror the depth and clarity of those given by human experts. Such expert-level explanations are essential for the dependable application of deep learning models in law enforcement and medical contexts. Additionally, we recognize that most explanations in real-world scenarios are communicated primarily through natural language. Addressing these needs, we propose a novel approach that utilizes characteristic descriptors to explain model decisions by identifying their presence in images, thereby generating expert-like explanations. Our method incorporates a concept bottleneck layer within the model architecture, which calculates the similarity between image and descriptor encodings to deliver inherent and faithful explanations. Through experiments in face recognition and chest X-ray diagnosis, we demonstrate that our approach offers a significant contrast over existing techniques, which are often limited to the use of saliency maps. We believe our approach represents a significant step toward making deep learning systems more accountable, transparent, and trustworthy in the critical domains of face recognition and medical diagnosis.
Authors: Fabio Pavirani, Jonas Van Gompel, Seyed Soroush Karimi Madahi, Bert Claessens, Chris Develder
Abstract: The growing reliance on renewable energy sources, particularly solar and wind, has introduced challenges due to their uncontrollable production. This complicates maintaining the electrical grid balance, prompting some transmission system operators in Western Europe to implement imbalance tariffs that penalize unsustainable power deviations. These tariffs create an implicit demand response framework to mitigate grid instability. Yet, several challenges limit active participation. In Belgium, for example, imbalance prices are only calculated at the end of each 15-minute settlement period, creating high risk due to price uncertainty. This risk is further amplified by the inherent volatility of imbalance prices, discouraging participation. Although transmission system operators provide minute-based price predictions, the system imbalance volatility makes accurate price predictions challenging to obtain and requires sophisticated techniques. Moreover, publishing price estimates can prompt participants to adjust their schedules, potentially affecting the system balance and the final price, adding further complexity. To address these challenges, we propose a Monte Carlo Tree Search method that publishes accurate imbalance prices while accounting for potential response actions. Our approach models the system dynamics using a neural network forecaster and a cluster of virtual batteries controlled by reinforcement learning agents. Compared to Belgium's current publication method, our technique improves price accuracy by 20.4% under ideal conditions and by 12.8% in more realistic scenarios. This research addresses an unexplored, yet crucial problem, positioning this paper as a pioneering work in analyzing the potential of more advanced imbalance price publishing techniques.
Authors: Alexandre Galashov, Michalis K. Titsias, Andr\'as Gy\"orgy, Clare Lyle, Razvan Pascanu, Yee Whye Teh, Maneesh Sahani
Abstract: Neural networks are traditionally trained under the assumption that data come from a stationary distribution. However, settings which violate this assumption are becoming more popular; examples include supervised learning under distributional shifts, reinforcement learning, continual learning and non-stationary contextual bandits. In this work we introduce a novel learning approach that automatically models and adapts to non-stationarity, via an Ornstein-Uhlenbeck process with an adaptive drift parameter. The adaptive drift tends to draw the parameters towards the initialisation distribution, so the approach can be understood as a form of soft parameter reset. We show empirically that our approach performs well in non-stationary supervised and off-policy reinforcement learning settings.
Authors: Chuhan Li, Ziyao Shangguan, Yilun Zhao, Deyuan Li, Yixin Liu, Arman Cohan
Abstract: Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.
Authors: Guillermo Villate-Castillo, Javier Del Ser, Borja Sanz
Abstract: Content moderation typically combines the efforts of human moderators and machine learning models.However, these systems often rely on data where significant disagreement occurs during moderation, reflecting the subjective nature of toxicity perception.Rather than dismissing this disagreement as noise, we interpret it as a valuable signal that highlights the inherent ambiguity of the content,an insight missed when only the majority label is considered.In this work, we introduce a novel content moderation framework that emphasizes the importance of capturing annotation disagreement. Our approach uses multitask learning, where toxicity classification serves as the primary task and annotation disagreement is addressed as an auxiliary task.Additionally, we leverage uncertainty estimation techniques, specifically Conformal Prediction, to account for both the ambiguity in comment annotations and the model's inherent uncertainty in predicting toxicity and disagreement.The framework also allows moderators to adjust thresholds for annotation disagreement, offering flexibility in determining when ambiguity should trigger a review.We demonstrate that our joint approach enhances model performance, calibration, and uncertainty estimation, while offering greater parameter efficiency and improving the review process in comparison to single-task methods.
Authors: Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz
Abstract: Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.
Authors: Guan Zhe Hong, Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Rina Panigrahy
Abstract: Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network's ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve, but we can train a small transformer to achieve perfect accuracy. Building on our set-up, we then pursue an understanding of precisely how a three-layer transformer, trained from scratch, solves this problem. We are able to identify certain "planning" and "reasoning" circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. To expand our findings, we then study a larger model, Mistral 7B. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason.
Authors: Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu
Abstract: Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
Authors: Shreya Gummadi, Mateus V. Gasparino, Deepak Vasisht, Girish Chowdhary
Abstract: Centralized learning requires data to be aggregated at a central server, which poses significant challenges in terms of data privacy and bandwidth consumption. Federated learning presents a compelling alternative, however, vanilla federated learning methods deployed in robotics aim to learn a single global model across robots that works ideally for all. But in practice one model may not be well suited for robots deployed in various environments. This paper proposes Federated-EmbedCluster (Fed-EC), a clustering-based federated learning framework that is deployed with vision based autonomous robot navigation in diverse outdoor environments. The framework addresses the key federated learning challenge of deteriorating model performance of a single global model due to the presence of non-IID data across real-world robots. Extensive real-world experiments validate that Fed-EC reduces the communication size by 23x for each robot while matching the performance of centralized learning for goal-oriented navigation and outperforms local learning. Fed-EC can transfer previously learnt models to new robots that join the cluster.
Authors: Daniel P. Jeong, Saurabh Garg, Zachary C. Lipton, Michael Oberst
Abstract: Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.
Authors: Amir Rahimi, Vanessa D'Amario, Moyuru Yamada, Kentaro Takemoto, Tomotake Sasaki, Xavier Boix
Abstract: Systematic generalization is a crucial aspect of intelligence, which refers to the ability to generalize to novel tasks by combining known subtasks and concepts. One critical factor that has been shown to influence systematic generalization is the diversity of training data. However, diversity can be defined in various ways, as data have many factors of variation. A more granular understanding of how different aspects of data diversity affect systematic generalization is lacking. We present new evidence in the problem of Visual Question Answering (VQA) that reveals that the diversity of simple tasks (i.e. tasks formed by a few subtasks and concepts) plays a key role in achieving systematic generalization. This implies that it may not be essential to gather a large and varied number of complex tasks, which could be costly to obtain. We demonstrate that this result is independent of the similarity between the training and testing data and applies to well-known families of neural network architectures for VQA (i.e. monolithic architectures and neural module networks). Additionally, we observe that neural module networks leverage all forms of data diversity we evaluated, while monolithic architectures require more extensive amounts of data to do so. These findings provide a first step towards understanding the interactions between data diversity design, neural network architectures, and systematic generalization capabilities.
Authors: Wei Geng, Baidi Xiao, Rongpeng Li, Ning Wei, Dong Wang, Zhifeng Zhao
Abstract: Generally, Reinforcement Learning (RL) agent updates its policy by repetitively interacting with the environment, contingent on the received rewards to observed states and undertaken actions. However, the environmental disturbance, commonly leading to noisy observations (e.g., rewards and states), could significantly shape the performance of agent. Furthermore, the learning performance of Multi-Agent Reinforcement Learning (MARL) is more susceptible to noise due to the interference among intelligent agents. Therefore, it becomes imperative to revolutionize the design of MARL, so as to capably ameliorate the annoying impact of noisy rewards. In this paper, we propose a novel decomposition-based multi-agent distributional RL method by approximating the globally shared noisy reward by a Gaussian mixture model (GMM) and decomposing it into the combination of individual distributional local rewards, with which each agent can be updated locally through distributional RL. Moreover, a diffusion model (DM) is leveraged for reward generation in order to mitigate the issue of costly interaction expenditure for learning distributions. Furthermore, the optimality of the distribution decomposition is theoretically validated, while the design of loss function is carefully calibrated to avoid the decomposition ambiguity. We also verify the effectiveness of the proposed method through extensive simulation experiments with noisy rewards. Besides, different risk-sensitive policies are evaluated in order to demonstrate the superiority of distributional RL in different MARL tasks.
Authors: Yue Liu, Sin Kit Lo, Qinghua Lu, Liming Zhu, Dehai Zhao, Xiwei Xu, Stefan Harrer, Jon Whittle
Abstract: Foundation model-enabled generative artificial intelligence facilitates the development and implementation of agents, which can leverage distinguished reasoning and language processing capabilities to takes a proactive, autonomous role to pursue users' goals. Nevertheless, there is a lack of systematic knowledge to guide practitioners in designing the agents considering challenges of goal-seeking (including generating instrumental goals and plans), such as hallucinations inherent in foundation models, explainability of reasoning process, complex accountability, etc. To address this issue, we have performed a systematic literature review to understand the state-of-the-art foundation model-based agents and the broader ecosystem. In this paper, we present a pattern catalogue consisting of 18 architectural patterns with analyses of the context, forces, and trade-offs as the outcomes from the previous literature review. We propose a decision model for selecting the patterns. The proposed catalogue can provide holistic guidance for the effective use of patterns, and support the architecture design of foundation model-based agents by facilitating goal-seeking and plan generation.
Authors: ZeMing Gong, Austin T. Wang, Xiaoliang Huo, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang
Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
Authors: Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li
Abstract: In this paper, we introduce LiveMind, a novel low-latency inference framework for large language model (LLM) inference which enables LLMs to perform inferences with incomplete user input. By reallocating computational processes to the input phase, a substantial reduction in latency is achieved, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming input to the model, allowing it to infer from incomplete user input or await additional content. Compared with traditional inference methods on complete user input, our approach demonstrates an average reduction in response latency of 84.0% on the MMLU dataset and 71.6% on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an large LLM for inference and a small LLM for output, we achieve an average 37% reduction in response latency, alongside a 4.30% improvement in accuracy on the MMLU-Pro dataset compared with the baseline. The proposed LiveMind framework advances the field of human-AI interaction by enabling more responsive and efficient communication between users and AI systems.
Authors: Jiaxin Deng, Shiyao Wang, Song Lu, Yinfeng Li, Xinchen Luo, Yuanjun Liu, Peixing Xu, Guorui Zhou
Abstract: State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users' long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user's interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.
Authors: Alfredo Ibias, Hector Antona, Guillem Ramirez-Miranda, Enric Guinovart, Eduard Alarcon
Abstract: Cognitive Architectures are the forefront of the research into developing an artificial cognition. However, they approach the problem from a separated memory and program model of computation. This model of computation poses a fundamental problem: the knowledge retrieval heuristic. In this paper we propose to solve this problem by using a novel model of computation, one where memory and program are merged: the Function-Representation. This model of computation involves defining a generic Function-Representation and instantiating multiple instances of it. In this paper we explore the potential of this novel model of computation through mathematical definitions and proofs. We also explore the kind of functions a Function-Representation can implement, and present different ways to organise multiple instances of a Function-Representation.
Authors: Longxuan Yu, Delin Chen, Siheng Xiong, Qingyang Wu, Qingzhen Liu, Dawei Li, Zhikai Chen, Xiaoze Liu, Liangming Pan
Abstract: Causal reasoning (CR) is a crucial aspect of intelligence, essential for problem-solving, decision-making, and understanding the world. While large language models (LLMs) can generate rationales for their outputs, their ability to reliably perform causal reasoning remains uncertain, often falling short in tasks requiring a deep understanding of causality. In this survey, we provide a comprehensive review of research aimed at enhancing LLMs for causal reasoning. We categorize existing methods based on the role of LLMs: either as reasoning engines or as helpers providing knowledge or data to traditional CR methods, followed by a detailed discussion of the methodologies in each category. We then evaluate the performance of LLMs on various causal reasoning tasks, providing key findings and in-depth analysis. Finally, we provide insights from current studies and highlight promising directions for future research. We aim for this work to serve as a comprehensive resource, fostering further advancements in causal reasoning with LLMs. Resources are available at https://github.com/chendl02/Awesome-LLM-causal-reasoning.
URLs: https://github.com/chendl02/Awesome-LLM-causal-reasoning.
Authors: Diana Pfau, Alexander Jung
Abstract: AI systems increasingly shape critical decisions across personal and societal domains. While empirical risk minimization (ERM) drives much of the AI success, it typically prioritizes accuracy over trustworthiness, often resulting in biases, opacity, and other adverse effects. This paper discusses how key requirements for trustworthy AI can be translated into design choices for the components of ERM. We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
Authors: Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, Ge Zhang
Abstract: Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.
Authors: Erik M. Lintunen, Nadia M. Ady, Christian Guckelsberger
Abstract: Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.
Authors: Temitope Akinboyewa, Zhenlong Li, Huan Ning, M. Naser Lessani
Abstract: Recent advancements in Generative AI offer promising capabilities for spatial analysis. Despite their potential, the integration of generative AI with established GIS platforms remains underexplored. In this study, we propose a framework for integrating LLMs directly into existing GIS platforms, using QGIS as an example. Our approach leverages the reasoning and programming capabilities of LLMs to autonomously generate spatial analysis workflows and code through an informed agent that has comprehensive documentation of key GIS tools and parameters. The implementation of this framework resulted in the development of a "GIS Copilot" that allows GIS users to interact with QGIS using natural language commands for spatial analysis. The GIS Copilot was evaluated based on three complexity levels: basic tasks that require one GIS tool and typically involve one data layer to perform simple operations; intermediate tasks involving multi-step processes with multiple tools, guided by user instructions; and advanced tasks which involve multi-step processes that require multiple tools but not guided by user instructions, necessitating the agent to independently decide on and executes the necessary steps. The evaluation reveals that the GIS Copilot demonstrates strong potential in automating foundational GIS operations, with a high success rate in tool selection and code generation for basic and intermediate tasks, while challenges remain in achieving full autonomy for more complex tasks. This study contributes to the emerging vision of Autonomous GIS, providing a pathway for non-experts to engage with geospatial analysis with minimal prior expertise. While full autonomy is yet to be achieved, the GIS Copilot demonstrates significant potential for simplifying GIS workflows and enhancing decision-making processes.
Authors: Xubo Lyu, Site Li, Seth Siriya, Ye Pu, Mo Chen
Abstract: In this paper, a novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL). The baseline is obtained by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task. In contrast to the traditional baseline aimed at variance reduction of policy gradient estimates, our work utilizes the optimal control value function to introduce a novel aspect to the role of baseline -- providing guided exploration during policy learning. This aspect is less discussed in prior works. We validate our baseline on robot learning tasks, showing its effectiveness in guided exploration, particularly in sparse reward environments.
Authors: Satya Narayan Shukla, Benjamin M. Marlin
Abstract: Irregularly sampled time series commonly occur in several domains where they present a significant challenge to standard deep learning models. In this paper, we propose a new deep learning framework for probabilistic interpolation of irregularly sampled time series that we call the Heteroscedastic Temporal Variational Autoencoder (HeTVAE). HeTVAE includes a novel input layer to encode information about input observation sparsity, a temporal VAE architecture to propagate uncertainty due to input sparsity, and a heteroscedastic output layer to enable variable uncertainty in output interpolations. Our results show that the proposed architecture is better able to reflect variable uncertainty through time due to sparse and irregular sampling than a range of baseline and traditional models, as well as recently proposed deep latent variable models that use homoscedastic output layers.
Authors: Fernando Zhapa-Camacho, Robert Hoehndorf
Abstract: Generating vector representations (embeddings) of OWL ontologies is a growing task due to its applications in predicting missing facts and knowledge-enhanced learning in fields such as bioinformatics. The underlying semantics of OWL ontologies are expressed using Description Logics (DLs). Initial approaches to generate embeddings relied on constructing a graph out of ontologies, neglecting the semantics of the logic therein. Recent semantic-preserving embedding methods often target lightweight DL languages like $\mathcal{EL}^{++}$, ignoring more expressive information in ontologies. Although some approaches aim to embed more descriptive DLs like $\mathcal{ALC}$, those methods require the existence of individuals, while many real-world ontologies are devoid of them. We propose an ontology embedding method for the $\mathcal{ALC}$ DL language that considers the lattice structure of concept descriptions. We use connections between DL and Category Theory to materialize the lattice structure and embed it using an order-preserving embedding method. We show that our method outperforms state-of-the-art methods in several knowledge base completion tasks. Furthermore, we incoporate saturation procedures that increase the information within the constructed lattices. We make our code and data available at \url{https://github.com/bio-ontology-research-group/catE}.
Authors: Kenta Oono, Nontawat Charoenphakdee, Kotatsu Bito, Zhengyan Gao, Hideyoshi Igata, Masashi Yoshikawa, Yoshiaki Ota, Hiroki Okui, Kei Akita, Shoichiro Yamaguchi, Yohei Sugawara, Shin-ichi Maeda, Kunihiko Miyoshi, Yuki Saito, Koki Tsuda, Hiroshi Maruyama, Kohei Hayashi
Abstract: Identifying the relationship between healthcare attributes, lifestyles, and personality is vital for understanding and improving physical and mental well-being. Machine learning approaches are promising for modeling their relationships and offering actionable suggestions. In this paper, we propose Virtual Human Generative Model (VHGM), a machine learning model for estimating healthcare, lifestyles, and personality attributes. VHGM is a deep generative model trained with masked modeling to learn the joint distribution of attributes conditioned on known ones. Using heterogeneous tabular datasets, VHGM learns more than 2,000 attributes efficiently. We numerically evaluate the performance of VHGM and its training techniques and have deployed VHGM as a Web service, enabling various healthcare applications.
Authors: Moses Charikar, Prasanna Ramakrishnan, Kangning Wang, Hongxun Wu
Abstract: We consider the following well-studied problem of metric distortion in social choice. Suppose we have an election with $n$ voters and $m$ candidates located in a shared metric space. We would like to design a voting rule that chooses a candidate whose average distance to the voters is small. However, instead of having direct access to the distances in the metric space, the voting rule obtains, from each voter, a ranked list of the candidates in order of distance. Can we design a rule that regardless of the election instance and underlying metric space, chooses a candidate whose cost differs from the true optimum by only a small factor (known as the distortion)? A long line of work culminated in finding optimal deterministic voting rules with metric distortion $3$. However, for randomized voting rules, there is still a gap in our understanding: Even though the best lower bound is $2.112$, the best upper bound is still $3$, attained even by simple rules such as Random Dictatorship. Finding a randomized rule that guarantees distortion $3 - \epsilon$ has been a major challenge in computational social choice, as prevalent approaches to designing voting rules are known to be insufficient. Such a voting rule must use information beyond aggregate comparisons between pairs of candidates, and cannot only assign positive probability to candidates that are voters' top choices. In this work, we give a rule that guarantees distortion less than $2.753$. To do so we study a handful of voting rules that are new to the problem. One is Maximal Lotteries, a rule based on the Nash equilibrium of a natural zero-sum game which dates back to the 60's. The others are novel rules that can be thought of as hybrids of Random Dictatorship and the Copeland rule. Though none of these rules can beat distortion $3$ alone, a randomization between Maximal Lotteries and any of the novel rules can.
Authors: Ziqi Zhang, Jingzehua Xu, Zifeng Zhuang, Hongyin Zhang, Jinxin Liu, Donglin wang, Shuai Zhang
Abstract: Proximal Policy Optimization (PPO) has been broadly applied to robotics learning, showcasing stable training performance. However, the fixed clipping bound setting may limit the performance of PPO. Specifically, there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process. Meanwhile, previous researches suggest that a fixed clipping bound restricts the policy's ability to explore. Therefore, many past studies have aimed to dynamically adjust the PPO clipping bound to enhance PPO's performance. However, the objective of these approaches are not directly aligned with the objective of reinforcement learning (RL) tasks, which is to maximize the cumulative Return. Unlike previous clipping approaches, we propose a bi-level proximal policy optimization objective that can dynamically adjust the clipping bound to better reflect the preference (maximizing Return) of these RL tasks. Based on this bi-level proximal policy optimization paradigm, we introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO). Pb-PPO utilizes a multi-armed bandit approach to refelect RL preference, recommending the clipping bound for PPO that can maximizes the current Return. Therefore, Pb-PPO results in greater stability and improved performance compared to PPO with a fixed clipping bound. We test Pb-PPO on locomotion benchmarks across multiple environments, including Gym-Mujoco and legged-gym. Additionally, we validate Pb-PPO on customized navigation tasks. Meanwhile, we conducted comparisons with PPO using various fixed clipping bounds and various of clipping approaches. The experimental results indicate that Pb-PPO demonstrates superior training performance compared to PPO and its variants. Our codebase has been released at : https://github.com/stevezhangzA/pb_ppo
Authors: Matthew Bone, Eugenia Ehlinger, Fabian Stephany
Abstract: Emerging professions in fields like Artificial Intelligence (AI) and sustainability (green jobs) are experiencing labour shortages as industry demand outpaces labour supply. In this context, our study aims to understand whether employers have begun focusing more on individual skills rather than formal qualifications in their recruitment processes. We analysed a large time-series dataset of approximately eleven million online job vacancies in the UK from 2018 to mid-2024, drawing on diverse literature on technological change and labour market signalling. Our findings provide evidence that employers have initiated "skill-based hiring" for AI roles, adopting more flexible hiring practices to expand the available talent pool. From 2018-2023, demand for AI roles grew by 21% as a proportion of all postings (and accelerated into 2024). Simultaneously, mentions of university education requirements for AI roles declined by 15%. Our regression analysis shows that university degrees have a significantly lower wage premium for both AI and green roles. In contrast, AI skills command a wage premium of 23%, exceeding the value of degrees up until the PhD-level (33%). In occupations with high demand for AI skills, the premium for skills is high, and the reward for degrees is relatively low. We recommend leveraging alternative skill-building formats such as apprenticeships, on-the-job training, MOOCs, vocational education and training, micro-certificates, and online bootcamps to fully utilise human capital and address talent shortages.
Authors: Peihua Mai, Youjia Yang, Ran Yan, Rui Ye, Yan Pang
Abstract: State-of-the-art large language models (LLMs) are typically deployed as online services, requiring users to transmit detailed prompts to cloud servers. This raises significant privacy concerns. In response, we introduce ConfusionPrompt, a novel framework for private LLM inference that protects user privacy by: (i) decomposing the original prompt into smaller sub-prompts, and (ii) generating pseudo-prompts alongside the genuine sub-prompts, which are then sent to the LLM. The server responses are later recomposed by the user to reconstruct the final output. This approach offers key advantages over previous LLM privacy protection methods: (i) it integrates seamlessly with existing black-box LLMs, and (ii) it delivers a significantly improved privacy-utility trade-off compared to existing text perturbation methods. We also develop a $(\lambda, \mu, \rho)$-privacy model to formulate the requirements for a privacy-preserving group of prompts and provide a complexity analysis to justify the role of prompt decomposition. Our empirical evaluation shows that ConfusionPrompt achieves significantly higher utility than local inference methods using open-source models and perturbation-based techniques, while also reducing memory consumption compared to open-source LLMs.
Authors: Liwei Hu, Wenyong Wang, Yu Xiang, Stefan Sommer
Abstract: We propose to incorporate Riemannian geometric features from the geometry of aircraft wing surfaces in the prediction of coefficient of pressure (CP) on the aircraft wing. Contrary to existing approaches that treat the wing surface as a flat object, we represent the wing as a piecewise smooth manifold and calculate a set of Riemannian geometric features (Riemannian metric, connection, and curvature) over points of the wing. Combining these features in neighborhoods of points on the wing with coordinates and flight conditions gives inputs to a deep learning model that predicts CP distributions. Experimental results show that the method with incorporation of Riemannian geometric features, compared to state-of-the-art Deep Attention Network (DAN), reduces the predicted mean square error (MSE) of CP by an average of 15.00% for the DLR-F11 aircraft test set.
Authors: Goitom Ybrah Hailu, Hadush Hailu, Shishay Welay
Abstract: University students often spend a considerable amount of time seeking answers to common questions from administrators or teachers. This can become tedious for both parties, leading to a need for a solution. In response, this paper proposes a chatbot model that utilizes natural language processing and deep learning techniques to answer frequently asked questions (FAQs) in the Amharic language. Chatbots are computer programs that simulate human conversation through the use of artificial intelligence (AI), acting as a virtual assistant to handle questions and other tasks. The proposed chatbot program employs tokenization, normalization, stop word removal, and stemming to analyze and categorize Amharic input sentences. Three machine learning model algorithms were used to classify tokens and retrieve appropriate responses: Support Vector Machine (SVM), Multinomial Na\"ive Bayes, and deep neural networks implemented through TensorFlow, Keras, and NLTK. The deep learning model achieved the best results with 91.55% accuracy and a validation loss of 0.3548 using an Adam optimizer and SoftMax activation function. The chatbot model was integrated with Facebook Messenger and deployed on a Heroku server for 24-hour accessibility. The experimental results demonstrate that the chatbot framework achieved its objectives and effectively addressed challenges such as Amharic Fidel variation, morphological variation, and lexical gaps. Future research could explore the integration of Amharic WordNet to narrow the lexical gap and support more complex questions.
Authors: Behzad Shayegh, Yuqiao Wen, Lili Mou
Abstract: We address unsupervised discontinuous constituency parsing, where we observe a high variance in the performance of the only previous model in the literature. We propose to build an ensemble of different runs of the existing discontinuous parser by averaging the predicted trees, to stabilize and boost performance. To begin with, we provide comprehensive computational complexity analysis (in terms of P and NP-complete) for tree averaging under different setups of binarity and continuity. We then develop an efficient exact algorithm to tackle the task, which runs in a reasonable time for all samples in our experiments. Results on three datasets show our method outperforms all baselines in all metrics; we also provide in-depth analyses of our approach.
Authors: Jonas Weidner, Ivan Ezhov, Michal Balcerak, Marie-Christin Metz, Sergey Litvinov, Sebastian Kaltenbach, Leonhard Feiner, Laurin Lux, Florian Kofler, Jana Lipkova, Jonas Latz, Daniel Rueckert, Bjoern Menze, Benedikt Wiestler
Abstract: Biophysical modeling, particularly involving partial differential equations (PDEs), offers significant potential for tailoring disease treatment protocols to individual patients. However, the inverse problem-solving aspect of these models presents a substantial challenge, either due to the high computational requirements of model-based approaches or the limited robustness of deep learning (DL) methods. We propose a novel framework that leverages the unique strengths of both approaches in a synergistic manner. Our method incorporates a DL ensemble for initial parameter estimation, facilitating efficient downstream evolutionary sampling initialized with this DL-based prior. We showcase the effectiveness of integrating a rapid deep-learning algorithm with a high-precision evolution strategy in estimating brain tumor cell concentrations from magnetic resonance images. The DL-Prior plays a pivotal role, significantly constraining the effective sampling-parameter space. This reduction results in a fivefold convergence acceleration and a Dice-score of 95%.
Authors: Kimon Protopapas, Anas Barakat
Abstract: Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as AlphaGo and AlphaZero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called $h$-PMD which incorporates multi-step greedy policy improvement with lookahead depth $h$ to the PMD update rule. To solve discounted infinite horizon Markov Decision Processes with discount factor $\gamma$, we show that $h$-PMD which generalizes the standard PMD enjoys a faster dimension-free $\gamma^h$-linear convergence rate, contingent on the computation of multi-step greedy policies. We propose an inexact version of $h$-PMD where lookahead action values are estimated. Under a generative model, we establish a sample complexity for $h$-PMD which improves over prior work. Finally, we extend our result to linear function approximation to scale to large state spaces. Under suitable assumptions, our sample complexity only involves dependence on the dimension of the feature map space instead of the state space size.
Authors: Arash Badie-Modiri, Chiara Boldrini, Lorenzo Valerio, J\'anos Kert\'esz, M\'arton Karsai
Abstract: Fully decentralised federated learning enables collaborative training of individual machine learning models on a distributed network of communicating devices while keeping the training data localised on each node. This approach avoids central coordination, enhances data privacy and eliminates the risk of a single point of failure. Our research highlights that the effectiveness of decentralised federated learning is significantly influenced by the network topology of connected devices and the learning models' initial conditions. We propose a strategy for uncoordinated initialisation of the artificial neural networks based on the distribution of eigenvector centralities of the underlying communication network, leading to a radically improved training efficiency. Additionally, our study explores the scaling behaviour and the choice of environmental parameters under our proposed initialisation strategy. This work paves the way for more efficient and scalable artificial neural network training in a distributed and uncoordinated environment, offering a deeper understanding of the intertwining roles of network structure and learning dynamics.
Authors: Jialin Chen, Jan Eric Lenssen, Aosong Feng, Weihua Hu, Matthias Fey, Leandros Tassiulas, Jure Leskovec, Rex Ying
Abstract: Time series forecasting has attracted significant attention in recent decades. Previous studies have demonstrated that the Channel-Independent (CI) strategy improves forecasting performance by treating different channels individually, while it leads to poor generalization on unseen instances and ignores potentially necessary interactions between channels. Conversely, the Channel-Dependent (CD) strategy mixes all channels with even irrelevant and indiscriminate information, which, however, results in oversmoothing issues and limits forecasting accuracy. There is a lack of channel strategy that effectively balances individual channel treatment for improved forecasting performance without overlooking essential interactions between channels. Motivated by our observation of a correlation between the time series model's performance boost against channel mixing and the intrinsic similarity on a pair of channels, we developed a novel and adaptable Channel Clustering Module (CCM). CCM dynamically groups channels characterized by intrinsic similarities and leverages cluster information instead of individual channel identities, combining the best of CD and CI worlds. Extensive experiments on real-world datasets demonstrate that CCM can (1) boost the performance of CI and CD models by an average margin of 2.4% and 7.2% on long-term and short-term forecasting, respectively; (2) enable zero-shot forecasting with mainstream time series forecasting models; (3) uncover intrinsic time series patterns among channels and improve interpretability of complex time series models.
Authors: Tuomas Kynk\"a\"anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, Jaakko Lehtinen
Abstract: Guidance is a crucial technique for extracting the best performance out of image-generating diffusion models. Traditionally, a constant guidance weight has been applied throughout the sampling chain of an image. We show that guidance is clearly harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle. We thus restrict it to a specific range of noise levels, improving both the inference speed and result quality. This limited guidance interval improves the record FID in ImageNet-512 significantly, from 1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial across different sampler parameters, network architectures, and datasets, including the large-scale setting of Stable Diffusion XL. We thus suggest exposing the guidance interval as a hyperparameter in all diffusion models that use guidance.
Authors: Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki
Abstract: The development of Large Language Models (LLMs) in various languages has been advancing, but the combination of non-English languages with domain-specific contexts remains underexplored. This paper presents our findings from training and evaluating a Japanese business domain-specific LLM designed to better understand business-related documents, such as the news on current affairs, technical reports, and patents. Additionally, LLMs in this domain require regular updates to incorporate the most recent knowledge. Therefore, we also report our findings from the first experiments and evaluations involving updates to this LLM using the latest article data, which is an important problem setting that has not been addressed in previous research. From our experiments on a newly created benchmark dataset for question answering in the target domain, we found that (1) our pretrained model improves QA accuracy without losing general knowledge, and (2) a proper mixture of the latest and older texts in the training data for the update is necessary. Our pretrained model and business domain benchmark are publicly available to support further studies.
Authors: Azarakhsh Jalalvand, SangKyeun Kim, Jaemin Seo, Qiming Hu, Max Curie, Peter Steiner, Andrew Oakleigh Nelson, Yong-Su Na, Egemen Kolemen
Abstract: A non-linear system governed by multi-spatial and multi-temporal physics scales cannot be fully understood with a single diagnostic, as each provides only a partial view, leading to information loss. Combining multiple diagnostics may also result in incomplete projections of the system's physics. By identifying hidden inter-correlations between diagnostics, we can leverage mutual support to fill in these gaps, but uncovering such correlations analytically is too complex. We introduce a machine learning methodology to address this issue. Unlike traditional methods, our multimodal approach does not rely on the target diagnostic's direct measurements to generate its super-resolution version. Instead, it uses other diagnostics to produce super-resolution data, capturing detailed structural evolution and responses to perturbations previously unobservable. This not only enhances the resolution of a diagnostic for deeper insights but also reconstructs the target diagnostic, providing a valuable tool to mitigate diagnostic failure. This methodology addresses a key challenge in fusion plasmas: the Edge Localized Mode (ELM), a plasma instability that can cause significant erosion of plasma-facing materials. A method to stabilize ELM is using resonant magnetic perturbation (RMP) to trigger magnetic islands. However, limited spatial and temporal resolution restricts analysis of these islands due to their small size, rapid dynamics, and complex plasma interactions. With super-resolution diagnostics, we can experimentally verify theoretical models of magnetic islands for the first time, providing insights into their role in ELM stabilization. This advancement supports the development of effective ELM suppression strategies for future fusion reactors like ITER and has broader applications, potentially revolutionizing diagnostics in fields such as astronomy, astrophysics, and medical imaging.
Authors: Yifan Wu, Lutao Yan, Leixian Shen, Yunhai Wang, Nan Tang, Yuyu Luo
Abstract: Chart question answering (ChartQA) tasks play a critical role in interpreting and extracting insights from visualization charts. While recent advancements in multimodal large language models (MLLMs) like GPT-4o have shown promise in high-level ChartQA tasks, such as chart captioning, their effectiveness in low-level ChartQA tasks (e.g., identifying correlations) remains underexplored. In this paper, we address this gap by evaluating MLLMs on low-level ChartQA using a newly curated dataset, ChartInsights, which consists of 22,347 (chart, task, query, answer) covering 10 data analysis tasks across 7 chart types. We systematically evaluate 19 advanced MLLMs, including 12 open-source and 7 closed-source models. The average accuracy rate across these models is 39.8%, with GPT-4o achieving the highest accuracy at 69.17%. To further explore the limitations of MLLMs in low-level ChartQA, we conduct experiments that alter visual elements of charts (e.g., changing color schemes, adding image noise) to assess their impact on the task effectiveness. Furthermore, we propose a new textual prompt strategy, Chain-of-Charts, tailored for low-level ChartQA tasks, which boosts performance by 14.41%, achieving an accuracy of 83.58%. Finally, incorporating a visual prompt strategy that directs attention to relevant visual elements further improves accuracy to 84.32%.
Authors: Renato M. Silva, Greg\'orio F. Azevedo, Matheus V. V. Berto, Jean R. Rocha, Eduardo C. Fidelis, Matheus V. Nogueira, Pedro H. Lisboa, Tiago A. Almeida
Abstract: Traffic incidents involving vulnerable road users (VRUs) constitute a significant proportion of global road accidents. Advances in traffic communication ecosystems, coupled with sophisticated signal processing and machine learning techniques, have facilitated the utilization of data from diverse sensors. Despite these advancements and the availability of extensive datasets, substantial progress is required to mitigate traffic casualties. This paper provides a comprehensive survey of state-of-the-art technologies and methodologies to enhance the safety of VRUs. The study delves into the communication networks between vehicles and VRUs, emphasizing the integration of advanced sensors and the availability of relevant datasets. It explores preprocessing techniques and data fusion methods to enhance sensor data quality. Furthermore, our study assesses critical simulation environments essential for developing and testing VRU safety systems. Our research also highlights recent advances in VRU detection and classification algorithms, addressing challenges such as variable environmental conditions. Additionally, we cover cutting-edge research in predicting VRU intentions and behaviors, which is crucial for proactive collision avoidance strategies. Through this survey, we aim to provide a comprehensive understanding of the current landscape of VRU safety technologies, identifying areas of progress and areas needing further research and development.
Authors: Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith
Abstract: Reward Machines provide an automaton-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing the underlying structure of a reward function, they enable the decomposition of an RL task, leading to impressive gains in sample efficiency. Although Reward Machines and similar formal specifications have a rich history of application towards sequential decision-making problems, they critically rely on a ground-truth interpretation of the domain-specific vocabulary that forms the building blocks of the reward function--such ground-truth interpretations are elusive in the real world due in part to partial observability and noisy sensing. In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain-specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary.
Authors: Xuehui Yu, Mhairi Dunion, Xin Li, Stefano V. Albrecht
Abstract: Meta-Reinforcement Learning (Meta-RL) agents can struggle to operate across tasks with varying environmental features that require different optimal skills (i.e., different modes of behaviour). Using context encoders based on contrastive learning to enhance the generalisability of Meta-RL agents is now widely studied but faces challenges such as the requirement for a large sample size, also referred to as the $\log$-$K$ curse. To improve RL generalisation to different tasks, we first introduce Skill-aware Mutual Information (SaMI), an optimisation objective that aids in distinguishing context embeddings according to skills, thereby equipping RL agents with the ability to identify and execute different skills across tasks. We then propose Skill-aware Noise Contrastive Estimation (SaNCE), a $K$-sample estimator used to optimise the SaMI objective. We provide a framework for equipping an RL agent with SaNCE in practice and conduct experimental validation on modified MuJoCo and Panda-gym benchmarks. We empirically find that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks. Additionally, the context encoder trained with SaNCE demonstrates greater robustness to a reduction in the number of available samples, thus possessing the potential to overcome the $\log$-$K$ curse.
Authors: Zhao Xu, Fan Liu, Hao Liu
Abstract: Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced $\textbf{JailTrickBench}$ to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.
Authors: Enming Zhang, Ruobing Yao, Huanyong Liu, Junhui Yu, Jiale Wang
Abstract: With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs' abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on MLLMs for tasks based on flowcharts. \url{https://github.com/360AILABNLP/FlowCE}
Authors: Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev
Abstract: In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers after fine-tuning, enabling the processing of lengths up to 50 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 10 million token lengths.
Authors: Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, Jun Zhu
Abstract: Diffusion models have obtained substantial progress in image-to-video generation. However, in this paper, we find that these models tend to generate videos with less motion than expected. We attribute this to the issue called conditional image leakage, where the image-to-video diffusion models (I2V-DMs) tend to over-rely on the conditional image at large time steps. We further address this challenge from both inference and training aspects. First, we propose to start the generation process from an earlier time step to avoid the unreliable large-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to bridge the training-inference gap. Second, we design a time-dependent noise distribution (TimeNoise) for the conditional image during training, applying higher noise levels at larger time steps to disrupt it and reduce the model's dependency on it. We validate these general strategies on various I2V-DMs on our collected open-domain image benchmark and the UCF101 dataset. Extensive results show that our methods outperform baselines by producing higher motion scores with lower errors while maintaining image alignment and temporal consistency, thereby yielding superior overall performance and enabling more accurate motion control. The project page: \url{https://cond-image-leak.github.io/}.
Authors: Haoran Wang, Xinji Mai, Zeng Tao, Xuan Tong, Junxiong Lin, Yan Wang, Jiawen Yu, Boyang Wang, Shaoqi Yan, Qing Zhao, Ziheng Zhou, Shuyong Gao, Wenqiang Zhang
Abstract: The contemporary state-of-the-art of Dynamic Facial Expression Recognition (DFER) technology facilitates remarkable progress by deriving emotional mappings of facial expressions from video content, underpinned by training on voluminous datasets. Yet, the DFER datasets encompass a substantial volume of noise data. Noise arises from low-quality captures that defy logical labeling, and instances that suffer from mislabeling due to annotation bias, engendering two principal types of uncertainty: the uncertainty regarding data usability and the uncertainty concerning label reliability. Addressing the two types of uncertainty, we have meticulously crafted a two-stage framework aiming at \textbf{S}eeking \textbf{C}ertain data \textbf{I}n extensive \textbf{U}ncertain data (SCIU). This initiative aims to purge the DFER datasets of these uncertainties, thereby ensuring that only clean, verified data is employed in training processes. To mitigate the issue of low-quality samples, we introduce the Coarse-Grained Pruning (CGP) stage, which assesses sample weights and prunes those deemed unusable due to their low weight. For samples with incorrect annotations, the Fine-Grained Correction (FGC) stage evaluates prediction stability to rectify mislabeled data. Moreover, SCIU is conceived as a universally compatible, plug-and-play framework, tailored to integrate seamlessly with prevailing DFER methodologies. Rigorous experiments across prevalent DFER datasets and against numerous benchmark methods substantiates SCIU's capacity to markedly elevate performance metrics.
Authors: Haolang Lu, Hongrui Peng, Guoshun Nan, Jiaoyang Cui, Cheng Wang, Weifei Jin, Songtao Wang, Shengli Pan, Xiaofeng Tao
Abstract: Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations,and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose MALSIGHT, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summary dataset, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS and benign pseudocode datasets. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting summaries' usability, accuracy, and completeness. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed MALSIGHT. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger Code-Llama.
Authors: Azim Ospanov, Jingwei Zhang, Mohammad Jalali, Xuenan Cao, Andrej Bogdanov, Farzan Farnia
Abstract: While standard evaluation scores for generative models are mostly reference-based, a reference-dependent assessment of generative models could be generally difficult due to the unavailability of applicable reference datasets. Recently, the reference-free entropy scores, VENDI and RKE, have been proposed to evaluate the diversity of generated data. However, estimating these scores from data leads to significant computational costs for large-scale generative models. In this work, we leverage the random Fourier features framework to reduce the computational price and propose the Fourier-based Kernel Entropy Approximation (FKEA) method. We utilize FKEA's approximated eigenspectrum of the kernel matrix to efficiently estimate the mentioned entropy scores. Furthermore, we show the application of FKEA's proxy eigenvectors to reveal the method's identified modes in evaluating the diversity of produced samples. We provide a stochastic implementation of the FKEA assessment algorithm with a complexity $O(n)$ linearly growing with sample size $n$. We extensively evaluate FKEA's numerical performance in application to standard image, text, and video datasets. Our empirical results indicate the method's scalability and interpretability applied to large-scale generative models. The codebase is available at https://github.com/aziksh-ospanov/FKEA.
Authors: Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
Abstract: Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concatenated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the experimental results indicate that the proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.
Authors: Andong Hua, Mehak Preet Dhaliwal, Ryan Burke, Laya Pullela, Yao Qin
Abstract: Accurate nutrition estimation helps people make informed dietary choices and is essential in the prevention of serious health complications. We present NutriBench, the first publicly available natural language meal description nutrition benchmark. NutriBench consists of 11,857 meal descriptions generated from real-world global dietary intake data. The data is human-verified and annotated with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. We conduct an extensive evaluation of NutriBench on the task of carbohydrate estimation, testing twelve leading Large Language Models (LLMs), including GPT-4o, Llama3.1, Qwen2, Gemma2, and OpenBioLLM models, using standard, Chain-of-Thought and Retrieval-Augmented Generation strategies. Additionally, we present a study involving professional nutritionists, finding that LLMs can provide more accurate and faster estimates. Finally, we perform a real-world risk assessment by simulating the effect of carbohydrate predictions on the blood glucose levels of individuals with diabetes. Our work highlights the opportunities and challenges of using LLMs for nutrition estimation, demonstrating their potential to aid professionals and laypersons and improve health outcomes. Our benchmark is publicly available at: https://mehak126.github.io/nutribench.html
Authors: Silviu Pitis, Ziang Xiao, Nicolas Le Roux, Alessandro Sordoni
Abstract: While finetuning language models from pairwise preferences has proven remarkably effective, the underspecified nature of natural language presents critical challenges. Direct preference feedback is uninterpretable, difficult to provide where multidimensional criteria may apply, and often inconsistent, either because it is based on incomplete instructions or provided by diverse principals. To address these challenges, we consider the two-step preference modeling procedure that first resolves the under-specification by selecting a context, and then evaluates preference with respect to the chosen context. We decompose reward modeling error according to these two steps, which suggests that supervising context in addition to context-specific preference may be a viable approach to aligning models with diverse human preferences. For this to work, the ability of models to evaluate context-specific preference is critical. To this end, we contribute context-conditioned preference datasets and accompanying experiments that investigate the ability of language models to evaluate context-specific preference. We use our datasets to (1) show that existing preference models benefit from, but fail to fully consider, added context, (2) finetune a context-aware reward model with context-specific performance exceeding that of GPT-4 and Llama 3 70B on tested datasets, and (3) investigate the value of context-aware preference modeling.
Authors: Gon\c{c}alo Vinagre Martins, Jo\~ao Magalh\~aes, Afonso Quinaz, Carla Viegas, Sofia Cavaco
Abstract: SLVideo is a video moment retrieval system for Sign Language videos that incorporates facial expressions, addressing this gap in existing technology. The system extracts embedding representations for the hand and face signs from video frames to capture the signs in their entirety, enabling users to search for a specific sign language video segment with text queries. A collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset, and a CLIP model is used to generate the embeddings. The initial results are promising in a zero-shot setting. In addition, SLVideo incorporates a thesaurus that enables users to search for similar signs to those retrieved, using the video segment embeddings, and also supports the edition and creation of video sign language annotations. Project web page: https://novasearch.github.io/SLVideo/
Authors: Carlo Alberto Barbano, Matteo Brunello, Benoit Dufumier, Marco Grangetto
Abstract: Deep Learning (DL) in neuroimaging has become increasingly relevant for detecting neurological conditions and neurodegenerative disorders. One of the most predominant biomarkers in neuroimaging is represented by brain age, which has been shown to be a good indicator for different conditions, such as Alzheimer's Disease. Using brain age for pretraining DL models in transfer learning settings has also recently shown promising results, especially when dealing with data scarcity of different conditions. On the other hand, anatomical information of brain MRIs (e.g. cortical thickness) can provide important information for learning good representations that can be transferred to many downstream tasks. In this work, we propose AnatCL, an anatomical foundation model for brain MRIs that i.) leverages anatomical information with a weakly contrastive learning approach and ii.) achieves state-of-the-art performances in many different downstream tasks. To validate our approach we consider 12 different downstream tasks for diagnosis classification, and prediction of 10 different clinical assessment scores. Pretrained models can be found at https://github.com/EIDOSLAB/AnatCL.
Authors: Derry Pratama, Naufal Suryanto, Andro Aprila Adiputra, Thi-Thu-Huong Le, Ahmada Yusril Kadiptya, Muhammad Iqbal, Howon Kim
Abstract: Penetration testing, a critical component of cybersecurity, typically requires extensive time and effort to find vulnerabilities. Beginners in this field often benefit from collaborative approaches with the community or experts. To address this, we develop CIPHER (Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers), a large language model specifically trained to assist in penetration testing tasks. We trained CIPHER using over 300 high-quality write-ups of vulnerable machines, hacking techniques, and documentation of open-source penetration testing tools. Additionally, we introduced the Findings, Action, Reasoning, and Results (FARR) Flow augmentation, a novel method to augment penetration testing write-ups to establish a fully automated pentesting simulation benchmark tailored for large language models. This approach fills a significant gap in traditional cybersecurity Q\&A benchmarks and provides a realistic and rigorous standard for evaluating AI's technical knowledge, reasoning capabilities, and practical utility in dynamic penetration testing scenarios. In our assessments, CIPHER achieved the best overall performance in providing accurate suggestion responses compared to other open-source penetration testing models of similar size and even larger state-of-the-art models like Llama 3 70B and Qwen1.5 72B Chat, particularly on insane difficulty machine setups. This demonstrates that the current capabilities of general LLMs are insufficient for effectively guiding users through the penetration testing process. We also discuss the potential for improvement through scaling and the development of better benchmarks using FARR Flow augmentation results. Our benchmark will be released publicly at https://github.com/ibndias/CIPHER.
Authors: Hasan Iqbal, Yuxia Wang, Minghan Wang, Georgi Georgiev, Jiahui Geng, Iryna Gurevych, Preslav Nakov
Abstract: The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.
URLs: https://github.com/mbzuai-nlp/openfactcheck), https://pypi.org/project/openfactcheck/), http://app.openfactcheck.com)., https://youtu.be/-i9VKL0HleI.
Authors: Nadav Cohen, Noam Razin
Abstract: These notes are based on a lecture delivered by NC on March 2021, as part of an advanced course in Princeton University on the mathematical understanding of deep learning. They present a theory (developed by NC, NR and collaborators) of linear neural networks -- a fundamental model in the study of optimization and generalization in deep learning. Practical applications born from the presented theory are also discussed. The theory is based on mathematical tools that are dynamical in nature. It showcases the potential of such tools to push the envelope of our understanding of optimization and generalization in deep learning. The text assumes familiarity with the basics of statistical learning theory. Exercises (without solutions) are included.
Authors: P. Curvo, D. R. Ferreira, R. Jorge
Abstract: The design of fusion devices is typically based on computationally expensive simulations. This can be alleviated using high aspect ratio models that employ a reduced number of free parameters, especially in the case of stellarator optimization where non-axisymmetric magnetic fields with a large parameter space are optimized to satisfy certain performance criteria. However, optimization is still required to find configurations with properties such as low elongation, high rotational transform, finite plasma beta, and good fast particle confinement. In this work, we train a machine learning model to construct configurations with favorable confinement properties by finding a solution to the inverse design problem, that is, obtaining a set of model input parameters for given desired properties. Since the solution of the inverse problem is non-unique, a probabilistic approach, based on mixture density networks, is used. It is shown that optimized configurations can be generated reliably using this method.
Authors: Seyed Mohammad Azimi-Abarghouyi, Lav R. Varshney
Abstract: This paper introduces a federated learning framework that enables over-the-air computation via digital communications, using a new joint source-channel coding scheme. Without relying on channel state information at devices, this scheme employs lattice codes to both quantize model parameters and exploit interference from the devices. We propose a novel receiver structure at the server, designed to reliably decode an integer combination of the quantized model parameters as a lattice point for the purpose of aggregation. We present a mathematical approach to derive a convergence bound for the proposed scheme and offer design remarks. In this context, we suggest an aggregation metric and a corresponding algorithm to determine effective integer coefficients for the aggregation in each communication round. Our results illustrate that, regardless of channel dynamics and data heterogeneity, our scheme consistently delivers superior learning accuracy across various parameters and markedly surpasses other over-the-air methodologies.
Authors: Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang
Abstract: While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.
Authors: Kevin Li, Fulu Li
Abstract: We use images of cars of a wide range of varieties to compose an image of an animal such as a bird or a lion for the theme of environmental protection to maximize the information about cars in a single composed image and to raise the awareness about environmental challenges. We present a novel way of image interaction with an artistically-composed photomosaic image, in which a simple operation of "click and display" is used to demonstrate the interactive switch between a tile image in a photomosaic image and the corresponding original car image, which will be automatically saved on the Desktop. We build a multimodal custom GPT named TalkMosaic by incorporating car images information and the related knowledge to ChatGPT. By uploading the original car image to TalkMosaic, we can ask questions about the given car image and get the corresponding answers efficiently and effectively such as where to buy the tire in the car image that satisfies high environmental standards. We give an in-depth analysis on how to speed up the inference of multimodal LLM using sparse attention and quantization techniques with presented probabilistic FlashAttention (PrFlashAttention) and Staircase Adaptive Quantization (SAQ) methods. The implemented prototype demonstrates the feasibility and effectiveness of the presented approach.
Authors: Hanwen Zhang, Dusit Niyato, Wei Zhang, Changyuan Zhao, Hongyang Du, Abbas Jamalipour, Sumei Sun, Yiyang Pei
Abstract: With the advancements of generative artificial intelligence (GenAI) models, their capabilities are expanding significantly beyond content generation and the models are increasingly being used across diverse applications. Particularly, GenAI shows great potential in addressing challenges in the electric vehicle (EV) ecosystem ranging from charging management to cyber-attack prevention. In this paper, we specifically consider Internet of electric vehicles (IoEV) and we categorize GenAI for IoEV into four different layers namely, EV's battery layer, individual EV layer, smart grid layer, and security layer. We introduce various GenAI techniques used in each layer of IoEV applications. Subsequently, public datasets available for training the GenAI models are summarized. Finally, we provide recommendations for future directions. This survey not only categorizes the applications of GenAI in IoEV across different layers but also serves as a valuable resource for researchers and practitioners by highlighting the design and implementation challenges within each layer. Furthermore, it provides a roadmap for future research directions, enabling the development of more robust and efficient IoEV systems through the integration of advanced GenAI techniques.
Authors: Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, Manuel Gomez-Rodriguez
Abstract: "Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-Instruct and Ministral-8B-Instruct and conduct a qualitative and a quantitative analysis of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.
Authors: Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D'Haro, Robby T. Tan, Haizhou Li
Abstract: Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.
Authors: Yan Scholten, Stephan G\"unnemann, Leo Schwinn
Abstract: Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework in LLMs. Namely, we derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Through a case study focused on unlearning, we reveal that deterministic evaluations falsely indicate successful unlearning, whereas our probabilistic evaluations demonstrate that most if not all of the supposedly unlearned information remains accessible in these models. Additionally, we propose a novel unlearning loss based on entropy optimization and adaptive temperature scaling, which significantly improves unlearning in probabilistic settings on recent benchmarks. Our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://github.com/yascho/probabilistic-unlearning.
Authors: Shuhe Wang, Guoyin Wang, Yizhong Wang, Jiwei Li, Eduard Hovy, Chen Guo
Abstract: Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model's maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file.
URLs: https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file.
Authors: Mete Ismayilzada, Defne Circi, Jonne S\"alev\"a, Hale Sirin, Abdullatif K\"oksal, Bhuwan Dhingra, Antoine Bosselut, Lonneke van der Plas, Duygu Ataman
Abstract: Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questionable, raising doubts about whether these models learn language similarly to humans. While humans exhibit compositional generalization and linguistic creativity in language use, the extent to which LLMs replicate these abilities, particularly in morphology, is under-explored. In this work, we systematically investigate the morphological generalization abilities of LLMs through the lens of compositionality. We define morphemes as compositional primitives and design a novel suite of generative and discriminative tasks to assess morphological productivity and systematicity. Focusing on agglutinative languages such as Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots, with performance declining sharply as morphological complexity increases. While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.
Authors: Khiem Le, Nitesh V. Chawla
Abstract: Molecule optimization is a critical task in drug discovery to optimize desired properties of a given molecule through chemical modification. Despite Large Language Models (LLMs) holding the potential to efficiently simulate this task by using natural language to direct the optimization, straightforwardly utilizing shows limited performance. In this work, we facilitate utilizing LLMs in an iterative paradigm by proposing a simple yet highly effective domain feedback provider, namely $\text{Re}^3$DF. In detail, $\text{Re}^3$DF harnesses an external toolkit, RDKit, to handle the molecule hallucination, if the modified molecule is chemically invalid. Otherwise, its desired properties are computed and compared to the original one, establishing reliable domain feedback with correct direction and distance towards the objective, followed by a retrieved example, to explicitly guide the LLM to refine the modified molecule. We conduct experiments across both single- and multi-property objectives with 2 thresholds, where $\text{Re}^3$DF shows significant improvements. Particularly, for 20 single-property objectives, $\text{Re}^3$DF enhances Hit ratio by 16.95% and 20.76% under loose and strict thresholds, respectively. For 32 multi-property objectives, $\text{Re}^3$DF enhances Hit ratio by 6.04% and 5.25%.
Authors: Tu Trinh, Mohamad H. Danesh, Nguyen X. Khanh, Benjamin Plaut
Abstract: While reinforcement learning (RL) agents often perform well during training, they can struggle with distribution shift in real-world deployments. One particularly severe risk of distribution shift is goal misgeneralization, where the agent learns a proxy goal that coincides with the true goal during training but not during deployment. In this paper, we explore whether allowing an agent to ask for help from a supervisor in unfamiliar situations can mitigate this issue. We focus on agents trained with PPO in the CoinRun environment, a setting known to exhibit goal misgeneralization. We evaluate multiple methods for determining when the agent should request help and find that asking for help consistently improves performance. However, we also find that methods based on the agent's internal state fail to proactively request help, instead waiting until mistakes have already occurred. Further investigation suggests that the agent's internal state does not represent the coin at all, highlighting the importance of learning nuanced representations, the risks of ignoring everything not immediately relevant to reward, and the necessity of developing ask-for-help strategies tailored to the agent's training algorithm.
Authors: Qintong Zhang, Victor Shea-Jay Huang, Bin Wang, Junyuan Zhang, Zhengren Wang, Hao Liang, Shawn Wang, Matthieu Lin, Conghui He, Wentao Zhang
Abstract: Document parsing is essential for converting unstructured and semi-structured documents-such as contracts, academic papers, and invoices-into structured, machine-readable data. Document parsing extract reliable structured data from unstructured inputs, providing huge convenience for numerous applications. Especially with recent achievements in Large Language Models, document parsing plays an indispensable role in both knowledge base construction and training data generation. This survey presents a comprehensive review of the current state of document parsing, covering key methodologies, from modular pipeline systems to end-to-end models driven by large vision-language models. Core components such as layout detection, content extraction (including text, tables, and mathematical expressions), and multi-modal data integration are examined in detail. Additionally, this paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts, integrating multiple modules, and recognizing high-density text. It emphasizes the importance of developing larger and more diverse datasets and outlines future research directions.
Authors: Jianlan Luo, Charles Xu, Jeffrey Wu, Sergey Levine
Abstract: Reinforcement learning (RL) holds great promise for enabling autonomous acquisition of complex robotic manipulation skills, but realizing this potential in real-world settings has been challenging. We present a human-in-the-loop vision-based RL system that demonstrates impressive performance on a diverse set of dexterous manipulation tasks, including dynamic manipulation, precision assembly, and dual-arm coordination. Our approach integrates demonstrations and human corrections, efficient RL algorithms, and other system-level design choices to learn policies that achieve near-perfect success rates and fast cycle times within just 1 to 2.5 hours of training. We show that our method significantly outperforms imitation learning baselines and prior RL approaches, with an average 2x improvement in success rate and 1.8x faster execution. Through extensive experiments and analysis, we provide insights into the effectiveness of our approach, demonstrating how it learns robust, adaptive policies for both reactive and predictive control strategies. Our results suggest that RL can indeed learn a wide range of complex vision-based manipulation policies directly in the real world within practical training times. We hope this work will inspire a new generation of learned robotic manipulation techniques, benefiting both industrial applications and research advancements. Videos and code are available at our project website https://hil-serl.github.io/.
Authors: Yinyi Lai, Anbo Cao, Yuan Gao, Jiaqi Shang, Zongyu Li, Jia Guo
Abstract: Early and accurate diagnosis of brain tumors is crucial for improving patient survival rates. However, the detection and classification of brain tumors are challenging due to their diverse types and complex morphological characteristics. This study investigates the application of pre-trained models for brain tumor classification, with a particular focus on deploying the Mamba model. We fine-tuned several mainstream transfer learning models and applied them to the multi-class classification of brain tumors. By comparing these models to those trained from scratch, we demonstrated the significant advantages of transfer learning, especially in the medical imaging field, where annotated data is often limited. Notably, we introduced the Vision Mamba (Vim), a novel network architecture, and applied it for the first time in brain tumor classification, achieving exceptional classification accuracy. Experimental results indicate that the Vim model achieved 100% classification accuracy on an independent test set, emphasizing its potential for tumor classification tasks. These findings underscore the effectiveness of transfer learning in brain tumor classification and reveal that, compared to existing state-of-the-art models, the Vim model is lightweight, efficient, and highly accurate, offering a new perspective for clinical applications. Furthermore, the framework proposed in this study for brain tumor classification, based on transfer learning and the Vision Mamba model, is broadly applicable to other medical imaging classification problems.
Authors: Jonas Bode, Bastian P\"atzold, Raphael Memmesheimer, Sven Behnke
Abstract: Recent advances in LLM have been instrumental in autonomous robot control and human-robot interaction by leveraging their vast general knowledge and capabilities to understand and reason across a wide range of tasks and scenarios. Previous works have investigated various prompt engineering techniques for improving the performance of LLM to accomplish tasks, while others have proposed methods that utilize LLMs to plan and execute tasks based on the available functionalities of a given robot platform. In this work, we consider both lines of research by comparing prompt engineering techniques and combinations thereof within the application of high-level task planning and execution in service robotics. We define a diverse set of tasks and a simple set of functionalities in simulation, and measure task completion accuracy and execution time for several state-of-the-art models.
Authors: Bin Wu, Shinnosuke Takamichi, Sakriani Sakti, Satoshi Nakamura
Abstract: Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.
Authors: Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa
Abstract: Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP's novelty lies in its unique $torch.compile$-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques.
Authors: Heiko Hoffmann
Abstract: Scalar variables, e.g., the orientation of a shape in an image, are commonly predicted using a single output neuron in a neural network. In contrast, the mammalian cortex represents variables with a population of neurons. In this population code, each neuron is most active at its preferred value and shows partial activity for other values. Here, we investigate the benefit of using a population code for the output layer of a neural network. We compare population codes against single-neuron outputs and one-hot vectors. First, we show theoretically and in experiments with synthetic data that population codes improve robustness to input noise in networks of stacked linear layers. Second, we demonstrate the benefit of using population codes to encode ambiguous outputs, such as the pose of symmetric objects. Using the T-LESS dataset of feature-less real-world objects, we show that population codes improve the accuracy of predicting 3D object orientation from image input.
Authors: Zhili Feng, Tanya Marwah, Nicolo Fusi, David Alvarez-Melis, Lester Mackey
Abstract: Modern large language models use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English language models, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.
Authors: Vu Khanh Quy, Nguyen Minh Quy, Tran Thi Hoai, Shaba Shaon, Md Raihan Uddin, Tien Nguyen, Dinh C. Nguyen, Aryan Kaushik, Periklis Chatzimisios
Abstract: 6G wireless networks are expected to provide seamless and data-based connections that cover space-air-ground and underwater networks. As a core partition of future 6G networks, Space-Air-Ground Integrated Networks (SAGIN) have been envisioned to provide countless real-time intelligent applications. To realize this, promoting AI techniques into SAGIN is an inevitable trend. Due to the distributed and heterogeneous architecture of SAGIN, federated learning (FL) and then quantum FL are emerging AI model training techniques for enabling future privacy-enhanced and computation-efficient SAGINs. In this work, we explore the vision of using FL/QFL in SAGINs. We present a few representative applications enabled by the integration of FL and QFL in SAGINs. A case study of QFL over UAV networks is also given, showing the merit of quantum-enabled training approach over the conventional FL benchmark. Research challenges along with standardization for QFL adoption in future SAGINs are also highlighted.
Authors: Md Raihan Uddin, Ratun Rahman, Dinh C. Nguyen
Abstract: Smart metering networks are increasingly susceptible to cyber threats, where false data injection (FDI) appears as a critical attack. Data-driven-based machine learning (ML) methods have shown immense benefits in detecting FDI attacks via data learning and prediction abilities. Literature works have mostly focused on centralized learning and deploying FDI attack detection models at the control center, which requires data collection from local utilities like meters and transformers. However, this data sharing may raise privacy concerns due to the potential disclosure of household information like energy usage patterns. This paper proposes a new privacy-preserved FDI attack detection by developing an efficient federated learning (FL) framework in the smart meter network with edge computing. Distributed edge servers located at the network edge run an ML-based FDI attack detection model and share the trained model with the grid operator, aiming to build a strong FDI attack detection model without data sharing. Simulation results demonstrate the efficiency of our proposed FL method over the conventional method without collaboration.
Authors: Gjergji Kasneci, Enkelejda Kasneci
Abstract: Feature engineering is crucial for optimizing machine learning model performance, particularly in tabular data classification tasks. Leveraging advancements in natural language processing, this study presents a systematic approach to enrich tabular datasets with features derived from large language model embeddings. Through a comprehensive ablation study on diverse datasets, we assess the impact of RoBERTa and GPT-2 embeddings on ensemble classifiers, including Random Forest, XGBoost, and CatBoost. Results indicate that integrating embeddings with traditional numerical and categorical features often enhances predictive performance, especially on datasets with class imbalance or limited features and samples, such as UCI Adult, Heart Disease, Titanic, and Pima Indian Diabetes, with improvements particularly notable in XGBoost and CatBoost classifiers. Additionally, feature importance analysis reveals that LLM-derived features frequently rank among the most impactful for the predictions. This study provides a structured approach to embedding-based feature enrichment and illustrates its benefits in ensemble learning for tabular data.
Authors: Mohammad Zbeeb, Mohammad Ghorayeb, Mariam Salman
Abstract: Artificial Intelligence (AI) research often aims to develop models that can generalize reliably across complex datasets, yet this remains challenging in fields where data is scarce, intricate, or inaccessible. This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize one of the most demanding structured datasets: Malicious Network Traffic. Our approach uniquely transforms numerical data into text, re-framing data generation as a language modeling task, which not only enhances data regularization but also significantly improves generalization and the quality of the synthetic data. Extensive statistical analyses demonstrate that our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data. Additionally, we conduct a comprehensive study on synthetic data applications, effectiveness, and evaluation strategies, offering valuable insights into its role across various domains. Our code and pre-trained models are openly accessible at Github, enabling further exploration and application of our methodology. Index Terms: Data synthesis, machine learning, traffic generation, privacy preserving data, generative models.
Authors: Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, Haoze Li, Haoxuan Lan, Jiaming Tian, Jing Yuan, Junbo Zhao, Junlin Zhou, Kaizhe Shou, Liangyu Zha, Lin Long, Liyao Li, Pengzuo Wu, Qi Zhang, Qingyi Huang, Saisai Yang, Tao Zhang, Wentao Ye, Wufang Zhu, Xiaomeng Hu, Xijun Gu, Xinjie Sun, Xiang Li, Yuhang Yang, Zhiqing Xiao
Abstract: The emergence of models like GPTs, Claude, LLaMA, and Qwen has reshaped AI applications, presenting vast new opportunities across industries. Yet, the integration of tabular data remains notably underdeveloped, despite its foundational role in numerous real-world domains. This gap is critical for three main reasons. First, database or data warehouse data integration is essential for advanced applications; second, the vast and largely untapped resource of tabular data offers immense potential for analysis; and third, the business intelligence domain specifically demands adaptable, precise solutions that many current LLMs may struggle to provide. In response, we introduce TableGPT2, a model rigorously pre-trained and fine-tuned with over 593.8K tables and 2.36M high-quality query-table-output tuples, a scale of table-related data unprecedented in prior research. This extensive training enables TableGPT2 to excel in table-centric tasks while maintaining strong general language and coding abilities. One of TableGPT2's key innovations is its novel table encoder, specifically designed to capture schema-level and cell-level information. This encoder strengthens the model's ability to handle ambiguous queries, missing column names, and irregular tables commonly encountered in real-world applications. Similar to visual language models, this pioneering approach integrates with the decoder to form a robust large multimodal model. We believe the results are compelling: over 23 benchmarking metrics, TableGPT2 achieves an average performance improvement of 35.20% in the 7B model and 49.32% in the 72B model over prior benchmark-neutral LLMs, with robust general-purpose capabilities intact.
Authors: Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu, Chenchen Zhang, Shihui Hu, Zilong Zhao, Zifan Wu, Yao Ding, Weichao Wang, Han Liu, Roberts Wang, Hao Fei, Peijie Yu, Ze Zhao, Xun Cao, Hai Wang, Fusheng Xiang, Mengyuan Huang, Zhiyuan Xiong, Bin Hu, Xuebin Hou, Lei Jiang, Jianqiang Ma, Jiajia Wu, Yaping Deng, Yi Shen, Qian Wang, Weijie Liu, Jie Liu, Meng Chen, Liang Dong, Weiwen Jia, Hu Chen, Feifei Liu, Rui Yuan, Huilin Xu, Zhenxiang Yan, Tengfei Cao, Zhichao Hu, Xinhua Feng, Dong Du, Tinghao Yu, Yangyu Tao, Feng Zhang, Jianchen Zhu, Chengzhong Xu, Xirui Li, Chong Zha, Wen Ouyang, Yinben Xia, Xiang Li, Zekun He, Rongpeng Chen, Jiawei Song, Ruibin Chen, Fan Jiang, Chongqing Zhao, Bo Wang, Hao Gong, Rong Gan, Winston Hu, Zhanhui Kang, Yong Yang, Yuhong Liu, Di Wang, Jie Jiang
Abstract: In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large
URLs: https://github.com/Tencent/Hunyuan-Large, https://huggingface.co/tencent/Tencent-Hunyuan-Large
Authors: Ho Sung Shim, Hyoungjun Park, Kyuhan Lee, Jang-Sun Park, Seonhye Kang
Abstract: Smishing, which aims to illicitly obtain personal information from unsuspecting victims, holds significance due to its negative impacts on our society. In prior studies, as a tool to counteract smishing, machine learning (ML) has been widely adopted, which filters and blocks smishing messages before they reach potential victims. However, a number of challenges remain in ML-based smishing detection, with the scarcity of annotated datasets being one major hurdle. Specifically, given the sensitive nature of smishing-related data, there is a lack of publicly accessible data that can be used for training and evaluating ML models. Additionally, the nuanced similarities between smishing messages and other types of social engineering attacks such as spam messages exacerbate the challenge of smishing classification with limited resources. To tackle this challenge, we introduce a novel data augmentation method utilizing a few-shot prompt learning approach. What sets our approach apart from extant methods is the use of the principles of persuasion, a psychology theory which explains the underlying mechanisms of smishing. By designing prompts grounded in the persuasion principles, our augmented dataset could effectively capture various, important aspects of smishing messages, enabling ML models to be effectively trained. Our evaluation within a real-world context demonstrates that our augmentation approach produces more diverse and higher-quality smishing data instances compared to other cutting-edging approaches, leading to substantial improvements in the ability of ML models to detect the subtle characteristics of smishing messages. Moreover, our additional analyses reveal that the performance improvement provided by our approach is more pronounced when used with ML models that have a larger number of parameters, demonstrating its effectiveness in training large-scale ML models.
Authors: Fan Nie, Xiaotian Hou, Shuhang Lin, James Zou, Huaxiu Yao, Linjun Zhang
Abstract: The propensity of Large Language Models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored. In this paper, we introduce FactTest, a novel framework that statistically assesses whether an LLM can confidently provide correct answers to given questions with high-probability correctness guarantees. We formulate factuality testing as hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that our framework also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. %These analyses are amenable to the principled NP framework. Our approach is distribution-free and works for any number of human-annotated samples. It is model-agnostic and applies to any black-box or white-box LM. Extensive experiments on question-answering (QA) and multiple-choice benchmarks demonstrate that \approach effectively detects hallucinations and improves the model's ability to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
Authors: Xin Zhang, Victor S. Sheng
Abstract: We propose an innovative token representation and update method in a new ultra-small language model: the Wave network. Specifically, we use a complex vector to represent each token, encoding both global and local semantics of the input text. A complex vector consists of two components: a magnitude vector representing the global semantics of the input text, and a phase vector capturing the relationships between individual tokens and global semantics. Experiments on the AG News text classification task demonstrate that, when generating complex vectors from randomly initialized token embeddings, our single-layer Wave Network achieves 90.91% accuracy with wave interference and 91.66% with wave modulation - outperforming a single Transformer layer using BERT pre-trained embeddings by 19.23% and 19.98%, respectively, and approaching the accuracy of the pre-trained and fine-tuned BERT base model (94.64%). Additionally, compared to BERT base, the Wave Network reduces video memory usage and training time by 77.34% and 85.62% during wave modulation. In summary, we used a 2.4-million-parameter small language model to achieve accuracy comparable to a 100-million-parameter BERT model in text classification.
Authors: Hossein Hosseini, Mohammad Sobhan Zare, Amir Hossein Mohammadi, Arefeh Kazemi, Zahra Zojaji, Mohammad Ali Nematbakhsh
Abstract: Retrieval augmented generation (RAG) models, which integrate large-scale pre-trained generative models with external retrieval mechanisms, have shown significant success in various natural language processing (NLP) tasks. However, applying RAG models in Persian language as a low-resource language, poses distinct challenges. These challenges primarily involve the preprocessing, embedding, retrieval, prompt construction, language modeling, and response evaluation of the system. In this paper, we address the challenges towards implementing a real-world RAG system for Persian language called PersianRAG. We propose novel solutions to overcome these obstacles and evaluate our approach using several Persian benchmark datasets. Our experimental results demonstrate the capability of the PersianRAG framework to enhance question answering task in Persian.
Authors: Qixun Wang, Yifei Wang, Yisen Wang, Xianghua Ying
Abstract: Enhancing node-level Out-Of-Distribution (OOD) generalization on graphs remains a crucial area of research. In this paper, we develop a Structural Causal Model (SCM) to theoretically dissect the performance of two prominent invariant learning methods -- Invariant Risk Minimization (IRM) and Variance-Risk Extrapolation (VREx) -- in node-level OOD settings. Our analysis reveals a critical limitation: due to the lack of class-conditional invariance constraints, these methods may struggle to accurately identify the structure of the predictive invariant ego-graph and consequently rely on spurious features. To address this, we propose Cross-environment Intra-class Alignment (CIA), which explicitly eliminates spurious features by aligning cross-environment representations conditioned on the same class, bypassing the need for explicit knowledge of the causal pattern structure. To adapt CIA to node-level OOD scenarios where environment labels are hard to obtain, we further propose CIA-LRA (Localized Reweighting Alignment) that leverages the distribution of neighboring labels to selectively align node representations, effectively distinguishing and preserving invariant features while removing spurious ones, all without relying on environment labels. We theoretically prove CIA-LRA's effectiveness by deriving an OOD generalization error bound based on PAC-Bayesian analysis. Experiments on graph OOD benchmarks validate the superiority of CIA and CIA-LRA, marking a significant advancement in node-level OOD generalization. The codes are available at https://github.com/NOVAglow646/NeurIPS24-Invariant-Learning-on-Graphs.
URLs: https://github.com/NOVAglow646/NeurIPS24-Invariant-Learning-on-Graphs.
Authors: Pourya Jafarzadeh, Amir Mohammad Rostami, Padideh Choobdar
Abstract: Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.
Authors: Adrien LeCoz, St\'ephane Herbin, Faouzi Adjed
Abstract: For classification models based on neural networks, the maximum predicted class probability is often used as a confidence score. This score rarely predicts well the probability of making a correct prediction and requires a post-processing calibration step. However, many confidence calibration methods fail for problems with many classes. To address this issue, we transform the problem of calibrating a multiclass classifier into calibrating a single surrogate binary classifier. This approach allows for more efficient use of standard calibration methods. We evaluate our approach on numerous neural networks used for image or text classification and show that it significantly enhances existing calibration methods.
Authors: Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Fabrizio Silvestri, Emanuele Rodol\`a
Abstract: Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.
Authors: Nasib Ullah, Erik Schultheis, Mike Lasby, Yani Ioannou, Rohit Babbar
Abstract: In recent years, Dynamic Sparse Training (DST) has emerged as an alternative to post-training pruning for generating efficient models. In principle, DST allows for a more memory efficient training process, as it maintains sparsity throughout the entire training run. However, current DST implementations fail to capitalize on this in practice. Because sparse matrix multiplication is much less efficient than dense matrix multiplication on GPUs, most implementations simulate sparsity by masking weights. In this paper, we leverage recent advances in semi-structured sparse training to apply DST in the domain of classification with large output spaces, where memory-efficiency is paramount. With a label space of possibly millions of candidates, the classification layer alone will consume several gigabytes of memory. Switching from a dense to a fixed fan-in sparse layer updated with sparse evolutionary training (SET); however, severely hampers training convergence, especially at the largest label spaces. We find that poor gradient flow from the sparse classifier to the dense text encoder make it difficult to learn good input representations. By employing an intermediate layer or adding an auxiliary training objective, we recover most of the generalisation performance of the dense model. Overall, we demonstrate the applicability and practical benefits of DST in a challenging domain -- characterized by a highly skewed label distribution that differs substantially from typical DST benchmark datasets -- which enables end-to-end training with millions of labels on commodity hardware.
Authors: Shan Zhao, Zhaiyu Chen, Zhitong Xiong, Yilei Shi, Sudipan Saha, Xiao Xiang Zhu
Abstract: Earth Observation (EO) data analysis has been significantly revolutionized by deep learning (DL), with applications typically limited to grid-like data structures. Graph Neural Networks (GNNs) emerge as an important innovation, propelling DL into the non-Euclidean domain. Naturally, GNNs can effectively tackle the challenges posed by diverse modalities, multiple sensors, and the heterogeneous nature of EO data. To introduce GNNs in the related domains, our review begins by offering fundamental knowledge on GNNs. Then, we summarize the generic problems in EO, to which GNNs can offer potential solutions. Following this, we explore a broad spectrum of GNNs' applications to scientific problems in Earth systems, covering areas such as weather and climate analysis, disaster management, air quality monitoring, agriculture, land cover classification, hydrological process modeling, and urban modeling. The rationale behind adopting GNNs in these fields is explained, alongside methodologies for organizing graphs and designing favorable architectures for various tasks. Furthermore, we highlight methodological challenges of implementing GNNs in these domains and possible solutions that could guide future research. While acknowledging that GNNs are not a universal solution, we conclude the paper by comparing them with other popular architectures like transformers and analyzing their potential synergies.
Authors: Dung Thuy Nguyen, Ziyan An, Taylor T. Johnson, Meiyi Ma, Kevin Leach
Abstract: Federated Learning (FL) offers a promising solution to the privacy concerns associated with centralized Machine Learning (ML) by enabling decentralized, collaborative learning. However, FL is vulnerable to various security threats, including poisoning attacks, where adversarial clients manipulate the training data or model updates to degrade overall model performance. Recognizing this threat, researchers have focused on developing defense mechanisms to counteract poisoning attacks in FL systems. However, existing robust FL methods predominantly focus on computer vision tasks, leaving a gap in addressing the unique challenges of FL with time series data. In this paper, we present FLORAL, a defense mechanism designed to mitigate poisoning attacks in federated learning for time-series tasks, even in scenarios with heterogeneous client data and a large number of adversarial participants. Unlike traditional model-centric defenses, FLORAL leverages logical reasoning to evaluate client trustworthiness by aligning their predictions with global time-series patterns, rather than relying solely on the similarity of client updates. Our approach extracts logical reasoning properties from clients, then hierarchically infers global properties, and uses these to verify client updates. Through formal logic verification, we assess the robustness of each client contribution, identifying deviations indicative of adversarial behavior. Experimental results on two datasets demonstrate the superior performance of our approach compared to existing baseline methods, highlighting its potential to enhance the robustness of FL to time series applications. Notably, FLORAL reduced the prediction error by 93.27% in the best-case scenario compared to the second-best baseline. Our code is available at https://anonymous.4open.science/r/FLORAL-Robust-FTS.
Authors: George Jiayuan Gao, Tianyu Li, Nadia Figueroa
Abstract: We propose an object-centric recovery policy framework to address the challenges of out-of-distribution (OOD) scenarios in visuomotor policy learning. Previous behavior cloning (BC) methods rely heavily on a large amount of labeled data coverage, failing in unfamiliar spatial states. Without relying on extra data collection, our approach learns a recovery policy constructed by an inverse policy inferred from object keypoint manifold gradient in the original training data. The recovery policy serves as a simple add-on to any base visuomotor BC policy, agnostic to a specific method, guiding the system back towards the training distribution to ensure task success even in OOD situations. We demonstrate the effectiveness of our object-centric framework in both simulation and real robot experiments, achieving an improvement of 77.7% over the base policy in OOD. Project Website: https://sites.google.com/view/ocr-penn