Authors: Vanessa Figueiredo
Abstract: We study how architectural inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding mechanism paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which architectural scaffolds can reliably shape emergent instructional strategies in LLMs.
Authors: Tingxuan Xu, Jiarui Feng, Justin Melendez, Kaleigh Roberts, Donghong Cai, Mingfang Zhu, Donald Elbert, Yixin Chen, Randall J. Bateman
Abstract: In the past two years, large language model (LLM)-based chatbots, such as ChatGPT, have revolutionized various domains by enabling diverse task completion and question-answering capabilities. However, their application in scientific research remains constrained by challenges such as hallucinations, limited domain-specific knowledge, and lack of explainability or traceability for the response. Graph-based Retrieval-Augmented Generation (GraphRAG) has emerged as a promising approach to improving chatbot reliability by integrating domain-specific contextual information before response generation, addressing some limitations of standard LLMs. Despite its potential, there are only limited studies that evaluate GraphRAG on specific domains that require intensive knowledge, like Alzheimer's disease or other biomedical domains. In this paper, we assess the quality and traceability of two popular GraphRAG systems. We compile a database of 50 papers and 70 expert questions related to Alzheimer's disease, construct a GraphRAG knowledge base, and employ GPT-4o as the LLM for answering queries. We then compare the quality of responses generated by GraphRAG with those from a standard GPT-4o model. Additionally, we discuss and evaluate the traceability of several Retrieval-Augmented Generation (RAG) and GraphRAG systems. Finally, we provide an easy-to-use interface with a pre-built Alzheimer's disease database for researchers to test the performance of both standard RAG and GraphRAG.
Authors: Sri Ram Macharla, Sridhar Murthy J, Anjaneyulu Pasala
Abstract: MultiFluxAI is an innovative AI platform developed to address the challenges of managing and integrating vast, disparate data sources in product engineering across application domains. It addresses both current and new service related queries that enhance user engagement in the digital ecosystem. This platform leverages advanced AI techniques, such as Generative AI, vectorization, and agentic orchestration to provide dynamic and context-aware responses to complex user queries.
Authors: Mohsen Nayebi Kerdabadi, Arya Hadizadeh Moghaddam, Dongjie Wang, Zijun Yao
Abstract: Medical ontology graphs map external knowledge to medical codes in electronic health records via structured relationships. By leveraging domain-approved connections (e.g., parent-child), predictive models can generate richer medical concept representations by incorporating contextual information from related concepts. However, existing literature primarily focuses on incorporating domain knowledge from a single ontology system, or from multiple ontology systems (e.g., diseases, drugs, and procedures) in isolation, without integrating them into a unified learning structure. Consequently, concept representation learning often remains limited to intra-ontology relationships, overlooking cross-ontology connections. In this paper, we propose LINKO, a large language model (LLM)-augmented integrative ontology learning framework that leverages multiple ontology graphs simultaneously by enabling dual-axis knowledge propagation both within and across heterogeneous ontology systems to enhance medical concept representation learning. Specifically, LINKO first employs LLMs to provide a graph-retrieval-augmented initialization for ontology concept embedding, through an engineered prompt that includes concept descriptions, and is further augmented with ontology context. Second, our method jointly learns the medical concepts in diverse ontology graphs by performing knowledge propagation in two axes: (1) intra-ontology vertical propagation across hierarchical ontology levels and (2) inter-ontology horizontal propagation within every level in parallel. Last, through extensive experiments on two public datasets, we validate the superior performance of LINKO over state-of-the-art baselines. As a plug-in encoder compatible with existing EHR predictive models, LINKO further demonstrates enhanced robustness in scenarios involving limited data availability and rare disease prediction.
Authors: Yi Liao, Yu Gu, Yuan Sui, Zining Zhu, Yifan Lu, Guohua Tang, Zhongqian Sun, Wei Yang
Abstract: Large language models (LLMs) excel at complex reasoning tasks such as mathematics and coding, yet they frequently struggle with simple interactive tasks that young children perform effortlessly. This discrepancy highlights a critical gap between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). Although traditional reinforcement learning (RL) agents can acquire procedural knowledge through environmental interaction, they often operate as black boxes and require substantial training data. In contrast, LLMs possess extensive world knowledge and reasoning capabilities, but are unable to effectively convert this static knowledge into dynamic decision-making in interactive settings. To address this challenge, we propose Think in Games (TiG), a novel framework that empowers LLMs to develop procedural understanding through direct interaction with game environments, while retaining their inherent reasoning and explanatory abilities. Specifically, TiG reformulates RL-based decision-making as a language modeling task: LLMs generate language-guided policies, which are refined iteratively through online reinforcement learning based on environmental feedback. Our experimental results show that TiG successfully bridges the gap between declarative and procedural knowledge, achieving competitive performance with dramatically lower data and computational demands compared to conventional RL methods. Moreover, TiG provides step-by-step natural language explanations for its decisions, greatly improving transparency and interpretability in complex interactive tasks.
Authors: Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang
Abstract: Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.
Authors: Bor-Sung Liang
Abstract: The focus of AI development has shifted from academic research to practical applications. However, AI development faces numerous challenges at various levels. This article will attempt to analyze the opportunities and challenges of AI from several different perspectives using a structured approach. This article proposes a seven-layer model for AI compute architecture, including Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer, from bottom to top. It also explains how AI computing has evolved into this 7-layer architecture through the three-stage evolution on large-scale language models (LLMs). For each layer, we describe the development trajectory and key technologies. In Layers 1 and 2 we discuss AI computing issues and the impact of Scale-Up and Scale-Out strategies on computing architecture. In Layer 3 we explore two different development paths for LLMs. In Layer 4 we discuss the impact of contextual memory on LLMs and compares it to traditional processor memory. In Layers 5 to 7 we discuss the trends of AI agents and explore the issues in evolution from a single AI agent to an AI-based ecosystem, and their impact on the AI industry. Furthermore, AI development involves not only technical challenges but also the economic issues to build self-sustainable ecosystem. This article analyzes the internet industry to provide predictions on the future trajectory of AI development.
Authors: Leonard Frank Neis, Andre Antakli, Matthias Klusch
Abstract: User-friendly modeling and virtual simulation of urban traffic scenarios with different types of interacting agents such as pedestrians, cyclists and autonomous vehicles remains a challenge. We present CARJAN, a novel tool for semi-automated generation and simulation of such scenarios based on the multi-agent engineering framework AJAN and the driving simulator CARLA. CARJAN provides a visual user interface for the modeling, storage and maintenance of traffic scenario layouts, and leverages SPARQL Behavior Tree-based decision-making and interactions for agents in dynamic scenario simulations in CARLA. CARJAN provides a first integrated approach for interactive, intelligent agent-based generation and simulation of virtual traffic scenarios in CARLA.
Authors: Christoph Beierle, Alexander Hahn, Diana Howey, Gabriele Kern-Isberner, Kai Sauerwald
Abstract: Forgetting as a knowledge management operation deliberately ignores parts of the knowledge and beliefs of an agent, for various reasons. Forgetting has many facets, one may want to forget parts of the syntax, a proposition, or a conditional. In the literature, two main operators suitable for performing forgetting have been proposed and investigated in depth: First, variable elimination is a syntactical method that blends out certain atomic variables to focus on the rest of the language. It has been mainly used in the area of logic programming and answer set programming. Second, contraction in AGM belief revision theory effectively removes propositions from belief sets under logical deduction. Both operations rely mainly on classical logics. In this article, we take an epistemic perspective and study forgetting operations in epistemic states with richer semantic structures, but with clear links to propositional logic. This allows us to investigate what forgetting in the epistemic background means, thereby lifting well-known and novel forgetting operations to the epistemic level. We present five general types of epistemic forgetting and instantiate them with seven concrete forgetting operations for Spohn's ranking functions. We take inspiration from postulates of forgetting both from logic programming and AGM theory to propose a rich landscape of axioms for evaluating forgetting operations. Finally, we evaluate all concrete forgetting operations according to all postulates, leading to a novel comprehensive overview highlighting differences and commonalities among the forgetting operators.
Authors: Niklas Jansen, Jonas G\"osgens, Hector Geffner
Abstract: Consider the problem of learning a lifted STRIPS model of the sliding-tile puzzle from random state-action traces where the states represent the location of the tiles only, and the actions are the labels up, down, left, and right, with no arguments. Two challenges are involved in this problem. First, the states are not full STRIPS states, as some predicates are missing, like the atoms representing the position of the ``blank''. Second, the actions are not full STRIPS either, as they do not reveal all the objects involved in the actions effects and preconditions. Previous approaches have addressed different versions of this model learning problem, but most assume that actions in the traces are full STRIPS actions or that the domain predicates are all observable. The new setting considered in this work is more ``realistic'', as the atoms observed convey the state of the world but not full STRIPS states, and the actions reveal the arguments needed for selecting the action but not the ones needed for modeling it in STRIPS. For formulating and addressing the learning problem, we introduce a variant of STRIPS, which we call STRIPS+, where certain STRIPS action arguments can be left implicit in preconditions which can also involve a limited form of existential quantification. The learning problem becomes the problem of learning STRIPS+ models from STRIPS+ state-action traces. For this, the proposed learning algorithm, called SYNTH, constructs a stratified sequence (conjunction) of precondition expressions or ``queries'' for each action, that denote unique objects in the state and ground the implicit action arguments in STRIPS+. The correctness and completeness of SYNTH is established, and its scalability is tested on state-action traces obtained from STRIPS+ models derived from existing STRIPS domains.
Authors: Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong
Abstract: Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text-masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use. We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demand multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text-image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial-Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues. We provide a model-agnostic agent framework with browsing tools and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.
Authors: Sweta Kaman, Ankita Sharma, Romi Banerjee
Abstract: Background: Wisdom is a superordinate construct that embraces perspective taking, reflectiveness, prosocial orientation, reflective empathetic action, and intellectual humility. Unlike conventional models of reasoning that are rigidly bound by binary thinking, wisdom unfolds in shades of ambiguity, requiring both graded evaluation and self-reflective humility. Current measures depend on self-reports and seldom reflect the humility and uncertainty inherent in wise reasoning. A computational framework that takes into account both multidimensionality and confidence has the potential to improve psychological science and allow humane AI. Method: We present a fuzzy inference system with Z numbers, each of the decisions being expressed in terms of a wisdom score (restriction) and confidence score (certainty). As part of this study, participants (N = 100) were exposed to culturally neutral pictorial moral dilemma tasks to which they generated think-aloud linguistic responses, which were mapped into five theoretically based components of wisdom. The scores of each individual component were combined using a base of 21 rules, with membership functions tuned via Gaussian kernel density estimation. Results: In a proof of concept study, the system produced dual attribute wisdom representations that correlated modestly but significantly with established scales while showing negligible relations with unrelated traits, supporting convergent and divergent validity. Contribution: The contribution is to formalize wisdom as a multidimensional, uncertainty-conscious construct, operationalized in the form of Z-numbers. In addition to progressing measurement in psychology, it calculates how fuzzy Z numbers can provide AI systems with interpretable, confidence-sensitive reasoning that affords a safe, middle ground between rigorous computation and human-like judgment.
Authors: Nicola Gigante, Francesco Leofante, Andrea Micheli
Abstract: Counterfactual Explanations (CEs) are a powerful technique used to explain Machine Learning models by showing how the input to a model should be minimally changed for the model to produce a different output. Similar proposals have been made in the context of Automated Planning, where CEs have been characterised in terms of minimal modifications to an existing plan that would result in the satisfaction of a different goal. While such explanations may help diagnose faults and reason about the characteristics of a plan, they fail to capture higher-level properties of the problem being solved. To address this limitation, we propose a novel explanation paradigm that is based on counterfactual scenarios. In particular, given a planning problem $P$ and an \ltlf formula $\psi$ defining desired properties of a plan, counterfactual scenarios identify minimal modifications to $P$ such that it admits plans that comply with $\psi$. In this paper, we present two qualitative instantiations of counterfactual scenarios based on an explicit quantification over plans that must satisfy $\psi$. We then characterise the computational complexity of generating such counterfactual scenarios when different types of changes are allowed on $P$. We show that producing counterfactual scenarios is often only as expensive as computing a plan for $P$, thus demonstrating the practical viability of our proposal and ultimately providing a framework to construct practical algorithms in this area.
Authors: Eduardo Illueca-Fernandez, Kaile Chen, Fernando Seoane, Farhad Abtahi
Abstract: Process mining has emerged as a powerful analytical technique for understanding complex healthcare workflows. However, its application faces significant barriers, including technical complexity, a lack of standardized approaches, and limited access to practical training resources. We introduce HealthProcessAI, a GenAI framework designed to simplify process mining applications in healthcare and epidemiology by providing a comprehensive wrapper around existing Python (PM4PY) and R (bupaR) libraries. To address unfamiliarity and improve accessibility, the framework integrates multiple Large Language Models (LLMs) for automated process map interpretation and report generation, helping translate technical analyses into outputs that diverse users can readily understand. We validated the framework using sepsis progression data as a proof-of-concept example and compared the outputs of five state-of-the-art LLM models through the OpenRouter platform. To test its functionality, the framework successfully processed sepsis data across four proof-of-concept scenarios, demonstrating robust technical performance and its capability to generate reports through automated LLM analysis. LLM evaluation using five independent LLMs as automated evaluators revealed distinct model strengths: Claude Sonnet-4 and Gemini 2.5-Pro achieved the highest consistency scores (3.79/4.0 and 3.65/4.0) when evaluated by automated LLM assessors. By integrating multiple Large Language Models (LLMs) for automated interpretation and report generation, the framework addresses widespread unfamiliarity with process mining outputs, making them more accessible to clinicians, data scientists, and researchers. This structured analytics and AI-driven interpretation combination represents a novel methodological advance in translating complex process mining results into potentially actionable insights for healthcare applications.
Authors: Issa Hanou, Sebastijan Duman\v{c}i\'c, Mathijs de Weerdt
Abstract: We propose a new framework for discovering landmarks that automatically generalize across a domain. These generalized landmarks are learned from a set of solved instances and describe intermediate goals for planning problems where traditional landmark extraction algorithms fall short. Our generalized landmarks extend beyond the predicates of a domain by using state functions that are independent of the objects of a specific problem and apply to all similar objects, thus capturing repetition. Based on these functions, we construct a directed generalized landmark graph that defines the landmark progression, including loop possibilities for repetitive subplans. We show how to use this graph in a heuristic to solve new problem instances of the same domain. Our results show that the generalized landmark graphs learned from a few small instances are also effective for larger instances in the same domain. If a loop that indicates repetition is identified, we see a significant improvement in heuristic performance over the baseline. Generalized landmarks capture domain information that is interpretable and useful to an automated planner. This information can be discovered from a small set of plans for the same domain.
Authors: Yang You, Alex Schutz, Zhikun Li, Bruno Lacerda, Robert Skilton, Nick Hawes
Abstract: Many high-level multi-agent planning problems, including multi-robot navigation and path planning, can be effectively modeled using deterministic actions and observations. In this work, we focus on such domains and introduce the class of Deterministic Decentralized POMDPs (Det-Dec-POMDPs). This is a subclass of Dec-POMDPs characterized by deterministic transitions and observations conditioned on the state and joint actions. We then propose a practical solver called Iterative Deterministic POMDP Planning (IDPP). This method builds on the classic Joint Equilibrium Search for Policies framework and is specifically optimized to handle large-scale Det-Dec-POMDPs that current Dec-POMDP solvers are unable to address efficiently.
Authors: Saravanan Venkatachalam
Abstract: This paper presents an integrated framework that combines traditional network optimization models with large language models (LLMs) to deliver interactive, explainable, and role-aware decision support for supply chain planning. The proposed system bridges the gap between complex operations research outputs and business stakeholder understanding by generating natural language summaries, contextual visualizations, and tailored key performance indicators (KPIs). The core optimization model addresses tactical inventory redistribution across a network of distribution centers for multi-period and multi-item, using a mixed-integer formulation. The technical architecture incorporates AI agents, RESTful APIs, and a dynamic user interface to support real-time interaction, configuration updates, and simulation-based insights. A case study demonstrates how the system improves planning outcomes by preventing stockouts, reducing costs, and maintaining service levels. Future extensions include integrating private LLMs, transfer learning, reinforcement learning, and Bayesian neural networks to enhance explainability, adaptability, and real-time decision-making.
Authors: Ramkumar Natarajan, Muhammad Suhail Saleem, William Xiao, Sandip Aine, Howie Choset, Maxim Likhachev
Abstract: Designing good heuristic functions for graph search requires adequate domain knowledge. It is often easy to design heuristics that perform well and correlate with the underlying true cost-to-go values in certain parts of the search space but these may not be admissible throughout the domain thereby affecting the optimality guarantees of the search. Bounded suboptimal search using several such partially good but inadmissible heuristics was developed in Multi-Heuristic A* (MHA*). Although MHA* leverages multiple inadmissible heuristics to potentially generate a faster suboptimal solution, the original version does not improve the solution over time. It is a one shot algorithm that requires careful setting of inflation factors to obtain a desired one time solution. In this work, we tackle this issue by extending MHA* to an anytime version that finds a feasible suboptimal solution quickly and continually improves it until time runs out. Our work is inspired from the Anytime Repairing A* (ARA*) algorithm. We prove that our precise adaptation of ARA* concepts in the MHA* framework preserves the original suboptimal and completeness guarantees and enhances MHA* to perform in an anytime fashion. Furthermore, we report the performance of A-MHA* in 3-D path planning domain and sliding tiles puzzle and compare against MHA* and other anytime algorithms.
Authors: Farhad Abtahi, Mehdi Astaraki, Fernando Seoane
Abstract: Bias in medical artificial intelligence is conventionally viewed as a defect requiring elimination. However, human reasoning inherently incorporates biases shaped by education, culture, and experience, suggesting their presence may be inevitable and potentially valuable. We propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversitY), a conceptual framework that orchestrates multiple AI models while preserving their diverse outputs rather than collapsing them into a consensus. Unlike traditional approaches that suppress disagreement, MEDLEY documents model-specific biases as potential strengths and treats hallucinations as provisional hypotheses for clinician verification. A proof-of-concept demonstrator was developed using over 30 large language models, creating a minimum viable product that preserved both consensus and minority views in synthetic cases, making diagnostic uncertainty and latent biases transparent for clinical oversight. While not yet a validated clinical tool, the demonstration illustrates how structured diversity can enhance medical reasoning under clinician supervision. By reframing AI imperfection as a resource, MEDLEY offers a paradigm shift that opens new regulatory, ethical, and innovation pathways for developing trustworthy medical AI systems.
Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim
Abstract: We present a novel training-free framework, \textit{PosterForest}, for automated scientific poster generation. Unlike prior approaches, which largely neglect the hierarchical structure of scientific documents and the semantic integration of textual and visual elements, our method addresses both challenges directly. We introduce the \textit{Poster Tree}, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships at multiple levels. Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback. This approach enables the joint optimization of logical consistency, content fidelity, and visual coherence. Extensive experiments on multiple academic domains show that our method outperforms existing baselines in both qualitative and quantitative evaluations. The resulting posters achieve quality closest to expert-designed ground truth and deliver superior information preservation, structural clarity, and user preference.
Authors: Fabrizio Fagiolo, Nicolo' Vescera
Abstract: In this paper we present a variational algorithm for the Traveling Salesman Problem (TSP) that combines (i) a compact encoding of permutations, which reduces the qubit requirement too, (ii) an optimize-freeze-reuse strategy: where the circuit topology (``Ansatz'') is first optimized on a training instance by Simulated Annealing (SA), then ``frozen'' and re-used on novel instances, limited to a rapid re-optimization of only the circuit parameters. This pipeline eliminates costly structural research in testing, making the procedure immediately implementable on NISQ hardware. On a set of $40$ randomly generated symmetric instances that span $4 - 7$ cities, the resulting Ansatz achieves an average optimal trip sampling probability of $100\%$ for 4 city cases, $90\%$ for 5 city cases and $80\%$ for 6 city cases. With 7 cities the success rate drops markedly to an average of $\sim 20\%$, revealing the onset of scalability limitations of the proposed method. The results show robust generalization ability for moderate problem sizes and indicate how freezing the Ansatz can dramatically reduce time-to-solution without degrading solution quality. The paper also discusses scalability limitations, the impact of ``warm-start'' initialization of parameters, and prospects for extension to more complex problems, such as Vehicle Routing and Job-Shop Scheduling.
Authors: Timoth\'ee Loranchet, Charles K. Assaad
Abstract: Understanding causal relations between temporal variables is a central challenge in time series analysis, particularly when the full causal structure is unknown. Even when the full causal structure cannot be fully specified, experts often succeed in providing a high-level abstraction of the causal graph, known as a summary causal graph, which captures the main causal relations between different time series while abstracting away micro-level details. In this work, we present conditions that guarantee the orientability of micro-level edges between temporal variables given the background knowledge encoded in a summary causal graph and assuming having access to a faithful and causally sufficient distribution with respect to the true unknown graph. Our results provide theoretical guarantees for edge orientation at the micro-level, even in the presence of cycles or bidirected edges at the macro-level. These findings offer practical guidance for leveraging SCGs to inform causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge to improve causal inference from observational time series data.
Authors: Hyeonseong Jeon, Cheolhong Min, Jaesik Park
Abstract: Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. However, standard gradient guidance typically performs optimally under convex and differentiable reward landscapes, showing substantially reduced effectiveness in real-world scenarios involving non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: tree-diffusion-planner.github.io.
Authors: Yeawon Lee, Xiaoyang Wang, Christopher C. Yang
Abstract: Accurate interpretation of clinical narratives is critical for patient care, but the complexity of these notes makes automation challenging. While Large Language Models (LLMs) show promise, single-model approaches can lack the robustness required for high-stakes clinical tasks. We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap. The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes, simulating the diagnostic reasoning process of synthesizing raw data into an assessment. A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus. We evaluated our MAS against a single-agent baseline on a curated dataset of 420 MIMIC-III notes. The dynamic multi-agent configuration demonstrated consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis. Qualitative analysis of the agent debates reveals that this structure effectively surfaces and weighs conflicting evidence, though it can occasionally be susceptible to groupthink. By modeling a clinical team's reasoning process, our system offers a promising path toward more accurate, robust, and interpretable clinical decision support tools.
Authors: Allen Wang, Gavin Tao
Abstract: We address vision-guided quadruped motion control with reinforcement learning (RL) and highlight the necessity of combining proprioception with vision for robust control. We propose QuadKAN, a spline-parameterized cross-modal policy instantiated with Kolmogorov-Arnold Networks (KANs). The framework incorporates a spline encoder for proprioception and a spline fusion head for proprioception-vision inputs. This structured function class aligns the state-to-action mapping with the piecewise-smooth nature of gait, improving sample efficiency, reducing action jitter and energy consumption, and providing interpretable posture-action sensitivities. We adopt Multi-Modal Delay Randomization (MMDR) and perform end-to-end training with Proximal Policy Optimization (PPO). Evaluations across diverse terrains, including both even and uneven surfaces and scenarios with static or dynamic obstacles, demonstrate that QuadKAN achieves consistently higher returns, greater distances, and fewer collisions than state-of-the-art (SOTA) baselines. These results show that spline-parameterized policies offer a simple, effective, and interpretable alternative for robust vision-guided locomotion. A repository will be made available upon acceptance.
Authors: Hao Xu, Zhichao Wang, Shengqi Sang, Pisit Wajanasara, Nuno Bandeira
Abstract: Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.
Authors: Kyohoon Jin, Juhwan Choi, Jungmin Yun, Junho Lee, Soojin Jang, Youngbin Kim
Abstract: Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.
Authors: Nazanin Siavash, Armin Moin
Abstract: This paper introduces a novel research direction for model-to-text/code transformations by leveraging Large Language Models (LLMs) that can be enhanced with Retrieval-Augmented Generation (RAG) pipelines. The focus is on quantum and hybrid quantum-classical software systems, where model-driven approaches can help reduce the costs and mitigate the risks associated with the heterogeneous platform landscape and lack of developers' skills. We validate one of the proposed ideas regarding generating code out of UML model instances of software systems. This Python code uses a well-established library, called Qiskit, to execute on gate-based or circuit-based quantum computers. The RAG pipeline that we deploy incorporates sample Qiskit code from public GitHub repositories. Experimental results show that well-engineered prompts can improve CodeBLEU scores by up to a factor of four, yielding more accurate and consistent quantum code. However, the proposed research direction can go beyond this through further investigation in the future by conducting experiments to address our other research questions and ideas proposed here, such as deploying software system model instances as the source of information in the RAG pipelines, or deploying LLMs for code-to-code transformations, for instance, for transpilation use cases.
Authors: Zezhong Jin, Shubhang Desai, Xu Chen, Biyi Fang, Zhuoyi Huang, Zhe Li, Chong-Xin Gan, Xiao Tu, Man-Wai Mak, Yan Lu, Shujie Liu
Abstract: In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56\% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: https://akahello-a11y.github.io/trink-demo/
Authors: Xiangtao Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, Shanqing Guo
Abstract: Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.
Authors: Dilruk Perera, Gousia Habib, Qianyi Xu, Daniel J. Tan, Kai He, Erik Cambria, Mengling Feng
Abstract: Reinforcement learning (RL) marks a fundamental shift in how artificial intelligence is applied in healthcare. Instead of merely predicting outcomes, RL actively decides interventions with long term goals. Unlike traditional models that operate on fixed associations, RL systems learn through trial, feedback, and long-term reward optimization, introducing transformative possibilities and new risks. From an information fusion lens, healthcare RL typically integrates multi-source signals such as vitals, labs clinical notes, imaging and device telemetry using temporal and decision-level mechanisms. These systems can operate within centralized, federated, or edge architectures to meet real-time clinical constraints, and naturally span data, features and decision fusion levels. This survey explore RL's rise in healthcare as more than a set of tools, rather a shift toward agentive intelligence in clinical environments. We first structure the landscape of RL techniques including model-based and model-free methods, offline and batch-constrained approaches, and emerging strategies for reward specification and uncertainty calibration through the lens of healthcare constraints. We then comprehensively analyze RL applications spanning critical care, chronic disease, mental health, diagnostics, and robotic assistance, identifying their trends, gaps, and translational bottlenecks. In contrast to prior reviews, we critically analyze RL's ethical, deployment, and reward design challenges, and synthesize lessons for safe, human-aligned policy learning. This paper serves as both a a technical roadmap and a critical reflection of RL's emerging transformative role in healthcare AI not as prediction machinery, but as agentive clinical intelligence.
Authors: Abdul Rehman, Ilona Heldal, Jerry Chun-Wei Lin
Abstract: Recent advancements in EEG-based emotion recognition have shown promising outcomes using both deep learning and classical machine learning approaches; however, most existing studies focus narrowly on binary valence prediction or subject-specific classification, which limits generalizability and deployment in real-world affective computing systems. To address this gap, this paper presents a unified, multigranularity EEG emotion classification framework built on the GAMEEMO dataset, which consists of 14-channel EEG recordings and continuous self-reported emotion ratings (boring, horrible, calm, and funny) from 28 subjects across four emotion-inducing gameplay scenarios. Our pipeline employs a structured preprocessing strategy that comprises temporal window segmentation, hybrid statistical and frequency-domain feature extraction, and z-score normalization to convert raw EEG signals into robust, discriminative input vectors. Emotion labels are derived and encoded across three complementary axes: (i) binary valence classification based on the averaged polarity of positive and negative emotion ratings, and (ii) Multi-class emotion classification, where the presence of the most affective state is predicted. (iii) Fine-grained multi-label representation via binning each emotion into 10 ordinal classes. We evaluate a broad spectrum of models, including Random Forest, XGBoost, and SVM, alongside deep neural architectures such as LSTM, LSTM-GRU, and CNN-LSTM. Among these, the LSTM-GRU model consistently outperforms the others, achieving an F1-score of 0.932 in the binary valence task and 94.5% and 90.6% in both multi-class and Multi-Label emotion classification.
Authors: Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang
Abstract: Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
Authors: Tatyana Matveeva, Aleksandr Katrutsa, Evgeny Frolov
Abstract: Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods, approximating the exact Hessian, can model these correlations and may enable faster convergence. At the same time, their computational and memory costs are often prohibitive for large-scale models. To address this limitation, we propose AdaGram, an optimizer that enables efficient full-matrix adaptive gradient updates. To reduce memory and computational overhead, we utilize fast symmetric factorization for computing the preconditioned update direction at each iteration. Additionally, we maintain the low-rank structure of a preconditioner along the optimization trajectory using matrix integrator methods. Numerical experiments on standard machine learning tasks show that AdaGram converges faster or matches the performance of diagonal adaptive optimizers when using rank five and smaller rank approximations. This demonstrates AdaGram's potential as a scalable solution for adaptive optimization in large models.
Authors: Dongjun Lee, Changho Hwang, Kimin Lee
Abstract: Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.
Authors: Georgios Vamvouras, Konstantinos Braimakis, Christos Tzivanidis
Abstract: This paper presents a Deep Learning (DL) framework for 48-hour forecasting of temperature, solar irradiance, and relative humidity to support Model Predictive Control (MPC) in smart HVAC systems. The approach employs a stacked Bidirectional Long Short-Term Memory (BiLSTM) network with attention, capturing temporal and cross-feature dependencies by jointly predicting all three variables. Historical meteorological data (2019-2022) with encoded cyclical time features were used for training, while 2023 data evaluated generalization. The model achieved Mean Absolute Errors of 1.3 degrees Celsius (temperature), 31 W/m2 (irradiance), and 6.7 percentage points (humidity), outperforming state-of-the-art numerical weather prediction and machine learning benchmarks. Integrated Gradients quantified feature contributions, and attention weights revealed temporal patterns, enhancing interpretability. By combining multivariate forecasting, attention-based DL, and explainability, this work advances data-driven weather prediction. The demonstrated accuracy and transparency highlight the framework's potential for energy-efficient building control through reliable short-term meteorological forecasting.
Authors: Evan J. Chou (University of California San Diego, Pasadena City College), Lisa S. Locke (Jet Propulsion Laboratory California Institute of Technology), Harvey M. Soldan (Jet Propulsion Laboratory California Institute of Technology)
Abstract: The Deep Space Network (DSN) is NASA's largest network of antenna facilities that generate a large volume of multivariate time-series data. These facilities contain DSN antennas and transmitters that undergo degradation over long periods of time, which may cause costly disruptions to the data flow and threaten the earth-connection of dozens of spacecraft that rely on the Deep Space Network for their lifeline. The purpose of this study was to experiment with different methods that would be able to assist JPL engineers with directly pinpointing anomalies and equipment degradation through collected data, and continue conducting maintenance and operations of the DSN for future space missions around our universe. As such, we have researched various machine learning techniques that can fully reconstruct data through predictive analysis, and determine anomalous data entries within real-time datasets through statistical computations and thresholds. On top of the fully trained and tested machine learning models, we have also integrated the use of a reinforcement learning subsystem that classifies identified anomalies based on severity level and a Large Language Model that labels an explanation for each anomalous data entry, all of which can be improved and fine-tuned over time through human feedback/input. Specifically, for the DSN transmitters, we have also implemented a full data pipeline system that connects the data extraction, parsing, and processing workflow all together as there was no coherent program or script for performing these tasks before. Using this data pipeline system, we were able to then also connect the models trained from DSN antenna data, completing the data workflow for DSN anomaly detection. This was all wrapped around and further connected by an agentic AI system, where complex reasoning was utilized to determine the classifications and predictions of anomalous data.
Authors: Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang
Abstract: The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.
Authors: Jie Jiang, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng
Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
Authors: Harris Song, Tuan-Anh Vu, Sanjith Menon, Sriram Narasimhan, M. Khalid Jawed
Abstract: Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na\"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.
Authors: Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou
Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
Authors: Kevin Putra Santoso, Rizka Wakhidatus Sholikah, Raden Venantius Hari Ginardi
Abstract: High-quality audio is essential in a wide range of applications, including online communication, virtual assistants, and the multimedia industry. However, degradation caused by noise, compression, and transmission artifacts remains a major challenge. While diffusion models have proven effective for audio restoration, they typically require significant computational resources and struggle to handle longer missing segments. This study introduces WaveLLDM (Wave Lightweight Latent Diffusion Model), an architecture that integrates an efficient neural audio codec with latent diffusion for audio restoration and denoising. Unlike conventional approaches that operate in the time or spectral domain, WaveLLDM processes audio in a compressed latent space, reducing computational complexity while preserving reconstruction quality. Empirical evaluations on the Voicebank+DEMAND test set demonstrate that WaveLLDM achieves accurate spectral reconstruction with low Log-Spectral Distance (LSD) scores (0.48 to 0.60) and good adaptability to unseen data. However, it still underperforms compared to state-of-the-art methods in terms of perceptual quality and speech clarity, with WB-PESQ scores ranging from 1.62 to 1.71 and STOI scores between 0.76 and 0.78. These limitations are attributed to suboptimal architectural tuning, the absence of fine-tuning, and insufficient training duration. Nevertheless, the flexible architecture that combines a neural audio codec and latent diffusion model provides a strong foundation for future development.
Authors: Ao Shen, Xueming Fu, Junfeng Jiang, Qiang Zeng, Ye Tang, Zhengming Chen, Luming Nong, Feng Wang, S. Kevin Zhou
Abstract: Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.
Authors: Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush
Abstract: Large language models (LLMs) are increasingly used to evaluate outputs, yet their judgments may be influenced. This study examines bias in self- and cross-model evaluations by ChatGPT, Gemini, and Claude under four conditions: no labels, true labels, and two false-label scenarios. Blog posts authored by each model were evaluated by all three using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness, with all scores expressed as percentages for direct comparison. Results reveal striking asymmetries: the "Claude" label consistently boosts scores, while the "Gemini" label consistently depresses them, regardless of actual content. False labels frequently reversed rankings, producing shifts of up to 50 percentage points in preference votes and up to 12 percentage points in converted quality ratings. Gemini's self-scores collapsed under true labels, while Claude's self-preference intensified. These findings show that perceived model identity can heavily distort high-level judgments and subtly influence detailed quality ratings, underscoring the need for blind or multimodel evaluation protocols to ensure fairness in LLM benchmarking.
Authors: Matteo Pinna, Andrea Ceni, Claudio Gallicchio
Abstract: Echo State Networks (ESNs) are a particular type of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) framework, popular for their fast and efficient learning. However, traditional ESNs often struggle with long-term information processing. In this paper, we introduce a novel class of deep untrained RNNs based on temporal residual connections, called Deep Residual Echo State Networks (DeepResESNs). We show that leveraging a hierarchy of untrained residual recurrent layers significantly boosts memory capacity and long-term temporal modeling. For the temporal residual connections, we consider different orthogonal configurations, including randomly generated and fixed-structure configurations, and we study their effect on network dynamics. A thorough mathematical analysis outlines necessary and sufficient conditions to ensure stable dynamics within DeepResESN. Our experiments on a variety of time series tasks showcase the advantages of the proposed approach over traditional shallow and deep RC.
Authors: Ziheng Chen, Jin Huang, Jiali Cheng, Yuchan Guo, Mengjie Wang, Lalitesh Morishetti, Kaushiki Nag, Hadi Amiri
Abstract: Tree ensembles are widely recognized for their effectiveness in classification tasks, achieving state-of-the-art performance across diverse domains, including bioinformatics, finance, and medical diagnosis. With increasing emphasis on data privacy and the \textit{right to be forgotten}, several unlearning algorithms have been proposed to enable tree ensembles to forget sensitive information. However, existing methods are often tailored to a particular model or rely on the discrete tree structure, making them difficult to generalize to complex ensembles and inefficient for large-scale datasets. To address these limitations, we propose FUTURE, a novel unlearning algorithm for tree ensembles. Specifically, we formulate the problem of forgetting samples as a gradient-based optimization task. In order to accommodate non-differentiability of tree ensembles, we adopt the probabilistic model approximations within the optimization framework. This enables end-to-end unlearning in an effective and efficient manner. Extensive experiments on real-world datasets show that FUTURE yields significant and successful unlearning performance.
Authors: Deepro Choudhury, Sinead Williamson, Adam Goli\'nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth
Abstract: We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM's belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.
Authors: Christopher R. Lee-Jenkins
Abstract: Decoding in large language models is often described as scoring tokens and normalizing with softmax. We give a minimal, self-contained account of this step as a constrained variational principle on the probability simplex. The discrete, normalization-respecting ascent is the classical multiplicative-weights (entropic mirror) update; its continuous-time limit is the replicator flow. From these ingredients we prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium. This formalizes the common ``manifold traversal'' intuition at the output-distribution level. The analysis yields precise, practice-facing consequences: temperature acts as an exact rescaling of time along the same trajectory, while top-k and nucleus sampling restrict the flow to a face with identical guarantees. We also outline a controlled account of path-dependent score adjustments and their connection to loop-like, hallucination-style behavior. We make no claims about training dynamics or internal representations; those are deferred to future work.
Authors: Arash Ahmadi, Sarah Sharif, Yaser Banad
Abstract: Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.
Authors: Han Yang, Jian Lan, Yihong Liu, Hinrich Sch\"utze, Thomas Seidl
Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.
Authors: Zhizhong Huang, Xiaoming Liu
Abstract: Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
Authors: Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan
Abstract: Automatic Speech Recognition (ASR) systems often struggle to accurately process children's speech due to its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children's speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children's speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.
Authors: Weizhi Gao, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin
Abstract: Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.
Authors: Aditya Makineni, Baocheng Geng, Qing Tian
Abstract: Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.
Authors: Ahmad Alomari, Sathish A. P. Kumar
Abstract: This study proposes an HCQA for designing optimal Quantum Sensor Circuits (QSCs) to address complex quantum physics problems. The HCQA integrates computational intelligence techniques by leveraging a Deep Q-Network (DQN) for learning and policy optimization, enhanced by a quantum-based action selection mechanism based on the Q-values. A quantum circuit encodes the agent current state using Ry gates, and then creates a superposition of possible actions. Measurement of the circuit results in probabilistic action outcomes, allowing the agent to generate optimal QSCs by selecting sequences of gates that maximize the Quantum Fisher Information (QFI) while minimizing the number of gates. This computational intelligence-driven HCQA enables the automated generation of entangled quantum states, specifically the squeezed states, with high QFI sensitivity for quantum state estimation and control. Evaluation of the HCQA on a QSC that consists of two qubits and a sequence of Rx, Ry, and S gates demonstrates its efficiency in generating optimal QSCs with a QFI of 1. This work highlights the synergy between AI-driven learning and quantum computation, illustrating how intelligent agents can autonomously discover optimal quantum circuit designs for enhanced sensing and estimation tasks.
Authors: Subham Kutum, Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Mahesh Chandra Govil
Abstract: Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children's speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children's speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children's speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system's effectiveness across different age groups of children. To assess the system's robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children's speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.
Authors: Mohammad Amin Nabian, Sanjay Choudhry
Abstract: The computational cost associated with high-fidelity CFD simulations remains a significant bottleneck in the automotive design and optimization cycle. While ML-based surrogate models have emerged as a promising alternative to accelerate aerodynamic predictions, the field is characterized by a diverse and rapidly evolving landscape of specialized neural network architectures, with no single model demonstrating universal superiority. This paper introduces a novel meta-learning framework that leverages this architectural diversity as a strength. We propose a Mixture of Experts (MoE) model that employs a dedicated gating network to dynamically and optimally combine the predictions from three heterogeneous, state-of-the-art surrogate models: DoMINO, a decomposable multi-scale neural operator; X-MeshGraphNet, a scalable multi-scale graph neural network; and FigConvNet, a factorized implicit global convolution network. The gating network learns a spatially-variant weighting strategy, assigning credibility to each expert based on its localized performance in predicting surface pressure and wall shear stress fields. To prevent model collapse and encourage balanced expert contributions, we integrate an entropy regularization term into the training loss function. The entire system is trained and validated on the DrivAerML dataset, a large-scale, public benchmark of high-fidelity CFD simulations for automotive aerodynamics. Quantitative results demonstrate that the MoE model achieves a significant reduction in L-2 prediction error, outperforming not only the ensemble average but also the most accurate individual expert model across all evaluated physical quantities. This work establishes the MoE framework as a powerful and effective strategy for creating more robust and accurate composite surrogate models by synergistically combining the complementary strengths of specialized architectures.
Authors: Laxmisha Ashok Attisara, Sathish Kumar
Abstract: In the rapidly evolving field of quantum computing, optimizing quantum circuits for specific tasks is crucial for enhancing performance and efficiency. More recently, quantum sensing has become a distinct and rapidly growing branch of research within the area of quantum science and technology. The field is expected to provide new opportunities, especially regarding high sensitivity and precision. Entanglement is one of the key factors in achieving high sensitivity and measurement precision [3]. This paper presents a novel approach utilizing quantum machine learning techniques to optimize entanglement distribution in quantum sensor circuits. By leveraging reinforcement learning within a quantum environment, we aim to optimize the entanglement layout to maximize Quantum Fisher Information (QFI) and entanglement entropy, which are key indicators of a quantum system's sensitivity and coherence, while minimizing circuit depth and gate counts. Our implementation, based on Qiskit, integrates noise models and error mitigation strategies to simulate realistic quantum environments. The results demonstrate significant improvements in circuit performance and sensitivity, highlighting the potential of machine learning in quantum circuit optimization by measuring high QFI and entropy in the range of 0.84-1.0 with depth and gate count reduction by 20-86%.
Authors: Laxmisha Ashok Attisara, Sathish Kumar
Abstract: As the number of qubits in a sensor increases, the complexity of designing and controlling the quantum circuits grows exponentially. Manually optimizing these circuits becomes infeasible. Optimizing entanglement distribution in large-scale quantum circuits is critical for enhancing the sensitivity and efficiency of quantum sensors [5], [6]. This paper presents an engineering integration of reinforcement learning with tensor-network-based simulation (MPS) for scalable circuit optimization for optimizing quantum sensor circuits with up to 60 qubits. To enable efficient simulation and scalability, we adopt tensor network methods, specifically the Matrix Product State (MPS) representation, instead of traditional state vector or density matrix approaches. Our reinforcement learning agent learns to restructure circuits to maximize Quantum Fisher Information (QFI) and entanglement entropy while reducing gate counts and circuit depth. Experimental results show consistent improvements, with QFI values approaching 1, entanglement entropy in the 0.8-1.0 range, and up to 90% reduction in depth and gate count. These results highlight the potential of combining quantum machine learning and tensor networks to optimize complex quantum circuits under realistic constraints.
Authors: Minda Zhao
Abstract: Recommender systems struggle to provide accurate suggestions to new users with limited interaction history, a challenge known as the cold-user problem. This paper proposes a reinforcement learning approach using Double and Dueling Deep Q-Networks (DQN) to dynamically learn user preferences from sparse feedback, enhancing recommendation accuracy without relying on sensitive demographic data. By integrating these advanced DQN variants with a matrix factorization model, we achieve superior performance on a large e-commerce dataset compared to traditional methods like popularity-based and active learning strategies. Experimental results show that our method, particularly Dueling DQN, reduces Root Mean Square Error (RMSE) for cold users, offering an effective solution for privacy-constrained environments.
Authors: Roy M. Gabriel, Mohammadreza Zandehshahvar, Marly van Assen, Nattakorn Kittisut, Kyle Peters, Carlo N. De Cecco, Ali Adibi
Abstract: To reduce the amount of required labeled data for lung disease severity classification from chest X-rays (CXRs) under class imbalance, this study applied deep active learning with a Bayesian Neural Network (BNN) approximation and weighted loss function. This retrospective study collected 2,319 CXRs from 963 patients (mean age, 59.2 $\pm$ 16.6 years; 481 female) at Emory Healthcare affiliated hospitals between January and November 2020. All patients had clinically confirmed COVID-19. Each CXR was independently labeled by 3 to 6 board-certified radiologists as normal, moderate, or severe. A deep neural network with Monte Carlo Dropout was trained using active learning to classify disease severity. Various acquisition functions were used to iteratively select the most informative samples from an unlabeled pool. Performance was evaluated using accuracy, area under the receiver operating characteristic curve (AU ROC), and area under the precision-recall curve (AU PRC). Training time and acquisition time were recorded. Statistical analysis included descriptive metrics and performance comparisons across acquisition strategies. Entropy Sampling achieved 93.7% accuracy (AU ROC, 0.91) in binary classification (normal vs. diseased) using 15.4% of the training data. In the multi-class setting, Mean STD sampling achieved 70.3% accuracy (AU ROC, 0.86) using 23.1% of the labeled data. These methods outperformed more complex and computationally expensive acquisition functions and significantly reduced labeling needs. Deep active learning with BNN approximation and weighted loss effectively reduces labeled data requirements while addressing class imbalance, maintaining or exceeding diagnostic performance.
Authors: Hui Chen, Antoine Didisheim, Luciano Somoza, Hanqing Tian
Abstract: Emerging techniques in computer science make it possible to "brain scan" large language models (LLMs), identify the plain-English concepts that guide their reasoning, and steer them while holding other factors constant. We show that this approach can map LLM-generated economic forecasts to concepts such as sentiment, technical analysis, and timing, and compute their relative importance without reducing performance. We also show that models can be steered to be more or less risk-averse, optimistic, or pessimistic, which allows researchers to correct or simulate biases. The method is transparent, lightweight, and replicable for empirical research in the social sciences.
Authors: Daria Kryvosheieva, Saba Sturua, Michael G\"unther, Scott Martens, Han Xiao
Abstract: jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.
Authors: Jo\~ao Guilherme Alves Santos, Giovana Kerche Bon\'as, Thales Sales Almeida
Abstract: With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.
Authors: Shihao Ji, Zihui Song
Abstract: Continual or Lifelong Learning aims to develop models capable of acquiring new knowledge from a sequence of tasks without catastrophically forgetting what has been learned before. Existing approaches often rely on storing samples from previous tasks (experience replay) or employing complex regularization terms to protect learned weights. However, these methods face challenges related to data privacy, storage limitations, and performance degradation when tasks are dissimilar. To address these challenges, we introduce MyGO (Memory Yielding Generative Offline-consolidation), a novel lifelong learning framework inspired by the biological wake-sleep cycle. During the "wake" phase, the system rapidly learns a new task and trains a compact generative model (Generative Memory, G-mem) to capture its data distribution. During the "sleep" phase, the system enters an offline state, using all learned G-mem models to generate pseudo-data ("dreams") and consolidate new and old knowledge into a core feature extractor via knowledge distillation. This approach obviates the need to store any raw data, retaining only compact generative models, which offers significant advantages in privacy and storage efficiency. We evaluate MyGO on computer vision (Split-MNIST) and natural language processing (Split-AG News) benchmarks, comparing it against a sequential fine-tuning baseline. The results demonstrate that MyGO significantly mitigates catastrophic forgetting and maintains high average accuracy across tasks, proving the framework's effectiveness and domain-generality.
Authors: Jie Zhu, Chihao Shen, Ziyang Li, Jiahao Yu, Yizheng Chen, Kexin Pei
Abstract: Directed fuzzing aims to find program inputs that lead to specified target program states. It has broad applications, such as debugging system crashes, confirming reported bugs, and generating exploits for potential vulnerabilities. This task is inherently challenging because target states are often deeply nested in the program, while the search space manifested by numerous possible program inputs is prohibitively large. Existing approaches rely on branch distances or manually-specified constraints to guide the search; however, the branches alone are often insufficient to precisely characterize progress toward reaching the target states, while the manually specified constraints are often tailored for specific bug types and thus difficult to generalize to diverse target states and programs. We present Locus, a novel framework to improve the efficiency of directed fuzzing. Our key insight is to synthesize predicates to capture fuzzing progress as semantically meaningful intermediate states, serving as milestones towards reaching the target states. When used to instrument the program under fuzzing, they can reject executions unlikely to reach the target states, while providing additional coverage guidance. To automate this task and generalize to diverse programs, Locus features an agentic framework with program analysis tools to synthesize and iteratively refine the candidate predicates, while ensuring the predicates strictly relax the target states to prevent false rejections via symbolic execution. Our evaluation shows that Locus substantially improves the efficiency of eight state-of-the-art fuzzers in discovering real-world vulnerabilities, achieving an average speedup of 41.6x. So far, Locus has found eight previously unpatched bugs, with one already acknowledged with a draft patch.
Authors: Xuan Hou, Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang
Abstract: Generative models have been successfully used in the field of time series generation. However, when dealing with long-term time series, which span over extended periods and exhibit more complex long-term temporal patterns, the task of generation becomes significantly more challenging. Long-term time series exhibit long-range temporal dependencies, but their data distribution also undergoes gradual changes over time. Finding a balance between these long-term dependencies and the drift in data distribution is a key challenge. On the other hand, long-term time series contain more complex interrelationships between different feature sequences, making the task of effectively capturing both intra-sequence and inter-sequence dependencies another important challenge. To address these issues, we propose Stage-Diff, a staged generative model for long-term time series based on diffusion models. First, through stage-wise sequence generation and inter-stage information transfer, the model preserves long-term sequence dependencies while enabling the modeling of data distribution shifts. Second, within each stage, progressive sequence decomposition is applied to perform channel-independent modeling at different time scales, while inter-stage information transfer utilizes multi-channel fusion modeling. This approach combines the robustness of channel-independent modeling with the information fusion advantages of multi-channel modeling, effectively balancing the intra-sequence and inter-sequence dependencies of long-term time series. Extensive experiments on multiple real-world datasets validate the effectiveness of Stage-Diff in long-term time series generation tasks.
Authors: Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Falk Scholer, Christina Lioma
Abstract: Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems. Our code is available at: https://github.com/theresiavr/stairway-to-fairness.
Authors: Xuan Hou, Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang
Abstract: Time series synthesis is an effective approach to ensuring the secure circulation of time series data. Existing time series synthesis methods typically perform temporal modeling based on random sequences to generate target sequences, which often struggle to ensure the temporal dependencies in the generated time series. Additionally, directly modeling temporal features on random sequences makes it challenging to accurately capture the feature information of the original time series. To address the above issues, we propose a simple but effective generative model \textbf{D}ual-\textbf{L}ayer \textbf{G}enerative \textbf{A}dversarial \textbf{N}etworks, named \textbf{DLGAN}. The model decomposes the time series generation process into two stages: sequence feature extraction and sequence reconstruction. First, these two stages form a complete time series autoencoder, enabling supervised learning on the original time series to ensure that the reconstruction process can restore the temporal dependencies of the sequence. Second, a Generative Adversarial Network (GAN) is used to generate synthetic feature vectors that align with the real-time sequence feature vectors, ensuring that the generator can capture the temporal features from real time series. Extensive experiments on four public datasets demonstrate the superiority of this model across various evaluation metrics.
Authors: Bodu Gong, Gustavo Enrique Batista, Pierre Lafaye de Micheaux
Abstract: In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.
Authors: Yulin Liu, Mocca Schweitzer
Abstract: The Decentralized Physical Infrastructure (DePIN) market is revolutionizing the sharing economy through token-based economics and smart contracts that govern decentralized operations. By 2024, DePIN projects have exceeded \$10 billion in market capitalization, underscoring their rapid growth. However, the unregulated nature of these markets, coupled with the autonomous deployment of AI agents in smart contracts, introduces risks such as inefficiencies and potential misalignment with human values. To address these concerns, we introduce EconAgentic, a Large Language Model (LLM)-powered framework designed to mitigate these challenges. Our research focuses on three key areas: 1) modeling the dynamic evolution of DePIN markets, 2) evaluating stakeholders' actions and their economic impacts, and 3) analyzing macroeconomic indicators to align market outcomes with societal goals. Through EconAgentic, we simulate how AI agents respond to token incentives, invest in infrastructure, and adapt to market conditions, comparing AI-driven decisions with human heuristic benchmarks. Our results show that EconAgentic provides valuable insights into the efficiency, inclusion, and stability of DePIN markets, contributing to both academic understanding and practical improvements in the design and governance of decentralized, tokenized economies.
Authors: Shubham Sharma, Sneha Tuli, Narendra Badam
Abstract: Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI's closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.
Authors: Chenduo Ying, Linkang Du, Peng Cheng, Yuanchao Shu
Abstract: Large language models (LLMs) demonstrate remarkable capabilities in reasoning and code generation, enabling robotic manipulation to be initiated with just a single instruction. The LLM carries out various tasks by generating policy code required to control the robot. Despite advances in LLMs, achieving reliable policy code generation remains a significant challenge due to the diverse requirements of real-world tasks and the inherent complexity of user instructions. In practice, different users may provide distinct instructions to drive the robot for the same task, which may cause the unreliability of policy code generation. To bridge this gap, we design RoboInspector, a pipeline to unveil and characterize the unreliability of the policy code for LLM-enabled robotic manipulation from two perspectives: the complexity of the manipulation task and the granularity of the instruction. We perform comprehensive experiments with 168 distinct combinations of tasks, instructions, and LLMs in two prominent frameworks. The RoboInspector identifies four main unreliable behaviors that lead to manipulation failure. We provide a detailed characterization of these behaviors and their underlying causes, giving insight for practical development to reduce unreliability. Furthermore, we introduce a refinement approach guided by failure policy code feedback that improves the reliability of policy code generation by up to 35% in LLM-enabled robotic manipulation, evaluated in both simulation and real-world environments.
Authors: Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek
Abstract: Do neural networks build their representations through smooth, gradual refinement, or via more complex computational processes? We investigate this by extending the logit lens to analyze the policy network of Leela Chess Zero, a superhuman chess engine. We find strong monotonic trends in playing strength and puzzle-solving ability across layers, yet policy distributions frequently follow non-smooth trajectories. Evidence for this includes correct puzzle solutions that are discovered early but subsequently discarded, move rankings that remain poorly correlated with final outputs, and high policy divergence until late in the network. These findings contrast with the smooth distributional convergence typically observed in language models.
Authors: Alexandre Kabbach
Abstract: This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the statistical interpretation of the normal--understood as the average both in the normative and mathematical sense of the term--proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this paper argues that the Turing test is a test of normal intelligence as assessed by a normal judge characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence per se. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind--a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.
Authors: Tanguy Herserant, Vincent Guigue
Abstract: This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods (G-Eval, SEval-Ex), we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting. We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics. Our results reveal a structural trade-off: metrics with the highest alignment with human judgments tend to be computationally intensive and less stable across runs. Beyond comparative analysis, this study highlights key concerns about relying on LLMs for evaluation, stressing their randomness, technical dependencies, and limited reproducibility. We advocate for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.
Authors: Guofu Liao, Taotao Wang, Shengli Zhang, Jiqun Zhang, Shi Long, Dacheng Tao
Abstract: Fine-tuning large language models (LLMs) is crucial for adapting them to specific tasks, yet it remains computationally demanding and raises concerns about correctness and privacy, particularly in untrusted environments. Although parameter-efficient methods like Low-Rank Adaptation (LoRA) significantly reduce resource requirements, ensuring the security and verifiability of fine-tuning under zero-knowledge constraints remains an unresolved challenge. To address this, we introduce zkLoRA, the first framework to integrate LoRA fine-tuning with zero-knowledge proofs (ZKPs), achieving provable security and correctness. zkLoRA employs advanced cryptographic techniques -- such as lookup arguments, sumcheck protocols, and polynomial commitments -- to verify both arithmetic and non-arithmetic operations in Transformer-based architectures. The framework provides end-to-end verifiability for forward propagation, backward propagation, and parameter updates during LoRA fine-tuning, while safeguarding the privacy of model parameters and training data. Leveraging GPU-based implementations, zkLoRA demonstrates practicality and efficiency through experimental validation on open-source LLMs like LLaMA, scaling up to 13 billion parameters. By combining parameter-efficient fine-tuning with ZKPs, zkLoRA bridges a critical gap, enabling secure and trustworthy deployment of LLMs in sensitive or untrusted environments.
Authors: Cheng-Yeh Yang, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen
Abstract: A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman's rank correlation coefficient (SRCC) over the widely-used average pooling approach.
Authors: Felix Simon Reimers, Carl-Hendrik Peters, Stefano Nichele
Abstract: Using data from mobile network utilization in Norway, we showcase the possibility of monitoring the state of communication and mobility networks with a non-invasive, low-cost method. This method transforms the network data into a model within the framework of reservoir computing and then measures the model's performance on proxy tasks. Experimentally, we show how the performance on these proxies relates to the state of the network. A key advantage of this approach is that it uses readily available data sets and leverages the reservoir computing framework for an inexpensive and largely agnostic method. Data from mobile network utilization is available in an anonymous, aggregated form with multiple snapshots per day. This data can be treated like a weighted network. Reservoir computing allows the use of weighted, but untrained networks as a machine learning tool. The network, initialized as a so-called echo state network (ESN), projects incoming signals into a higher dimensional space, on which a single trained layer operates. This consumes less energy than deep neural networks in which every weight of the network is trained. We use neuroscience inspired tasks and trained our ESN model to solve them. We then show how the performance depends on certain network configurations and also how it visibly decreases when perturbing the network. While this work serves as proof of concept, we believe it can be elevated to be used for near-real-time monitoring as well as the identification of possible weak spots of both mobile communication networks as well as transportation networks.
Authors: Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen
Abstract: Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
Authors: Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, Yaroslav Zharov
Abstract: Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering ( SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these strategies within SWE-agent on SWE-bench Verified across five diverse model configurations. We find that a simple observation-masking strategy halves cost relative to a raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. For example, with Qwen3-Coder 480B, masking improves solve rate from 53.8% (raw agent) to 54.8%, while remaining competitive with summarization at a lower cost. These results suggest that, at least within SWE-agent on SWE-bench Verified, the most effective and efficient context management can be the simplest. We release code and data for reproducibility
Authors: Francisco Caetano, Christiaan Viviers, Peter H. H. de With, Fons van der Sommen
Abstract: Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html
Authors: Xiaoxi Cui, Weihai Lu, Yu Tong, Yiheng Li, Zhejun Zhao
Abstract: In click-through rate prediction, click-through rate prediction is used to model users' interests. However, most of the existing CTR prediction methods are mainly based on the ID modality. As a result, they are unable to comprehensively model users' multi-modal preferences. Therefore, it is necessary to introduce multi-modal CTR prediction. Although it seems appealing to directly apply the existing multi-modal fusion methods to click-through rate prediction models, these methods (1) fail to effectively disentangle commonalities and specificities across different modalities; (2) fail to consider the synergistic effects between modalities and model the complex interactions between modalities. To address the above issues, this paper proposes the Diffusion-based Multi-modal Synergy Interest Network (Diff-MSIN) framework for click-through prediction. This framework introduces three innovative modules: the Multi-modal Feature Enhancement (MFE) Module Synergistic Relationship Capture (SRC) Module, and the Feature Dynamic Adaptive Fusion (FDAF) Module. The MFE Module and SRC Module extract synergistic, common, and special information among different modalities. They effectively enhances the representation of the modalities, improving the overall quality of the fusion. To encourage distinctiveness among different features, we design a Knowledge Decoupling method. Additionally, the FDAF Module focuses on capturing user preferences and reducing fusion noise. To validate the effectiveness of the Diff-MSIN framework, we conducted extensive experiments using the Rec-Tmall and three Amazon datasets. The results demonstrate that our approach yields a significant improvement of at least 1.67% compared to the baseline, highlighting its potential for enhancing multi-modal recommendation systems. Our code is available at the following link: https://github.com/Cxx-0/Diff-MSIN.
Authors: Seungyeon Choi, Hwanhee Kim, Chihyun Park, Dahyeon Lee, Seungyong Lee, Yoonju Kim, Hyoungjoon Park, Sein Kwon, Youngwan Jo, Sanghyun Park
Abstract: Recent advances in Structure-based Drug Design (SBDD) have leveraged generative models for 3D molecular generation, predominantly evaluating model performance by binding affinity to target proteins. However, practical drug discovery necessitates high binding affinity along with synthetic feasibility and selectivity, critical properties that were largely neglected in previous evaluations. To address this gap, we identify fundamental limitations of conventional diffusion-based generative models in effectively guiding molecule generation toward these diverse pharmacological properties. We propose CByG, a novel framework extending Bayesian Flow Network into a gradient-based conditional generative model that robustly integrates property-specific guidance. Additionally, we introduce a comprehensive evaluation scheme incorporating practical benchmarks for binding affinity, synthetic feasibility, and selectivity, overcoming the limitations of conventional evaluation methods. Extensive experiments demonstrate that our proposed CByG framework significantly outperforms baseline models across multiple essential evaluation criteria, highlighting its effectiveness and practicality for real-world drug discovery applications.
Authors: Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin
Abstract: Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
URLs: https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
Authors: Sara B. Coutinho, Rafael M. O. Cruz, Francimaria R. S. Nascimento, George D. C. Cavalcanti
Abstract: Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers's performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project's repository: https://github.com/SaraBCoutinho/HSFN .
Authors: Pascal R. van der Vaart, Neil Yorke-Smith, Matthijs T. J. Spaan
Abstract: Uncertainty quantification in reinforcement learning can greatly improve exploration and robustness. Approximate Bayesian approaches have recently been popularized to quantify uncertainty in model-free algorithms. However, so far the focus has been on improving the accuracy of the posterior approximation, instead of studying the accuracy of the prior and likelihood assumptions underlying the posterior. In this work, we demonstrate that there is a cold posterior effect in Bayesian deep Q-learning, where contrary to theory, performance increases when reducing the temperature of the posterior. To identify and overcome likely causes, we challenge common assumptions made on the likelihood and priors in Bayesian model-free algorithms. We empirically study prior distributions and show through statistical tests that the common Gaussian likelihood assumption is frequently violated. We argue that developing more suitable likelihoods and priors should be a key focus in future Bayesian reinforcement learning research and we offer simple, implementable solutions for better priors in deep Q-learning that lead to more performant Bayesian algorithms.
Authors: Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu
Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.
Authors: Geri Skenderi
Abstract: Graph Neural Networks (GNNs) have recently shown promise as solvers for Boolean Satisfiability Problems (SATs) by operating on graph representations of logical formulas. However, their performance degrades sharply on harder instances, raising the question of whether this reflects fundamental architectural limitations. In this work, we provide a geometric explanation through the lens of graph Ricci Curvature (RC), which quantifies local connectivity bottlenecks. We prove that bipartite graphs derived from random k-SAT formulas are inherently negatively curved, and that this curvature decreases with instance difficulty. Building on this, we show that GNN-based SAT solvers are affected by oversquashing, a phenomenon where long-range dependencies become impossible to compress into fixed-length representations. We validate our claims empirically across different SAT benchmarks and confirm that curvature is both a strong indicator of problem complexity and can be used to predict performance. Finally, we connect our findings to design principles of existing solvers and outline promising directions for future work.
Authors: Ziwei Liao, Mohamed Sayed, Steven L. Waslander, Sara Vicente, Daniyar Turmukhambetov, Michael Firman
Abstract: Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single "mode" for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.
Authors: Jens Leysen, Marco Favier, Bart Goethals
Abstract: Data minimization is a legal principle requiring personal data processing to be limited to what is necessary for a specified purpose. Operationalizing this principle for recommender systems, which rely on extensive personal data, remains a significant challenge. This paper conducts a feasibility study on minimizing implicit feedback inference data for such systems. We propose a novel problem formulation, analyze various minimization techniques, and investigate key factors influencing their effectiveness. We demonstrate that substantial inference data reduction is technically feasible without significant performance loss. However, its practicality is critically determined by two factors: the technical setting (e.g., performance targets, choice of model) and user characteristics (e.g., history size, preference complexity). Thus, while we establish its technical feasibility, we conclude that data minimization remains practically challenging and its dependence on the technical and user context makes a universal standard for data `necessity' difficult to implement.
Authors: Yujin Park, Haejun Chung, Ikbeom Jang
Abstract: Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.
Authors: Julen Cestero, Carmine Delle Femine, Kenji S. Muro, Marco Quartulli, Marcello Restelli
Abstract: Physics-Informed Neural Networks (PINNs) present a transformative approach for smart grid modeling by integrating physical laws directly into learning frameworks, addressing critical challenges of data scarcity and physical consistency in conventional data-driven methods. This paper evaluates PINNs' capabilities as surrogate models for smart grid dynamics, comparing their performance against XGBoost, Random Forest, and Linear Regression across three key experiments: interpolation, cross-validation, and episodic trajectory prediction. By training PINNs exclusively through physics-based loss functions (enforcing power balance, operational constraints, and grid stability) we demonstrate their superior generalization, outperforming data-driven models in error reduction. Notably, PINNs maintain comparatively lower MAE in dynamic grid operations, reliably capturing state transitions in both random and expert-driven control scenarios, while traditional models exhibit erratic performance. Despite slight degradation in extreme operational regimes, PINNs consistently enforce physical feasibility, proving vital for safety-critical applications. Our results contribute to establishing PINNs as a paradigm-shifting tool for smart grid surrogation, bridging data-driven flexibility with first-principles rigor. This work advances real-time grid control and scalable digital twins, emphasizing the necessity of physics-aware architectures in mission-critical energy systems.
Authors: Wuque Cai, Hongze Sun, Jiayi He, Qianqian Liao, Yunliang Zang, Duo Chen, Dezhong Yao, Daqing Guo
Abstract: Spiking neural networks (SNNs) are artificial neural networks based on simulated biological neurons and have attracted much attention in recent artificial intelligence technology studies. The dendrites in biological neurons have efficient information processing ability and computational power; however, the neurons of SNNs rarely match the complex structure of the dendrites. Inspired by the nonlinear structure and highly sparse properties of neuronal dendrites, in this study, we propose an efficient, lightweight SNN method with nonlinear pruning and dendritic integration (NSPDI-SNN). In this method, we introduce nonlinear dendritic integration (NDI) to improve the representation of the spatiotemporal information of neurons. We implement heterogeneous state transition ratios of dendritic spines and construct a new and flexible nonlinear synaptic pruning (NSP) method to achieve the high sparsity of SNN. We conducted systematic experiments on three benchmark datasets (DVS128 Gesture, CIFAR10-DVS, and CIFAR10) and extended the evaluation to two complex tasks (speech recognition and reinforcement learning-based maze navigation task). Across all tasks, NSPDI-SNN consistently achieved high sparsity with minimal performance degradation. In particular, our method achieved the best experimental results on all three event stream datasets. Further analysis showed that NSPDI significantly improved the efficiency of synaptic information transfer as sparsity increased. In conclusion, our results indicate that the complex structure and nonlinear computation of neuronal dendrites provide a promising approach for developing efficient SNN methods.
Authors: Tobias Deu{\ss}er, Lorenz Sparrenberg, Armin Berger, Max Hahnb\"uck, Christian Bauckhage, Rafet Sifa
Abstract: The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.
Authors: Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu
Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our \method consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.
Authors: Zuzanna Gawrysiak, Krzysztof Krawiec
Abstract: We present PhISM, a physics-informed deep learning architecture that learns without supervision to explicitly disentangle hyperspectral observations and model them with continuous basis functions. \mname outperforms prior methods on several classification and regression benchmarks, requires limited labeled data, and provides additional insights thanks to interpretable latent representation.
Authors: Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu
Abstract: We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.
Authors: Imran S. A. Khan, Emmanuel G. Blanchard, S\'ebastien George
Abstract: This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.
Authors: Shashank Vempati, Nishit Anand, Gaurav Talebailkar, Arpan Garai, Chetan Arora
Abstract: Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website
Authors: Amirhossein Nazeri, Wael Hafez
Abstract: Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.
Authors: Jiazheng Xing, Hai Ci, Hongbin Xu, Hangjie Yuan, Yong Liu, Mike Zheng Shou
Abstract: Watermarking diffusion-generated images is crucial for copyright protection and user tracking. However, current diffusion watermarking methods face significant limitations: zero-bit watermarking systems lack the capacity for large-scale user tracking, while multi-bit methods are highly sensitive to certain image transformations or generative attacks, resulting in a lack of comprehensive robustness. In this paper, we propose OptMark, an optimization-based approach that embeds a robust multi-bit watermark into the intermediate latents of the diffusion denoising process. OptMark strategically inserts a structural watermark early to resist generative attacks and a detail watermark late to withstand image transformations, with tailored regularization terms to preserve image quality and ensure imperceptibility. To address the challenge of memory consumption growing linearly with the number of denoising steps during optimization, OptMark incorporates adjoint gradient methods, reducing memory usage from O(N) to O(1). Experimental results demonstrate that OptMark achieves invisible multi-bit watermarking while ensuring robust resilience against valuemetric transformations, geometric transformations, editing, and regeneration attacks.
Authors: Jo\~ao Valente, Atabak Dehban, Rodrigo Ventura
Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.
Authors: Maya Guhan (Center for Ethics and Health Policy, Baylor College of Medicine, Houston, TX, USA), Meghan E. Hurley (Center for Ethics and Health Policy, Baylor College of Medicine, Houston, TX, USA), Eric A. Storch (Department of Psychiatry and Behavioral Sciences, Baylor College of Medicine, Houston, TX, USA), John Herrington (Department of Child and Adolescent Psychiatry and Behavioral Sciences, Children's Hospital of Philadelphia, Philadelphia, PA, USA), Casey Zampella (Department of Child and Adolescent Psychiatry and Behavioral Sciences, Children's Hospital of Philadelphia, Philadelphia, PA, USA), Julia Parish-Morris (Department of Child and Adolescent Psychiatry and Behavioral Sciences, Children's Hospital of Philadelphia, Philadelphia, PA, USA), Gabriel L\'azaro-Mu\~noz (Center for Bioethics, Harvard Medical School, Boston, MA, USA), Kristin Kostick-Quenet (Center for Ethics and Health Policy, Baylor College of Medicine, Houston, TX, USA)
Abstract: Artificial intelligence (AI)-based computer perception (CP) technologies use mobile sensors to collect behavioral and physiological data for clinical decision-making. These tools can reshape how clinical knowledge is generated and interpreted. However, effective integration of these tools into clinical workflows depends on how developers balance clinical utility with user acceptability and trustworthiness. Our study presents findings from 20 in-depth interviews with developers of AI-based CP tools. Interviews were transcribed and inductive, thematic analysis was performed to identify 4 key design priorities: 1) to account for context and ensure explainability for both patients and clinicians; 2) align tools with existing clinical workflows; 3) appropriately customize to relevant stakeholders for usability and acceptability; and 4) push the boundaries of innovation while aligning with established paradigms. Our findings highlight that developers view themselves as not merely technical architects but also ethical stewards, designing tools that are both acceptable by users and epistemically responsible (prioritizing objectivity and pushing clinical knowledge forward). We offer the following suggestions to help achieve this balance: documenting how design choices around customization are made, defining limits for customization choices, transparently conveying information about outputs, and investing in user training. Achieving these goals will require interdisciplinary collaboration between developers, clinicians, and ethicists.
Authors: Hamza Ezzaoui Rahali, Abhilasha Dave, Larry Ruckman, Mohammad Mehdi Rahimifar, Audrey C. Therrien, James J. Russel, Ryan T. Herbst
Abstract: The LCLS-II Free Electron Laser (FEL) will generate X-ray pulses for beamline experiments at rates of up to 1~MHz, with detectors producing data throughputs exceeding 1 TB/s. Managing such massive data streams presents significant challenges, as transmission and storage infrastructures become prohibitively expensive. Machine learning (ML) offers a promising solution for real-time data reduction, but conventional implementations introduce excessive latency, making them unsuitable for high-speed experimental environments. To address these challenges, SLAC developed the SLAC Neural Network Library (SNL), a specialized framework designed to deploy real-time ML inference models on Field-Programmable Gate Arrays (FPGA). SNL's key feature is the ability to dynamically update model weights without requiring FPGA resynthesis, enhancing flexibility for adaptive learning applications. To further enhance usability and accessibility, we introduce Auto-SNL, a Python extension that streamlines the process of converting Python-based neural network models into SNL-compatible high-level synthesis code. This paper presents a benchmark comparison against hls4ml, the current state-of-the-art tool, across multiple neural network architectures, fixed-point precisions, and synthesis configurations targeting a Xilinx ZCU102 FPGA. The results showed that SNL achieves competitive or superior latency in most tested architectures, while in some cases also offering FPGA resource savings. This adaptation demonstrates SNL's versatility, opening new opportunities for researchers and academics in fields such as high-energy physics, medical imaging, robotics, and many more.
Authors: Diane Tchuindjo, Omar Khattab
Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.
Authors: Nattapong Kurpukdee, Adrian G. Bors
Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.
Authors: Ugur Dinc, Jibak Sarkar, Philipp Schubert, Sabine Semrau, Thomas Weissmann, Andre Karius, Johann Brand, Bernd-Niklas Axer, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Sogand Beirami, Annette Schwarz, Udo Gaipl, Benjamin Frey, Christoph Bert, Stefanie Corradini, Rainer Fietkau, Florian Putz
Abstract: Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.
Authors: Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen
Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
Authors: In\'es Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy
Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.
Authors: Xiaoyang Wang, Christopher C. Yang
Abstract: Healthcare systems generate diverse multimodal data, including Electronic Health Records (EHR), clinical notes, and medical images. Effectively leveraging this data for clinical prediction is challenging, particularly as real-world samples often present with varied or incomplete modalities. Existing approaches typically require complete modality data or rely on manual selection strategies, limiting their applicability in real-world clinical settings where data availability varies across patients and institutions. To address these limitations, we propose MoE-Health, a novel Mixture of Experts framework designed for robust multimodal fusion in healthcare prediction. MoE-Health architecture is specifically developed to handle samples with differing modalities and improve performance on critical clinical tasks. By leveraging specialized expert networks and a dynamic gating mechanism, our approach dynamically selects and combines relevant experts based on available data modalities, enabling flexible adaptation to varying data availability scenarios. We evaluate MoE-Health on the MIMIC-IV dataset across three critical clinical prediction tasks: in-hospital mortality prediction, long length of stay, and hospital readmission prediction. Experimental results demonstrate that MoE-Health achieves superior performance compared to existing multimodal fusion methods while maintaining robustness across different modality availability patterns. The framework effectively integrates multimodal information, offering improved predictive performance and robustness in handling heterogeneous and incomplete healthcare data, making it particularly suitable for deployment in diverse healthcare environments with heterogeneous data availability.
Authors: Jiawei Liu, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan
Abstract: Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.
Authors: Navid Aftabi, Abhishek Hanchate, Satish Bukkapatnam, Dan Li
Abstract: Industry 4.0's highly networked Machine Tool Controllers (MTCs) are prime targets for replay attacks that use outdated sensor data to manipulate actuators. Dynamic watermarking can reveal such tampering, but current schemes assume linear-Gaussian dynamics and use constant watermark statistics, making them vulnerable to the time-varying, partly proprietary behavior of MTCs. We close this gap with DynaMark, a reinforcement learning framework that models dynamic watermarking as a Markov decision process (MDP). It learns an adaptive policy online that dynamically adapts the covariance of a zero-mean Gaussian watermark using available measurements and detector feedback, without needing system knowledge. DynaMark maximizes a unique reward function balancing control performance, energy consumption, and detection confidence dynamically. We develop a Bayesian belief updating mechanism for real-time detection confidence in linear systems. This approach, independent of specific system assumptions, underpins the MDP for systems with linear dynamics. On a Siemens Sinumerik 828D controller digital twin, DynaMark achieves a reduction in watermark energy by 70% while preserving the nominal trajectory, compared to constant variance baselines. It also maintains an average detection delay equivalent to one sampling interval. A physical stepper-motor testbed validates these findings, rapidly triggering alarms with less control performance decline and exceeding existing benchmarks.
Authors: Yiming Lin, Yuchen Niu, Shang Wang, Kaizhu Huang, Qiufeng Wang, Xiao-Bo Jin
Abstract: Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.
Authors: Haichao Zhang, We Xu, Haonan Yu
Abstract: Pre-training with offline data and online fine-tuning using reinforcement learning is a promising strategy for learning control policies by leveraging the best of both worlds in terms of sample efficiency and performance. One natural approach is to initialize the policy for online learning with the one trained offline. In this work, we introduce a policy expansion scheme for this task. After learning the offline policy, we use it as one candidate policy in a policy set. We then expand the policy set with another policy which will be responsible for further learning. The two policies will be composed in an adaptive manner for interacting with the environment. With this approach, the policy previously learned offline is fully retained during online learning, thus mitigating the potential issues such as destroying the useful behaviors of the offline policy in the initial stage of online learning while allowing the offline policy participate in the exploration naturally in an adaptive manner. Moreover, new useful behaviors can potentially be captured by the newly added policy through learning. Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach. Code is available at https://github.com/Haichao-Zhang/PEX
Authors: Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, Xin Liu
Abstract: Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.
Authors: Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov
Abstract: Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) is a technique that enhances Large Language Model (LLM) inference in tasks like Question Answering (QA) by retrieving relevant information from knowledge graphs (KGs). However, real-world KGs are often incomplete, meaning that essential information for answering questions may be missing. Existing benchmarks do not adequately capture the impact of KG incompleteness on KG-RAG performance. In this paper, we systematically evaluate KG-RAG methods under incomplete KGs by removing triples using different methods and analyzing the resulting effects. We demonstrate that KG-RAG methods are sensitive to KG incompleteness, highlighting the need for more robust approaches in realistic settings.
Authors: Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang
Abstract: Mathematical geometric problem solving (GPS) demands verifiable logical coherence and multimodal reasoning capabilities. While large language models (LLMs) have shown rapid progress in GPS, their advancement is hindered by the lack of reliable benchmarks and systematic methodologies. A critical challenge is the inherent hallucination in LLMs, which leads to synthetic GPS datasets that are often noisy, unverified, and self-contradictory. To address this, we introduce TrustGeoGen, a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark. Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our \textit{GeoExplore} series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking. Using this engine, we create the GeoTrust-200K dataset and the corresponding GeoTrust-test benchmark, both with guaranteed cross-modal integrity. Experiments reveal that state-of-the-art models achieve only 45.83\% accuracy on GeoTrust-test, highlighting its significant challenge. Furthermore, training on our synthesized data substantially improves model performance on GPS tasks, with strong generalization to out-of-domain (OOD) benchmarks. Our code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen
Authors: Jan Speller, Malte Luttermann, Marcel Gehrke, Tanya Braun
Abstract: Probabilistic graphical models that encode indistinguishable objects and relations among them use first-order logic constructs to compress a propositional factorised model for more efficient (lifted) inference. To obtain a lifted representation, the state-of-the-art algorithm Advanced Colour Passing (ACP) groups factors that represent matching distributions. In an approximate version using $\varepsilon$ as a hyperparameter, factors are grouped that differ by a factor of at most $(1\pm \varepsilon)$. However, finding a suitable $\varepsilon$ is not obvious and may need a lot of exploration, possibly requiring many ACP runs with different $\varepsilon$ values. Additionally, varying $\varepsilon$ can yield wildly different models, leading to decreased interpretability. Therefore, this paper presents a hierarchical approach to lifted model construction that is hyperparameter-free. It efficiently computes a hierarchy of $\varepsilon$ values that ensures a hierarchy of models, meaning that once factors are grouped together given some $\varepsilon$, these factors will be grouped together for larger $\varepsilon$ as well. The hierarchy of $\varepsilon$ values also leads to a hierarchy of error bounds. This allows for explicitly weighing compression versus accuracy when choosing specific $\varepsilon$ values to run ACP with and enables interpretability between the different models.
Authors: Xiaoran Liu, Istvan David
Abstract: Insufficient data volume and quality are particularly pressing challenges in the adoption of modern subsymbolic AI. To alleviate these challenges, AI simulation uses virtual training environments in which AI agents can be safely and efficiently developed with simulated, synthetic data. Digital twins open new avenues in AI simulation, as these high-fidelity virtual replicas of physical systems are equipped with state-of-the-art simulators and the ability to further interact with the physical system for additional data collection. In this article, we report on our systematic survey of digital twin-enabled AI simulation. By analyzing 22 primary studies, we identify technological trends and derive a reference framework to situate digital twins and AI components. Based on our findings, we derive a reference framework and provide architectural guidelines by mapping it onto the ISO 23247 reference architecture for digital twins. Finally, we identify challenges and research opportunities for prospective researchers.
Authors: Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.
Authors: Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov
Abstract: Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.
Authors: Kaouther Mouheb, Mobina Ghojogh Nejad, Lavsen Dahal, Ehsan Samei, Kyle J. Lafata, W. Paul Segars, Joseph Y. Lo
Abstract: Accurate 3D modeling of human organs is critical for constructing digital phantoms in virtual imaging trials. However, organs such as the large intestine remain particularly challenging due to their complex geometry and shape variability. We propose CLAP, a novel Conditional LAtent Point-diffusion model that combines geometric deep learning with denoising diffusion models to enhance 3D representations of the large intestine. Given point clouds sampled from segmentation masks, we employ a hierarchical variational autoencoder to learn both global and local latent shape representations. Two conditional diffusion models operate within this latent space to refine the organ shape. A pretrained surface reconstruction model is then used to convert the refined point clouds into meshes. CLAP achieves substantial improvements in shape modeling accuracy, reducing Chamfer distance by 26% and Hausdorff distance by 36% relative to the initial suboptimal shapes. This approach offers a robust and extensible solution for high-fidelity organ modeling, with potential applicability to a wide range of anatomical structures.
Authors: Ricardo Cannizzaro, Michael Groom, Jonathan Routley, Robert Osazuwa Ness, Lars Kunze
Abstract: Manipulation tasks require robots to reason about cause and effect when interacting with objects. Yet, many data-driven approaches lack causal semantics and thus only consider correlations. We introduce COBRA-PPM, a novel causal Bayesian reasoning architecture that combines causal Bayesian networks and probabilistic programming to perform interventional inference for robot manipulation under uncertainty. We demonstrate its capabilities through high-fidelity Gazebo-based experiments on an exemplar block stacking task, where it predicts manipulation outcomes with high accuracy (Pred Acc: 88.6%) and performs greedy next-best action selection with a 94.2% task success rate. We further demonstrate sim2real transfer on a domestic robot, showing effectiveness in handling real-world uncertainty from sensor noise and stochastic actions. Our generalised and extensible framework supports a wide range of manipulation scenarios and lays a foundation for future work at the intersection of robotics and causality.
Authors: Simone Scardapane
Abstract: Neural networks surround us, in the form of large language models, speech transcription systems, molecular discovery algorithms, robotics, and much more. Stripped of anything else, neural networks are compositions of differentiable primitives, and studying them means learning how to program and how to interact with these models, a particular example of what is called differentiable programming. This primer is an introduction to this fascinating field imagined for someone, like Alice, who has just ventured into this strange differentiable wonderland. I overview the basics of optimizing a function via automatic differentiation, and a selection of the most common designs for handling sequences, graphs, texts, and audios. The focus is on a intuitive, self-contained introduction to the most important design techniques, including convolutional, attentional, and recurrent blocks, hoping to bridge the gap between theory and code (PyTorch and JAX) and leaving the reader capable of understanding some of the most advanced models out there, such as large language models (LLMs) and multimodal architectures.
Authors: John T. Halloran, Manbir Gulati, Paul F. Roysdon
Abstract: Mamba state-space models (SSMs) have recently outperformed state-of-the-art (SOTA) Transformer large language models (LLMs) in various tasks and been widely adapted. However, a major concern for stable learning in recurrent-based deep models (such as SSMs) is the sensitivity of their recurrent dynamics. Despite widespread adaptation, the sensitivity of Mamba's recurrent dynamics under common fine-tuning methods-e.g., mixed-precision fine-tuning (MPFT) and parameter-efficient fine-tuning (PEFT)-remains unexplored. Empirically, we show that Mamba LLMs are extremely stable to changes introduced by combinations of MPFT and PEFT, in stark contrast to Transformer LLMs, which we demonstrate may drastically diverge from their respective full-precision counterparts under different combinations of MPFT and PEFT (despite the near-ubiquitous adaptation of these fine-tuning frameworks for attention-based models). The demonstrated robustness of Mamba LLMs are due to their recurrent dynamics, which we prove are guaranteed to be stable using dynamical systems theory (in particular, Lyapunov stability). We conclude by using MPFT and PEFT to novelly study Mamba LLMs' in-context learning (ICL) abilities on natural language tasks, thus supplementing other recent work.
Authors: Guillermo Villate-Castillo, Javier Del Ser, Borja Sanz
Abstract: Content moderation typically combines the efforts of human moderators and machine learning models. However, these systems often rely on data where significant disagreement occurs during moderation, reflecting the subjective nature of toxicity perception. Rather than dismissing this disagreement as noise, we interpret it as a valuable signal that highlights the inherent ambiguity of the content,an insight missed when only the majority label is considered. In this work, we introduce a novel content moderation framework that emphasizes the importance of capturing annotation disagreement. Our approach uses multitask learning, where toxicity classification serves as the primary task and annotation disagreement is addressed as an auxiliary task. Additionally, we leverage uncertainty estimation techniques, specifically Conformal Prediction, to account for both the ambiguity in comment annotations and the model's inherent uncertainty in predicting toxicity and disagreement.The framework also allows moderators to adjust thresholds for annotation disagreement, offering flexibility in determining when ambiguity should trigger a review. We demonstrate that our joint approach enhances model performance, calibration, and uncertainty estimation, while offering greater parameter efficiency and improving the review process in comparison to single-task methods.
Authors: Nikolas Adaloglou, Tim Kaiser, Damir Iagudin, Markus Kollmann
Abstract: Guidance is a widely used technique for diffusion models to enhance sample quality. Technically, guidance is realised by using an auxiliary model that generalises more broadly than the primary model. Using a 2D toy example, we first show that it is highly beneficial when the auxiliary model exhibits similar but stronger generalisation errors than the primary model. Based on this insight, we introduce \emph{masked sliding window guidance (M-SWG)}, a novel, training-free method. M-SWG upweights long-range spatial dependencies by guiding the primary model with itself by selectively restricting its receptive field. M-SWG requires neither access to model weights from previous iterations, additional training, nor class conditioning. M-SWG achieves a superior Inception score (IS) compared to previous state-of-the-art training-free approaches, without introducing sample oversaturation. In conjunction with existing guidance methods, M-SWG reaches state-of-the-art Frechet DINOv2 distance on ImageNet using EDM2-XXL and DiT-XL. The code is available at https://github.com/HHU-MMBS/swg_bmvc2025_official.
Authors: Yiqun Zhang, Mingjie Zhao, Hong Jia, Yang Lu, Mengke Li, Yiu-ming Cheung
Abstract: Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space such as the Euclidean distance space of numerical data, the distribution of categorical data is usually under-represented, and thus valuable information can be easily twisted in clustering. This paper, therefore, introduces a novel order distance metric learning approach to intuitively represent categorical attribute values by learning their optimal order relationship and quantifying their distance in a line similar to that of the numerical attributes. Since subjectively created qualitative categorical values involve ambiguity and fuzziness, the order distance metric is learned in the context of clustering. Accordingly, a new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning with low time complexity and a guarantee of convergence. Due to the clustering-friendly order learning mechanism and the homogeneous ordinal nature of the order distance and Euclidean distance, the proposed method achieves superior clustering accuracy on categorical and mixed datasets. More importantly, the learned order distance metric greatly reduces the difficulty of understanding and managing the non-intuitive categorical data. Experiments with ablation studies, significance tests, case studies, etc., have validated the efficacy of the proposed method. The source code is available at https://github.com/DAJ0612/OCL_Source_Code.
Authors: Xue Tan, Hao Luan, Mingyu Luo, Xiaoyan Sun, Ping Chen, Jun Dai
Abstract: Retrieval-Augmented Generation (RAG) enriches the input to LLMs by retrieving information from the relevant knowledge database, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%.
Authors: Yang Wu, Huayi Zhang, Yizheng Jiao, Lin Ma, Xiaozhong Liu, Jinhong Yu, Dongyu Zhang, Dezhi Yu, Wei Xu
Abstract: Instruction tuning has underscored the significant potential of large language models (LLMs) in producing more human controllable and effective outputs in various domains. In this work, we focus on the data selection problem for task-specific instruction tuning of LLMs. Prevailing methods primarily rely on the crafted similarity metrics to select training data that aligns with the test data distribution. The goal is to minimize instruction tuning loss on the test data, ultimately improving performance on the target task. However, it has been widely observed that instruction tuning loss (i.e., cross-entropy loss for next token prediction) in LLMs often fails to exhibit a monotonic relationship with actual task performance. This misalignment undermines the effectiveness of current data selection methods for task-specific instruction tuning. To address this issue, we introduce ROSE, a novel Reward-Oriented inStruction data sElection method which leverages pairwise preference loss as a reward signal to optimize data selection for task-specific instruction tuning. Specifically, ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points. Experimental results show that by selecting just 5\% of the training data using ROSE, our approach can achieve competitive results compared to fine-tuning with the full training dataset, and it surpasses other state-of-the-art data selection methods for task-specific instruction tuning. Our qualitative analysis further confirms the robust generalizability of our method across multiple benchmark datasets and diverse model architectures.
Authors: Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou
Abstract: Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance MT models. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs' retrieval-augmented MT ability. RAGtrans contains 169K MT samples collected via GPT-4o and human translators. Besides, documents from various languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.6-3.1 BLEU and 1.0-2.0 COMET scores in En-Zh, and 1.7-2.9 BLEU and 2.1-2.7 COMET scores in En-De. We also conclude the critical difficulties that current LLMs face with this task.
Authors: Naquee Rizwan, Nayandeep Deb, Sarthak Roy, Vishwajeet Singh Solanki, Kiran Garimella, Animesh Mukherjee
Abstract: Tackling toxic behavior in digital communication continues to be a pressing concern for both academics and industry professionals. While significant research has explored toxicity on platforms like social networks and discussion boards, podcasts despite their rapid rise in popularity remain relatively understudied in this context. This work seeks to fill that gap by curating a dataset of political podcast transcripts and analyzing them with a focus on conversational structure. Specifically, we investigate how toxicity surfaces and intensifies through sequences of replies within these dialogues, shedding light on the organic patterns by which harmful language can escalate across conversational turns. Warning: Contains potentially abusive/toxic contents.
Authors: Sijia Gu, Noor Nashid, Ali Mesbah
Abstract: Automating unit test generation remains a significant challenge, particularly for complex methods in real-world projects. While Large Language Models (LLMs) have made strides in code generation, they struggle to achieve high branch coverage due to their limited ability to reason about intricate control flow structures. To address this limitation, we introduce Panta, a technique that emulates the iterative process human developers follow when analyzing code and constructing test cases. Panta integrates static control flow analysis and dynamic code coverage analysis to systematically guide LLMs in identifying uncovered execution paths and generating better test cases. By incorporating an iterative feedback-driven mechanism, our technique continuously refines test generation based on static and dynamic path coverage insights, ensuring more comprehensive and effective testing. Our empirical evaluation, conducted on classes with high cyclomatic complexity from open-source projects, demonstrates that Panta achieves 26% higher line coverage and 23% higher branch coverage compared to the state-of-the-art.
Authors: Chen Gong, Kecen Li, Zinan Lin, Tianhao Wang
Abstract: Differentially private (DP) image synthesis aims to generate artificial images that retain the properties of sensitive images while protecting the privacy of individual images within the dataset. Despite recent advancements, we find that inconsistent--and sometimes flawed--evaluation protocols have been applied across studies. This not only impedes the understanding of current methods but also hinders future advancements. To address the issue, this paper introduces DPImageBench for DP image synthesis, with thoughtful design across several dimensions: (1) Methods. We study eleven prominent methods and systematically characterize each based on model architecture, pretraining strategy, and privacy mechanism. (2) Evaluation. We include nine datasets and seven fidelity and utility metrics to thoroughly assess them. Notably, we find that a common practice of selecting downstream classifiers based on the highest accuracy on the sensitive test set not only violates DP but also overestimates the utility scores. DPImageBench corrects for these mistakes. (3) Platform. Despite the methods and evaluation protocols, DPImageBench provides a standardized interface that accommodates current and future implementations within a unified framework. With DPImageBench, we have several noteworthy findings. For example, contrary to the common wisdom that pretraining on public image datasets is usually beneficial, we find that the distributional similarity between pretraining and sensitive images significantly impacts the performance of the synthetic images and does not always yield improvements. In addition, adding noise to low-dimensional features, such as the high-level characteristics of sensitive images, is less affected by the privacy budget compared to adding noise to high-dimensional features, like weight gradients. The former methods perform better than the latter under a low privacy budget.
Authors: Ziheng Chen, Jiali Cheng, Hadi Amiri, Kaushiki Nag, Lu Lin, Xiangguo Sun, Gabriele Tolomei
Abstract: With growing emphasis on privacy regulations, machine unlearning has become increasingly critical in real-world applications such as social networks and recommender systems, many of which are naturally represented as graphs. However, existing graph unlearning methods often modify nodes or edges indiscriminately, overlooking their impact on fairness. For instance, forgetting links between users of different genders may inadvertently exacerbate group disparities. To address this issue, we propose a novel framework that jointly optimizes both the graph structure and the model to achieve fair unlearning. Our method rewires the graph by removing redundant edges that hinder forgetting while preserving fairness through targeted edge augmentation. We further introduce a worst-case evaluation mechanism to assess robustness under challenging scenarios. Experiments on real-world datasets show that our approach achieves more effective and fair unlearning than existing baselines.
Authors: Shahryar Zehtabi, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton
Abstract: Much of federated learning (FL) focuses on settings where local dataset statistics remain the same between training and testing. However, this assumption often does not hold in practice due to distribution shifts, motivating the development of domain generalization (DG) approaches that leverage source domain data to train models capable of generalizing to unseen target domains. In this paper, we are motivated by two major gaps in existing work on FL and DG: (1) the lack of formal mathematical analysis of DG objectives; and (2) DG research in FL being limited to the star-topology architecture. We develop Decentralized Federated Domain Generalization with Style Sharing ($\textit{StyleDDG}$), a decentralized DG algorithm which allows devices in a peer-to-peer network to achieve DG based on sharing style information inferred from their datasets. Additionally, we provide the first systematic approach to analyzing style-based DG training in decentralized networks. We cast existing centralized DG algorithms within our framework, and employ their formalisms to model $\textit{StyleDDG}$. We then obtain analytical conditions under which convergence of $\textit{StyleDDG}$ can be guaranteed. Through experiments on popular DG datasets, we demonstrate that $\textit{StyleDDG}$ can obtain significant improvements in accuracy across target domains with minimal communication overhead compared to baseline decentralized gradient methods.
Authors: Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens
Abstract: In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for retinal disease detection. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the mode's decision process. We evaluated our method on two medical tasks focused on disease detection using color fundus images. Our model achieves state-of-the-art predictive performance compared to black-box and interpretable models and provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.
URLs: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.
Authors: Jiaan Wang, Fandong Meng, Jie Zhou
Abstract: Recently, deep reasoning LLMs (e.g., OpenAI o1 and DeepSeek-R1) have shown promising performance in various downstream tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation. However, the task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning (RL). Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought processes. The reward model teaches DeepTrans how to think and free-translate the given sentences during RL. Besides, our RL training does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning LLMs. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.
Authors: Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
Abstract: We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
Authors: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
Authors: Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, Alina Oprea
Abstract: Large Language Model (LLM)-based agents increasingly interact, collaborate, and delegate tasks to one another autonomously with minimal human interaction. Industry guidelines for agentic system governance emphasize the need for users to maintain comprehensive control over their agents, mitigating potential damage from malicious agents. Several proposed agentic system designs address agent identity, authorization, and delegation, but remain purely theoretical, without concrete implementation and evaluation. Most importantly, they do not provide user-controlled agent management. To address this gap, we propose SAGA, a scalable Security Architecture for Governing Agentic systems, that offers user oversight over their agents' lifecycle. In our design, users register their agents with a central entity, the Provider, that maintains agent contact information, user-defined access control policies, and helps agents enforce these policies on inter-agent communication. We introduce a cryptographic mechanism for deriving access control tokens, that offers fine-grained control over an agent's interaction with other agents, providing formal security guarantees. We evaluate SAGA on several agentic tasks, using agents in different geolocations, and multiple on-device and cloud LLMs, demonstrating minimal performance overhead with no impact on underlying task utility in a wide range of conditions. Our architecture enables secure and trustworthy deployment of autonomous agents, accelerating the responsible adoption of this technology in sensitive environments.
Authors: Junsheng Huang (May), Zhitao He (May), Yucheng Huang (May), Sandeep Polisetty (May), Qingyun Wang (May), Yi. R (May), Fung
Abstract: The hallucination of non-existent facts by LLMs is an important problem given its widespread adoption across various applications. Previous research addresses this problem by analyzing the internal parameterized knowledge boundaries to estimate confidence. However, these studies focus on the single-problem setting and have not explored the more challenging multi-problem setting, which requires accurately answering multiple questions simultaneously. We introduce a novel method for the multi-problem setting, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25\% in average precision.
Authors: Donghun Noh, Deqian Kong, Minglu Zhao, Andrew Lizarraga, Jianwen Xie, Ying Nian Wu, Dennis Hong
Abstract: We present the Latent Adaptive Planner (LAP), a trajectory-level latent-variable policy for dynamic nonprehensile manipulation (e.g., box catching) that formulates planning as inference in a low-dimensional latent space and is learned effectively from human demonstration videos. During execution, LAP achieves real-time adaptation by maintaining a posterior over the latent plan and performing variational replanning as new observations arrive. To bridge the embodiment gap between humans and robots, we introduce a model-based proportional mapping that regenerates accurate kinematic-dynamic joint states and object positions from human demonstrations. Through challenging box catching experiments with varying object properties, LAP demonstrates superior success rates, trajectory smoothness, and energy efficiency by learning human-like compliant motions and adaptive behaviors. Overall, LAP enables dynamic manipulation with real-time adaptation and successfully transfer across heterogeneous robot platforms using the same human demonstration videos.
Authors: Shanshan Song, Hui Tang, Honglong Yang, Xiaomeng Li
Abstract: Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.
Authors: Wenqing Peng, Zhi-Song Liu, Michael Boy
Abstract: Estimating rate coefficients from complex chemical reactions is essential for advancing detailed chemistry. However, the stiffness inherent in real-world atmospheric chemistry systems poses severe challenges, leading to training instability and poor convergence, which hinder effective rate coefficient estimation using learning-based approaches. To address this, we propose a Stiff Physics-Informed Neural ODE framework (SPIN-ODE) for chemical reaction modelling. Our method introduces a three-stage optimisation process: first, a black-box neural ODE is trained to fit concentration trajectories; second, a Chemical Reaction Neural Network (CRNN) is pre-trained to learn the mapping between concentrations and their time derivatives; and third, the rate coefficients are fine-tuned by integrating with the pre-trained CRNN. Extensive experiments on both synthetic and newly proposed real-world datasets validate the effectiveness and robustness of our approach. As the first work addressing stiff neural ODE for chemical rate coefficient discovery, our study opens promising directions for integrating neural networks with detailed chemistry.
Authors: Bo Ai, Liu Dai, Nico Bohlinger, Dichen Li, Tongzhou Mu, Zhanxin Wu, K. Fay, Henrik I. Christensen, Jan Peters, Hao Su
Abstract: Cross-embodiment generalization underpins the vision of building generalist embodied agents for any robot, yet its enabling factors remain poorly understood. We investigate embodiment scaling laws, the hypothesis that increasing the number of training embodiments improves generalization to unseen ones, using robot locomotion as a test bed. We procedurally generate ~1,000 embodiments with topological, geometric, and joint-level kinematic variations, and train policies on random subsets. We observe positive scaling trends supporting the hypothesis, and find that embodiment scaling enables substantially broader generalization than data scaling on fixed embodiments. Our best policy, trained on the full dataset, transfers zero-shot to novel embodiments in simulation and the real world, including the Unitree Go2 and H1. These results represent a step toward general embodied intelligence, with relevance to adaptive control for configurable robots, morphology co-design, and beyond.
Authors: Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong
Abstract: Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. In this work, we propose WebInject, a prompt injection attack that manipulates the webpage environment to induce a web agent to perform an attacker-specified action. Our attack adds a perturbation to the raw pixel values of the rendered webpage. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the attacker-specified action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple datasets shows that WebInject is highly effective and significantly outperforms baselines.
Authors: Zishuai Zhang, Hainan zhang, Weihua Li, Qinnan zhang, jin Dong, Yongxin Tong, Zhiming Zheng
Abstract: Private data holds promise for improving LLMs due to its high quality, but its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based federated split models are proposed, which offload most model parameters to the server (or distributed clients) while retaining only a small portion on the client to ensure data privacy. Despite this design, they still face three challenges: 1) Peer-to-peer key encryption struggles to secure transmitted vectors effectively; 2) The auto-regressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) Fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FedSEA-LLaMA, a Secure, Efficient, and Adaptive Federated splitting framework based on LLaMA2. First, we inject Gaussian noise into forward-pass hidden states to enable secure end-to-end vector transmission. Second, we employ attention-mask compression and KV cache collaboration to reduce communication costs, accelerating training and inference. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements. Experiments on natural language understanding, summarization, and conversational QA tasks show that FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 and achieves up to 8x speedups in training and inference. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FedSEA-LLaMA in security and adaptability.
Authors: Jatin Kumar Arora, Soutrik Bandyopadhyay, Shubhendu Bhasin
Abstract: Path planning for autonomous robots presents a fundamental trade-off between optimality and safety. While conventional algorithms typically prioritize one of these objectives, we introduce the Unified Path Planner (UPP), a unified framework that simultaneously addresses both. UPP is a graph-search-based algorithm that employs a modified heuristic function incorporating a dynamic safety cost, enabling an adaptive balance between path length and obstacle clearance. We establish theoretical sub-optimality bounds for the planner and demonstrate that its safety-to-optimality ratio can be tuned via adjustable parameters, with a trade-off in computational complexity. Extensive simulations show that UPP achieves a high success rate, generating near-optimal paths with only a negligible increase in cost over traditional A*, while ensuring safety margins that closely approach those of the classical Voronoi planner. Finally, the practical efficacy of UPP is validated through a hardware implementation on a TurtleBot, confirming its ability to navigate cluttered environments by generating safe, sub-optimal paths.
Authors: Joydeep Chandra, Aleksandr Algazinov, Satyam Kumar Navneet, Rim El Filali, Matt Laing, Andrew Hanna
Abstract: In the age of open and free information, a concerning trend of reliance on AI is emerging. However, existing AI tools struggle to evaluate the credibility of information and to justify their assessments. Hence, there is a growing need for systems that can help users evaluate the trustworthiness of online information. Although major search engines incorporate AI features, they often lack clear reliability indicators. We present TrueGL, a model that makes trustworthy search results more accessible. The model is a fine-tuned version of IBM's Granite-1B, trained on the custom dataset and integrated into a search engine with a reliability scoring system. We evaluate the system using prompt engineering and assigning each statement a continuous reliability score from 0.1 to 1, then instructing the model to return a textual explanation alongside the score. Each model's predicted scores are measured against real scores using standard evaluation metrics. TrueGL consistently outperforms other small-scale LLMs and rule-based approaches across all experiments on key evaluation metrics, including MAE, RMSE, and R2. The model's high accuracy, broad content coverage, and ease of use make trustworthy information more accessible and help reduce the spread of false or misleading content online. Our code is publicly available at https://github.com/AlgazinovAleksandr/TrueGL, and our model is publicly released at https://huggingface.co/JoydeepC/trueGL.
URLs: https://github.com/AlgazinovAleksandr/TrueGL,, https://huggingface.co/JoydeepC/trueGL.
Authors: Jie Zhang, Qinghua Zhao, Chi-ho Lin, Zhongfeng Kang, Lei Li
Abstract: Memorization in large language models poses critical risks for privacy and fairness as these systems scale to billions of parameters. While previous studies established correlations between memorization and factors like token frequency and repetition patterns, we revealed distinct response patterns: frequency increases minimally impact memorized samples (e.g. 0.09) while substantially affecting non-memorized samples (e.g., 0.25), with consistency observed across model scales. Through counterfactual analysis by perturbing sample prefixes and quantifying perturbation strength through token positional changes, we demonstrate that redundancy correlates with memorization patterns. Our findings establish that: about 79% of memorized samples are low-redundancy, these low-redundancy samples exhibit 2-fold higher vulnerability than high-redundancy ones, and consequently memorized samples drop by 0.6 under perturbation while non-memorized samples drop by only 0.01, indicating that more redundant content becomes both more memorable and more fragile. These findings suggest potential redundancy-guided approaches for data preprocessing, thereby reducing privacy risks and mitigating bias to ensure fairness in model deployments.
Authors: Javad Enayati, Pedram Asef, Alexandre Benoit
Abstract: This paper introduces a novel hybrid AI method combining H filtering and an adaptive linear neuron network for flicker component estimation in power distribution systems.The proposed method leverages the robustness of the H filter to extract the voltage envelope under uncertain and noisy conditions followed by the use of ADALINE to accurately identify flicker frequencies embedded in the envelope.This synergy enables efficient time domain estimation with rapid convergence and noise resilience addressing key limitations of existing frequency domain approaches.Unlike conventional techniques this hybrid AI model handles complex power disturbances without prior knowledge of noise characteristics or extensive training.To validate the method performance we conduct simulation studies based on IEC Standard 61000 4 15 supported by statistical analysis Monte Carlo simulations and real world data.Results demonstrate superior accuracy robustness and reduced computational load compared to Fast Fourier Transform and Discrete Wavelet Transform based estimators.
Authors: Joshua Fan, Haodi Xu, Feng Tao, Md Nasim, Marc Grimson, Yiqi Luo, Carla P. Gomes
Abstract: Understanding how carbon flows through the soil is crucial for mitigating the effects of climate change. While soils have potential to sequester carbon from the atmosphere, the soil carbon cycle remains poorly understood. Scientists have developed mathematical process-based models of the soil carbon cycle based on existing knowledge, but they contain numerous unknown parameters that must be set in an ad-hoc manner, and often fit observations poorly. On the other hand, neural networks can learn patterns from data, but do not respect known scientific laws, nor can they reveal novel scientific relationships due to their black-box nature. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN leverages Kolmogorov-Arnold networks (KAN) to ensure the encoder is fully interpretable and reveals relationships between input features and latent parameters; it uses novel smoothness penalties to balance expressivity and simplicity. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge. While the process-based decoder enforces established scientific knowledge, the KAN-based encoder reveals new scientific relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability -- it can infer latent scientific mechanisms and their relationships with input features.
Authors: Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du
Abstract: Rotations have become essential to state-of-the-art quantization pipelines for large language models (LLMs) by effectively smoothing outliers in weights and activations. However, further optimizing the rotation parameters offers only limited performance gains and introduces significant training overhead: due to rotation parameter sharing, full-model must be loaded simultaneously to enable backpropagation, resulting in substantial memory consumption and limited practical utility. In this work, we identify two fundamental limitations of current rotational quantization methods: (i) rotation fails to align channel means, resulting in wider quantization bounds and increased rounding errors; and (ii) rotation makes the activation distribution more Gaussian-like, increasing energy loss caused by clipping errors. To address these issues, we introduce \textbf{BASE-Q}, a simple yet powerful approach that combines bias correction and asymmetric scaling to effectively reduce rounding and clipping errors. Furthermore, BASE-Q enables blockwise optimization, eliminating the need for memory-intensive full-model backpropagation. Extensive experiments on various LLMs and benchmarks demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5\%, 42.9\%, and 29.2\% compared to QuaRot, SpinQuant, and OSTQuant, respectively. The code will be released soon.
Authors: Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy
Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSCo more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.
Authors: Jaewook Lee, Alexander Scarlatos, Andrew Lan
Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
Authors: Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng
Abstract: Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG
Authors: Zezhen Xiang, Jingzhi Gong, Tao Chen
Abstract: Modern configurable software systems need to learn models that correlate configuration and performance. However, when the system operates in dynamic environments, the workload variations, hardware changes, and system updates will inevitably introduce concept drifts at different levels - global drifts, which reshape the performance landscape of the entire configuration space; and local drifts, which only affect certain sub-regions of that space. As such, existing offline and transfer learning approaches can struggle to adapt to these implicit and unpredictable changes in real-time, rendering configuration performance learning challenging. To address this, we propose DHDA, an online configuration performance learning framework designed to capture and adapt to these drifts at different levels. The key idea is that DHDA adapts to both the local and global drifts using dually hierarchical adaptation: at the upper level, we redivide the data into different divisions, within each of which the local model is retrained, to handle global drifts only when necessary. At the lower level, the local models of the divisions can detect local drifts and adapt themselves asynchronously. To balance responsiveness and efficiency, DHDA combines incremental updates with periodic full retraining to minimize redundant computation when no drifts are detected. Through evaluating eight software systems and against state-of-the-art approaches, we show that DHDA achieves considerably better accuracy and can effectively adapt to drifts with up to 2x improvements, while incurring reasonable overhead and is able to improve different local models in handling concept drift.
Authors: Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen
Abstract: Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning. The code (https://github.com/yyysjz1997/Time-RA) and dataset (https://huggingface.co/datasets/Time-RA/RATs40K) have been fully open-sourced to support and accelerate future research in this area.
URLs: https://github.com/yyysjz1997/Time-RA), https://huggingface.co/datasets/Time-RA/RATs40K)
Authors: Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang
Abstract: Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/zjukg/SKA-Bench.
Authors: Junjie Cao
Abstract: Speech-to-text alignment is a critical component of neural text to speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line, while non-autoregressive end to end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can give promising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and adaptation ability to conditions, compared to previous baseline models. Specifically, it makes a considerable improvement on phoneme-level alignment accuracy and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.
Authors: Shree Mitra, Ritabrata Chakraborty, Nilkanta Sahu
Abstract: Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.
Authors: Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu
Abstract: Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, including high inference costs due to parallel rollouts and early-stage estimation bias that fosters overconfidence, reducing output diversity and causing performance plateaus. To address these challenges, we introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of the rollout tokens budget. This highlights our method's ability to effectively optimize the trade-off between inference efficiency, diversity, and estimation robustness, thereby advancing unsupervised reinforcement learning for open-domain reasoning tasks.
Authors: Hongliang Wei, Xianqi Zhang, Xingtao Wang, Xiaopeng Fan, Debin Zhao
Abstract: Despite significant progress, existing research on Multimodal Large Language Models (MLLMs) mainly focuses on general visual understanding, overlooking the ability to integrate textual context associated with objects for a more context-aware multimodal understanding -- an ability we refer to as Region-level Context-aware Multimodal Understanding (RCMU). To address this limitation, we first formulate the RCMU task, which requires models to respond to user instructions by integrating both image content and textual information of regions or objects. To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT), which incorporates object information into the model input and enables the model to utilize bounding box coordinates to effectively associate objects' visual content with their textual information. To address the lack of datasets, we introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks. We also propose RC\&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks. Additionally, we propose a reference-free evaluation metric to perform a comprehensive and fine-grained evaluation of the region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental results indicate that RC-Qwen2-VL models not only achieve outstanding performance on multiple RCMU tasks but also demonstrate successful applications in multimodal RAG and personalized conversation. Our data, model and benchmark are available at https://github.com/hongliang-wei/RC-MLLM
Authors: Yong Deng, Guoqing Wang, Zhenzhe Ying, Xiaofeng Wu, Jinzhen Lin, Wenwen Xiong, Yuqin Dai, Shuo Yang, Zhanwei Zhang, Qiwen Wang, Yang Qin, Yuan Wang, Quanxing Zha, Sunhao Dai, Changhua Meng
Abstract: Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths. Experiments on seven benchmarks show consistent improvements over the state-of-the-art. Key advantages include: (1) Atom-Searcher scales computation at test-time. (2) Atomic Thought provides supervision anchors for RRMs, bridging deep research tasks and RRMs. (3) Atom-Searcher exhibits more interpretable, human-like reasoning patterns.
Authors: Mackenzie Jorgensen, Kendall Brogle, Katherine M. Collins, Lujain Ibrahim, Arina Shah, Petra Ivanovic, Noah Broestl, Gabriel Piles, Paul Dongha, Hatim Abdulhussein, Adrian Weller, Jillian Powers, Umang Bhatt
Abstract: Artificial intelligence (AI) is increasingly integrated into society, from financial services and traffic management to creative writing. Academic literature on the deployment of AI has mostly focused on the risks and harms that result from the use of AI. We introduce Fabric, a publicly available repository of deployed AI use cases to outline their governance mechanisms. Through semi-structured interviews with practitioners, we collect an initial set of 20 AI use cases. In addition, we co-design diagrams of the AI workflow with the practitioners. We discuss the oversight mechanisms and guardrails used in practice to safeguard AI use. The Fabric repository includes visual diagrams of AI use cases and descriptions of the deployed systems. Using the repository, we surface gaps in governance and find common patterns in human oversight of deployed AI systems. We intend for Fabric to serve as an extendable, evolving tool for researchers to study the effectiveness of AI governance.
Authors: Hamza A. Abushahla, Dara Varam, Ariel J. N. Panopio, Mohamed I. AlHajri
Abstract: The deployment of Quantized Neural Networks (QNNs) on resource-constrained devices, such as microcontrollers, has introduced significant challenges in balancing model performance, computational complexity, and memory constraints. Tiny Machine Learning (TinyML) addresses these issues by integrating advancements across machine learning algorithms, hardware acceleration, and software optimization to efficiently run deep neural networks on embedded systems. This survey presents a hardware-centric introduction to quantization, systematically reviewing essential quantization techniques employed to accelerate deep learning models for embedded applications. In particular, further emphasis is placed on the critical trade-offs between model performance and hardware capabilities. The survey further evaluates existing software frameworks and hardware platforms designed specifically for supporting QNN execution on microcontrollers. Moreover, we provide an analysis of the current challenges and an outline of promising future directions in the rapidly evolving domain of QNN deployment.
Authors: V Venktesh, Mandeep Rathee, Avishek Anand
Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.
URLs: https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.
Authors: Hoyoung Lee, Wonbin Ahn, Suhwan Park, Jaehoon Lee, Minjae Kim, Sungdong Yoo, Taeyoon Lim, Woohyung Lim, Yongjae Lee
Abstract: Thematic investing, which aims to construct portfolios aligned with structural trends, remains a challenging endeavor due to overlapping sector boundaries and evolving market dynamics. A promising direction is to build semantic representations of investment themes from textual data. However, despite their power, general-purpose LLM embedding models are not well-suited to capture the nuanced characteristics of financial assets, since the semantic representation of investment assets may differ fundamentally from that of general financial text. To address this, we introduce THEME, a framework that fine-tunes embeddings using hierarchical contrastive learning. THEME aligns themes and their constituent stocks using their hierarchical relationship, and subsequently refines these embeddings by incorporating stock returns. This process yields representations effective for retrieving thematically aligned assets with strong return potential. Empirical results demonstrate that THEME excels in two key areas. For thematic asset retrieval, it significantly outperforms leading large language models. Furthermore, its constructed portfolios demonstrate compelling performance. By jointly modeling thematic relationships from text and market dynamics from returns, THEME generates stock embeddings specifically tailored for a wide range of practical investment applications.
Authors: Syed Nazmus Sakib, Nafiul Haque, Mohammad Zabed Hossain, Shifat E. Arman
Abstract: PlantVillageVQA is a large-scale visual question answering (VQA) dataset derived from the widely used PlantVillage image corpus. It was designed to advance the development and evaluation of vision-language models for agricultural decision-making and analysis. The PlantVillageVQA dataset comprises 193,609 high-quality question-answer (QA) pairs grounded over 55,448 images spanning 14 crop species and 38 disease conditions. Questions are organised into 3 levels of cognitive complexity and 9 distinct categories. Each question category was phrased manually following expert guidance and generated via an automated two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevancy. The final dataset was evaluated using three state-of-the-art models for quality assessment. Our objective remains to provide a publicly available, standardised and expert-verified database to enhance diagnostic accuracy for plant disease identifications and advance scientific research in the agricultural domain. Our dataset will be open-sourced at https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.
URLs: https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.
Authors: Mirza Mumtaz Zahoor (Faculty of Computer Sciences, Ibadat International University, Islamabad, Pakistan), Saddam Hussain Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences)
Abstract: Brain tumors remain among the most lethal human diseases, where early detection and accurate classification are critical for effective diagnosis and treatment planning. Although deep learning-based computer-aided diagnostic (CADx) systems have shown remarkable progress. However, conventional convolutional neural networks (CNNs) and Transformers face persistent challenges, including high computational cost, sensitivity to minor contrast variations, structural heterogeneity, and texture inconsistencies in MRI data. Therefore, a novel hybrid framework, CE-RS-SBCIT, is introduced, integrating residual and spatial learning-based CNNs with transformer-driven modules. The proposed framework exploits local fine-grained and global contextual cues through four core innovations: (i) a smoothing and boundary-based CNN-integrated Transformer (SBCIT), (ii) tailored residual and spatial learning CNNs, (iii) a channel enhancement (CE) strategy, and (iv) a novel spatial attention mechanism. The developed SBCIT employs stem convolution and contextual interaction transformer blocks with systematic smoothing and boundary operations, enabling efficient global feature modeling. Moreover, Residual and spatial CNNs, enhanced by auxiliary transfer-learned feature maps, enrich the representation space, while the CE module amplifies discriminative channels and mitigates redundancy. Furthermore, the spatial attention mechanism selectively emphasizes subtle contrast and textural variations across tumor classes. Extensive evaluation on challenging MRI datasets from Kaggle and Figshare, encompassing glioma, meningioma, pituitary tumors, and healthy controls, demonstrates superior performance, achieving 98.30% accuracy, 98.08% sensitivity, 98.25% F1-score, and 98.43% precision.
Authors: Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li
Abstract: Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware reward function to optimize for both accuracy and budget adherence. We demonstrate that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets on challenging mathematical benchmarks. Our method provides a scalable and effective solution for developing efficient and controllable LLM reasoning, making advanced models more practical for deployment in resource-constrained and real-time environments.
Authors: Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng
Abstract: We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.
Authors: Hejia Liu, Mochen Yang, Gediminas Adomavicius
Abstract: Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs' impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.
Authors: Maha Shatta, Konstantinos Balaskas, Paula Carolina Lozano Duarte, Georgios Panagopoulos, Mehdi B. Tahoori, Georgios Zervakis
Abstract: Flexible Electronics (FE) offer a promising alternative to rigid silicon-based hardware for wearable healthcare devices, enabling lightweight, conformable, and low-cost systems. However, their limited integration density and large feature sizes impose strict area and power constraints, making ML-based healthcare systems-integrating analog frontend, feature extraction and classifier-particularly challenging. Existing FE solutions often neglect potential system-wide solutions and focus on the classifier, overlooking the substantial hardware cost of feature extraction and Analog-to-Digital Converters (ADCs)-both major contributors to area and power consumption. In this work, we present a holistic mixed-signal feature-to-classifier co-design framework for flexible smart wearable systems. To the best of our knowledge, we design the first analog feature extractors in FE, significantly reducing feature extraction cost. We further propose an hardware-aware NAS-inspired feature selection strategy within ML training, enabling efficient, application-specific designs. Our evaluation on healthcare benchmarks shows our approach delivers highly accurate, ultra-area-efficient flexible systems-ideal for disposable, low-power wearable monitoring.
Authors: Vojtech Mrazek, Konstantinos Balaskas, Paula Carolina Lozano Duarte, Zdenek Vasicek, Mehdi B. Tahoori, Georgios Zervakis
Abstract: Printed electronics offer a promising alternative for applications beyond silicon-based systems, requiring properties like flexibility, stretchability, conformality, and ultra-low fabrication costs. Despite the large feature sizes in printed electronics, printed neural networks have attracted attention for meeting target application requirements, though realizing complex circuits remains challenging. This work bridges the gap between classification accuracy and area efficiency in printed neural networks, covering the entire processing-near-sensor system design and co-optimization from the analog-to-digital interface-a major area and power bottleneck-to the digital classifier. We propose an automated framework for designing printed Ternary Neural Networks with arbitrary input precision, utilizing multi-objective optimization and holistic approximation. Our circuits outperform existing approximate printed neural networks by 17x in area and 59x in power on average, being the first to enable printed-battery-powered operation with under 5% accuracy loss while accounting for analog-to-digital interfacing costs.