new The Diffusion-Attention Connection

Authors: Julio Candanedo

Abstract: Transformers, diffusion-maps, and magnetic Laplacians are usually treated as separate tools; we show they are all different regimes of a single Markov geometry built from pre-softmax query-scores. We define a QK "bidivergence" whose exponentiated and normalized forms yield attention, diffusion-maps, and magnetic diffusion. And use product of experts and Schr\"odinger-bridges to connect and organize them into equilibrium, nonequilibrium steady-state, and driven dynamics.

new Fairboard: a quantitative framework for equity assessment of healthcare models

Authors: James K. Ruffle, Samia Mohinta, Chris Foulon, Mohamad Zeina, Zicheng Wang, Sebastian Brandner, Harpreet Hyare, Parashkev Nachev

Abstract: Despite there now being more than 1,000 FDA-authorised AI medical devices, formal equity assessments -- whether model performance is uniform across patient subgroups -- are rare. Here, we evaluate the equity of 18 open-source brain tumour segmentation models across 648 glioma patients from two independent datasets (n = 11,664 model inferences) along distinct univariate, Bayesian multivariate, spatial, and representational dimensions. We find that patient identity consistently explains more performance variance than model choice, with clinical factors, including molecular diagnosis, tumour grade, and extent of resection, predicting segmentation accuracy more strongly than model architecture. A voxel-wise spatial meta-analysis identifies neuroanatomically localised biases that are compartment-specific yet often consistent across models. Within a high-dimensional latent space of lesion masks and clinic-demographic features, model performance clusters significantly, indicating that the patient feature space contains axes of algorithmic vulnerability. Although newer models tend toward greater equity, none provide a formal fairness guarantee. Lastly, we release Fairboard, an open-source, no-code dashboard that lowers barriers to equitable model monitoring in medical imaging.

new Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Authors: Pankayaraj Pathmanathan, Furong Huang

Abstract: While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these alignment methods. To this end, the work on Deliberative alignment proposed distilling reasoning capabilities from stronger reasoning models, thereby instilling deeper safety in LLMs. In this work, we study the impact of deliberative alignment in language models. First, we show that despite being larger in model size and stronger in safety capability, there exists an alignment gap between teacher and student language models, which affects both the safety and general utility of the student model. Furthermore, we show that models aligned through deliberative alignment can retain unsafe behaviors from the base model despite learning the reasoning patterns of larger reasoning models. Building upon this observation, we propose a BoN sampling method that attributes the unsafe behavior back to the base LLMs in the latent space, thereby down-ranking unsafe responses to gain a meaningful improvement in model safety across multiple safety benchmarks with minimal loss in utility. In particular, across 7 teacher models and 6 student models of different classes and sizes, we show an average attack success rate (ASR) reduction of 28.2% in DAN, 31.3% in WildJailbreak and 35.4 % in StrongREJECT benchmarks. We further show that these safety gains prevail post RL training, thus highlighting the uncertainty in safety reasoning and it's explicit attribution to the base model.

new Human-like Working Memory Interference in Large Language Models

Authors: Hua-Dong Xiong (School of Psychological and Brain Sciences, Georgia Tech), Li Ji-An (Department of Psychology, New York University), Jiaqi Huang (Department of Cognitive Science, Indiana University Bloomington, Honda Research Institute), Robert C. Wilson (School of Psychological and Brain Sciences, Georgia Tech, Center of Excellence for Computational Cognition, Georgia Tech), Kwonjoon Lee (Honda Research Institute), Xue-Xin Wei (Departments of Neuroscience and Psychology, The University of Texas at Austin)

Abstract: Intelligent systems must maintain and manipulate task-relevant information online to adapt to dynamic environments and changing goals. This capacity, known as working memory, is fundamental to human reasoning and intelligence. Despite having on the order of 100 billion neurons, both biological and artificial systems exhibit limitations in working memory. This raises a key question: why do large language models (LLMs) show such limitations, given that transformers have full access to prior context through attention? We find that although a two-layer transformer can be trained to solve working memory tasks perfectly, a diverse set of pretrained LLMs continues to show working memory limitations. Notably, LLMs reproduce interference signatures observed in humans: performance degrades with increasing memory load and is biased by recency and stimulus statistics. Across models, stronger working memory capacity correlates with broader competence on standard benchmarks, mirroring its link to general intelligence in humans. Yet despite substantial variability in working memory performance, LLMs surprisingly converge on a common computational mechanism. Rather than directly copying the relevant memory item from context, models encode multiple memory items in entangled representations, such that successful recall depends on interference control -- actively suppressing task-irrelevant content to isolate the target for readout. Moreover, a targeted intervention that suppresses stimulus content information improves performance, providing causal support for representational interference. Together, these findings identify representational interference as a core constraint on working memory in pretrained LLMs, suggesting that working-memory limits in biological and artificial systems may reflect a shared computational challenge: selecting task-relevant information under interference.

new Belief-State RWKV for Reinforcement Learning under Partial Observability

Authors: Liu Xiao

Abstract: We propose a stronger formulation of RL on top of RWKV-style recurrent sequence models, in which the fixed-size recurrent state is explicitly interpreted as a belief state rather than an opaque hidden vector. Instead of conditioning policy and value on a single summary h_t, we maintain a compact uncertainty-aware state b_t = (\mu_t, \Sigma_t) derived from RWKV-style recurrent statistics and let control depend on both memory and uncertainty. This design targets a key weakness of plain fixed-state policies in partially observed settings: they may store evidence, but not necessarily confidence. We present the method, a theoretical program, and a pilot RL experiment with hidden episode-level observation noise together with a test-time noise sweep. The pilot shows that belief-state policies nearly match the best recurrent baseline overall while slightly improving return on the hardest in-distribution regime and under a held-out noise shift. Additional ablations show that this simple belief readout is currently stronger than two more structured extensions, namely gated memory control and privileged belief targets, underscoring the need for richer benchmarks.

new Active Inference with a Self-Prior in the Mirror-Mark Task

Authors: Dongmin Kim, Hoshinori Kanazawa, Yasuo Kuniyoshi

Abstract: The mirror self-recognition test evaluates whether a subject touches a mark on its own body that is visible only in a mirror, and is widely used as an indicator of self-awareness. In this study, we present a computational model in which this behavior emerges spontaneously through a single mechanism, the self-prior, without any external reward. The self-prior, implemented with a Transformer, learns the density of familiar multisensory experiences; when a novel mark appears, the discrepancy from this learned distribution drives mark-directed behavior through active inference. A simulated infant, relying solely on vision and proprioception without tactile input, discovered a sticker placed on its own face in the mirror and removed it in approximately 70% of cases without any explicit instruction. Expected free energy decreased significantly after sticker removal, confirming that the self-prior operates as an internal criterion for distinguishing self from non-self. Cross-modal sampling further demonstrated that the self-prior captures visual--proprioceptive associations, functioning as a probabilistic body schema. These results provide a concise computational account of the key behavior observed in the mirror test and suggest that the free energy principle can serve as a unifying hypothesis for investigating the developmental origins of self-awareness. Code is available at: https://github.com/kim135797531/self-prior-mirror

URLs: https://github.com/kim135797531/self-prior-mirror

new A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

Authors: Ming Lei, Christophe Baehr

Abstract: Reinforcement learning (RL) has become a key approach for enhancing reasoning in large language models (LLMs), yet scalable training is often hindered by the rapid collapse of policy entropy, which leads to premature convergence and performance saturation. This paper provides a comparative theoretical analysis of two entropy control strategies: traditional entropy regularization and the recently proposed covariance-based mechanism. We establish a unified framework for entropy dynamics under softmax parameterization, showing that entropy change is governed by the covariance between log-probabilities and logit updates. Our analysis reveals that traditional entropy regularization introduces a dense, persistent bias that modifies the stationary condition, leading to suboptimal policies, while covariance-based methods selectively regularize a sparse subset of high-covariance tokens and achieve asymptotic unbiasedness when the regularization coefficient is annealed. These results provide principled guidelines for entropy control in LLM posttraining, with implications for scaling RL to larger models and more complex reasoning tasks.

new STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

Authors: Samah Fodeh, Ganesh Puthiaraju, Elyas Irankhah, Linhai Ma, Srivani Talakokkul, Afshan Khan, Sreeraj Ramachandran, Jordan Alpert, Sarah Schellhorn

Abstract: Structured prediction requires models to generate ontology-constrained labels, grounded evidence, and valid structure under ambiguity, label skew, and heterogeneous group difficulty. We present a two-part framework for controllable inference and robust fine-tuning. First, we introduce a task-agnostic prompting strategy that combines XML-based instruction structure, disambiguation rules, verification-style reasoning, schema constraints, and self-validation to address format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion in in-context structured generation. Second, we introduce STaR-DRO, a stateful robust optimization method for group heterogeneity. It combines Tsallis mirror descent with momentum-smoothed, centered group-loss signals and bounded excess-only multipliers so that only persistently hard groups above a neutral baseline are upweighted, concentrating learning where it is most needed while avoiding volatile, dense exponentiated-gradient reweighting and unnecessary loss from downweighting easier groups. We evaluate the combined framework on EPPC Miner, a benchmark for extracting hierarchical labels and evidence spans from patient-provider secure messages. Prompt engineering improves zero-shot by +15.44 average F1 across Code, Sub-code, and Span over four Llama models. Building on supervised fine-tuning, STaR-DRO further improves the hardest semantic decisions: on Llama-3.3-70B-Instruct, Code F1 rises from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30, while preserving Span performance and reducing group-wise validation cross-entropy by up to 29.6% on the most difficult clinical categories. Because these rare and difficult groups correspond to clinically consequential communication behaviors, these gains are not merely statistical improvements: they directly strengthen communication mining reliability for patient-centered care analysis.

new ExecTune: Effective Steering of Black-Box LLMs with Guide Models

Authors: Vijay Lingam, Aditya Golatkar, Anwesan Pal, Ben Vo, Narayanan Sadagopan, Alessandro Achille, Jun Huan, Anoop Deoras, Stefano Soatto

Abstract: For large language models deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.

new Efficient Matrix Implementation for Rotary Position Embedding

Authors: Chen Minqi, Zhongqi Yue, Shihao Zhang, Yun Xu, Peng Wu, kaixiang Xu, Zeyi Huang, Hanwang Zhang

Abstract: Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.

URLs: https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.

new Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms

Authors: Mainak Kundu, Catherine Chen, Rifatul Islam, Ismail Uysal, Ria Kanjilal

Abstract: Human activity recognition (HAR) has become a key component of intelligent systems for healthcare monitoring, assistive living, smart environments, and human-computer interaction. Although deep learning has substantially improved HAR performance on multivariate sensor data, the resulting models often remain opaque, limiting trust, reliability, and real-world deployment. Explainable artificial intelligence (XAI) has therefore emerged as a critical direction for making HAR systems more transparent and human-centered. This paper presents a comprehensive review of explainable HAR methods across wearable, ambient, physiological, and multimodal sensing settings. We introduce a unified perspective that separates conceptual dimensions of explainability from algorithmic explanation mechanisms, reducing ambiguities in prior surveys. Building on this distinction, we present a mechanism-centric taxonomy of XAI-HAR methods covering major explanation paradigms. The review examines how these methods address the temporal, multimodal, and semantic complexities of HAR, and summarize their interpretability objectives, explanation targets, and limitations. In addition, we discuss current evaluation practices, highlight key challenges in achieving reliable and deployable XAI-HAR, and outline directions toward trustworthy activity recognition systems that better support human understanding and decision-making.

new NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity

Authors: Weijian Mai, Mu Nan, Yu Zhu, Jiahang Cao, Rui Zhang, Yuqin Dai, Chunfeng Song, Andrew F. Luo, Jiamin Wu

Abstract: Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (1) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (2) Cross-modal Flow Matching (XFM) bypasses the typical paradigm of noise-to-data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time-dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding-decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain-computer interfaces.

new Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features

Authors: Robin Young, Michael E. Van Nuland, E. Toby Kiers, Tom\'a\v{s} V\v{e}trovsk\'y, Petr Kohout, Petr Baldrian, Srinivasan Keshav

Abstract: Mycorrhizal fungi are vital to terrestrial ecosystem functioning. Yet monitoring their biodiversity at landscape scales is often unfeasible due to time and cost constraints. Current predictions suggest that 90\% of mycorrhizal diversity hotspots remain unprotected, opening questions of how to broadly and effectively map underground fungal communities. Here, we show that self-supervised learning (SSL) applied to satellite imagery can predict below-ground ectomycorrhizal fungal richness across diverse environments. Our models explain over half the variance in species richness across ~12,000 field samples spanning Europe and Asia. SSL-derived features prove to be the single most informative predictor, subsuming the majority of information contained in climate, soil, and land cover datasets. Using this approach, we achieve a 10,000-fold increase in spatial resolution over existing techniques, moving from 1km landscape averages to 10m habitat-scale observations with nearly no systematic bias. As satellite observations are dynamic rather than static, this enables temporal monitoring of below-ground biodiversity at landscape scales for the first time. We analyze multi-year trends in predicted fungal richness across UK National Park woodlands, finding that ancient forests may be losing ectomycorrhizal diversity at disproportionate rates. These results establish SSL satellite features as a scalable tool for extending sparse field observations to continuous, high-resolution biodiversity maps for monitoring the invisible half of terrestrial ecosystems.

new Relational Preference Encoding in Looped Transformer Internal States

Authors: Jan Kirin

Abstract: We investigate how looped transformers encode human preference in their internal iteration states. Using Ouro-2.6B-Thinking, a 2.6B-parameter looped transformer with iterative refinement, we extract hidden states from each loop iteration and train lightweight evaluator heads (~5M parameters) to predict human preference on the Anthropic HH-RLHF dataset. Our pairwise evaluator achieves 95.2% test accuracy on 8,552 unseen examples, surpassing a full-batch L-BFGS probe (84.5%) while the base model remains completely frozen. Our central finding is that loop states encode preference predominantly relationally: a linear probe on pairwise differences achieves 84.5%, the best nonlinear independent evaluator reaches only 65% test accuracy, and linear independent classification scores 21.75%, below chance and with inverted polarity. Interpreted precisely, the evaluator functions as a model-internal consistency probe, measuring how stably Ouro's own learned value system organizes its representations rather than how well it predicts noisy human annotations. We also document a systematic architecture search that established a genuine 70% ceiling for independent scoring, and show how the 50% argument-swap protocol required to prevent degenerate pairwise solutions deflated pairwise training metrics by about 31 points at peak, creating the false appearance that pairwise and pointwise evaluators shared the same ceiling. Finally, we show that a cosine learning-rate dead zone at epoch 2 accidentally acted as early stopping, preserving the generalization peak before overfitting degraded test accuracy from 95.2% to 62.4% by epoch 5. Cross-epoch flip-test analysis shows that antisymmetry correlation remains stable while strict sign-flip rate mainly tracks scorer bias. We propose the flip test as a mandatory diagnostic for pairwise preference evaluators.

new Efficient Personalization of Generative User Interfaces

Authors: Yi-Hao Peng, Samarth Das, Jeffrey P. Bigham, Jason Wu

Abstract: Generative user interfaces (UIs) create new opportunities to adapt interfaces to individual users on demand, but personalization remains difficult because desirable UI properties are subjective, hard to articulate, and costly to infer from sparse feedback. We study this problem through a new dataset in which 20 trained designers each provide pairwise judgments over the same 600 generated UIs, enabling direct analysis of preference divergence. We find substantial disagreement across designers (average kappa = 0.25), and written rationales reveal that even when designers appeal to similar concepts such as hierarchy or cleanliness, designers differ in how they define, prioritize, and apply those concepts. Motivated by these findings, we develop a sample-efficient personalization method that represents a new user in terms of prior designers rather than a fixed rubric of design concepts. In a technical evaluation, our preference model outperforms both a pretrained UI evaluator and a larger multimodal model, and scales better with additional feedback. When used to personalize generation, it also produces interfaces preferred by 12 new designers over baseline approaches, including direct user prompting. Our findings suggest that lightweight preference elicitation can serve as a practical foundation for personalized generative UI systems.

new SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning

Authors: Halil Ibrahim Gulluk, Olivier Gevaert

Abstract: Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF

URLs: https://anonymous.4open.science/r/SemEnrich-75CF

new Improving Pediatric Emergency Department Triage with Modality Dropout in Late Fusion Multimodal EHR Models

Authors: Tyler Yang, Romal Mitr

Abstract: Emergency department triage relies heavily on both quantitative vital signs and qualitative clinical notes, yet multimodal machine learning models predicting triage acuity often suffer from modality collapse by over-relying on structured tabular data. This limitation severely hinders demographic generalizability, particularly for pediatric patients where developmental variations in vital signs make unstructured clinical narratives uniquely crucial. To address this gap, we propose a late-fusion multimodal architecture that processes tabular vitals via XGBoost and unstructured clinical text via Bio_ClinicalBERT, combined through a Logistic Regression meta-classifier to predict the 5-level Emergency Severity Index. To explicitly target the external validity problem, we train our model exclusively on adult encounters from the MIMIC-IV and NHAMCS datasets and evaluate its zero-shot generalization on a traditionally overlooked pediatric cohort. Furthermore, we employ symmetric modality dropout during training to prevent the ensemble from overfitting to adult-specific clinical correlations. Our results demonstrate that the multimodal framework significantly outperforms single-modality baselines. Most notably, applying a 30-40% symmetric modality dropout rate yielded steep performance improvements in the unseen pediatric cohort, elevating the Quadratic Weighted Kappa to 0.351. These findings highlight modality dropout as a critical regularization technique for mitigating modality collapse and enhancing cross-demographic generalization in clinical AI.

new Last-Iterate Convergence of Randomized Kaczmarz and SGD with Greedy Step Size

Authors: Micha{\l} Derezi\'nski, Xiaoyu Dong

Abstract: We study last-iterate convergence of SGD with greedy step size over smooth quadratics in the interpolation regime, a setting which captures the classical Randomized Kaczmarz algorithm as well as other popular iterative linear system solvers. For these methods, we show that the $t$-th iterate attains an $O(1/t^{3/4})$ convergence rate, addressing a question posed by Attia, Schliserman, Sherman, and Koren, who gave an $O(1/t^{1/2})$ guarantee for this setting. In the proof, we introduce the family of stochastic contraction processes, whose behavior can be described by the evolution of a certain deterministic eigenvalue equation, which we analyze via a careful discrete-to-continuous reduction.

new Regularized Entropy Information Adaptation with Temporal-Awareness Networks for Simultaneous Speech Translation

Authors: Joseph Liu, Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana

Abstract: Simultaneous Speech Translation (SimulST) requires balancing high translation quality with low latency. Recent work introduced REINA, a method that trains a Read/Write policy based on estimating the information gain of reading more audio. However, we find that information-based policies often lack temporal context, leading the policy to bias itself toward reading most of the audio before starting to write. We improve REINA using two distinct strategies: a supervised alignment network (REINA-SAN) and a timestep-augmented network (REINA-TAN). Our results demonstrate that while both methods significantly outperform the baseline and resolve stability issues, REINA-TAN provides a slightly superior Pareto frontier for streaming efficiency, whereas REINA-SAN offers more robustness against 'read loops'. Applied to Whisper, both methods improve the pareto frontier of streaming efficiency as measured by Normalized Streaming Efficiency (NoSE) scores up to 7.1% over existing competitive baselines.

new A Tale of Two Temperatures: Simple, Efficient, and Diverse Sampling from Diffusion Language Models

Authors: Theo X. Olausson, Metod Jazbec, Xi Wang, Armando Solar-Lezama, Christian A. Naesseth, Stephan Mandt, Eric Nalisnick

Abstract: Much work has been done on designing fast and accurate sampling for diffusion language models (dLLMs). However, these efforts have largely focused on the tradeoff between speed and quality of individual samples; how to additionally ensure diversity across samples remains less well understood. In this work, we show that diversity can be increased by using softened, tempered versions of familiar confidence-based remasking heuristics, retaining their computational benefits and offering simple implementations. We motivate this approach by introducing an idealized formal model of fork tokens and studying the impact of remasking on the expected entropy at the forks. Empirically, the proposed tempered heuristics close the exploration gap (pass@k) between existing confidence-based and autoregressive sampling, hence outperforming both when controlling for cost (pass@NFE). We further study how the increase in diversity translates to downstream post-training and test-time compute scaling. Overall, our findings demonstrate that simple, efficient, and diverse sampling from dLLMs is possible.

new K-STEMIT: Knowledge-Informed Spatio-Temporal Efficient Multi-Branch Graph Neural Network for Subsurface Stratigraphy Thickness Estimation from Radar Data

Authors: Zesheng Liu, Maryam Rahnemoonfar

Abstract: Subsurface stratigraphy contains important spatio-temporal information about accumulation, deformation, and layer formation in polar ice sheets. In particular, variations in internal ice layer thickness provide valuable constraints for snow mass balance estimation and projections of ice sheet change. Although radar sensors can capture these layered structures as depth-resolved radargrams, convolutional neural networks applied directly to radar images are often sensitive to speckle noise and acquisition artifacts. In addition, purely data-driven methods may underuse physical knowledge, leading to unrealistic thickness estimates under spatial or temporal extrapolation. To address these challenges, we develop K-STEMIT, a novel knowledge-informed, efficient, multi-branch spatio-temporal graph neural network that combines a geometric framework for spatial learning with temporal convolution to capture temporal dynamics, and incorporates physical data synchronized from the Model Atmospheric Regional physical weather model. An adaptive feature fusion strategy is employed to dynamically combine features learned from different branches. Extensive experiments have been conducted to compare K-STEMIT against current state-of-the-art methods in both knowledge-informed and non-knowledge-informed settings, as well as other existing methods. Results show that K-STEMIT consistently achieves the highest accuracy while maintaining near-optimal efficiency. Most notably, incorporating adaptive feature fusion and physical priors reduces the root mean-squared error by 21.01% with negligible additional cost compared to its conventional multi-branch variants. Additionally, our proposed K-STEMIT achieves consistently lower per-year relative MAE, enabling reliable, continuous spatiotemporal assessment of snow accumulation variability across large spatial regions.

new A Hybrid Intelligent Framework for Uncertainty-Aware Condition Monitoring of Industrial Systems

Authors: Maryam Ahang, Todd Charter, Masoud Jalayer, Homayoun Najjaran

Abstract: Hybrid approaches that combine data-driven learning with physics-based insight have shown promise for improving the reliability of industrial condition monitoring. This work develops a hybrid condition monitoring framework that integrates primary sensor measurements, lagged temporal features, and physics-informed residuals derived from nominal surrogate models. Two hybrid integration strategies are examined. The first is a feature-level fusion approach that augments the input space with residual and temporal information. The second is a model-level ensemble approach in which machine learning classifiers trained on different feature types are combined at the decision level. Both hybrid approaches of the condition monitoring framework are evaluated on a continuous stirred-tank reactor (CSTR) benchmark using several machine learning models and ensemble configurations. Both feature-level and model-level hybridization improve diagnostic accuracy relative to single-source baselines, with the best model-level ensemble achieving a 2.9\% improvement over the best baseline ensemble. To assess predictive reliability, conformal prediction is applied to quantify coverage, prediction-set size, and abstention behavior. The results show that hybrid integration enhances uncertainty management, producing smaller and well-calibrated prediction sets at matched coverage levels. These findings demonstrate that lightweight physics-informed residuals, temporal augmentation, and ensemble learning can be combined effectively to improve both accuracy and decision reliability in nonlinear industrial systems.

new Vestibular reservoir computing

Authors: Smita Deb, Shirin Panahi, Mulugeta Haile, Ying-Cheng Lai

Abstract: Reservoir computing (RC) is a computational framework known for its training efficiency, making it ideal for physical hardware implementations. However, realizing the complex interconnectivity of traditional reservoirs in physical systems remains a significant challenge. This paper proposes a physical RC scheme inspired by the biological vestibular system. To overcome hardware complexity, we introduce a designed uncoupled topology and demonstrate that it achieves performance comparable to fully coupled networks. We theoretically analyze the difference between these topologies by deriving a memory capacity formula for linear reservoirs, identifying specific conditions where both configurations yield equivalent memory. These analytical results are demonstrated to approximately hold for nonlinear reservoir systems. Furthermore, we systematically examine the impact of reservoir size on predictive statistics and memory capacity. Our findings suggest that uncoupled reservoir architectures offer a mathematically sound and practically feasible pathway for efficient physical reservoir computing.

new SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

Authors: Renjini R. Nair (Microsoft), Damian K. Kowalczyk (Microsoft), Marco Gaudesi (Microsoft), Chhaya Methani (Microsoft)

Abstract: Many applications today use large language models for code generation; however, production systems have strict latency requirements that can be difficult to meet with large models. Small language models with a few billion parameters are resource efficient but may suffer from limited reasoning, hallucinations, or poor retention of longer context. Fine tuning improves task specific accuracy by embedding domain knowledge directly into model weights, reducing reliance on runtime context. We previously implemented a baseline natural language to code generation approach using a retrieval augmented generation pipeline that dynamically selected few shot examples to embed domain specific language context for a large language model. In this study, we evaluate small language models for generating domain specific language from natural language by fine tuning variants of Mistral and other models on a dataset of natural language code pairs. Our results show that the fine-tuned models achieve improved performance and latency on test datasets compared to larger models. We also demonstrate that the trained model can be further fine-tuned for customer specific scenarios without degrading general performance, helping resolve production issues. Load testing followed by production deployment confirmed optimal performance in terms of latency and quality. These findings demonstrate that task specific fine tuning with small language models provides an efficient, faster, and cost-effective alternative to large language models for domain specific language generation.

new From Recency Bias to Stable Convergence Block Kaczmarz Methods for Online Preference Learning in Matchmaking Applications

Authors: James Nguyen

Abstract: We present a family of Kaczmarz-based preference learning algorithms for real-time personalized matchmaking in reciprocal recommender systems. Post-step L2 normalization, common in Kaczmarz-inspired online learners, induces exponential recency bias: the influence of the t-th interaction decays as eta^(n - t), reaching approximately 1e-6 after just 20 swipes at eta = 0.5. We resolve this by replacing the normalization step with a Tikhonov-regularized projection denominator that bounds step size analytically without erasing interaction history. When candidate tag vectors are not pre-normalized, as in realistic deployments where candidates vary in tag density, the Tikhonov denominator ||a||^2 + alpha produces genuinely per-candidate adaptive step sizes, making it structurally distinct from online gradient descent with any fixed learning rate. We further derive a block variant that processes full swipe sessions as a single Gram matrix solve. Population-scale simulation over 6,400 swipes reveals that Block Normalized Kaczmarz (BlockNK), which combines the batch Gram solve with post-session L2 normalization, achieves the highest preference alignment (Align@20 = 0.698), the strongest inter-session direction stability (delta = 0.994), and the flattest degradation profile under label noise across flip ratios p_flip in [0.10, 0.35]. Experiments under cosine similarity subsampling further show that adaptively filtering the candidate pool toward the current preference direction substantially improves asymptotic alignment, at the cost of introducing a feedback loop that may slow recovery from miscalibration. The sequential Tikhonov-Kaczmarz method performs comparably to K-NoNorm under our simulation conditions, suggesting the dominant practical gain over normalized Kaczmarz is the removal of per-step normalization rather than the Tikhonov constant alpha itself.

new Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Authors: Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang

Abstract: Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon$^2$, an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon$^2$, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon$^2$ demonstrates dramatic improvement over Muon at each polar step. Across GPT and LLaMA pre-training experiments from 60M to 1.3B parameters, Muon$^2$ consistently outperforms Muon and recent Muon variants while reducing NS iterations by 40\%. We further introduce Muon$^2$-F, a memory-efficient factorized variant that preserves most of the gains of Muon$^2$ with negligible memory overhead.

new LoDAdaC: a unified local training-based decentralized framework with adaptive gradients and compressed communication

Authors: Wei Liu, Anweshit Panda, Ujwal Pandey, Haven Cook, George M. Slota, Naigang Wang, Jie Chen, Yangyang Xu

Abstract: In the decentralized distributed learning, achieving fast convergence and low communication cost is essential for scalability and high efficiency. Adaptive gradient methods, such as Adam, have demonstrated strong practical performance in deep learning and centralized distributed settings. However, their convergence properties remain largely unexplored in decentralized settings involving multiple local training steps, such as federated learning. To address this limitation, we propose LoDAdaC, a unified multiple Local Training (MLT) Decentralized framework with Adam-type updates and Compressed communication (CC). LoDAdaC accommodates a broad class of optimizers for its local adaptive updates, including AMSGrad, Adam, and AdaGrad; it is compatible with standard (possibly biased) compressors such as low-bit quantization and sparsification. MLT and CC enable LoDAdaC to achieve multiplied reduction of communication cost, while the technique of adaptive updates enables fast convergence. We rigorously prove the combined advantage through complexity analysis. In addition, experiments on image classification and GPT-style language model training validate our theoretical findings and show that LoDAdaC significantly outperforms existing decentralized algorithms in terms of convergence speed and communication efficiency.

new Towards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels

Authors: Kening Wang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiale Wei, Kailun Yang, Rainer Stiefelhagen, Kunyu Peng

Abstract: Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label-noise-robust multi-source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and show that existing noisy-label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF-TRUST, a domain-invariant multimodal sleep staging framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR). By jointly exploiting temporal and spectral consistency together with confidence-diversity regularization, FF-TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at https://github.com/KNWang970918/FF-TRUST.git.

URLs: https://github.com/KNWang970918/FF-TRUST.git.

new Closed-Form Concept Erasure via Double Projections

Authors: Chi Zhang, Jingpu Cheng, Zhixian Wang, Ping Liu

Abstract: While modern generative models such as diffusion-based architectures have enabled impressive creative capabilities, they also raise important safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.

new Cross-Validated Cross-Channel Self-Attention and Denoising for Automatic Modulation Classification

Authors: Prakash Suman, Yanzhen Qu

Abstract: This study addresses a key limitation in deep learning Automatic Modulation Classification (AMC) models, which perform well at high signal-to-noise ratios (SNRs) but degrade under noisy conditions due to conventional feature extraction suppressing both discriminative structure and interference. The goal was to develop a feature-preserving denoising method that mitigates the loss of modulation class separation. A deep learning AMC model was proposed, incorporating a cross-channel self-attention block to capture dependencies between in-phase and quadrature components, along with dual-path deep residual shrinkage denoising blocks to suppress noise. Experiments using the RML2018.01a dataset employed stratified sampling across 24 modulation types and 26 SNR levels. Results showed that denoising depth strongly influences robustness at low and moderate SNRs. Compared to benchmark models PET-CGDNN, MCLDNN, and DAE, the proposed model achieved notable accuracy improvements across -8 dB to +2 dB SNR, with increases of 3%, 2.3%, and 14%, respectively. Cross-validation confirmed the model's robustness, yielding a mean accuracy of 62.6%, macro precision of 65.8%, macro-recall of 62.6%, and macro-F1 score of 62.9%. The architecture advances interference-aware AMC by formalizing baseband modeling as orthogonal subproblems and introducing cross-channel attention as a generalized complex interaction operator, with ablations confirming the critical role of feature-preserving denoising for robustness at low-to-medium SNR.

new When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Authors: Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li, Haoyu Zhao, Xuezhou Zhang, Sanghyun Hong, Huazheng Wang

Abstract: We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker's objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla non-robust RL algorithms. Our theory extends beyond linear MDPs -- by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.

new Graph-RHO: Critical-path-aware Heterogeneous Graph Network for Long-Horizon Flexible Job-Shop Scheduling

Authors: Yujie Li, Jiuniu Wang, Mugen Peng, Guangzuo Li, Wenjia Xu

Abstract: Long-horizon Flexible Job-Shop Scheduling~(FJSP) presents a formidable combinatorial challenge due to complex, interdependent decisions spanning extended time horizons. While learning-based Rolling Horizon Optimization~(RHO) has emerged as a promising paradigm to accelerate solving by identifying and fixing invariant operations, its effectiveness is hindered by the structural complexity of FJSP. Existing methods often fail to capture intricate graph-structured dependencies and ignore the asymmetric costs of prediction errors, in which misclassifying critical-path operations is significantly more detrimental than misclassifying non-critical ones. Furthermore, dynamic shifts in predictive confidence during the rolling process make static pruning thresholds inadequate. To address these limitations, we propose Graph-RHO, a novel critical-path-aware graph-based RHO framework. First, we introduce a topology-aware heterogeneous graph network that encodes subproblems as operation-machine graphs with multi-relational edges, leveraging edge-feature-aware message passing to predict operation stability. Second, we incorporate a critical-path-aware mechanism that injects inductive biases during training to distinguish highly sensitive bottleneck operations from robust ones. Third, we devise an adaptive thresholding strategy that dynamically calibrates decision boundaries based on online uncertainty estimation to align model predictions with the solver's search space. Extensive experiments on standard benchmarks demonstrate that \mbox{Graph-RHO} establishes a new state of the art in solution quality and computational efficiency. Remarkably, it exhibits exceptional zero-shot generalization, reducing solve time by over 30\% on large-scale instances (2000 operations) while achieving superior solution quality. Our code is available \href{https://github.com/IntelliSensing/Graph-RHO}{here}.

URLs: https://github.com/IntelliSensing/Graph-RHO

new Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

Authors: Hongkang Li, Hancheng Min, Rene Vidal

Abstract: Transformer-based diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically quantify the required number of tokens per data point and training iterations for the global convergence towards the Bayes optimal risk of the denoising objective, thereby achieving a desired score matching error. A deeper investigation reveals that the self-attention module of the trained transformer implements a mean denoising mechanism that enables the trained model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator of the injected noise in the diffusion steps. Numerical experiments validate these findings.

new Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Authors: Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong

Abstract: As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.

URLs: https://github.com/ZunhaiSu/Awesome-Attention-Sink.

new End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables

Authors: Francesco Carlucci, Giovanni Pollo, Xiaying Wang, Massimo Poncino, Enrico Macii, Luca Benini, Sara Vinco, Alessio Burrello, Daniele Jahier Pagliari

Abstract: Photoplethysmography (PPG)-based blood pressure (BP) estimation is a challenging task, particularly on resource-constrained wearable devices. However, fully on-board processing is desirable to ensure user data confidentiality. Recent deep neural networks (DNNs) have achieved high BP estimation accuracy by reconstructing BP waveforms or directly regressing BP values, but their large memory, computation, and energy requirements hinder deployment on wearables. This work introduces a fully automated DNN design pipeline that combines hardware-aware neural architecture search (NAS), pruning, and mixed-precision search (MPS) to generate accurate yet compact BP prediction models optimized for ultra-low-power multicore systems-on-chip (SoCs). Starting from state-of-the-art baseline models on four public datasets, our optimized networks achieve up to 7.99% lower error with a 7.5x parameter reduction, or up to 83x fewer parameters with negligible accuracy loss. All models fit within 512 kB of memory on our target SoC (GreenWaves' GAP8), requiring less than 55 kB and achieving an average inference latency of 142 ms and energy consumption of 7.25 mJ. Patient-specific fine-tuning further improves accuracy by up to 64%, enabling fully autonomous, low-cost BP monitoring on wearables.

new Consensus-based Recursive Multi-Output Gaussian Process

Authors: Yogesh Prasanna Kumar Rao, Tamas Keviczky, Raj Thilak Rajan

Abstract: Multi-output Gaussian Processes provide principled uncertainty-aware learning of vector-valued fields but are difficult to deploy in large-scale, distributed, and streaming settings due to their computational and centralized nature. This paper proposes a Consensus-based Recursive Multi-Output Gaussian Process (CRMGP) framework that combines recursive inference on shared basis vectors with neighbour-to-neighbour information-consensus updates. The resulting method supports parallel, fully distributed learning with bounded per-step computation while preserving inter-output correlations and calibrated uncertainty. Experiments on synthetic wind fields and real LiDAR data demonstrate that CRMGP achieves competitive predictive performance and reliable uncertainty calibration, offering a scalable alternative to centralized Gaussian process models for multi-agent sensing applications.

new A Temporally Augmented Graph Attention Network for Affordance Classification

Authors: Ami Chopra, Supriya Bordoloi, Shyamanta M. Hazarika

Abstract: Graph attention networks (GATs) provide one of the best frameworks for learning node representations in relational data; but, existing variants such as Graph Attention Network (GAT) mainly operate on static graphs and rely on implicit temporal aggregation when applied to sequential data. In this paper, we introduce Electroencephalography-temporal Graph Attention Network (EEG-tGAT), a temporally augmented formulation of GATv2 that is tailored for affordance classification from interaction sequences. The proposed model incorporates temporal attention to modulate the contribution of different time segments and temporal dropout to regularize learning across temporally correlated observations. The design reflects the assumption that temporal dimensions in affordance data are not semantically uniform and that discriminative information may be unevenly distributed across time. Experimental results on affordance datasets show that EEG-tGAT achieves improved classification performance compared to GATv2. The observed gains helps to conclude that explicitly encoding temporal importance and enforcing temporal robustness introduce inductive biases that are much better aligned with the structure of affordance-driven interaction data. These findings show us that modest architectural changes to graph attention models can help one obtain consistent benefits when temporal relationships play a nontrivial role in the task.

new Tracing the Thought of a Grandmaster-level Chess-Playing Transformer

Authors: Rui Lin, Zhenyu Jin, Guancheng Zhou, Xuyang Ge, Wentao Shu, Jiaxing Wu, Junxuan Wang, Zhengfu He, Junping Zhang, Xipeng Qiu

Abstract: While modern transformer neural networks achieve grandmaster-level performance in chess and other reasoning tasks, their internal computation process remains largely opaque. Focusing on Leela Chess Zero (LC0), we introduce a sparse decomposition framework to interpret its internal computation by decomposing its MLP and attention modules with sparse replacement layers, which capture the primary computation process of LC0. We conduct a detailed case study showing that these pathways expose rich, interpretable tactical considerations that are empirically verifiable. We further introduce three quantitative metrics and show that LC0 exhibits parallel reasoning behavior consistent with the inductive bias of its policy head architecture. To the best of our knowledge, this is the first work to decompose the internal computation of a transformer on both MLP and attention modules for interpretability. Combining sparse replacement layers and causal interventions in LC0 provides a comprehensive understanding of advanced tactical reasoning, offering critical insights into the underlying mechanisms of superhuman systems. Our code is available at https://github.com/JacklE0niden/Leela-SAEs.

URLs: https://github.com/JacklE0niden/Leela-SAEs.

new Virtual Smart Metering in District Heating Networks via Heterogeneous Spatial-Temporal Graph Neural Networks

Authors: Keivan Faghih Niresi, Christian M{\o}ller Jensen, Carsten Skovmose Kalles{\o}e, Rafael Wisniewski, Olga Fink

Abstract: Intelligent operation of thermal energy networks aims to improve energy efficiency, reliability, and operational flexibility through data-driven control, predictive optimization, and early fault detection. Achieving these goals relies on sufficient observability, requiring continuous and well-distributed monitoring of thermal and hydraulic states. However, district heating systems are typically sparsely instrumented and frequently affected by sensor faults, limiting monitoring. Virtual sensing offers a cost-effective means to enhance observability, yet its development and validation remain limited in practice. Existing data-driven methods generally assume dense synchronized data, while analytical models rely on simplified hydraulic and thermal assumptions that may not adequately capture the behavior of heterogeneous network topologies. Consequently, modeling the coupled nonlinear dependencies between pressure, flow, and temperature under realistic operating conditions remains challenging. In addition, the lack of publicly available benchmark datasets hinders systematic comparison of virtual sensing approaches. To address these challenges, we propose a heterogeneous spatial-temporal graph neural network (HSTGNN) for constructing virtual smart heat meters. The model incorporates the functional relationships inherent in district heating networks and employs dedicated branches to learn graph structures and temporal dynamics for flow, temperature, and pressure measurements, thereby enabling the joint modeling of cross-variable and spatial correlations. To support further research, we introduce a controlled laboratory dataset collected at the Aalborg Smart Water Infrastructure Laboratory, providing synchronized high-resolution measurements representative of real operating conditions. Extensive experiments demonstrate that the proposed approach significantly outperforms existing baselines.

new Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks

Authors: Yuto Omae, Kazuki Sakai, Yohei Kakimoto, Makoto Sasaki, Yusuke Sakai, Hirotaka Takahashi

Abstract: Neural networks (NNs) are central to modern machine learning and achieve state-of-the-art results in many applications. However, the relationship between loss geometry and generalization is still not well understood. The local geometry of the loss function near a critical point is well-approximated by its quadratic form, obtained through a second-order Taylor expansion. The coefficients of the quadratic term correspond to the Hessian matrix, whose eigenspectrum allows us to evaluate the sharpness of the loss at the critical point. Extensive research suggests flat critical points generalize better, while sharp ones lead to higher generalization error. However, sharpness requires the Hessian eigenspectrum, but general matrix characteristic equations have no closed-form solution. Therefore, most existing studies on evaluating loss sharpness rely on numerical approximation methods. Existing closed-form analyses of the eigenspectrum are primarily limited to simplified architectures, such as linear or ReLU-activated networks; consequently, theoretical analysis of smooth nonlinear multilayer neural networks remains limited. Against this background, this study focuses on nonlinear, smooth multilayer neural networks and derives a closed-form upper bound for the maximum eigenvalue of the Hessian with respect to the cross-entropy loss by leveraging the Wolkowicz-Styan bound. Specifically, the derived upper bound is expressed as a function of the affine transformation parameters, hidden layer dimensions, and the degree of orthogonality among the training samples. The primary contribution of this paper is an analytical characterization of loss sharpness in smooth nonlinear multilayer neural networks via a closed-form expression, avoiding explicit numerical eigenspectrum computation. We hope that this work provides a small yet meaningful step toward unraveling the mysteries of deep learning.

new Mild Over-Parameterization Benefits Asymmetric Tensor PCA

Authors: Shihong Ding, Weicheng Lin, Cong Fang

Abstract: Asymmetric Tensor PCA (ATPCA) is a prototypical model for studying the trade-offs between sample complexity, computation, and memory. Existing algorithms for this problem typically require at least $d^{\left\lceil\overline{k}/2\right\rceil}$ state memory cost to recover the signal, where $d$ is the vector dimension and $\overline{k}$ is the tensor order. We focus on the setting where $\overline{k} \geq 4$ is even and consider (stochastic) gradient descent-based algorithms under a limited memory budget, which permits only mild over-parameterization of the model. We propose a matrix-parameterized method (in $d^{2}$ state memory cost) using a novel three-phase alternating-update algorithm to address the problem and demonstrate how mild over-parameterization facilitates learning in two key aspects: (i) it improves sample efficiency, allowing our method to achieve \emph{near-optimal} $d^{\overline{k}-2}$ sample complexity in our limited memory setting; and (ii) it enhances adaptivity to problem structure, a previously unrecognized phenomenon, where the required sample size naturally decreases as consecutive vectors become more aligned, and in the symmetric limit attains $d^{\overline{k}/2}$, matching the \emph{best} known polynomial-time complexity. To our knowledge, this is the \emph{first} tractable algorithm for ATPCA with $d^{\overline{k}}$-independent memory costs.

new Exploring the impact of fairness-aware criteria in AutoML

Authors: Joana Sim\~oes, Jo\~ao Correia

Abstract: Machine Learning (ML) systems are increasingly used to support decision-making processes that affect individuals. However, these systems often rely on biased data, which can lead to unfair outcomes against specific groups. With the growing adoption of Automated Machine Learning (AutoML), the risk of intensifying discriminatory behaviours increases, as most frameworks primarily focus on model selection to maximise predictive performance. Previous research on fairness in AutoML had largely followed this trend, integrating fairness awareness only in the model selection or hyperparameter tuning, while neglecting other critical stages of the ML pipeline. This paper aims to study the impact of integrating fairness directly into the optimisation component of an AutoML framework that constructs complete ML pipelines, from data selection and transformations to model selection and tuning. As selecting appropriate fairness metrics remains a key challenge, our work incorporates complementary fairness metrics to capture different dimensions of fairness during the optimisation. Their integration within AutoML resulted in measurable differences compared to a baseline focused solely on predictive performance. Despite a 9.4% decrease in predictive power, the average fairness improved by 14.5%, accompanied by a 35.7% reduction in data usage. Furthermore, fairness integration produced complete yet simpler final solutions, suggesting that model complexity is not always required to achieve balanced and fair ML solutions.

new A Multi-head Attention Fusion Network for Industrial Prognostics under Discrete Operational Conditions

Authors: Yuqi Su, Xiaolei Fang

Abstract: Complex systems such as aircraft engines, turbines, and industrial machinery often operate under dynamically changing conditions. These varying operating conditions can substantially influence degradation behavior and make prognostic modeling more challenging, as accurate prediction requires explicit consideration of operational effects. To address this issue, this paper proposes a novel multi-head attention-based fusion neural network. The proposed framework explicitly models and integrates three signal components: (1) the monotonic degradation trend, which reflects the underlying deterioration of the system; (2) discrete operating states, identified through clustering and encoded into dense embeddings; and (3) residual random noise, which captures unexplained variation in sensor measurements. The core strength of the framework lies in its architecture, which combines BiLSTM networks with attention mechanisms to better capture complex temporal dependencies. The attention mechanism allows the model to adaptively weight different time steps and sensor signals, improving its ability to extract prognostically relevant information. In addition, a fusion module is designed to integrate the outputs from the degradation-trend branch and the operating-state embeddings, enabling the model to capture their interactions more effectively. The proposed method is validated using a dataset from the NASA repository, and the results demonstrate its effectiveness.

new The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

Authors: Mani Rash Ahmadi

Abstract: We prove that in a coupled Kuramoto oscillator network at stable equilibrium, the physical phase displacement under weak output nudging is the gradient of the loss with respect to natural frequencies, with equality as the nudging strength beta tends to zero. Prior oscillator equilibrium propagation work explicitly set aside natural frequency as a learnable parameter; we show that on sparse layered architectures, frequency learning outperforms coupling-weight learning among converged seeds (96.0% vs. 83.3% at matched parameter counts, p = 1.8e-12). The approximately 50% convergence failure rate under random initialization is a loss-landscape property, not a gradient error; topology-aware spectral seeding eliminates it in all settings tested (46/100 to 100/100 seeds on the primary task; 50/50 on a second task, K-only training, and a larger architecture).

new A Diffusion-Contrastive Graph Neural Network with Virtual Nodes for Wind Nowcasting in Unobserved Regions

Authors: Jie Shi, Siamak Mehrkanoon

Abstract: Accurate weather nowcasting remains one of the central challenges in atmospheric science, with critical implications for climate resilience, energy security, and disaster preparedness. Since it is not feasible to deploy observation stations everywhere, some regions lack dense observational networks, resulting in unreliable short-term wind predictions across those unobserved areas. Here we present a deep graph self-supervised framework that extends nowcasting capability into such unobserved regions without requiring new sensors. Our approach introduces "virtual nodes" into a diffusion and contrastive-based graph neural network, enabling the model to learn wind condition (i.e., speed, direction and gusts) in places with no direct measurements. Using high-temporal resolution weather station data across the Netherlands, we demonstrate that this approach reduces nowcast mean absolute error (MAE) of wind speed, gusts, and direction in unobserved regions by more than 30% - 46% compared with interpolation and regression methods. By enabling localized nowcasts where no measurements exist, this method opens new pathways for renewable energy integration, agricultural planning, and early-warning systems in data-sparse regions.

new Integrating SAINT with Tree-Based Models: A Case Study in Employee Attrition Prediction

Authors: Adil Derrazi, Javad Pourmostafa Roshan Sharami

Abstract: Employee attrition presents a major challenge for organizations, increasing costs and reducing productivity. Predicting attrition accurately enables proactive retention strategies, but existing machine learning models often struggle to capture complex feature interactions in tabular HR datasets. While tree-based models such as XGBoost and LightGBM perform well on structured data, traditional encoding techniques like one-hot encoding can introduce sparsity and fail to preserve semantic relationships between categorical features. This study explores a hybrid approach by integrating SAINT (Self-Attention and Intersample Attention Transformer)-generated embeddings with tree-based models to enhance employee attrition prediction. SAINT leverages self-attention mechanisms to model intricate feature interactions. In this study, we explore SAINT both as a standalone classifier and as a feature extractor for tree-based models. We evaluate the performance, generalizability, and interpretability of standalone models (SAINT, XGBoost, LightGBM) and hybrid models that combine SAINT embeddings with tree-based classifiers. Experimental results show that standalone tree-based models outperform both the standalone SAINT model and the hybrid approaches in predictive accuracy and generalization. Contrary to expectations, the hybrid models did not improve performance. One possible explanation is that tree-based models struggle to utilize dense, high-dimensional embeddings effectively. Additionally, the hybrid approach significantly reduced interpretability, making model decisions harder to explain. These findings suggest that transformer-based embeddings, while capturing feature relationships, do not necessarily enhance tree-based classifiers. Future research should explore alternative fusion strategies for integrating deep learning with structured data.

new WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents

Authors: Jiaqi Wen, Pingbo Tang, Shaolei Ren, Jianyi Yang

Abstract: We study the operation of community water systems, where pumps and valves must be scheduled to reliably meet water demands while minimizing energy consumption. While existing optimization-based methods are effective under well-modeled environments, real-world community scenarios exhibit highly dynamic contexts-such as human activities, weather variations, etc-that significantly affect water demand patterns and operational targets across different zones. Traditional optimization approaches struggle to aggregate and adapt to such heterogeneous and rapidly evolving contextual information in real time. While Large Language Model (LLM) agents offer strong capabilities for understanding heterogeneous community context, they are not suitable for directly producing reliable real-time control actions. To address these challenges, we propose a bi-level AI-agent-based framework, WaterAdmin, which integrates LLM-based community context abstraction at the upper level with optimization-based operational control at the lower level. This design leverages the complementary strengths of both paradigms to enable adaptive and reliable operation. We implement WaterAdmin on the hydraulic simulation platform EPANET and demonstrate superior performance in maintaining pressure reliability and reducing energy consumption under highly dynamic community contexts.

new Battery health prognosis using Physics-informed neural network with Quantum Feature mapping

Authors: Muhammad Imran Hossain, Md Fazley Rafy, Sarika Khushlani Solanki, Anurag K. Srivastava

Abstract: Accurate battery health prognosis using State of Health (SOH) estimation is essential for the reliability of multi-scale battery energy storage, yet existing methods are limited in generalizability across diverse battery chemistries and operating conditions. The inability of standard neural networks to capture the complex, high-dimensional physics of battery degradation is a major contributor to these limitations. To address this, a physics-informed neural network with the Quantum Feature Mapping(QFM) technique (QPINN) is proposed. QPINN projects raw battery sensor data into a high-dimensional Hilbert space, creating a highly expressive feature set that effectively captures subtle, non-linear degradation patterns using Nystr\"om method. These quantum-enhanced features are then processed by a physics-informed network that enforces physical constraints. The proposed method achieves an average SOH estimation accuracy of 99.46\% across different datasets, substantially outperforming state-of-the-art baselines, with reductions in MAPE and RMSE of up to 65\% and 62\%, respectively. This method was validated on a large-scale, multi-chemistry dataset of 310,705 samples from 387 cells, and further showed notable adaptability in cross-validation settings, successfully transferring from one chemistry to another without relying on target-domain SOH labels.

new Structural Gating and Effect-aligned Lag-resolved Temporal Causal Discovery Framework with Application to Heat-Pollution Extremes

Authors: Rui Chen, Jinsong Wu

Abstract: This study proposes Structural Gating and Effect-aligned Discovery for Temporal Causal Discovery (SGED-TCD), a novel and general framework for lag-resolved causal discovery in complex multivariate time series. SGED-TCD combines explicit structural gating, stability-oriented learning, perturbation-effect alignment, and unified graph extraction to improve the interpretability, robustness, and functional consistency of inferred causal graphs. To evaluate its effectiveness in a representative real-world setting, we apply SGED-TCD to teleconnection-driven compound heatwave--air-pollution extremes in eastern and northern China. Using large-scale climate indices, regional circulation and boundary-layer variables, and compound extreme indicators, the framework reconstructs weighted causal networks with explicit dominant lags and relative causal importance. The inferred networks reveal clear regional and seasonal heterogeneity: warm-season extremes in Eastern China are mainly linked to low-latitude oceanic variability through circulation, radiation, and ventilation pathways, whereas cold-season extremes in Northern China are more strongly governed by high-latitude circulation variability associated with boundary-layer suppression and persistent stagnation. These results show that SGED-TCD can recover physically interpretable, hierarchical, and lag-resolved causal pathways in a challenging climate--environment system. More broadly, the proposed framework is not restricted to the present application and provides a general basis for temporal causal discovery in other complex domains.

new Intent-aligned Formal Specification Synthesis via Traceable Refinement

Authors: Zhe Ye, Aidan Z. H. Yang, Huangyuan Su, Zhenyu Liao, Samuel Tenka, Zhizhen Qin, Udaya Ghai, Dawn Song, Soonho Kong

Abstract: Large language models are increasingly used to generate code from natural language, but ensuring correctness remains challenging. Formal verification offers a principled way to obtain such guarantees by proving that a program satisfies a formal specification. However, specifications are frequently missing in real-world codebases, and writing high-quality specifications remains expensive and expertise-intensive. We present VeriSpecGen, a traceable refinement framework that synthesizes intent-aligned specifications in Lean through requirement-level attribution and localized repair. VeriSpecGen decomposes natural language into atomic requirements and generates requirement-targeted tests with explicit traceability maps to validate generated specifications. When validation fails, traceability maps attribute failures to specific requirements, enabling targeted clause-level repairs. VeriSpecGen achieve 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points across different model families and scales. Beyond inference-time gains, we generate 343K training examples from VeriSpecGen refinement trajectories and demonstrate that training on these trajectories substantially improves specification synthesis by 62-106% relative and transfers gains to general reasoning abilities.

new Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

Authors: Eric Easley, Sebastian Farquhar

Abstract: We address jailbreaks, backdoors, and unlearning for large language models (LLMs). Unlike prior work, which trains LLMs based on their actions when given malign instructions, our method specifically trains the model to change how it interprets instructions. Our method, Latent Instruction Representation Alignment (LIRA), greatly improves generalization. We further boost generalization through an internally adversarial training algorithm. Our methods block over 99% of PEZ jailbreak attacks; remove a challenging insecure code backdoor; and achieve optimal forgetting on WMDP cyber with negligible loss of benign capabilities.

new CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation

Authors: Elahe Khatibi, Ziyu Wang, Ankita Sharma, Krishnendu Chakrabarty, Sanaz Rahimi Moosavi, Farshad Firouzi, Amir Rahmani

Abstract: Large language models (LLMs) enable waveform-to-text ECG interpretation and interactive clinical questioning, yet most ECG-LLM systems still rely on weak signal-text alignment and retrieval without explicit physiological or causal structure. This limits grounding, temporal reasoning, and counterfactual "what-if" analysis central to clinical decision-making. We propose CARE-ECG, a causally structured ECG-language reasoning framework that unifies representation learning, diagnosis, and explanation in a single pipeline. CARE-ECG encodes multi-lead ECGs into temporally organized latent biomarkers, performs causal graph inference for probabilistic diagnosis, and supports counterfactual assessment via structural causal models. To improve faithfulness, CARE-ECG grounds language outputs through causal retrieval-augmented generation and a modular agentic pipeline that integrates history, diagnosis, and response with verification. Across multiple ECG benchmarks and expert QA settings, CARE-ECG improves diagnostic accuracy and explanation faithfulness while reducing hallucinations (e.g., 0.84 accuracy on Expert-ECG-QA and 0.76 on SCP-mapped PTB-XL under GPT-4). Overall, CARE-ECG provides traceable reasoning by exposing key latent drivers, causal evidence paths, and how alternative physiological states would change outcomes.

new Replicable Composition

Authors: Kiarash Banihashem, MohammadHossein Bateni, Hossein Esfandiari, Samira Goudarzi, MohammadTaghi Hajiaghayi

Abstract: Replicability requires that algorithmic conclusions remain consistent when rerun on independently drawn data. A central structural question is composition: given $k$ problems each admitting a $\rho$-replicable algorithm with sample complexity $n$, how many samples are needed to solve all jointly while preserving replicability? The naive analysis yields $\widetilde{O}(nk^2)$ samples, and Bun et al. (STOC'23) observed that reductions through differential privacy give an alternative $\widetilde{O}(n^2k)$ bound, leaving open whether the optimal $\widetilde{O}(nk)$ scaling is achievable. We resolve this open problem and, more generally, show that problems with sample complexities $n_1,\ldots,n_k$ can be jointly solved with $\widetilde{O}(\sum_i n_i)$ samples while preserving constant replicability. Our approach converts each replicable algorithm into a perfectly generalizing one, composes them via a privacy-style analysis, and maps back via correlated sampling. This yields the first advanced composition theorem for replicability. En route, we obtain new bounds for the composition of perfectly generalizing algorithms with heterogeneous parameters. As part of our results, we provide a boosting theorem for the success probability of replicable algorithms. For a broad class of problems, the failure probability appears as a separate additive term independent of $\rho$, immediately yielding improved sample complexity bounds for several problems. Finally, we prove an $\Omega(nk^2)$ lower bound for adaptive composition, establishing a quadratic separation from the non-adaptive setting. The key technique, which we call the phantom run, yields structural results of independent interest.

new Membership Inference Attacks Expose Participation Privacy in ECG Foundation Encoders

Authors: Ziyu Wang, Elahe Khatibi, Ankita Sharma, Krishnendu Chakrabarty, Sanaz Rahimi Moosavi, Farshad Firouzi, Amir Rahmani

Abstract: Foundation-style ECG encoders pretrained with self-supervised learning are increasingly reused across tasks, institutions, and deployment contexts, often through model-as-a-service interfaces that expose scalar scores or latent representations. While such reuse improves data efficiency and generalization, it raises a participation privacy concern: can an adversary infer whether a specific individual or cohort contributed ECG data to pretraining, even when raw waveforms and diagnostic labels are never disclosed? In connected-health settings, training participation itself may reveal institutional affiliation, study enrollment, or sensitive health context. We present an implementation-grounded audit of membership inference attacks (MIAs) against modern self-supervised ECG foundation encoders, covering contrastive objectives (SimCLR, TS2Vec) and masked reconstruction objectives (CNN- and Transformer-based MAE). We evaluate three realistic attacker interfaces: (i) score-only black-box access to scalar outputs, (ii) adaptive learned attackers that aggregate subject-level statistics across repeated queries, and (iii) embedding-access attackers that probe latent representation geometry. Using a subject-centric protocol with window-to-subject aggregation and calibration at fixed false-positive rates under a cross-dataset auditing setting, we observe heterogeneous and objective-dependent participation leakage: leakage is most pronounced in small or institution-specific cohorts and, for contrastive encoders, can saturate in embedding space, while larger and more diverse datasets substantially attenuate operational tail risk. Overall, our results show that restricting access to raw signals or labels is insufficient to guarantee participation privacy, underscoring the need for deployment-aware auditing of reusable biosignal foundation encoders in connected-health systems.

new Towards Green Wearable Computing: A Physics-Aware Spiking Neural Network for Energy-Efficient IMU-based Human Activity Recognition

Authors: Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yinze Zhou

Abstract: Wearable IMU-based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power-hungry floating-point operations and rigid requirement to process complete temporal windows severely cripple battery-constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event-driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics-Aware Spiking Neural Network (PAS-Net), a fully multiplier-free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human-joint physical constraints. Temporally, an $O(1)$-memory causal neuromodulator yields context-aware dynamic threshold neurons, adapting actively to non-stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early-exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS-Net achieves state-of-the-art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence-driven early-exit capability drastically reduces dynamic energy consumption by up to 98\%. PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing.

new Rethinking the Diffusion Model from a Langevin Perspective

Authors: Candi Zheng, Yuan Lan

Abstract: Diffusion models are often introduced from multiple perspectives, such as VAEs, score matching, or flow matching, accompanied by dense and technically demanding mathematics that can be difficult for beginners to grasp. One classic question is: how does the reverse process invert the forward process to generate data from pure noise? This article systematically organizes the diffusion model from a fresh Langevin perspective, offering a simpler, clearer, and more intuitive answer. We also address the following questions: how can ODE-based and SDE-based diffusion models be unified under a single framework? Why are diffusion models theoretically superior to ordinary VAEs? Why is flow matching not fundamentally simpler than denoising or score matching, but equivalent under maximum-likelihood? We demonstrate that the Langevin perspective offers clear and straightforward answers to these questions, bridging existing interpretations of diffusion models, showing how different formulations can be converted into one another within a common framework, and offering pedagogical value for both learners and experienced researchers seeking deeper intuition.

new Exact Finite-Sample Variance Decomposition of Subagging: A Spectral Filtering Perspective

Authors: Ye Su, Mingrui Ye, Yining Wang, Jipeng Guo, Yong Liu

Abstract: Standard resampling ratios (e.g., $\alpha \approx 0.632$) are widely used as default baselines in ensemble learning for three decades. However, how these ratios interact with a base learner's intrinsic functional complexity in finite samples lacks a exact mathematical characterization. We leverage the Hoeffding-ANOVA decomposition to derive the first exact, finite-sample variance decomposition for subagging, applicable to any symmetric base learner without requiring asymptotic limits or smoothness assumptions. We establish that subagging operates as a deterministic low-pass spectral filter: it preserves low-order structural signals while attenuating $c$-th order interaction variance by a geometric factor approaching $\alpha^c$. This decoupling reveals why default baselines often under-regularize high-capacity interpolators, which instead require smaller $\alpha$ to exponentially suppress spurious high-order noise. To operationalize these insights, we propose a complexity-guided adaptive subsampling algorithm, empirically demonstrating that dynamically calibrating $\alpha$ to the learner's complexity spectrum consistently improves generalization over static baselines.

new CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

Authors: Xiangyang Yin, Xingyu Liu, Tianhua Xia, Bo Bao, Vithursan Thangarasa, Valavan Manohararajah, Eric Sather, Sai Qian Zhang

Abstract: Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textit{CodeQuant}, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to $4.15\times$ speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints. Our code is available at https://github.com/SAI-Lab-NYU/CodeQuant.

URLs: https://github.com/SAI-Lab-NYU/CodeQuant.

new PepBenchmark: A Standardized Benchmark for Peptide Machine Learning

Authors: Jiahui Zhang, Rouyi Wang, Kuangqi Zhou, Tianshu Xiao, Lingyan Zhu, Yaosen Min, Yang Wang

Abstract: Peptide therapeutics are widely regarded as the "third generation" of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present PepBenchmark, which unifies datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) PepBenchData, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug development, representing, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) PepBenchPipeline, a standardized preprocessing pipeline that ensures consistent dataset cleaning, construction, splitting, and feature transformation, mitigating quality issues common in ad hoc pipelines; and (3) PepBenchLeaderboard, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: Fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. The data and code are publicly available at https://github.com/ZGCI-AI4S-Pep/PepBenchmark/.

URLs: https://github.com/ZGCI-AI4S-Pep/PepBenchmark/.

new IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

Authors: Yuzhen Mao, Qitong Wang, Martin Ester, Ke Li

Abstract: Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.

URLs: https://yuzhenmao.github.io/IceCache/.

new WaveMoE: A Wavelet-Enhanced Mixture-of-Experts Foundation Model for Time Series Forecasting

Authors: Shunyu Wu, Jiawei Huang, Weibin Feng, Boxin Li, Xiao Zhang, Erli Meng, Dan Li, Jian Lou, See-Kiong Ng

Abstract: Time series foundation models (TSFMs) have recently achieved remarkable success in universal forecasting by leveraging large-scale pretraining on diverse time series data. Complementing this progress, incorporating frequency-domain information yields promising performance in enhancing the modeling of complex temporal patterns, such as periodicity and localized high-frequency dynamics, which are prevalent in real-world time series. To advance this direction, we propose a new perspective that integrates explicit frequency-domain representations into scalable foundation models, and introduce WaveMoE, a wavelet-enhanced mixture-of-experts foundation model for time series forecasting. WaveMoE adopts a dual-path architecture that jointly processes time series tokens and wavelet tokens aligned along a unified temporal axis, and coordinates them through a shared expert routing mechanism that enables consistent expert specialization while efficiently scaling model capacity. Preliminary experimental results on 16 diverse benchmark datasets indicate that WaveMoE has the potential to further improve forecasting performance by incorporating wavelet-domain corpora.

new Topology-Aware PAC-Bayesian Generalization Analysis for Graph Neural Networks

Authors: Xinping Yi

Abstract: Graph neural networks have demonstrated excellent applicability to a wide range of domains, including social networks, biological systems, recommendation systems, and wireless communications. Yet a principled theoretical understanding of their generalization behavior remains limited, particularly for graph classification tasks where complex interactions between model parameters and graph structure play a crucial role. Among existing theoretical tools, PAC-Bayesian norm-based generalization bounds provide a flexible and data-dependent framework; however, current results for GNNs often restrict the exploitation of graph structures. In this work, we propose a topology-aware PAC-Bayesian norm-based generalization framework for graph convolutional networks (GCNs) that extends a previously developed framework to graph-structured models. Our approach reformulates the derivation of generalization bounds as a stochastic optimization problem and introduces sensitivity matrices that measure the response of classification outputs with respect to structured weight perturbations. By imposing different structures on sensitivity matrices from both spatial and spectral perspectives, we derive a family of generalization error bounds with graph structures explicitly embedded. Such bounds could recover existing results as special cases, while yielding bounds that are tighter than state-of-the-art PAC-Bayesian bounds for GNNs. Notably, the proposed framework explicitly integrates graph structural properties into the generalization analysis, enabling a unified inspection of GNN generalization behavior from both spatial aggregation and spectral filtering viewpoints.

new Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria

Authors: Nikodem Tomczak

Abstract: Profiled Sparse Networks (PSN) replace uniform connectivity with deterministic, heterogeneous fan-in profiles defined by continuous, nonlinear functions, creating neurons with both dense and sparse receptive fields. We benchmark PSN across four classification datasets spanning vision and tabular domains, input dimensions from 54 to 784, and network depths of 2--3 hidden layers. At 90% sparsity, all static profiles, including the uniform random baseline, achieve accuracy within 0.2-0.6% of dense baselines on every dataset, demonstrating that heterogeneous connectivity provides no accuracy advantage when hub placement is arbitrary rather than task-aligned. This result holds across sparsity levels (80-99.9%), profile shapes (eight parametric families, lognormal, and power-law), and fan-in coefficients of variation from 0 to 2.5. Internal gradient analysis reveals that structured profiles create a 2-5x gradient concentration at hub neurons compared to the ~1x uniform distribution in random baselines, with the hierarchy strength predicted by fan-in coefficient of variation ($r = 0.93$). When PSN fan-in distributions are used to initialise RigL dynamic sparse training, lognormal profiles matched to the equilibrium fan-in distribution consistently outperform standard ERK initialisation, with advantages growing on harder tasks, achieving +0.16% on Fashion-MNIST ($p = 0.036$, $d = 1.07$), +0.43% on EMNIST, and +0.49% on Forest Cover. RigL converges to a characteristic fan-in distribution regardless of initialisation. Starting at this equilibrium allows the optimiser to refine weights rather than rearrange topology. Which neurons become hubs matters more than the degree of connectivity variance, i.e., random hub placement provides no advantage, while optimisation-driven placement does.

new ReadMOF: Structure-Free Semantic Embeddings from Systematic MOF Nomenclature for Machine Learning

Authors: Kewei Zhu, Cameron Wilson, Bartosz Mazur, Yi Li, Ashleigh M. Chester, Peyman Z. Moghadam

Abstract: Systematic chemical names, such as IUPAC-style nomenclature for metal-organic frameworks (MOFs), contain rich structural and compositional information in a standardized textual format. Here we introduce ReadMOF, which is, to our knowledge, the first nomenclature-free machine learning framework that leverages these names to model structure-property relationships without requiring atomic coordinates or connectivity graphs. By employing pretrained language models, ReadMOF converts systematic MOF names from the Cambridge Structural Database (CSD) into vector embeddings that closely represent traditional structure-based descriptors. These embeddings enable applications in materials informatics, including property prediction, similarity retrieval, and clustering, with performance comparable to geometry-dependent methods. When combined with large language models, ReadMOF also establishes chemically meaningful reasoning ability with textual input only. Our results show that structured chemical language, interpreted through modern natural language processing techniques, can provide a scalable, interpretable, and geometry-independent alternative to conventional molecular representations. This approach opens new opportunities for language-driven discovery in materials science.

new WOODELF-HD: Efficient Background SHAP for High-Depth Decision Trees

Authors: Ron Wettenstein, Alexander Nadel, Udi Boker

Abstract: Decision-tree ensembles are a cornerstone of predictive modeling, and SHAP is a standard framework for interpreting their predictions. Among its variants, Background SHAP offers high accuracy by modeling missing features using a background dataset. Historically, this approach did not scale well, as the time complexity for explaining n instances using m background samples included an O(mn) component. Recent methods such as Woodelf and PLTreeSHAP reduce this to O(m+n), but introduce a preprocessing bottleneck that grows as 3^D with tree depth D, making them impractical for deep trees. We address this limitation with WoodelfHD, a Woodelf extension that reduces the 3^D factor to 2^D. The key idea is a Strassen-like multiplication scheme that exploits the structure of Woodelf matrices, reducing matrix-vector multiplication from O(k^2) to O(k*log(k)) via a fully vectorized, non-recursive implementation. In addition, we merge path nodes with identical features, reducing cache size and memory usage. When running on standard environments, WoodelfHD enables exact Background SHAP computation for trees with depths up to 21, where previous methods fail due to excessive memory usage. For ensembles of depths 12 and 15, it achieves speedups of 33x and 162x, respectively, over the state-of-the-art.

new Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

Authors: Subramanyam Sahoo

Abstract: Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on $1{,}000$ MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by $+0.006$ relative to the base model and MCE increases by $+0.010$ relative to neutral SFT -- though the effect does not reach statistical significance ($p = 0.41$) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by $40$--$64\%$ and improves accuracy by $1.5$--$3.0$ percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control ($0.042$ vs.\ $0.037$), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.

new Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR

Authors: Giacomo Cignoni, Simone Magistri, Andrew D. Bagdanov, Antonio Carta

Abstract: This paper explores Online Continual Self-Supervised Learning (OCSSL), a scenario in which models learn from continuous streams of unlabeled, non-stationary data, where methods typically employ replay and fast convergence is a central desideratum. We find that OCSSL requires particular attention to the stability-plasticity trade-off: stable methods (e.g. replay with Reservoir sampling) are able to converge faster compared to plastic ones (e.g. FIFO buffer), but incur in performance drops under certain conditions. We explain this collapse phenomenon with the Latent Rehearsal Decay hypothesis, which attributes it to latent space degradation under excessive stability of replay. We introduce two metrics (Overlap and Deviation) that diagnose latent degradation and correlate with accuracy declines. Building on these insights, we propose SOLAR, which leverages efficient online proxies of Deviation to guide buffer management and incorporates an explicit Overlap loss, allowing SOLAR to adaptively managing plasticity. Experiments demonstrate that SOLAR achieves state-of-the-art performance on OCSSL vision benchmarks, with both high convergence speed and final performance.

new Distributionally Robust PAC-Bayesian Control

Authors: Domagoj Herceg, Duarte Antunes

Abstract: We present a distributionally robust PAC-Bayesian framework for certifying the performance of learning-based finite-horizon controllers. While existing PAC-Bayes control literature typically assumes bounded losses and matching training and deployment distributions, we explicitly address unbounded losses and environmental distribution shifts (the sim-to-real gap). We achieve this by drawing on two modern lines of research, namely the PAC-Bayes generalization theory and distributionally robust optimization via the type-1 Wasserstein distance. By leveraging the System Level Synthesis (SLS) reparametrization, we derive a sub-Gaussian loss proxy and a bound on the performance loss due to distribution shift. Both are tied directly to the operator norm of the closed-loop map. For linear time-invariant systems, this yields a computationally tractable optimization-based framework together with high-probability safety certificates for deployment in real-world environments that differ from those used in training.

new MoEITS: A Green AI approach for simplifying MoE-LLMs

Authors: Luis Balderas, Miguel Lastra, Jos\'e M. Ben\'itez

Abstract: Large language models are transforming all areas of academia and industry, attracting the attention of researchers, professionals, and the general public. In the trek for more powerful architectures, Mixture-of-Experts, inspired by ensemble models, have emerged as one of the most effective ways to follow. However, this implies a high computational burden for both training and inference. To reduce the impact on computing and memory footprint as well as the energy consumption, simplification methods has arisen as very effective procedures. In this paper, an original algorithm, MoEITS, for MoE-LLMs simplification is presented. The algorithm is characterized by a refined simplicity, underpinned by standardized Information Theoretic frameworks. MoEITS is analyzed in depth from theoretical and practical points of view. Its computational complexity is studied. Its performance on the accuracy of the simplified LLMs and the reduction rate achieved is assessed through a thoroughly designed experimentation. This empirical evaluation includes a comparison with state-of-the-art MoE-LLM pruning methods applied on Mixtral $8\times7$B, Qwen1.5-2.7B, and DeepSeek-V2-Lite. The extensive experimentation conducted demonstrates that MoEITS outperforms state-of-the-art techniques by generating models that are both effective across all benchmarks and computationally efficient. The code implementing the method will be available at https://github.com/luisbalru/MoEITS.

URLs: https://github.com/luisbalru/MoEITS.

new Mitigating Privacy Risk via Forget Set-Free Unlearning

Authors: Aviraj Newatia, Michael Cooper, Viet Nguyen, Rahul G. Krishnan

Abstract: Training machine learning models requires the storage of large datasets, which often contain sensitive or private data. Storing data is associated with a number of potential risks which increase over time, such as database breaches and malicious adversaries. Machine unlearning is the study of methods to efficiently remove the influence of training data subsets from previously-trained models. Existing unlearning methods typically require direct access to the "forget set" -- the data to be forgotten-and organisations must retain this data for unlearning rather than deleting it immediately upon request, increasing risks associated with the forget set. We introduce partially-blind unlearning -- utilizing auxiliary information to unlearn without explicit access to the forget set. We also propose a practical framework Reload, a partially-blind method based on gradient optimization and structured weight sparsification to operationalize partially-blind unlearning. We show that Reload efficiently unlearns, approximating models retrained from scratch, and outperforms several forget set-dependent approaches. On language models, Reload unlearns entities using <0.025% of the retain set and <7% of model weights in <8 minutes on Llama2-7B. In the corrective case, Reload achieves unlearning even when only 10% of corrupted data is identified.

new SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates

Authors: Rajveer Singh

Abstract: We present a systematic empirical study of the spectral structure of LoRA weight updates. Through 2D Discrete Cosine Transform (DCT) analysis of trained adaptation matrices across BERT-base and RoBERTa-base on four GLUE benchmarks (SST-2, MNLI, CoLA, QQP), we establish that LoRA updates are universally dominated by low-frequency components: on average, just 33% of DCT coefficients capture 90% of total spectral energy. Retaining only 10% of frequency coefficients reduces adapter storage by 10x while sacrificing only 1.95pp on SST-2. Notably, frequency masking at k=50% improves over full LoRA on 3 of 8 model-task pairs, suggesting high-frequency components act as adaptation noise. We further discover that RoBERTa-base is systematically more spectrally compressible than BERT-base across all tasks, and that task complexity governs spectral sensitivity -- NLI tasks require more frequency budget than sentiment classification. These findings motivate a new design principle for PEFT: spectral sparsity in adaptation.

new Energy-Efficient Federated Edge Learning For Small-Scale Datasets in Large IoT Networks

Authors: Haihui Xie, Wenkun Wen, Shuwu Chen, Zhaogang Shu, Minghua Xia

Abstract: Large-scale Internet of Things (IoT) networks enable intelligent services such as smart cities and autonomous driving, but often face resource constraints. Collecting heterogeneous sensory data, especially in small-scale datasets, is challenging, and independent edge nodes can lead to inefficient resource utilization and reduced learning performance. To address these issues, this paper proposes a collaborative optimization framework for energy-efficient federated edge learning with small-scale datasets. We first derive an expected learning loss to quantify the relationship between the number of training samples and learning objectives. A stochastic online learning algorithm is then designed to adapt to data variations, and a resource optimization problem with a convergence bound is formulated. Finally, an online distributed algorithm efficiently solves large-scale optimization problems with high scalability. Extensive simulations and autonomous navigation case studies with collision avoidance demonstrate that the proposed approach significantly improves learning performance and resource efficiency compared to state-of-the-art benchmarks.

new Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Authors: Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi

Abstract: Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent's own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill-sd/

URLs: https://k1xe.github.io/skill-sd/

new SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Authors: Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

Abstract: On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

new Communication-Efficient Gluon in Federated Learning

Authors: Xun Qian, Alexander Gaponov, Grigory Malinovsky, Peter Richt\'arik

Abstract: Recent developments have shown that Muon-type optimizers based on linear minimization oracles (LMOs) over non-Euclidean norm balls have the potential to get superior practical performance than Adam-type methods in the training of large language models. Since large-scale neural networks are trained across massive machines, communication cost becomes the bottleneck. To address this bottleneck, we investigate Gluon, which is an extension of Muon under the more general layer-wise $(L^0, L^1)$-smooth setting, with both unbiased and contraction compressors. In order to reduce the compression error, we employ the variance reduced technique in SARAH in our compressed methods. The convergence rates and improved communication cost are achieved under certain conditions. As a byproduct, a new variance reduced algorithm with faster convergence rate than Gluon is obtained. We also incorporate momentum variance reduction (MVR) to these compressed algorithms and comparable communication cost is derived under weaker conditions when $L_i^1 \neq 0$. Finally, several numerical experiments are conducted to verify the superior performance of our compressed algorithms in terms of communication cost.

new Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Authors: Zikang Shan, Han Zhong, Liwei Wang, Li Zhao

Abstract: Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.

new INCRT: An Incremental Transformer That Determines Its Own Architecture

Authors: Giansalvo Cirrincione

Abstract: Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.

new PokeRL: Reinforcement Learning for Pokemon Red

Authors: Dheeraj Mudireddy, Sai Patibandla

Abstract: Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player's house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop-aware environment wrapper around the PyBoy emulator with map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at https://github.com/reddheeraj/PokemonRL

URLs: https://github.com/reddheeraj/PokemonRL

new Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

Authors: Yijin Ni, Xiaoming Huo

Abstract: We study online covariance matrix estimation for Polyak--Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of $O(n^{-(1-\alpha)/4})$, which yields $O(n^{-1/8})$ at the optimal learning-rate exponent $\alpha \rightarrow 1/2^+$. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to $O(n^{-(1-\alpha)/3})$, achieving $O(n^{-1/6})$. The modified estimator requires no Hessian access and preserves $O(d^2)$ memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate $\Theta(n^{-(1-\alpha)/2})$ for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives $\Omega(n^{-(1-\alpha)/2})$, and a trajectory-regression estimator--which estimates the Hessian by regressing SGD increments on iterates--achieves $O(n^{-(1-\alpha)/2})$, matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.

new Slithering Through Gaps: Capturing Discrete Isolated Modes via Logistic Bridging

Authors: Pinaki Mohanty, Ruqi Zhang

Abstract: High-dimensional and complex discrete distributions often exhibit multimodal behavior due to inherent discontinuities, posing significant challenges for sampling. Gradient-based discrete samplers, while effective, frequently become trapped in local modes when confronted with rugged or disconnected energy landscapes. This limits their ability to achieve adequate mixing and convergence in high-dimensional multimodal discrete spaces. To address these challenges, we propose \emph{Hyperbolic Secant-squared Gibbs-Sampling (HiSS)}, a novel family of sampling algorithms that integrates a \emph{Metropolis-within-Gibbs} framework to enhance mixing efficiency. HiSS leverages a logistic convolution kernel to couple the discrete sampling variable with the continuous auxiliary variable in a joint distribution. This design allows the auxiliary variable to encapsulate the true target distribution while facilitating easy transitions between distant and disconnected modes. We provide theoretical guarantees of convergence and demonstrate empirically that HiSS outperforms many popular alternatives on a wide variety of tasks, including Ising models, binary neural networks, and combinatorial optimization.

new Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Authors: Francesco D'Angelo, Nicolas Flammarion

Abstract: Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch learn solutions consistent with our theory: their predictive distributions, attention patterns, and learned transition matrix closely match the construction, while deeper models achieve performance comparable to multi-step Mirror Descent.

new Task2vec Readiness: Diagnostics for Federated Learning from Pre-Training Embeddings

Authors: Cristiano Mafuz, Rodrigo Silva

Abstract: Federated learning (FL) performance is highly sensitive to heterogeneity across clients, yet practitioners lack reliable methods to anticipate how a federation will behave before training. We propose readiness indices, derived from Task2Vec embeddings, that quantifies the alignment of a federation prior to training and correlates with its eventual performance. Our approach computes unsupervised metrics -- such as cohesion, dispersion, and density -- directly from client embeddings. We evaluate these indices across diverse datasets (CIFAR-10, FEMNIST, PathMNIST, BloodMNIST) and client counts (10--20), under Dirichlet heterogeneity levels spanning $\alpha \in \{0.05,\dots,5.0\}$ and FedAVG aggregation strategy. Correlation analyses show consistent and significant Pearson and Spearman coefficients between some of the Task2Vec-based readiness and final performance, with values often exceeding 0.9 across dataset$\times$client configurations, validating this approach as a robust proxy for FL outcomes. These findings establish Task2Vec-based readiness as a principled, pre-training diagnostic for FL that may offer both predictive insight and actionable guidance for client selection in heterogeneous federations.

new Query Lower Bounds for Diffusion Sampling

Authors: Zhiyang Xun, Eric Price

Abstract: Diffusion models generate samples by iteratively querying learned score estimates. A rapidly growing literature focuses on accelerating sampling by minimizing the number of score evaluations, yet the information-theoretic limits of such acceleration remain unclear. In this work, we establish the first score query lower bounds for diffusion sampling. We prove that for $d$-dimensional distributions, given access to score estimates with polynomial accuracy $\varepsilon=d^{-O(1)}$ (in any $L^p$ sense), any sampling algorithm requires $\widetilde{\Omega}(\sqrt{d})$ adaptive score queries. In particular, our proof shows that any sampler must search over $\widetilde{\Omega}(\sqrt{d})$ distinct noise levels, providing a formal explanation for why multiscale noise schedules are necessary in practice.

new DIB-OD: Preserving the Invariant Core for Robust Heterogeneous Graph Adaptation via Decoupled Information Bottleneck and Online Distillation

Authors: Yang Yan, Qiuyan Wang, Tianjin Huang, Qiudong Yu, Kexin Zhang

Abstract: Graph Neural Network pretraining is pivotal for leveraging unlabeled graph data. However, generalizing across heterogeneous domains remains a major challenge due to severe distribution shifts. Existing methods primarily focus on intra-domain patterns, failing to disentangle task-relevant invariant knowledge from domain-specific redundant noise, leading to negative transfer and catastrophic forgetting. To this end, we propose DIB-OD, a novel framework designed to preserve the invariant core for robust heterogeneous graph adaptation through a Decoupled Information Bottleneck and Online Distillation framework. Our core innovation is the explicit decomposition of representations into orthogonal invariant and redundant subspaces. By utilizing an Information Bottleneck teacher-student distillation mechanism and the Hilbert-Schmidt Independence Criterion, we isolate a stable invariant core that transcends domain boundaries. Furthermore, a self-adaptive semantic regularizer is introduced to protect this core from corruption during target-domain adaptation by dynamically gating label influence based on predictive confidence. Extensive experiments across chemical, biological, and social network domains demonstrate that DIB-OD significantly outperforms state-of-the-art methods, particularly in challenging inter-type domain transfers, showcasing superior generalization and anti-forgetting performance.

new Learning to Adapt: In-Context Learning Beyond Stationarity

Authors: Zhen Qin, Jiachen Jiang, Zhihui Zhu

Abstract: Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and rigorously characterize its advantages over standard linear attention in this dynamic setting. To model non-stationarity, we adopt a first-order autoregressive process and show that GLA achieves lower training and testing errors by adaptively modulating the influence of past inputs -- effectively implementing a learnable recency bias. Our theoretical findings are further supported by empirical results, which validate the benefits of gating mechanisms in non-stationary ICL tasks.

new UniPROT: Uniform Prototype Selection via Partial Optimal Transport with Submodular Guarantees

Authors: Prateek Chanda, Prayas Agrawal, Karthik S. Gurumoorthy, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria

Abstract: Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present $\methodprop$, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emph{super-additive} objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a $(1-1/e)$ approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub\footnote{Code: https://github.com/efficiency-learning/UniPROT}

URLs: https://github.com/efficiency-learning/UniPROT

new Hypergraph Neural Diffusion: A PDE-Inspired Framework for Hypergraph Message Passing

Authors: Zhiheng Zhou, Mengyao Zhou, Xixun Lin, Xingqin Qi, Guiying Yan

Abstract: Hypergraph neural networks (HGNNs) have shown remarkable potential in modeling high-order relationships that naturally arise in many real-world data domains. However, existing HGNNs often suffer from shallow propagation, oversmoothing, and limited adaptability to complex hypergraph structures. In this paper, we propose Hypergraph Neural Diffusion (HND), a novel framework that unifies nonlinear diffusion equations with neural message passing on hypergraphs. HND is grounded in a continuous-time hypergraph diffusion equation, formulated via hypergraph gradient and divergence operators, and modulated by a learnable, structure-aware coefficient matrix over hyperedge-node pairs. This partial differential equation (PDE) based formulation provides a physically interpretable view of hypergraph learning, where feature propagation is understood as an anisotropic diffusion process governed by local inconsistency and adaptive diffusion coefficient. From this perspective, neural message passing becomes a discretized gradient flow that progressively minimizes a diffusion energy functional. We derive rigorous theoretical guarantees, including energy dissipation, solution boundedness via a discrete maximum principle, and stability under explicit and implicit numerical schemes. The HND framework supports a variety of integration strategies such as non-adaptive-step (like Runge-Kutta) and adaptive-step solvers, enabling the construction of deep, stable, and interpretable architectures. Extensive experiments on benchmark datasets demonstrate that HND achieves competitive performance. Our results highlight the power of PDE-inspired design in enhancing the stability, expressivity, and interpretability of hypergraph learning.

new Continuous-time Online Learning via Mean-Field Neural Networks: Regret Analysis in Diffusion Environments

Authors: Erhan Bayraktar, Bingyan Han, Ziqing Zhang

Abstract: We study continuous-time online learning where data are generated by a diffusion process with unknown coefficients. The learner employs a two-layer neural network, continuously updating its parameters in a non-anticipative manner. The mean-field limit of the learning dynamics corresponds to a stochastic Wasserstein gradient flow adapted to the data filtration. We establish regret bounds for both the mean-field limit and finite-particle system. Our analysis leverages the logarithmic Sobolev inequality, Polyak-Lojasiewicz condition, Malliavin calculus, and uniform-in-time propagation of chaos. Under displacement convexity, we obtain a constant static regret bound. In the general non-convex setting, we derive explicit linear regret bounds characterizing the effects of data variation, entropic exploration, and quadratic regularization. Finally, our simulations demonstrate the outperformance of the online approach and the impact of network width and regularization parameters.

new Learning to Test: Physics-Informed Representation for Dynamical Instability Detection

Authors: Minxing Zheng, Zewei Deng, Liyan Xie, Shixiang Zhu

Abstract: Many safety-critical scientific and engineering systems evolve according to differential-algebraic equations (DAEs), where dynamical behavior is constrained by physical laws and admissibility conditions. In practice, these systems operate under stochastically varying environmental inputs, so stability is not a static property but must be reassessed as the context distribution shifts. Repeated large-scale DAE simulation, however, is computationally prohibitive in high-dimensional or real-time settings. This paper proposes a test-oriented learning framework for stability assessment under distribution shift. Rather than re-estimating physical parameters or repeatedly solving the underlying DAE, we learn a physics-informed latent representation of contextual variables that captures stability-relevant structure and is regularized toward a tractable reference distribution. Trained on baseline data from a certified safe regime, the learned representation enables deployment-time safety monitoring to be formulated as a distributional hypothesis test in latent space, with controlled Type I error. By integrating neural dynamical surrogates, uncertainty-aware calibration, and uniformity-based testing, our approach provides a scalable and statistically grounded method for detecting instability risk in stochastic constrained dynamical systems without repeated simulation.

new Robust Adversarial Policy Optimization Under Dynamics Uncertainty

Authors: Mintae Kim, Koushil Sreenath

Abstract: Reinforcement learning (RL) policies often fail under dynamics that differ from training, a gap not fully addressed by domain randomization or existing adversarial RL methods. Distributionally robust RL provides a formal remedy but still relies on surrogate adversaries to approximate intractable primal problems, leaving blind spots that potentially cause instability and over-conservatism. We propose a dual formulation that directly exposes the robustness-performance trade-off. At the trajectory level, a temperature parameter from the dual problem is approximated with an adversarial network, yielding efficient and stable worst-case rollouts within a divergence bound. At the model level, we employ Boltzmann reweighting over dynamics ensembles, focusing on more adverse environments to the current policy rather than uniform sampling. The two components act independently and complement each other: trajectory-level steering ensures robust rollouts, while model-level sampling provides policy-sensitive coverage of adverse dynamics. The resulting framework, robust adversarial policy optimization (RAPO) outperforms robust RL baselines, improving resilience to uncertainty and generalization to out-of-distribution dynamics while maintaining dual tractability.

new Tracking High-order Evolutions via Cascading Low-rank Fitting

Authors: Zhao Song

Abstract: Diffusion models have become the de facto standard for modern visual generation, including well-established frameworks such as latent diffusion and flow matching. Recently, modeling high-order dynamics has emerged as a promising frontier in generative modeling. Rather than only learning the first-order velocity field that transports random noise to a target data distribution, these approaches simultaneously learn higher-order derivatives, such as acceleration and jerk, yielding a diverse family of higher-order diffusion variants. To represent higher-order derivatives, naive approaches instantiate separate neural networks for each order, which scales the parameter space linearly with the derivative order. To overcome this computational bottleneck, we introduce cascading low-rank fitting, an ordinary differential equation inspired method that approximates successive derivatives by applying a shared base function augmented with sequentially accumulated low-rank components. Theoretically, we analyze the rank dynamics of these successive matrix differences. We prove that if the initial difference is linearly decomposable, the generic ranks of high-order derivatives are guaranteed to be monotonically non-increasing. Conversely, we demonstrate that without this structural assumption, the General Leibniz Rule allows ranks to strictly increase. Furthermore, we establish that under specific conditions, the sequence of derivative ranks can be designed to form any arbitrary permutation. Finally, we present a straightforward algorithm to efficiently compute the proposed cascading low-rank fitting.

new Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

Authors: Zhuolun Dong, Junyu Cao

Abstract: Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.

new K-Way Energy Probes for Metacognition Reduce to Softmax in Discriminative Predictive Coding Networks

Authors: Jon-Paul Cacioli

Abstract: We present this as a negative result with an explanatory mechanism, not as a formal upper bound. Predictive coding networks (PCNs) admit a K-way energy probe in which each candidate class is fixed as a target, inference is run to settling, and the per-hypothesis settled energies are compared. The probe appears to read a richer signal source than softmax, since the per-hypothesis energy depends on the entire generative chain. We argue this appearance is misleading under the standard Pinchetti-style discriminative PC formulation. We present an approximate reduction showing that with target-clamped CE-energy training and effectively-feedforward latent dynamics, the K-way energy margin decomposes into a monotone function of the log-softmax margin plus a residual that is not trained to correlate with correctness. The decomposition predicts that the structural probe should track softmax from below. We test this across six conditions on CIFAR-10: extended deterministic training, direct measurement of latent movement during inference, a post-hoc decoder fairness control on a backpropagation network, a matched-budget PC vs BP comparison, a five-point Langevin temperature sweep, and trajectory-integrated MCPC training. In every condition the probe sat below softmax. The gap was stable across training procedures within the discriminative PC family. Final-state and trajectory-integrated training produced probes whose AUROC_2 values differed by less than 10^-3 at deterministic evaluation. The empirical regime is small: single seed, 2.1M-parameter network, 1280 test images. We frame the result as a preprint inviting replication. We discuss conditions under which the decomposition does not apply (bidirectional PC, prospective configuration, generative PC, non-CE energy formulations) and directions for productive structural probing the analysis does not foreclose.

new Optimal Stability of KL Divergence under Gaussian Perturbations

Authors: Jialu Pan, Yufeng Zhang, Nan Hu, Keqin Li

Abstract: We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\mathcal{N}_1$ and $\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\mathcal{N}_1)$ is large and $KL(\mathcal{N}_1||\mathcal{N}_2)$ is at most $\epsilon$, then $KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrt{\epsilon})$. Moreover, we prove that this $\sqrt{\epsilon}$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.

new RTMC: Step-Level Credit Assignment via Rollout Trees

Authors: Tao Wang, Suhang Zheng, Xiaoxiao Xu

Abstract: Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages--without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.

new Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

Authors: Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.

new Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Authors: Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan

Abstract: Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from $10$ labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful, black-box elicitation matches or exceeds all white-box methods; when explanations are absent or misleading, gradient-based attribution improves accuracy by 3-5 percentage points, and relevance patching, RelP, gives the largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit. Variance decomposition suggests gradients track decision computation, which fields causally drive the output, whereas other readouts are dominated by task representation, biases toward field identity and value. We release all models, code, and evaluation infrastructure.

new A Faster Path to Continual Learning

Authors: Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng

Abstract: Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments show that C-Flat Turbo is 1.0$\times$ to 1.25$\times$ faster than C-Flat across a wide range of CL methods, while achieving comparable or even improved accuracy.

new CausalGaze: Unveiling Hallucinations via Counterfactual Graph Intervention in Large Language Models

Authors: Linggang Kong, Lei Wu, Yunlong Zhang, Xiaofeng Zhong, Zhen Wang, Yongjie Wang, Yao Pan

Abstract: Despite the groundbreaking advancements made by large language models (LLMs), hallucination remains a critical bottleneck for their deployment in high-stakes domains. Existing classification-based methods mainly rely on static and passive signals from internal states, which often captures the noise and spurious correlations, while overlooking the underlying causal mechanisms. To address this limitation, we shift the paradigm from passive observation to active intervention by introducing CausalGaze, a novel hallucination detection framework based on structural causal models (SCMs). CausalGaze models LLMs' internal states as dynamic causal graphs and employs counterfactual interventions to disentangle causal reasoning paths from incidental noise, thereby enhancing model interpretability. Extensive experiments across four datasets and three widely used LLMs demonstrate the effectiveness of CausalGaze, especially achieving over 5.2\% improvement in AUROC on the TruthfulQA dataset compared to state-of-the-art baselines.

new Bottleneck Tokens for Unified Multimodal Retrieval

Authors: Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, Yuchao Zheng

Abstract: Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., ) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).

new Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning

Authors: Linjie Li, Huiyu Xiao, Jiarui Cao, Zhenyu Wu, Yang Ji

Abstract: Class-incremental learning (CIL) aims to continuously accumulate knowledge from a stream of tasks and construct a unified classifier over all seen classes. Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. Specifically, we introduce a quantum-gated task modulation gating mechanism to model the relational dependencies among task embedding, dynamically capturing the sample-to-task relevance for both joint training and inference across streaming tasks. Guided by the quantum gating outputs, we perform task-interaction knowledge distillation guided by these task-embedding-level correlation weights from old to new adapters, enabling the model to bridge the representation gaps between independent task subspaces. Extensive experiments demonstrate that QKD effectively mitigates forgetting and achieves state-of-the-art performance.

new Distributionally Robust K-Means Clustering

Authors: Vikrant Malik, Taylan Kargin, Babak Hassibi

Abstract: K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.

new Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

Authors: Chenhao Fang, Jordi Mola, Mark Harman, Jason Nawrocki, Vaibhav Shrivastava, Yue Cheng, Jay Minesh Shah, Katayoun Zand, Mansi Tripathi, Arya Pudota, Matthew Becker, Herv\'e Robert, Abhishek Gulati

Abstract: Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline's suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.

new A Full Compression Pipeline for Green Federated Learning in Communication-Constrained Environments

Authors: Elouan Colybes, Shirin Salehi, Anke Schmeink

Abstract: Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, thereby preserving privacy. However, FL often suffers from significant communication and computational overhead, limiting its scalability and sustainability. In this work, we introduce a Full Compression Pipeline (FCP) for FL in communication-constrained environments. FCP integrates three complementary deep compression techniques (pruning, quantization, and Huffman encoding) into a unified end-to-end framework. By compressing local models and communication payloads, FCP substantially reduces transmission costs and resource consumption while maintaining competitive accuracy. To quantify its impact, we develop an evaluation framework that captures both communication and computation overheads as a unified model cost, allowing a holistic assessment of efficiency trade-offs. The pipeline is evaluated in an independent and identically distributed (IID) and non-IID data setting. In one representative scenario, training a ResNet-12 model on the CIFAR-10 dataset with ten clients and a 2 Mbps bandwidth, the FCP achieves more than 11$\times$ reduction in model size, with only a 2% drop in accuracy compared to the uncompressed baseline. This results in an FL training that is more than 60% faster.

new Gradient-Variation Regret Bounds for Unconstrained Online Learning

Authors: Yuheng Zhao, Andrew Jacobsen, Nicol\`o Cesa-Bianchi, Peng Zhao

Abstract: We develop parameter-free algorithms for unconstrained online learning with regret guarantees that scale with the gradient variation $V_T(u) = \sum_{t=2}^T \|\nabla f_t(u)-\nabla f_{t-1}(u)\|^2$. For $L$-smooth convex loss, we provide fully-adaptive algorithms achieving regret of order $\widetilde{O}(\|u\|\sqrt{V_T(u)} + L\|u\|^2+G^4)$ without requiring prior knowledge of comparator norm $\|u\|$, Lipschitz constant $G$, or smoothness $L$. The update in each round can be computed efficiently via a closed-form expression. Our results extend to dynamic regret and find immediate implications to the stochastically-extended adversarial (SEA) model, which significantly improves upon the previous best-known result [Wang et al., 2025].

new Towards Situation-aware State Modeling for Air Traffic Flow Prediction

Authors: Anqi Liu, Bin Wang, Jiangtao Zhao, Dechuan Ma, Guiyuan Jiang, Feng Hong, Yanwei Yu, Tianrui Li

Abstract: Accurate air traffic prediction in the terminal airspace (TA) is pivotal for proactive air traffic management (ATM). However, existing data-driven approaches predominantly rely on time series-based forecasting paradigms, which inherently overlook critical aircraft state information, such as real-time kinematics and proximity to airspace boundaries. To address this limitation, we propose \textit{AeroSense}, a direct state-to-flow modeling framework for air traffic prediction. Unlike classical time series-based methods that first aggregate aircraft trajectories into macroscopic flow sequences before modeling, AeroSense explicitly represents the real-time airspace situation as \textit{a dynamic set of aircraft states}, enabling the direct processing of a variable number of aircraft instead of time series as inputs. Specifically, we introduce a situation-aware state representation that enables AeroSense to sense the instantaneous terminal airspace situation directly from microscopic aircraft states. Furthermore, we design a model architecture that incorporates masked self-attention to capture inter-aircraft interactions, together with two decoupled prediction heads to model heterogeneous flow dynamics across two key functional areas of the TA. Extensive experiments on a large-scale real-world airport dataset demonstrate that AeroSense consistently achieves state-of-the-art performance, validating that direct modeling of microscopic aircraft states yields substantially higher predictive fidelity than time series-based baselines. Moreover, the proposed framework exhibits superior robustness during peak traffic periods, achieves Pareto-optimal performance under dayparting multi-object evaluation, and provides meaningful interpretability through attention-based visualizations.

new ShapShift: Explaining Model Prediction Shifts with Subgroup Conditional Shapley Values

Authors: Tom Bewley, Salim I. Amoukou, Emanuele Albini, Saumitra Mishra, Manuela Veloso

Abstract: Changes in input distribution can induce shifts in the average predictions of machine learning models. Such prediction shifts may impact downstream business outcomes (e.g. a bank's loan approval rate), so understanding their causes can be crucial. We propose \ours{}: a Shapley value method for attributing prediction shifts to changes in the conditional probabilities of interpretable subgroups of data, where these subgroups are defined by the structure of decision trees. We initially apply this method to single decision trees, providing exact explanations based on conditional probability changes at split nodes. Next, we extend it to tree ensembles by selecting the most explanatory tree and accounting for residual effects. Finally, we propose a model-agnostic variant using surrogate trees grown with a novel objective function, allowing application to models like neural networks. While exact computation can be intensive, approximation techniques enable practical application. We show that \ours{} provides simple, faithful, and near-complete explanations of prediction shifts across model classes, aiding model monitoring in dynamic environments.

new Unified Graph Prompt Learning via Low-Rank Graph Message Prompting

Authors: Beibei Wang, Bo Jiang, Ziyan Zhang, Jin Tang

Abstract: Graph Data Prompt (GDP), which introduces specific prompts in graph data for efficiently adapting pre-trained GNNs, has become a mainstream approach to graph fine-tuning learning problem. However, existing GDPs have been respectively designed for distinct graph component (e.g., node features, edge features, edge weights) and thus operate within limited prompt spaces for graph data. To the best of our knowledge, it still lacks a unified prompter suitable for targeting all graph components simultaneously. To address this challenge, in this paper, we first propose to reinterpret a wide range of existing GDPs from an aspect of Graph Message Prompt (GMP) paradigm. Based on GMP, we then introduce a novel graph prompt learning approach, termed Low-Rank GMP (LR-GMP), which leverages low-rank prompt representation to achieve an effective and compact graph prompt learning. Unlike traditional GDPs that target distinct graph components separately, LR-GMP concurrently performs prompting on all graph components in a unified manner, thereby achieving significantly superior generalization and robustness on diverse downstream tasks. Extensive experiments on several graph benchmark datasets demonstrate the effectiveness and advantages of our proposed LR-GMP.

new AbLWR:A Context-Aware Listwise Ranking Framework for Antibody-Antigen Binding Affinity Prediction via Positive-Unlabeled Learning

Authors: Fan Xu, Zhi-an Huang, Haohuai He, Yidong Song, Wei Liu, Dongxu Zhang, Yao Hu, Kay Chen Tan

Abstract: Accurate prediction of antibody-antigen binding affinity is fundamental to therapeutic design, yet remains constrained by severe label sparsity and the complexity of antigenic variations. In this paper, we propose AbLWR (Antibody-antigen binding affinity List-Wise Ranking), a novel framework that reformulates the conventional affinity regression task as a listwise ranking problem. To mitigate label sparsity, AbLWR incorporates a PU (Positive-Unlabeled) learning mechanism leveraging a dual-level contrastive objective and meta-optimized label refinement to learn robust representations. Furthermore, we address antigenic variation by employing a homologous antigen sampling strategy where Multi-Head Self-Attention (MHSA) explicitly models inter-sample relationships within training lists to capture subtle affinity nuances. Extensive experiments demonstrate that AbLWR significantly outperforms state-of-the-art baselines, improving the Precision@1 (P@1) by over 10$\%$ in randomized cross-validation experiments. Notably, case studies on Influenza and IL-33 validate its practical utility, demonstrating robust ranking consistency in distinguishing subtle viral mutations and efficiently prioritizing top-tier candidates for wet-lab screening.

new Mycelium-Index: A Streaming Approximate Nearest Neighbor Index with Myelial Edge Decay, Traffic-Driven Reinforcement, and Adaptive Living Hierarchy

Authors: Anton Pakhunov

Abstract: We present mycelium-index, a streaming approximate nearest neighbor (ANN) index for high-dimensional vector spaces, inspired by the adaptive growth patterns of biological mycelium. The system continuously adapts its topology through myelial edge decay and reinforcement, a traffic-driven living hierarchy, and hybrid deletion combining O(1) bypass for cold nodes with O(k) beam-search repair for hub nodes. Experimental evaluation on SIFT-1M demonstrates that mycelium achieves 0.927 +/- 0.028 recall@5 under FreshDiskANN's 100%-turnover benchmark protocol -- within the measurement confidence interval of FreshDiskANN's ~0.95 -- while using 5.7x less RAM (88 MB vs. >500 MB) and achieving 4.7x higher QPS (2,795 vs. ~600). On the static index, at ef=192, mycelium matches HNSW M=16 recall (0.962 vs. 0.965) at 5.2x less RAM (163 MB vs. 854 MB). Performance optimizations including NEON SIMD distance computation, Vec-backed node storage, and bitset visited tracking yield a cumulative 2.7x QPS improvement. A systematic study of ten streaming repair mechanisms finds that geometric heuristics universally fail in high dimensions, while topological mechanisms succeed -- a principle we term the topological repair invariance of high-dimensional ANN graphs.

new Sheaf Diffusion with Adaptive Local Structure for Spatio-Temporal Forecasting

Authors: Abeer Mostafa, Raneen Younis, Zahra Ahmadi

Abstract: Spatio-temporal systems often exhibit highly heterogeneous and non-intuitive responses to localized disruptions, limiting the effectiveness of conventional message passing approaches in modeling higher-order interactions under local heterogeneity. This paper reformulates spatio-temporal forecasting as the problem of learning information flow over locally structured spaces, rather than propagating globally aligned node representations. We introduce a spatio-temporal sheaf diffusion graph neural network (ST-Sheaf GNN) that embeds graph topology into sheaf-theoretic vector spaces connected by learned linear restriction maps. Unlike prior work that relies on static or globally shared transformations, our model learns dynamic restriction maps that evolve over time and adapt to local spatio-temporal patterns to enable substantially more expressive interactions. By explicitly modeling latent local structure, the proposed framework efficiently mitigates the oversmoothing phenomenon in deep GNN architectures. We evaluate our framework on a diverse set of real-world spatio-temporal forecasting benchmarks spanning multiple domains. Experimental results demonstrate state-of-the-art performance, highlighting the effectiveness of sheaf-theoretic topological representations as a powerful foundation for spatio-temporal graph learning. The code is available at: https://anonymous.4open.science/r/ST-SheafGNN-6523/.

URLs: https://anonymous.4open.science/r/ST-SheafGNN-6523/.

new Representation-Aligned Multi-Scale Personalization for Federated Learning

Authors: Wenfei Liang, Wee Peng Tay

Abstract: In federated learning (FL), accommodating clients with diverse resource constraints remains a significant challenge. A widely adopted approach is to use a shared full-size model, from which each client extracts a submodel aligned with its computational budget. However, regardless of the specific scoring strategy, these methods rely on the same global backbone, limiting both structural diversity and representational adaptation across clients. This paper presents FRAMP, a unified framework for personalized and resource-adaptive federated learning. Instead of relying on a fixed global model, FRAMP generates client-specific models from compact client descriptors, enabling fine-grained adaptation to both data characteristics and computational budgets. Each client trains a tailored lightweight submodel and aligns its learned representation with others to maintain global semantic consistency. Extensive experiments on vision and graph benchmarks demonstrate that FRAMP enhances generalization and adaptivity across a wide range of client settings.

new THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture

Authors: Augustus Haoyang Li

Abstract: We present THEIA, a modular neural architecture that learns complete Kleene three-valued logic (K3) end-to-end without any external symbolic solver, and investigate what architectural prior enables compositional generalization under uncertainty. THEIA processes four mathematical domains (arithmetic, order, set membership, propositional logic) through dedicated engines that converge in a final logic module. Trained on a 2M-sample dataset with input space ~3.4x10^13, it achieves 12/12 Kleene K3 rule coverage across 5 seeds in 9.2 +/- 3.5 minutes (5.6x faster than a parameter-comparable Transformer under matched settings). A mod-3 sequential composition experiment generalizes from 5-step training to 500-step evaluation at 99.97% +/- 0.02% -- a result that critically depends on structured inductive bias: replacing the four-engine backbone with a flat MLP collapses length generalization to chance by 50 steps regardless of capacity (both 0.80M and parameter-matched 2.75M variants fail), while a pre-LN TF8LTuned Transformer baseline (3,582,147 params) trained under the identical protocol reaches 99.24% at 500 steps (Appendix D). Mechanistic probing reveals that modularity induces a delayed verdict: upstream engines encode domain-specific variables without committing to the final truth value (probe accuracy <= 74% uncertainty-only ceiling), with the verdict emerging only at the Logic Engine boundary -- causally confirmed by activation patching (100% flip rate on 986 matched pairs, replicated across n=5 seeds; 100.0% aggregate). The Transformer baseline reaches equivalent correctness through a qualitatively different representational trajectory (contraction then expansion), suggesting that modular and monolithic architectures implement distinct compositional strategies.

new The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Authors: Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang, Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu

Abstract: Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.

new Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables

Authors: Meiyi Zhu, Osvaldo Simeone

Abstract: Conformal selection (CS) uses calibration data to identify test inputs whose unobserved outcomes are likely to satisfy a pre-specified minimal quality requirement, while controlling the false discovery rate (FDR). Existing methods fix the target FDR level before observing data, which prevents the user from adapting the balance between number of selected test inputs and FDR to downstream needs and constraints based on the available data. For example, in genomics or neuroimaging, researchers often inspect the distribution of test statistics, and decide how aggressively to pursue candidates based on observed evidence strength and available follow-up resources. To address this limitation, we introduce {post-hoc CS} (PH-CS), which generates a path of candidate selection sets, each paired with a data-driven false discovery proportion (FDP) estimate. PH-CS lets the user select any operating point on this path by maximizing a user-specified utility, arbitrarily balancing selection size and FDR. Building on conformal e-variables and the e-Benjamini-Hochberg (e-BH) procedure, PH-CS is proved to provide a finite-sample post-hoc reliability guarantee whereby the ratio between estimated FDP level and true FDP is, on average, upper bounded by $1$, so that the average estimated FDP is, to first order, a valid upper bound on the true FDR. PH-CS is extended to control quality defined in terms of a general risk. Experiments on synthetic and real-world datasets demonstrate that, unlike CS, PH-CS can consistently satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.

new Learning Discrete Diffusion of Graphs via Free-Energy Gradient Flows

Authors: Dario Rancati, Jan Maas, Francesco Locatello

Abstract: Diffusion-based models on continuous spaces have seen substantial recent progress through the mathematical framework of gradient flows, leveraging the Wasserstein-2 (${W}_2$) metric via the Jordan-Kinderlehrer-Otto (JKO) scheme. Despite the increasing popularity of diffusion models on discrete spaces using continuous-time Markov chains, a parallel theoretical framework based on gradient flows has remained elusive due to intrinsic challenges in translating the ${W}_2$ distance directly into these settings. In this work, we propose the first computational approach addressing these challenges, leveraging an appropriate metric $W_K$ on the simplex of probability distributions, which enables us to interpret widely used discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. Through this theoretical insight, we introduce a novel methodology for learning diffusion dynamics over discrete spaces, which recovers the underlying functional directly by leveraging first-order optimality conditions for the JKO scheme. The resulting method optimizes a simple quadratic loss, trains extremely fast, does not require individual sample trajectories, and only needs a numerical preprocessing computing $W_K$-geodesics. We validate our method through extensive numerical experiments on synthetic data, showing that we can recover the underlying functional for a variety of graph classes.

new S$^3$: Structured Sparsity Specification

Authors: Ayoub Ghriss

Abstract: We introduce the Structured Sparsity Specification (S$^3$), an algebraic framework for defining, composing, and implementing structured sparse patterns. S$^3$ specifies sparsity through three components: a View that reshapes the tensor via layout composition, a Block specification that defines the atomic pruning unit, and the sparsity decision Scope. Both Block and Scope support Coupling across tensors for coordinated sparsification. S$^3$ enables precise specification of diverse sparsity structures, from fine-grained N:M patterns to coarse channel pruning, and integrates seamlessly with Optimal Brain Damage (OBD) and Surgeon (OBS). We formalize the framework mathematically, demonstrate its expressiveness on canonical patterns, and validate it experimentally via structured OBS and OBD implementations built entirely on S$^3$, which surpasses well-established second order heuristics on output reconstruction across common configurations.

new Active Bayesian Inference for Robust Control under Sensor False Data Injection Attacks

Authors: Axel Andersson, Gy\"orgy D\'an

Abstract: We present a framework for bridging the gap between sensor attack detection and recovery in cyber-physical systems. The proposed framework models modern-day, complex perception pipelines as bipartite graphs, which combined with anomaly detector alerts defines a Bayesian network for inferring compromised sensors. An active probing strategy exploits system nonlinearities to maximize distinguishability between attack hypotheses, while compromised sensors are selectively disabled to maintain reliable state estimation. We propose a threshold-based probing strategy and show its effectiveness via a simplified partially observable Markov decision process (POMDP) formulation. Experiments on an inverted pendulum under single and multi-sensor attacks show that our method significantly outperforms outlier-robust and prediction-based baselines, especially under prolonged attacks.

new Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning

Authors: Ajinkya Mohgaonkar, Lukas Gosch, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar, Stephan G\"unnemann

Abstract: Label-flipping attacks, which corrupt training labels to induce misclassifications at inference, remain a major threat to supervised learning models. This drives the need for robustness certificates that provide formal guarantees about a model's robustness under adversarially corrupted labels. Existing certification frameworks rely on ensemble techniques such as smoothing or partition-aggregation, but treat the corresponding base classifiers as black boxes, yielding overly conservative guarantees. We introduce EnsembleCert, the first certification framework for partition-aggregation ensembles that utilizes white-box knowledge of the base classifiers. Concretely, EnsembleCert yields tighter guarantees than black-box approaches by aggregating per-partition white-box certificates to compute ensemble-level guarantees in polynomial time. To extract white-box knowledge from the base classifiers efficiently, we develop ScaLabelCert, a method that leverages the equivalence between sufficiently wide neural networks and kernel methods using the neural tangent kernel. ScaLabelCert yields the first exact, polynomial-time calculable certificate for neural networks against label-flipping attacks. EnsembleCert is either on par, or significantly outperforms the existing partition-based black box certificates. Exemplary, on CIFAR-10, our method can certify upto +26.5% more label flips in median over the test set compared to the existing black-box approach while requiring 100 times fewer partitions, thus, challenging the prevailing notion that heavy partitioning is a necessity for strong certified robustness.

new Emulating Non-Differentiable Metrics via Knowledge-Guided Learning: Introducing the Minkowski Image Loss

Authors: Filippo Quarenghi, Ryan Cotsakis, Tom Beucler

Abstract: The ``differentiability gap'' presents a primary bottleneck in Earth system deep learning: since models cannot be trained directly on non-differentiable scientific metrics and must rely on smooth proxies (e.g., MSE), they often fail to capture high-frequency details, yielding ``blurry'' outputs. We develop a framework that bridges this gap using two different methods to deal with non-differentiable functions: the first is to analytically approximate the original non-differentiable function into a differentiable equivalent one; the second is to learn differentiable surrogates for scientific functionals. We formulate the analytical approximation by relaxing discrete topological operations using temperature-controlled sigmoids and continuous logical operators. Conversely, our neural emulator uses Lipschitz-convolutional neural networks to stabilize gradient learning via: (1) spectral normalization to bound the Lipschitz constant; and (2) hard architectural constraints enforcing geometric principles. We demonstrate this framework's utility by developing the Minkowski image loss, a differentiable equivalent for the integral-geometric measures of surface precipitation fields (area, perimeter, connected components). Validated on the EUMETNET OPERA dataset, our constrained neural surrogate achieves high emulation accuracy, completely eliminating the geometric violations observed in unconstrained baselines. However, applying these differentiable surrogates to a deterministic super-resolution task reveals a fundamental trade-off: while strict Lipschitz regularization ensures optimization stability, it inherently over-smooths gradient signals, restricting the recovery of highly localized convective textures. This work highlights the necessity of coupling such topological constraints with stochastic generative architectures to achieve full morphological realism.

new Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

Authors: Zhipeng Chen, Tao Qian, Wayne Xin Zhao, Ji-Rong Wen

Abstract: Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \textbf{N}onlinear \textbf{Ext}rapolation of low-rank trajectories (\textbf{NExt}), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5\% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.

URLs: https://github.com/RUCAIBox/NExt.

new Learning How Much to Think: Difficulty-Aware Dynamic MoEs for Graph Node Classification

Authors: Jiajun Zhou, Yadong Li, Xuanze Chen, Chen Ma, Chuang Zhao, Shanqing Yu, Qi Xuan

Abstract: Mixture-of-Experts (MoE) architectures offer a scalable path for Graph Neural Networks (GNNs) in node classification tasks but typically rely on static and rigid routing strategies that enforce a uniform expert budget or coarse-grained expert toggles on all nodes. This limitation overlooks the varying discriminative difficulty of nodes and leads to under-fitting for hard nodes and redundant computation for easy ones. To resolve this issue, we propose D2MoE, a novel framework that shifts the focus from static expert selection to node-wise expert resource allocation. By using predictive entropy as a real-time proxy for difficulty, D2MoE employs a difficulty-driven top-p routing mechanism to adaptively concentrate expert resources on hard nodes while reducing overhead for easy ones, achieving continuous and fine-grained expert budget scaling for node classification. Experiments on 13 benchmarks demonstrate that D2MoE achieves consistent state-of-the-art performance, surpassing leading baselines by up to 7.92% in accuracy on heterophilous graphs. Notably, on large-scale graphs, it reduces memory consumption by up to 73.07% and training time by 46.53% compared to the best-performing Graph MoE, thereby validating its superior efficiency.

new Structural Consequences of Policy-Based Interventions on the Global Supply Chain Network

Authors: Lea Karbevska, Liming Xu, Zehui Dai, Sara AlMahri, Alexandra Brintrup

Abstract: As global political tensions rise and the anticipation of additional tariffs from the United States on international trade increases, the issues of economic independence and supply chain resilience become more prominent. The importance of supply chain resilience has been further underscored by disruptions caused by the COVID-19 pandemic and the ongoing war in Ukraine. In light of these challenges, ranging from geopolitical instability to product supply uncertainties, governments are increasingly focused on adopting new trade policies. This study explores the impact of several of these policies on the global electric vehicle (EV) supply chain network, with a particular focus on their effects on country clusters and the broader structure of international trade. Specifically, we analyse three key policies: Country Plus One, Friendshoring, and Reshoring. Our findings show that Friendshoring, contrary to expectations, leads to greater globalisation by increasing the number of supply links across friendly countries, potentially raising transaction costs. The Country Plus One policy similarly enhances network density through redundant links, while the Reshoring policy creates challenges in the EV sector due to the high number of irreplaceable products. Additionally, the effects of these policies vary across industries; for instance, mining goods being less affected in Country Plus One than the Friendshoring policy.

new CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

Authors: Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, Li Liu

Abstract: Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein--ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non-differentiable objectives while preserving chemical validity and diversity. The non-autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks demonstrate consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate, highlighting the effectiveness of our framework.

new Quantization Dominates Rank Reduction for KV-Cache Compression

Authors: Samuel Salfati

Abstract: We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%. We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a perturbation result showing projection damage exceeds quantization damage by 3 x 2^(2b) per direction under the softmax Fisher metric. A basis ablation confirms the finding is basis-independent (spread <0.4 PPL), establishing that the advantage comes from preserving dimensions, not from a better coordinate system. Joint K+V INT4 quantization achieves 75% total KV reduction at only +0.18 PPL on Mistral 7B.

new Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers

Authors: Miit Daga, Swarna Priya Ramu

Abstract: Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample's retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean $R^2 = 0.74$) than CNN forgetting ($R^2 = 0.52$). Third, per-sample forgetting is stochastic across random seeds (Spearman $\rho \approx 0.01$), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample's loss after head warmup predicts its long-term decay constant ($\rho = 0.30$ to $0.50$, $p < 10^{-45}$). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.

new Generative Path-Finding Method for Wasserstein Gradient Flow

Authors: Chengyu Liu, Xiang Zhou

Abstract: Wasserstein gradient flows (WGFs) describe the evolution of probability distributions in Wasserstein space as steepest descent dynamics for a free energy functional. Computing the full path from an arbitrary initial distribution to equilibrium is challenging, especially in high dimensions. Eulerian methods suffer from the curse of dimensionality, while existing Lagrangian approaches based on particles or generative maps do not naturally improve efficiency through time step tuning. We propose GenWGP, a generative path finding framework for Wasserstein gradient paths. GenWGP learns a generative flow that transports mass from an initial density to an unknown equilibrium distribution by minimizing a path loss that encodes the full trajectory and its terminal equilibrium condition. The loss is derived from a geometric action functional motivated by Dawson Gartner large deviation theory for empirical distributions of interacting diffusion systems. We formulate both a finite horizon action under physical time parametrization and a reparameterization invariant geometric action based on Wasserstein arclength. Using normalizing flows, GenWGP computes a geometric curve toward equilibrium while enforcing approximately constant intrinsic speed between adjacent network layers, so that discretized distributions remain nearly equidistant in the Wasserstein metric along the path. This avoids delicate time stepping constraints and enables stable training that is largely independent of temporal or geometric discretization. Experiments on Fokker Planck and aggregation type problems show that GenWGP matches or exceeds high fidelity reference solutions with only about a dozen discretization points while capturing complex dynamics.

new Continuous Adversarial Flow Models

Authors: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

Abstract: We propose continuous adversarial flow models, a type of continuous-time flow model trained with an adversarial objective. Unlike flow matching, which uses a fixed mean-squared-error criterion, our approach introduces a learned discriminator to guide training. This change in objective induces a different generalized distribution, which empirically produces samples that are better aligned with the target data distribution. Our method is primarily proposed for post-training existing flow-matching models, although it can also train models from scratch. On the ImageNet 256px generation task, our post-training substantially improves the guidance-free FID of latent-space SiT from 8.26 to 3.63 and of pixel-space JiT from 7.17 to 3.57. It also improves guided generation, reducing FID from 2.06 to 1.53 for SiT and from 1.86 to 1.80 for JiT. We further evaluate our approach on text-to-image generation, where it achieves improved results on both the GenEval and DPG benchmarks.

new TempusBench: An Evaluation Framework for Time-Series Forecasting

Authors: Denizalp Goktas, Gerardo Ria\~no-Brice\~no, Alif Abdullah, Aryan Nair, Chenkai Shen, Beatriz de Lucio, Alexandra Magnusson, Farhan Mashrur, Ahmed Abdulla, Shawrna Sen, Mahitha Thippireddy, Gregory Schwartz, Amy Greenwald

Abstract: Foundation models have transformed natural language processing and computer vision, and a rapidly growing literature on time-series foundation models (TSFMs) seeks to replicate this success in forecasting. While recent open-source models demonstrate the promise of TSFMs, the field lacks a comprehensive and community-accepted model evaluation framework. We see at least four major issues impeding progress on the development of such a framework. First, current evaluation frameworks consist of benchmark forecasting tasks derived from often outdated datasets (e.g., M3), many of which lack clear metadata and overlap with the corpora used to pre-train TSFMs. Second, existing frameworks evaluate models along a narrowly defined set of benchmark forecasting tasks such as forecast horizon length or domain, but overlook core statistical properties such as non-stationarity and seasonality. Third, domain-specific models (e.g., XGBoost) are often compared unfairly, as existing frameworks neglect a systematic and consistent hyperparameter tuning convention for all models. Fourth, visualization tools for interpreting comparative performance are lacking. To address these issues, we introduce TempusBench, an open-source evaluation framework for TSFMs. TempusBench consists of 1) new datasets which are not included in existing TSFM pretraining corpora, 2) a set of novel benchmark tasks that go beyond existing ones, 3) a model evaluation pipeline with a standardized hyperparameter tuning protocol, and 4) a tensorboard-based visualization interface. We provide access to our code on GitHub: https://github.com/Smlcrm/TempusBench.

URLs: https://github.com/Smlcrm/TempusBench.

new Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

Authors: Haolin Li, Shuyang Jiang, Ruipeng Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Abstract: While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at https://github.com/tdlhl/MedSSR.

URLs: https://github.com/tdlhl/MedSSR.

new bacpipe: a Python package to make bioacoustic deep learning models accessible

Authors: Vincent S. Kather, Sylvain Haupert, Burooj Ghani, Dan Stowell

Abstract: 1. Natural sounds have been recorded for millions of hours over the previous decades using passive acoustic monitoring. Improvements in deep learning models have vastly accelerated the analysis of large portions of this data. While new models advance the state-of-the-art, accessing them using tools to harness their full potential is not always straightforward. Here we present bacpipe, a collection of bioacoustic deep learning models and evaluation pipelines accessible through a graphical and programming interface, designed for both ecologists and computer scientists. Bacpipe is a modular software package intended as a point of convergence for bioacoustic models. 2. Bacpipe streamlines the usage of state-of-the-art models on custom audio datasets, generating acoustic feature vectors (embeddings) and classifier predictions. A modular design allows evaluation and benchmarking of models through interactive visualizations, clustering and probing. 3. We believe that access to new deep learning models is important. By designing bacpipe to target a wide audience, researchers will be enabled to answer new ecological and evolutionary questions in bioacoustics. 4. In conclusion, we believe accessibility to developments in deep learning to a wider audience benefits the ecological questions we are trying to answer.

new Layerwise Dynamics for In-Context Classification in Transformers

Authors: Patrick Lutz, Themistoklis Haris, Arjun Chandra, Aditya Gangrade, Venkatesh Saligrama

Abstract: Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion: an end-to-end identified, emergent update rule inside a softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature-label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement a geometry-driven algorithmic motif, which can provably amplify class separation and yields robust expected class alignment.

new SCNO: Spiking Compositional Neural Operator -- Towards a Neuromorphic Foundation Model for Nuclear PDE Solving

Authors: Samrendra Roy, Souvik Chakraborty, Rizwan-uddin, Syed Bahauddin Alam

Abstract: Neural operators have emerged as powerful surrogates for partial differential equation (PDE) solvers, yet they are typically trained as monolithic models for individual PDEs, require energy-intensive GPU hardware, and must be retrained from scratch when new physics emerge. We introduce the Spiking Compositional Neural Operator (SCNO), a modular architecture combining spiking and conventional components that addresses all three limitations. SCNO maintains a library of small spiking neural operator blocks, each trained on a single elementary differential operator (convection, diffusion, reaction), and composes them through a lightweight input-conditioned aggregator to solve coupled PDEs not seen during block training. A small correction network learns cross-coupling residuals while keeping all blocks and the aggregator frozen, preserving zero-forgetting modular expansion by construction. We evaluate SCNO on eight PDE families including five coupled systems and a nuclear-relevant 1-group neutron diffusion equation. SCNO with correction achieves the lowest relative $L^2$ error on four of five coupled PDEs, outperforming both a monolithic spiking DeepONet (by up to 62%, mean over 3 seeds) and a standard ANN DeepONet (by up to 65%), while requiring only 95K trainable parameters versus 462K for the monolithic baseline. To our knowledge, this is the first compositional spiking neural operator and the first proof-of-concept for modular neuromorphic PDE solving with built-in forgetting-free expansion.

new Inter-Layer Hessian Analysis of Neural Networks with DAG Architectures

Authors: Maxim Bolshim (ITMO University, Saint Petersburg, Russia), Alexander Kugaevskikh (ITMO University, Saint Petersburg, Russia)

Abstract: Modern automatic differentiation frameworks (JAX, PyTorch) return the Hessian of the loss function as a monolithic tensor, without exposing the internal structure of inter-layer interactions. This paper presents an analytical formalism that explicitly decomposes the full Hessian into blocks indexed by the DAG of an arbitrary architecture. The canonical decomposition $H = H^{GN} + H^T$ separates the Gauss--Newton component (convex part) from the tensor component (residual curvature responsible for saddle points). For piecewise-linear activations (ReLU), the tensor component of the input Hessian vanishes ($H^{T}_{v,w}\!\equiv\!0$ a.e., $H^f_{v,w}\!=\!H^{GN}_{v,w}\!\succeq\!0$); the full parametric Hessian contains residual terms that do not reduce to the GGN. Building on this decomposition, we introduce diagnostic metrics (inter-layer resonance~$\mathcal{R}$, geometric coupling~$\mathcal{C}$, stable rank~$\mathcal{D}$, GN-Gap) that are estimated stochastically in $O(P)$ time and reveal structural curvature interactions between layers. The theoretical analysis explains exponential decay of resonance in vanilla networks and its preservation under skip connections; empirical validation spans fully connected MLPs (Exp.\,1--5) and convolutional architectures (ResNet-18, ${\sim}11$M~parameters, Exp.\,6). When the architecture reduces to a single node, all definitions collapse to the standard Hessian $\nabla^2_\theta\mathcal{L}(\theta)\in\mathbb{R}^{p\times p}$.

new Towards Autonomous Mechanistic Reasoning in Virtual Cells

Authors: Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi

Abstract: Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

new Fairness is Not Flat: Geometric Phase Transitions Against Shortcut Learning

Authors: Nicolas Rodriguez-Alvarez (Instituto de Educacion Secundaria Parquesol, Valladolid, Spain), Fernando Rodriguez-Merino (University of Valladolid, Valladolid, Spain)

Abstract: Deep Neural Networks are highly susceptible to shortcut learning, frequently memorizing low-dimensional spurious correlations instead of underlying causal mechanisms. This phenomenon not only degrades out-of-distribution robustness but also induces severe demographic biases in sensitive applications. In this paper, we propose a geometric \textit{a priori} methodology to mitigate shortcut learning. By deploying a zero-hidden-layer ($N=1$) Topological Auditor, we mathematically isolate features that monopolize the gradient without human intervention. We empirically demonstrate a Capacity Phase Transition: once linear shortcuts are pruned, networks are forced to utilize higher geometric capacity ($N \geq 16$) to curve the decision boundary and learn ethical representations. Our approach outperforms L1 Regularization -- which collapses into demographic bias -- and operates at a fraction of the computational cost of post-hoc methods like Just Train Twice (JTT), successfully reducing counterfactual gender vulnerability from 21.18\% to 7.66\%.

new KL Divergence Between Gaussians: A Step-by-Step Derivation for the Variational Autoencoder Objective

Authors: Andr\'es Mu\~noz, Rodrigo Ramele

Abstract: Kullback-Leibler (KL) divergence is a fundamental concept in information theory that quantifies the discrepancy between two probability distributions. In the context of Variational Autoencoders (VAEs), it serves as a central regularization term, imposing structure on the latent space and thereby enabling the model to exhibit generative capabilities. In this work, we present a detailed derivation of the closed-form expression for the KL divergence between Gaussian distributions, a case of particular importance in practical VAE implementations. Starting from the general definition for continuous random variables, we derive the expression for the univariate case and extend it to the multivariate setting under the assumption of diagonal covariance. Finally, we discuss the interpretation of each term in the resulting expression and its impact on the training dynamics of the model.

new Autonomous Diffractometry Enabled by Visual Reinforcement Learning

Authors: J. Oppliger, M. Stifter, A. R\"uegg, I. Bia{\l}o, L. Martinelli, P. G. Freeman, D. Prabhakaran, J. Zhao, Q. Wang, J. Chang

Abstract: Automation underpins progress across scientific and industrial disciplines. Yet, automating tasks requiring interpretation of abstract visual information remain challenging. For example, crystal alignment strongly relies on humans with the ability to comprehend diffraction patterns. Here we introduce an autonomous system that aligns single crystals without access to crystallography and diffraction theory. Using a model-free reinforcement learning framework, an agent learns to identify and navigate towards high-symmetry orientations directly from Laue diffraction patterns. Despite the absence of human supervision, the agent develops human-like strategies to achieve time-efficient alignment across different crystal symmetry classes. With this, we provide a computational framework for intelligent diffractometers. As such, our approach advances the development of automated experimental workflows in materials science.

new ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Authors: Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Abstract: GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.

new A Mechanistic Analysis of Looped Reasoning Language Models

Authors: Hugh Blayney, \'Alvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, Xiaowen Dong

Abstract: Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM's layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.

new Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Authors: Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak

Abstract: We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.

URLs: https://sim2reason.github.io/.

new Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems

Authors: Mohammed Ezzaldin Babiker Abdullah

Abstract: The stable operation of off-grid photovoltaic systems requires accurate, computationally efficient solar forecasting. Contemporary deep learning models often suffer from massive computational overhead and physical blindness, generating impossible predictions. This paper introduces the Physics-Informed State Space Model (PISSM) to bridge the gap between efficiency and physical accuracy for edge-deployed microcontrollers. PISSM utilizes a dynamic Hankel matrix embedding to filter stochastic sensor noise by transforming raw meteorological sequences into a robust state space. A Linear State Space Model replaces heavy attention mechanisms, efficiently modeling temporal dependencies for parallel processing. Crucially, a novel Physics-Informed Gating mechanism leverages the Solar Zenith Angle and Clearness Index to structurally bound outputs, ensuring predictions strictly obey diurnal cycles and preventing nocturnal errors. Validated on a multi-year dataset for Omdurman, Sudan, PISSM achieves superior accuracy with fewer than 40,000 parameters, establishing an ultra-lightweight benchmark for real-time off-grid control.

cross SHANG++: Robust Stochastic Acceleration under Multiplicative Noise

Authors: Yaxin Yu, Long Chen, Minfu Feng

Abstract: Under the multiplicative noise scaling (MNS) condition, original Nesterov acceleration is provably sensitive to noise and may diverge when gradient noise overwhelms the signal. In this paper, we develop two accelerated stochastic gradient descent methods by discretizing the Hessian-driven Nesterov accelerated gradient flow. We first derive SHANG, a direct Gauss-Seidel-type discretization that already improves stability under MNS. We then introduce SHANG++, which adds a damping correction and achieves faster convergence with stronger noise robustness. We establish convergence guarantees for both convex and strongly convex objectives under MNS, together with explicit parameter choices. In our experiments, SHANG++ performs consistently well across convex problems and applications in deep learning. In a dedicated noise experiment on ResNet-34, a single hyperparameter configuration attains accuracy within 1% of the noise-free setting. Across all experiments, SHANG++ outperforms existing accelerated methods in robustness and efficiency, with minimal parameter sensitivity.

cross LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

Authors: Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques

Abstract: Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.

URLs: https://huggingface.co/datasets/futurehouse/labbench2, https://github.com/EdisonScientific/labbench2.

cross VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination

Authors: Muyan Hu, Ahan Gupta, Jiachen Yuan, Vima Gupta, Taeksang Kim, Xin Xu, Janardhan Kulkarni, Ofer Dekel, Vikram Adve, Charith Mendis

Abstract: With the widening gap between compute and memory operation latencies, data movement optimizations have become increasingly important for DNN compilation. Current optimizations such as layout transformations and operator fusion only target a subset of tensor operators and consequently miss important opportunities for reducing data movement in contemporary DNN workloads, including large language models. We introduce VTC, a novel tensor compilation framework that for the first time eliminates all unnecessary data movement by targeting the full spectrum of data movement operators. VTC proposes the concept of virtual tensors to track data movement between compute operators via index mappings rather than expensive physical data transfers to and from global memory, which can seamlessly interoperate with existing computation kernels and handle arbitrary tensor operator compositions. We also introduce a novel data movement elimination algorithm to automatically identify a profitable virtual tensor creation strategy. Evaluation on a variety of DNNs shows that VTC can outperform existing ML compilers by up to 1.93x (1.28x on average) on NVIDIA GPUs with up to 60% (17.5% on average) inference memory savings.

cross Seven simple steps for log analysis in AI systems

Authors: Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Satos, Sayash Kapoor, Sunischal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, Cozmin Ududec

Abstract: AI systems produce large volumes of logs as they interact with tools and users. Analysing these logs can help understand model capabilities, propensities, and behaviours, or assess whether an evaluation worked as intended. Researchers have started developing methods for log analysis, but a standardised approach is still missing. Here we suggest a pipeline based on current best practices. We illustrate it with concrete code examples in the Inspect Scout library, provide detailed guidance on each step, and highlight common pitfalls. Our framework provides researchers with a foundation for rigorous and reproducible log analysis.

cross Improving understanding and trust in AI: How users benefit from interval-based counterfactual explanations

Authors: Tabea E. R\"ober, Paul Festor, Rob Goedhart, S. \.Ilker Birbil, Aldo Faisal

Abstract: Experimental user studies evaluating the effectiveness of different subtypes of post-hoc explanations for black-box models are largely nonexistent. Therefore, the aim of this study was to investigate and evaluate how different types of counterfactual explanations, namely single point explanations and interval-based explanations, affect both model understanding and (demonstrated) trust. We conducted an online user study using a within-subjects experimental design, where the experimental arms were (i) no explanation (control), (ii) feature importance scores, (iii) point counterfactual explanations, and (iv) interval counterfactual explanations. Our results clearly show the superiority of interval explanations over other tested explanation types in increasing both model understanding and demonstrated trust in the AI. We could not support findings of some previous studies showing an effect of point counterfactual explanations compared to the control group. Our results further highlight the role individual differences in, for example, cognitive style or personality, in explanation effectiveness.

cross Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Authors: Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

Abstract: The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.

cross Generative UI: LLMs are Effective UI Generators

Authors: Yaniv Leviathan (Cheenu), Dani Valevski (Cheenu), Matan Kalman (Cheenu), Danny Lumen (Cheenu), Eyal Segalis (Cheenu), Eyal Molad (Cheenu), Shlomi Pasternak (Cheenu), Vishnu Natchu (Cheenu), Valerie Nygaard (Cheenu), Srinivasan (Cheenu), Venkatachary, James Manyika, Yossi Matias

Abstract: AI models excel at creating content, but typically render it with static, predefined interfaces. Specifically, the output of LLMs is often a markdown "wall of text". Generative UI is a long standing promise, where the model generates not just the content, but the interface itself. Until now, Generative UI was not possible in a robust fashion. We demonstrate that when properly prompted and equipped with the right set of tools, a modern LLM can robustly produce high quality custom UIs for virtually any prompt. When ignoring generation speed, results generated by our implementation are overwhelmingly preferred by humans over the standard LLM markdown output. In fact, while the results generated by our implementation are worse than those crafted by human experts, they are at least comparable in 50% of cases. We show that this ability for robust Generative UI is emergent, with substantial improvements from previous models. We also create and release PAGEN, a novel dataset of expert-crafted results to aid in evaluating Generative UI implementations, as well as the results of our system for future comparisons. Interactive examples can be seen at https://generativeui.github.io

URLs: https://generativeui.github.io

cross OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling

Authors: Hongyu Chen, Liang Lin, Guangrun Wang

Abstract: Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object-Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple $W = \langle S, T \rangle$: a State Abstraction ($G_\text{state}$) instantiating the environmental state $S$, coupled with a Control Policy ($G_\text{control}$) representing the transition logic $T: S \times A \rightarrow S'$. OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome-based rewards from the final plan to implicitly optimize the underlying object-oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom-30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.

cross MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

Authors: Yunfei Feng, Xi Zhao, Cheng Zhang, Dahu Feng, Daolin Cheng, Jianqi Yu, Yubin Xia, Erhu Feng

Abstract: Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow's evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.

cross Persistent Identity in AI Agents: A Multi-Anchor Architecture for Resilient Memory and Continuity

Authors: Prahlad G. Menon

Abstract: Modern AI agents suffer from a fundamental identity problem: when context windows overflow and conversation histories are summarized, agents experience catastrophic forgetting -- losing not just information, but continuity of self. This technical limitation reflects a deeper architectural flaw: AI agent identity is centralized in a single memory store, creating a single point of failure. Drawing on neurological case studies of human memory disorders, we observe that human identity survives damage because it is distributed across multiple systems: episodic memory, procedural memory, emotional continuity, and embodied knowledge. We present soul.py, an open-source architecture that implements persistent identity through separable components (identity files and memory logs), and propose extensions toward multi-anchor resilience. The framework introduces a hybrid RAG+RLM retrieval system that automatically routes queries to appropriate memory access patterns, achieving efficient retrieval without sacrificing comprehensiveness. We formalize the notion of identity anchors for AI systems and present a roadmap for building agents whose identity can survive partial memory failures. Code is available at github.com/menonpg/soul.py

cross Spatial Competence Benchmark

Authors: Jash Vira, Ashley Harris

Abstract: Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.

cross ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

Authors: Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan

Abstract: Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.

cross LLMs for Text-Based Exploration and Navigation Under Partial Observability

Authors: Stephan Sandfuchs, Maximilian Melchert, J\"org Frochte

Abstract: Exploration and goal-directed navigation in unknown layouts are central to inspection, logistics, and search-and-rescue. We ask whether large language models (LLMs) can function as \emph{text-only} controllers under partial observability -- without code execution, tools, or program synthesis. We introduce a reproducible benchmark with oracle localisation in fixed ASCII gridworlds: each step reveals only a local $5\times5$ window around the agent and the model must select one of \texttt{UP/RIGHT/DOWN/LEFT}. Nine contemporary LLMs ranging from open/proprietary, dense / Mixture of Experts and instruction- vs. reasoning-tuned are evaluated on two tasks across three layouts of increasing difficulty: \emph{Exploration} (maximising revealed cells) and \emph{Navigation} (reach the goal on the shortest path). The experimental results are evaluated on quantitative metrics including \emph{success rate}, \emph{efficiency} such as normalised coverage and \emph{path length} vs. oracle as well as qualitative analysis. Reasoning-tuned models reliably complete navigation across all layouts, yet remain less efficient than oracle paths. Few-shot demonstrations in the prompt chiefly help these Reasoning-tuned models by reducing invalid moves and shortening paths, while classic dense instruction models remain inconsistent. We observe characteristic action priors (UP/RIGHT) that can induce looping under partial observability. Overall, training regimen and test-time deliberation predict control ability better than raw parameter count. These findings suggest lightweight hybridisation with classical online planners as a practical route to deployable partial map systems.

cross Self-Calibrating Language Models via Test-Time Discriminative Distillation

Authors: Mohamed Rissal Hedna, Jan Strich, Martin Semmann, Chris Biemann

Abstract: Large language models (LLMs) are systematically overconfident: they routinely express high certainty on questions they often answer incorrectly. Existing calibration methods either require labeled validation data, degrade under distribution shifts, or incur substantial inference costs. Recent work has shown that LLMs already contain a better-calibrated signal than the one they verbalize: the token probability of "True" when the model is asked "Is this answer correct?" ($P(\text{True})$) consistently outperforms their stated confidence, a gap that is theoretically grounded as generative error is lower-bounded by roughly twice the corresponding discriminative error. We introduce $\textbf{SECL}$ ($\textbf{SE}$lf-$\textbf{C}$alibrating $\textbf{L}$anguage Models), a test-time training (TTT) pipeline that exploits this gap as label-free self-supervision, requiring no labeled data or human supervision. SECL adapts only when the input distribution shifts, training on just 6--26% of the question stream at lower cost than the baseline it distills from. Across four small language models from three model families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56--78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods. SECL is the first method to apply TTT to calibration; seven ablations covering signal quality, gating strategy, weight accumulation, loss design, domain ordering, hyperparameter sensitivity, and layer selection confirm that each component is crucial and robust across configurations. Code: https://anonymous.4open.science/r/secl-emnlp26-submission-C890

URLs: https://anonymous.4open.science/r/secl-emnlp26-submission-C890

cross Investigating Vaccine Buyer's Remorse: Post-Vaccination Decision Regret in COVID-19 Social Media Using Politically Diverse Human Annotation

Authors: Miles Stanley, Soumyajit Datta, Ashutosh Kumar, Ashiqur R. KhudaBukhsh

Abstract: A significant gap exists in datasets regarding post-COVID-19 vaccination experiences, particularly ``vaccine buyer's remorse''. Understanding the prevalence and nature of vaccine regret, whether based on personal or vicarious experiences, is vital for addressing vaccine hesitancy and refining public health communication. In this paper, we curate a novel dataset from a large YouTube news corpus capturing COVID-19 vaccination experiences, and construct a benchmark subset focused on vaccine regret, annotated by a politically diverse panel to account for the subjective and often politicized nature of the topic. We utilize large language models (LLMs) to identify posts expressing vaccine regret, analyze the reasons behind this regret, and quantify its occurrence in both first and second-person accounts. This paper aims to (1) quantify the prevalence of vaccine regret; (2) identify common reasons for this sentiment; (3) analyze differences between first-person and vicarious experiences; and (4) assess potential biases introduced by different LLMs. We find that while vaccine buyer's remorse appears in only $<2\%$ of public discourse, it is disproportionately concentrated in vaccine-skeptic influencer communities and is predominantly expressed through first-person narratives citing adverse health events.

cross ML-Based Real-Time Downlink Performance Prediction in Standalone 5G NR Using Smartphones

Authors: Md Mahfuzur Rahman, Jareen Shuva, Nishith Tripathi, Jeffrey H. Reed, Lingjia Liu

Abstract: We propose a machine learning (ML)-based framework for downlink performance prediction in 5G networks using real-time measurements from commercial off-the-shelf (COTS) user equipment (UE). Our experimental platform integrates the srsRAN 5G New Radio (NR) stack deployed on a Dell desktop serving as the 5G next generation nodeB (gNB), operating at 3.4 GHz. Two Google Pixel 7a smartphones are used to collect physical layer characteristics such as channel quality indicator (CQI), modulation and coding scheme (MCS), bit rate, transmission time interval (TTI), and block error rate (BLER), which are leveraged as predictors in model training. We use commercial-grade traffic generation tools, including Ookla, for stationary and mobility measurements under line-of-sight (LOS) and non-line-of-sight (nLOS) conditions. Test data includes global Ookla servers (e.g., USA, Portugal, Ghana, Egypt, Japan), iperf TCP/UDP data, and video streaming sessions from YouTube. To analyze inter-user interference, we also include scenarios with multiple UEs at the same location. We evaluate the predictive performance of five supervised regression models - linear regression, decision tree regression, random forest regression, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM). Our results demonstrate that throughput and BLER can be accurately predicted using COTS hardware and standard ML techniques in diverse real-world 5G scenarios.

cross Leveraging Machine Learning Techniques to Investigate Media and Information Literacy Competence in Tackling Disinformation

Authors: Jos\'e Manuel Alcalde-Llergo, Mariana Buenestado Fern\'andez, Carlos Enrique George-Reyes, Andrea Zingoni, Enrique Yeguas-Bol\'ivar

Abstract: This study develops machine learning models to assess Media and Information Literacy (MIL) skills specifically in the context of disinformation among students, particularly future educators and communicators. While the digital revolution has expanded access to information, it has also amplified the spread of false and misleading content, making MIL essential for fostering critical thinking and responsible media engagement. Despite its relevance, predictive modeling of MIL in relation to disinformation remains underexplored. To address this gap, a quantitative study was conducted with 723 students in education and communication programs using a validated survey. Classification and regression algorithms were applied to predict MIL competencies and identify key influencing factors. Results show that complex models outperform simpler approaches, with variables such as academic year and prior training significantly improving prediction accuracy. These findings can inform the design of targeted educational interventions and personalized strategies to enhance students' ability to critically navigate and respond to disinformation in digital environments.

cross Dynamic Forecasting and Temporal Feature Evolution of Stock Repurchases in Listed Companies Using Attention-Based Deep Temporal Networks

Authors: Xiang Ao, Jingxuan Zhang, Xinyu Zhao

Abstract: Accurately predicting stock repurchases is crucial for quantitative investment and risk management, yet traditional static models fail to capture the complex temporal dependencies of corporate financial conditions. This paper proposes a dynamic early warning system integrating economic theory with deep temporal networks. Using Chinese A-share panel data (2014-2024), we employ a hybrid Temporal Convolutional Network (TCN) and Attention-based LSTM to capture long- and short-term financial evolutionary patterns. Rolling-window cross-validation demonstrates our model significantly outperforms static baselines like Logistic Regression and XGBoost. Furthermore, utilizing Explainable AI (XAI), we reveal the temporal dynamics of repurchase decisions: prolonged "undervaluation" serves as the long-term underlying motive, while a sharp increase in "cash flow" acts as the decisive short-term trigger. This study provides a robust deep learning paradigm for financial forecasting and offers dynamic empirical support for classic corporate finance hypotheses.

cross FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

Authors: Xinyuan An, Tao Luo, Gengyun Peng, Yaobing Wang, Kui Ren, Dongxia Wang

Abstract: Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like $\pi_0$ showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism - the vector field dynamics - presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics. We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel $\tau$-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model's internal generative dynamics.

cross NeuroPath: Practically Adopting Motor Imagery Decoding through EEG Signals

Authors: Jiani Cao, Kun Wang, Yang Liu, Zhenjiang Li

Abstract: Motor Imagery (MI) is an emerging Brain-Computer Interface (BCI) paradigm where a person imagines body movements without physical action. By decoding scalp-recorded electroencephalography (EEG) signals, BCIs establish direct communication to control external devices, offering significant potential in prosthetics, rehabilitation, and human-computer interaction. However, existing solutions remain difficult to deploy. (i) Most employ independent, opaque models for each MI task, lacking a unified architectural foundation. Consequently, these models are trained in isolation, failing to learn robust representations from diverse datasets, resulting in modest performance. (ii) They primarily adopt fixed sensor deployment, whereas real-world setups vary in electrode number and placement, causing models to fail across configurations. (iii) Performance degrades sharply under low-SNR conditions typical of consumer-grade EEG. To address these challenges, we present NeuroPath, a neural architecture for robust MI decoding. NeuroPath takes inspiration from the brain's signal pathway from cortex to scalp, utilizing a deep neural architecture with specialized modules for signal filtering, spatial representation learning, and feature classification, enabling unified decoding. To handle varying electrode configurations, we introduce a spatially aware graph adapter accommodating different electrode numbers and placements. To enhance robustness under low-SNR conditions, NeuroPath incorporates multimodal auxiliary training to refine EEG representations and stabilize performance on noisy real-world data. Evaluations on three consumer-grade and three medical-grade public datasets demonstrate that NeuroPath achieves superior performance.

cross Digital hybridity and relics in cultural heritage: using corpus linguistics to inform design in emerging technologies from AI to VR

Authors: Emma McClaughlin, Glenn McGarry, Alan Chamberlain, Geert De Wilde, Oliver Butler

Abstract: Hybrid technologies enable the blending of physical and digital elements, creating new ways to experience and interact with the world. Such technologies can transform engagement with relics, both secular and sacred but they present challenges for capturing faith, belief, and representation responsibly. Given the complexities of digital representation and the ethical challenges inherent in digitising culturally significant objects, a transdisciplinary understanding of these issues is needed. To inform this discussion from a linguistic perspective, we examined the representation of relics in historical and contemporary texts. Using a corpus linguistic approach to extract modifiers of the word relic in corpora of Early Modern English books and contemporary web sourced texts from 2021, we examined the multifaceted ways in which relics have been perceived and evaluated over time. Early texts consider relics as both objects of moral and spiritual significance, and tools of religious and political control, while they are more often framed as heritage symbols, reflecting past events, places, and traditions in contemporary texts. We discuss how hybrid, sometimes AI based technologies can enhance accessibility and engagement, whilst also challenging traditional sensitivities around authenticity and sensory experience, which are integral to the meaning and significance of relics.

cross Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

Authors: Kumar Saurav

Abstract: Outbound AI calling systems must distinguish voicemail greetings from live human answers in real time to avoid wasted agent interactions and dropped calls. We present a lightweight approach that extracts 15 temporal features from the speech activity pattern of a pre-trained neural voice activity detector (VAD), then classifies with a shallow tree-based ensemble. Across two evaluation sets totaling 764 telephony recordings, the system achieves a combined 96.1% accuracy (734/764), with 99.3% (139/140) on an expert-labeled test set and 95.4% (595/624) on a held-out production set. In production validation over 77,000 calls, it maintained a 0.3% false positive rate and 1.3% false negative rate. End-to-end inference completes in 46 ms on a commodity dual-core CPU with no GPU, supporting 380+ concurrent WebSocket calls. In our search over 3,780 model, feature, and threshold combinations, feature importance was concentrated in three temporal variables. Adding transcription keywords or beep-based features did not improve the best real-time configuration and increased latency substantially. Our results suggest that temporal speech patterns are a strong signal for distinguishing voicemail greetings from live human answers.

cross Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part III -- Gradient Descent, Neural Plasticity, and the Emergence of Deep Intelligence

Authors: Ernest Fokou\'e, Gregory Babbitt, Yuval Levental

Abstract: In Parts I and II of this series, we established isomorphisms between ant colony decision-making and two major families of ensemble learning: random forests (parallel, variance reduction) and boosting (sequential, bias reduction). Here we complete the trilogy by demonstrating that the fundamental learning algorithm underlying deep neural networks -- stochastic gradient descent -- is mathematically isomorphic to the generational learning dynamics of ant colonies. We prove that pheromone evolution across generations follows the same update equations as weight evolution during gradient descent, with evaporation rates corresponding to learning rates, colony fitness corresponding to negative loss, and recruitment waves corresponding to backpropagation passes. We further show that neural plasticity mechanisms -- long-term potentiation, long-term depression, synaptic pruning, and neurogenesis -- have direct analogs in colony-level adaptation: trail reinforcement, evaporation, abandonment, and new trail formation. Comprehensive simulations confirm that ant colonies trained on environmental tasks exhibit learning curves indistinguishable from neural networks trained on analogous problems. This final isomorphism reveals that all three major paradigms of machine learning -- parallel ensembles, sequential ensembles, and gradient-based deep learning -- have direct analogs in the collective intelligence of social insects, suggesting a unified theory of learning that transcends substrate. The ant colony, we conclude, is not merely analogous to learning algorithms; it is a living embodiment of the fundamental principles of learning itself.

cross A Modular Zero-Shot Pipeline for Accident Detection, Localization, and Classification in Traffic Surveillance Video

Authors: Amey Thakur, Sarvesh Talele

Abstract: We describe a zero-shot pipeline developed for the ACCIDENT @ CVPR 2026 challenge. The challenge requires predicting when, where, and what type of traffic accident occurs in surveillance video, without labeled real-world training data. Our method separates the problem into three independent modules. The first module localizes the collision in time by running peak detection on z-score normalized frame-difference signals. The second module finds the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps using the Farneback algorithm. The third module classifies collision type by measuring cosine similarity between CLIP image embeddings of frames near the detected peak and text embeddings built from multi-prompt natural language descriptions of each collision category. No domain-specific fine-tuning is involved; the pipeline processes each video using only pre-trained model weights. Our implementation is publicly available as a Kaggle notebook.

cross Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

Authors: Abolfazl Mohammadi-Seif, Ricardo Baeza-Yates

Abstract: Machine learning progress has historically prioritized model-centric innovations, yet achievable performance is frequently capped by the intrinsic complexity of the data itself. In this work, we isolate and quantify the impact of instance density (measured by face count) as a primary driver of data complexity. Rather than simply observing that ``crowded scenes are harder,'' we rigorously control for class imbalance to measure the precise degradation caused by density alone. Controlled experiments on the WIDER FACE and Open Images datasets, restricted to exactly 1 to 18 faces per image with perfectly balanced sampling, reveal that model performance degrades monotonically with increasing face count. This trend holds across classification, regression, and detection paradigms, even when models are fully exposed to the entire density range. Furthermore, we demonstrate that models trained on low-density regimes fail to generalize to higher densities, exhibiting a systematic under-counting bias, with error rates increasing by up to 4.6x, which suggests density acts as a domain shift. These findings establish instance density as an intrinsic, quantifiable dimension of data hardness and motivate specific interventions in curriculum learning and density-stratified evaluation.

cross Sharpness-Aware Surrogate Training for On-Sensor Spiking Neural Networks

Authors: Maximilian Nicholson

Abstract: Spiking neural networks (SNNs) are a natural computational model for on-sensor and near-sensor vision, where event driven processors must operate under strict power budgets with hard binary spikes. However, models trained with surrogate gradients often degrade sharply when the smooth surrogate nonlinearity is replaced by a hard threshold at deployment; a surrogate-to-hard transfer gap that directly limits on-sensor accuracy. We study Sharpness-Aware Surrogate Training (SAST), which applies Sharpness-Aware Minimization (SAM) to a surrogate-forward SNN so that the training objective is smooth and the gradient is exact, and position it as one gap-reduction strategy under the tested settings rather than the only viable mechanism. Under explicit contraction assumptions we provide state-stability, input-Lipschitz, and smoothness bounds, together with a corresponding nonconvex convergence result. On two event-camera benchmarks, swap-only hard-spike accuracy improves from 65.7\% to 94.7\% on N-MNIST and from 31.8\% to 63.3\% on DVS Gesture. Under a hardware-aware inference simulation (INT8/INT4 weight quantization, fixed-point membrane potentials, discrete leak factors), SAST remains strong: on N-MNIST, hard-spike accuracy improves from 47.6\% to 96.9\% (INT8) and from 43.2\% to 81.0\% (INT4), while on DVS Gesture it improves from 25.3\% to 47.6\% (INT8) and from 26.0\% to 43.8\% (INT4). SynOps also decrease under the same hardware-aware setting, including 1734k$\rightarrow$1315k (N-MNIST, INT8) and 86221k$\rightarrow$4323k (DVS Gesture, INT8). These results suggest that SAST is a promising component in a broader toolbox for on-sensor spiking inference under the tested settings.

cross PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation

Authors: Melanie Neubauer, Elmar Rueckert, Christian Rauch

Abstract: Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called 'Patch Aggregation for Segmentation of Targets and Anomalies' (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.

cross Robust Fair Disease Diagnosis in CT Images

Authors: Justin Li, Daniel Ding, Asmita Yuki Pritha, Aryana Hou, Xin Wang, Shu Hu

Abstract: Automated diagnosis from chest CT has improved considerably with deep learning, but models trained on skewed datasets tend to perform unevenly across patient demographics. However, the situation is worse than simple demographic bias. In clinical data, class imbalance and group underrepresentation often coincide, creating compound failure modes that neither standard rebalancing nor fairness corrections can fix alone. We introduce a two-level objective that targets both axes of this problem. Logit-adjusted cross-entropy loss operates at the sample level, shifting decision margins by class frequency with provable consistency guarantees. Conditional Value at Risk aggregation operates at the group level, directing optimization pressure toward whichever demographic group currently has the higher loss. We evaluate on the Fair Disease Diagnosis benchmark using a 3D ResNet-18 pretrained on Kinetics-400, classifying CT volumes into Adenocarcinoma, Squamous Cell Carcinoma, COVID-19, and Normal groups with patient sex annotations. The training set illustrates the compound problem concretely: squamous cell carcinoma has 84 samples total, 5 of them female. The combined loss reaches a gender-averaged macro F1 of 0.8403 with a fairness gap of 0.0239, a 13.3% improvement in score and 78% reduction in demographic disparity over the baseline. Ablations show that each component alone falls short. The code is publicly available at https://github.com/Purdue-M2/Fair-Disease-Diagnosis.

URLs: https://github.com/Purdue-M2/Fair-Disease-Diagnosis.

cross Spectral Kernel Dynamics via Maximum Caliber: Fixed Points, Geodesics, and Phase Transitions

Authors: Jnaneshwar Das

Abstract: We derive a closed-form geometric functional for kernel dynamics on finite graphs by applying the Maximum Caliber (MaxCal) variational principle to the spectral transfer function h(lambda) of the graph Laplacian eigenbasis. The main result is that the MaxCal stationarity condition decouples into N one-dimensional problems with explicit solution: h*(lambda_l) = h_0(lambda_l) exp(-1 - T_l[h*]), yielding self-consistent (fixed-point) kernels via exponential tilting (Corollary 1), log-linear Fisher-Rao geodesics (Corollary 2), a diagonal Hessian stability criterion (Corollary 3), and an l^2_+ isometry for the spectral kernel space (Proposition 3). The spectral entropy H[h_t] provides a computable O(N) early-warning signal for network-structural phase transitions (Remark 7). All claims are numerically verified on the path graph P_8 with a Gaussian mutual-information source, using the open-source kernelcal library. The framework is grounded in a structural analogy with Einstein's field equations, used as a guiding template rather than an established equivalence; explicit limits are stated in Section 6.

cross Sustainable Transformer Neural Network Acceleration with Stochastic Photonic Computing

Authors: S. Afifi, O. Alo, I. Thakkar, S. Pasricha

Abstract: Transformers achieve state-of-the-art performance in natural language processing, vision, and scientific computing, but demand high computation and memory. To address these challenges, we present ASTRA, the first silicon-photonic accelerator leveraging stochastic computing for transformers. ASTRA employs novel optical stochastic multipliers and unary/analog homodyne accumulation in a crosstalk-minimal organization to efficiently process dynamic tensor computations. Evaluations show at least 7.6x speedup and 1.3x lower energy overheads compared to state-of-the-art accelerators, highlighting ASTRA's potential for efficient, scalable, and sustainable transformer inference.

cross Differentiable free energy surface: a variational approach to directly observing rare events using generative deep-learning models

Authors: Shuo-Hui Li, Chen Chen, Yao-Wen Zhang, Ding Pan

Abstract: Rare events are central to the evolution of complex many-body systems, characterized as key transitional configurations on the free energy surface (FES). Conventional methods require adequate sampling of rare event transitions to obtain the FES, which is computationally very demanding. Here we introduce the variational free energy surface (VaFES), a dataset-free framework that directly models FESs using tractable-density generative models. Rare events can then be immediately identified from the FES with their configurations generated directly via one-shot sampling of generative models. By extending a coarse-grained collective variable (CV) into its reversible equivalent, VaFES constructs a latent space of intermediate representation in which the CVs explicitly occupy a subset of dimensions. This latent-space construction preserves the physical interpretability and transparent controllability of the CVs by design, while accommodating arbitrary CV formulations. The reversibility makes the system energy exactly accessible, enabling variational optimization of the FES without pre-generated simulation data. A single optimization yields a continuous, differentiable FES together with one-shot generation of rare-event configurations. Our method can reproduce the exact analytical solution for the bistable dimer potential and identify a chignolin native folded state in close alignment with the experimental NMR structure. Our approach thus establishes a scalable, systematic framework for advancing the study of complex statistical systems.

cross Discrete Flow Maps

Authors: Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, Michael S. Albergo

Abstract: The sequential nature of autoregressive next-token prediction imposes a fundamental speed limit on large language models. While continuous flow models offer a path to parallel generation, they traditionally demand expensive iterative integration. Flow Maps bypass this bottleneck by compressing generative trajectories into single-step mappings, theoretically enabling the generation of full text sequences from noise in a single forward pass. However, standard formulations rely on Euclidean regression losses that are geometrically ill-suited for discrete data. In this work, we resolve this conflict with Discrete Flow Maps, a framework that reconciles trajectory compression with the geometry of the probability simplex. We recast standard flow map training for the discrete domain, aligning the training dynamics with the discrete nature of language. Empirically, this strict geometric alignment allows our method to surpass previous state-of-the-art results in discrete flow modeling.

cross Learning What's Real: Disentangling Signal and Measurement Artifacts in Multi-Sensor Data, with Applications to Astrophysics

Authors: Pablo Mercader-Perez, Carolina Cuesta-Lazaro, Daniel Muthukrishna, Jeroen Audenaert, V. Ashley Villar, David W. Hogg, Marc Huertas-Company, William T. Freeman

Abstract: Data collected from the physical world is always a combination of multiple sources: an underlying signal from the physical process of interest and a signal from measurement-dependent artifacts from the sensor or instrument. This secondary signal acts as a confounding factor, limiting our ability to extract information about the physics underlying the phenomena we observe. Furthermore, it complicates the combination of observations in heterogeneous or multi-instrument settings. We propose a deep learning framework that leverages overlapping observations, a dual-encoder architecture, and a counterfactual generation objective to disentangle these factors of variation. The resulting representations explicitly separate intrinsic signals from sensor-specific distortions and noise, and can be used for counterfactual view generation, parameter inference unconfounded by measurement distortions, and instrument-independent similarity search. We demonstrate the effectiveness of our approach on astrophysical galaxy images from the DESI Legacy Imaging Survey (Legacy) and the Hyper Suprime-Cam (HSC) Survey as a representative multi-instrument setting. This framework provides a general recipe for scientific and multi-modal self-supervised pretraining: construct training pairs from overlapping observations of the same physical system, treat sensor- or modality-specific effects as augmentations, and learn invariant representations through counterfactual generation.

cross Pioneer Agent: Continual Improvement of Small Language Models in Production

Authors: Dhruv Atreja, Julia White, Nikhil Nayak, Kelton Zhang, Henrijs Princis, George Hurn-Maloney, Ash Lewis, Urchade Zaratiana

Abstract: Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.

cross COMPOSITE-Stem

Authors: Kyle Waters, Lucas Nuzzi, Tadhg Looram, Alessandro Tomasiello, Ariel Ghislain Kemogne Kamdoum, Bikun Li, Damien Sileo, Egor Kretov, Francesco Fournier-Facio, Georgios Soloupis, Haile Kassahun, Hew Wolff, Jiaqi Cai, Lianghui Li, Marc Roth, Mohinder Naiya, Naixu Guo, Qicheng Tang, Richard Wheeler, Samuele Sala, Serguei Popov, Steven Dillman, Yuqi Li

Abstract: AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.

cross Steered LLM Activations are Non-Surjective

Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu

Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

cross Improving DNS Exfiltration Detection via Transformer Pretraining

Authors: Milo\v{s} Tomi\'c, Aleksa Cvetanovi\'c, Predrag Tadi\'c

Abstract: We study whether in-domain pretraining of Bidirectional Encoder Representations from Transformer (BERT) model improves subdomain-level detection of exfiltration at low false positive rates. While previous work mostly examines fine-tuned generic Transformers, it does not aim to isolate the effect of pretraining on the downstream task of classification. To address this gap, we develop a controlled pipeline where we freeze operating points on validation and transfer them to the test set, thus enabling clean ablations across different label and pretraining budgets. Our results show significant improvements in the left tail of the Receiver Operating Characteristic (ROC) curve, especially against randomly initialized baseline. Additionally, within pretrained model variants, increasing the number of pretraining steps helps the most when more labeled data are available for fine-tuning.

cross MEMENTO: Teaching LLMs to Manage Their Own Context

Authors: Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos

Abstract: Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15\,pp on AIME24.

cross Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

Authors: Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Fengqing Zhu

Abstract: Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object's class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: https://gitlab.com/viper-purdue/stereo-typical-estimator.

URLs: https://gitlab.com/viper-purdue/stereo-typical-estimator.

cross CableTract: A Co-Designed Cable-Driven Field Robot for Low-Compaction, Off-Grid Capable Agriculture

Authors: Ozgur Yilmaz

Abstract: Conventional field operations spend most of their energy moving the tractor body, not the implement. Yet feasibility studies for novel agricultural vehicles rarely tie mechanics, energy harvest, draft, field geometry, economics, life-cycle CO2, and uncertainty quantification together on a single reproducible code path. This paper builds such a framework and applies it to CableTract, a two-module cable-driven field robot. A stationary Main Unit (winch + motor + battery + harvester module) (MU) and a lighter Anchor module (held by helical screw piles) tension a cable across a strip while a lightweight implement carriage rolls along it. The heavy bodies stay on the headland; only the carriage enters the field. The carriage runs a 10-implement library co-designed for the cable architecture. This co-design is the paper's central analytical lever. The framework is prototype-free. It chains a catenary cable model, a drivetrain efficiency chain, a stochastic draft model fitted to the co-designed library, an hourly solar + wind + battery simulator on six sites, a polygon coverage planner on a 50-field corpus, a contact-pressure compaction model, a discounted cash-flow economics engine with battery replacement and life-cycle CO2, and a global sensitivity analysis on 20 inputs. An operating-envelope sweep and an architectural-variant comparison close the loop. The full implementation is open source. Applied to the codesigned reference, the framework yields energy, compaction advantages and potential off-grid operation.

cross New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework

Authors: Shaocong Ma, Peiran Yu, Heng Huang

Abstract: Fine-tuning Large Language Models (LLMs) typically involves either full fine-tuning, which updates all model parameters, or Parameter-Efficient Fine-Tuning (PEFT), which adjusts a small subset of parameters. However, both approaches have inherent limitations: full fine-tuning is computationally expensive, while PEFT often struggles to learn new knowledge and exhibits suboptimal performance. To overcome these issues, we propose a novel hybrid fine-tuning approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze our new algorithm, we develop a theoretical framework centered on the concept of hybrid smoothness condition, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We derive a rigorous convergence analysis for the convergence of reshuffling-type SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. On the practical side, our results demonstrate consistent performance improvement, making the approach a viable solution for large-scale language model fine-tuning.

cross A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics

Authors: Louie Hong Yao, Yuhao Li, Shengchao Liu

Abstract: Self-supervised representation learning is central to modern machine learning because it extracts structured latent features from unlabeled data and enables robust transfer across tasks and domains. However, it can suffer from representation collapse, a widely observed failure mode in which embeddings lose discriminative structure and distinct inputs become indistinguishable. To understand the mechanisms that drive collapse and the ingredients that prevent it, we introduce a minimal embedding-only model whose gradient-flow dynamics and fixed points can be analyzed in closed form, using a classification-representation setting as a concrete playground where collapse is directly quantified through the contraction of label-embedding geometry. We illustrate that the model does not collapse when the data are perfectly classifiable, while a small fraction of frustrated samples that cannot be classified consistently induces collapse through an additional slow time scale that follows the early performance gain. Within the same framework, we examine collapse prevention by adding a shared projection head and applying stop-gradient at the level of the training dynamics. We analyze the resulting fixed points and develop a dynamical mean-field style self-consistency description, showing that stop-gradient enables non-collapsed solutions and stabilizes finite class separation under frustration. We further verify empirically that the same qualitative dynamics and collapse-prevention effects appear in a linear teacher-student model, indicating that the minimal theory captures features that persist beyond the pure embedding setting.

cross Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions

Authors: Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee

Abstract: Reproducibility must validate architectural robustness, not just numerical accuracy. We evaluate ColBERT-v2 and ConstBERT across five dimensions, finding that while ConstBERT reproduces within 0.05% MRR@10 on MS-MARCO, both models show a drop of 86-97% on long, narrative queries (TREC ToT 2025). Ablations prove this failure is architectural: performance plateaus at 20 words because the MaxSim operator's uniform token weighting cannot distinguish signal from filler noise. Furthermore, undocumented backend parameters create an 8-point gap due to ConstBERT's sparse centroid coverage, and fine-tuning with 3x more data actually degrades performance by up to 29%. We conclude that architectural constraints in multi-vector retrieval cannot be overcome by adaptation alone. Code: https://github.com/utshabkg/multi-vector-reproducibility.

URLs: https://github.com/utshabkg/multi-vector-reproducibility.

cross Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach

Authors: Federico Formica, Andrea Rota, Aurora Francesca Zanenga, Andrea Bombarda, Mark Lawford, Lionel C. Briand, Claudio Menghi

Abstract: Deep Neural Networks (DNNs) are widely used by engineers to solve difficult problems that require predictive modeling from data. However, these models are often massive, with millions or billions of parameters, and require substantial computational power, RAM, and storage. This becomes a limitation in practical scenarios where strict size and resource constraints must be respected. In this paper, we present a novel concept-based pruning technique for DNNs that guides pruning decisions using human-interpretable concepts, such as features, colors, and classes. This is particularly important in a software engineering context, as DNNs are integrated into systems and must be pruned according to specific system requirements. Our concept-based pruning solution analyzes neuron activations to identify important neurons from a system requirements viewpoint and uses this information to guide the DNN pruning. We assess our solution using the VGG-19 network and a dataset of 26'384 RGB images, focusing on its ability to produce small, effective pruned DNNs and on the computational complexity and performance of these pruned DNNs. We also analyzed the pruning efficiency of our solution and compared alternative configurations. Our results show that concept-based pruning efficiently generates much smaller, effective pruned DNNs. Pruning greatly improves the computational efficiency and performance of DNNs, properties that are particularly useful for practical applications with stringent memory and computational time constraints. Finally, alternative configuration options enable engineers to identify trade-offs adapted to different practical situations.

cross Predicting Associations between Solar Flares and Coronal Mass Ejections Using SDO/HMI Magnetograms and a Hybrid Neural Network

Authors: Jialiang Li, Vasyl Yurchyshyn, Jason T. L. Wang, Haimin Wang, Manolis K. Georgoulis, Wen He, Yasser Abduallah, Hameedullah A. Farooki, Yan Xu

Abstract: Solar eruptions, including flares and coronal mass ejections (CMEs), have a significant impact on Earth. Some flares are associated with CMEs, and some flares are not. The association between flares and CMEs is not always obvious. In this study, we propose a new deep learning method, specifically a hybrid neural network (HNN) that combines a vision transformer with long short-term memory, to predict associations between flares and CMEs. HNN finds spatio-temporal patterns in the time series of line-of-sight magnetograms of solar active regions (ARs) collected by the Helioseismic and Magnetic Imager on board the Solar Dynamics Observatory and uses the patterns to predict whether a flare projected to occur within the next 24 hours will be eruptive (i.e., CME-associated) or confined (i.e., not CME-associated). Our experimental results demonstrate the good performance of the HNN method. Furthermore, the results show that magnetic flux cancellation in polarity inversion line regions may well play a role in triggering flare-associated CMEs, a finding consistent with literature.

cross Masked Contrastive Pre-Training Improves Music Audio Key Detection

Authors: Ori Yonay, Tracy Hammond, Tianbao Yang

Abstract: Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate that masked contrastive embeddings uniquely enable state-of-the-art (SOTA) performance in key detection in the supervised setting. First, we discover that linear evaluation after masking-based contrastive pretraining on Mel spectrograms leads to competitive performance on music key detection out of the box. This leads us to train shallow but wide multi-layer perceptrons (MLPs) on features extracted from our base model, leading to SOTA performance without the need for sophisticated data augmentation policies. We further analyze robustness and show empirically that the learned representations naturally encode common augmentations. Our study establishes self-supervised pretraining as an effective approach for pitch-sensitive MIR tasks and provides insights for designing and probing music foundation models.

cross LVSum: A Benchmark for Timestamp-Aware Long Video Summarization

Authors: Alkesh Patel, Melis Ozyildirim, Ying-Chang Cheng, Ganesh Nagarajan

Abstract: Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.

cross Daily Predictions of F10.7 and F30 Solar Indices with Deep Learning

Authors: Zhenduo Wang, Yasser Abduallah, Jason T. L. Wang, Haimin Wang, Yan Xu, Vasyl Yurchyshyn, Vincent Oria, Khalid A. Alobaid, Xiaoli Bai

Abstract: The F10.7 and F30 solar indices are the solar radio fluxes measured at wavelengths of 10.7 cm and 30 cm, respectively, which are key indicators of solar activity. F10.7 is valuable for explaining the impact of solar ultraviolet (UV) radiation on the upper atmosphere of Earth, while F30 is more sensitive and could improve the reaction of thermospheric density to solar stimulation. In this study, we present a new deep learning model, named the Solar Index Network, or SINet for short, to predict daily values of the F10.7 and F30 solar indices. The SINet model is designed to make medium-term predictions of the index values (1-60 days in advance). The observed data used for SINet training were taken from the National Oceanic and Atmospheric Administration (NOAA) as well as Toyokawa and Nobeyama facilities. Our experimental results show that SINet performs better than five closely related statistical and deep learning methods for the prediction of F10.7. Furthermore, to our knowledge, this is the first time deep learning has been used to predict the F30 solar index.

cross Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating

Authors: Saniah Kayenat Chowdhury, Muhammad E. H. Chowdhury

Abstract: Student engagement is crucial for improving learning outcomes in group activities. Highly engaged students perform better both individually and contribute to overall group success. However, most existing automated engagement recognition methods are designed for online classrooms or estimate engagement at the individual level. Addressing this gap, we propose DualEngage, a novel two-stream framework for group-level engagement recognition from in-classroom videos. It models engagement as a joint function of both individual and group-level behaviors. The primary stream models person-level motion dynamics by detecting and tracking students, extracting dense optical flow with the Recurrent All-Pairs Field Transforms network, encoding temporal motion patterns using a transformer encoder, and finally aggregating per-student representations through attention pooling into a unified representation. The secondary stream captures scene-level spatiotemporal information from the full video clip, leveraging a pretrained three-dimensional Residual Network. The two-stream representations are combined via softmax-gated fusion, which dynamically weights each stream's contribution based on the joint context of both features. DualEngage learns a joint representation of individual actions with overarching group dynamics. We evaluate the proposed approach using fivefold cross-validation on the Classroom Group Engagement Dataset developed by Ocean University of China, achieving an average classification accuracy of 0.9621+/-0.0161 with a macro-averaged F1 of 0.9530+/-0.0204. To understand the contribution of each branch, we further conduct an ablation study comparing single-stream variants against the two-stream model. This work is among the first in classroom engagement recognition to adopt a dual-stream design that explicitly leverages motion cues as an estimator.

cross Global monitoring of methane point sources using deep learning on hyperspectral radiance measurements from EMIT

Authors: Vishal V. Batchu, Michelangelo Conserva, Alex Wilson, Anna M. Michalak, Varun Gulshan, Philip G. Brodrick, Andrew K. Thorpe, Christopher V. Arsdale

Abstract: Anthropogenic methane (CH4) point sources drive near-term climate forcing, safety hazards, and system inefficiencies. Space-based imaging spectroscopy is emerging as a tool for identifying emissions globally, but existing approaches largely rely on manual plume identification. Here we present the Methane Analysis and Plume Localization with EMIT (MAPL-EMIT) model, an end-to-end vision transformer framework that leverages the complete radiance spectrum from the Earth Surface Mineral Dust Source Investigation (EMIT) instrument to jointly retrieve methane enhancements across all pixels within a scene. This approach brings together spectral and spatial context to significantly lower detection limits. MAPL-EMIT simultaneously supports enhancement quantification, plume delineation, and source localization, even for multiple overlapping plumes. The model was trained on 3.6 million physics-based synthetic plumes injected into global EMIT radiance data. Synthetic evaluation confirms the model's ability to identify plumes with high recall and precision and to capture weaker plumes relative to existing matched-filter approaches. On real-world benchmarks, MAPL-EMIT captures 79% of known hand-annotated NASA L2B plume complexes across a test set of 1084 EMIT granules, while capturing twice as many plausible plumes than identified by human analysts. Further validation against coincident airborne data, top-emitting landfills, and controlled release experiments confirms the model's ability to identify previously uncaptured sources. By incorporating model-generated metrics such as spectral fit scores and estimated noise levels, the framework can further limit false-positive rates. Overall, MAPL-EMIT enables high-throughput implementation on the full EMIT catalog, shifting methane monitoring from labor-intensive workflows to a rapid, scalable paradigm for global plume mapping at the facility scale.

cross Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations

Authors: Bernard Muller, Antonio Armando Ortiz Barra\~n\'on, LaVonne Roberts

Abstract: Dysarthric speech severity assessment typically requires trained clinicians or supervised models built from labelled pathological speech, limiting scalability across languages and clinical settings. We present a training-free method that quantifies dysarthria severity by measuring degradation in phonological feature subspaces within frozen HuBERT representations. No supervised severity model is trained; feature directions are estimated from healthy control speech using a pretrained forced aligner. For each speaker, we extract phone-level embeddings via Montreal Forced Aligner, compute d-prime scores along phonological contrast directions (nasality, voicing, stridency, sonorance, manner, and four vowel features) derived exclusively from healthy controls, and construct a 12-dimensional phonological profile.Evaluating 890 speakers across 10 corpora, 5 languages (English, Spanish, Dutch, Mandarin, French), and 3 primary aetiologies (Parkinson's disease, cerebral palsy, ALS), we find that all five consonant d-prime features correlate significantly with clinical severity (random-effects meta-analysis rho = -0.50 to -0.56, p < 2e-4; pooled Spearman rho = -0.47 to -0.55 with bootstrap 95% CIs not crossing zero). The effect replicates within individual corpora, survives FDR correction, and remains robust to leave-one-corpus-out removal and alignment quality controls. Nasality d-prime decreases monotonically from control to severe in 6 of 7 severity-graded corpora. Mann-Whitney U tests confirm that all 12 features distinguish controls from severely dysarthric speakers (p < 0.001).The method requires no dysarthric training data and applies to any language with an existing MFA acoustic model (currently 29 languages). We release the full pipeline and phone feature configurations for six languages.

cross SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

Authors: Jehyeon Bang, Eunyeong Cho, Ranggi Hwang, Jinha Chung, Minsoo Rhu

Abstract: The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30\times$, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.

cross Accelerated Dopant Screening in Oxide Semiconductors via Multi-Fidelity Contextual Bandits and a Three-Tier DFT Validation Funnel

Authors: Abhinaba Basu

Abstract: Band gap engineering of oxide semiconductors through doping is critical for photocatalysis and optoelectronics, yet the combinatorial space of dopant elements, substitution sites, and co-doping combinations far exceeds typical density functional theory (DFT) budgets. We screen doped candidates across five oxide hosts (ZnO, TiO2, SrTiO3, SnO2, MgO), culminating in a 529-candidate ZnO co-doping campaign, and identify Cu-containing co-doped ZnO systems as consistently achieving visible-light-range band gaps (1.0-1.8 eV), with Y2Cu2 co-doped ZnO as the optimal candidate (1.84 eV). A three-tier validation funnel (PBE, PBE+U, ionic relaxation) reveals that no single level of theory suffices: V-doped ZnO shifts from near-metallic to wide-gap upon Hubbard U correction, while Cu-doped SrTiO3 enters the visible-light window only after correcting for d-electron localization. To make this screening tractable, we introduce a multi-fidelity screening strategy that replaces 81% of DFT evaluations with computationally inexpensive surrogate predictions, reducing a 529-candidate closed-loop Quantum ESPRESSO campaign from an estimated 440 to 62 CPU-hours while finding the global optimum in 100% of 50 independent trials (p = 5.0e-8 versus random screening, Wilcoxon signed-rank). Cross-host analysis of the dopant-host interaction matrix reveals that dopant performance is governed by just two latent chemical dimensions, enabling prediction of rankings in unseen hosts. All 583 DFT calculations, screening code, and stability proofs are released as an open benchmark.

cross MAVEN-T: Multi-Agent enVironment-aware Enhanced Neural Trajectory predictor with Reinforcement Learning

Authors: Wenchang Duan

Abstract: Trajectory prediction remains a critical yet challenging component in autonomous driving systems, requiring sophisticated reasoning capabilities while meeting strict real-time deployment constraints. While knowledge distillation has demonstrated effectiveness in model compression, existing approaches often fail to preserve complex decision-making capabilities, particularly in dynamic multi-agent scenarios. This paper introduces MAVEN-T, a teacher-student framework that achieves state-of-the-art trajectory prediction through complementary architectural co-design and progressive distillation. The teacher employs hybrid attention mechanisms for maximum representational capacity, while the student uses efficient architectures optimized for deployment. Knowledge transfer is performed via multi-granular distillation with adaptive curriculum learning that dynamically adjusts complexity based on performance. Importantly, the framework incorporates reinforcement learning to overcome the imitation ceiling of traditional distillation, enabling the student to verify, refine, and optimize teacher knowledge through dynamic environmental interaction, potentially achieving more robust decision-making than the teacher itself. Extensive experiments on NGSIM and highD datasets demonstrate 6.2x parameter compression and 3.7x inference speedup while maintaining state-of-the-art accuracy, establishing a new paradigm for deploying sophisticated reasoning models under resource constraints.

cross Continuous PT-Symmetry Breaking as a Design Variable for Giant Altermagnetic Spin Splitting

Authors: Kichan Chun, Gunn Kim

Abstract: Magnetic point-group analysis classifies altermagnets but returns only a binary symmetry verdict, leaving spin-splitting energy (SSE) inaccessible without spin-polarized density functional theory (DFT). This binary ceiling is not fundamental. Sublattice symmetry breaking is promoted here to a continuous, DFT-free scalar -- the Motif Symmetry-Breaking Index (MSBI) -- that quantifies $\mathcal{PT}$-symmetry breaking between antiparallel magnetic motifs directly from crystal coordinates. SHAP analysis of an XGBoost surrogate trained on 3,851 DFT-labeled binary structures identifies three dominant descriptors: MSBI (symmetry-breaking axis), motif packing fraction MPF (superexchange axis), and the $p/d$ electron ratio (covalency axis), each mapping onto a directly tunable experimental handle. A controlled VO--CrSb comparison within the same P$6_3$/mmc host lattice demonstrates that composition alone boosts SSE sevenfold. Bayesian optimization over this three-axis space, followed by independent DFT validation, recovers $\alpha$-NiS (SSE $= 0.823$\,eV) as cross-validation against an independent symmetry-based prediction and identifies three previously unrecognized high-SSE candidates -- square-planar FeS (1.297\,eV), octahedral CoS (1.103\,eV), and FeAs (1.089\,eV) -- all matching or exceeding CrSb. Square-planar Fe--S is proposed as a transferable coordination motif for giant altermagnetic spin splitting, advancing altermagnet design from symmetry classification to continuous quantitative optimization.

cross "bot lane noob" Towards Deployment of NLP-based Toxicity Detectors in Video Games

Authors: Jonas Ave, Irdin Pekaric, Matthias Frohner, Giovanni Apruzzese

Abstract: Toxicity and harassment are widespread in the video-gaming context. Especially in competitive online multiplayer scenarios, gamers oftentimes send harmful messages to other players (teammates or opponents) whose consequences span from mild annoyance to withdrawal and depression. Abundant prior work tackled these problems, e.g., pointing out the negative effects of toxic interactions. However, few works proposed countermeasures specifically developed and tested on textual messages sent during a match -- i.e., when the "harassment" actually occurs. We posit that such a scarcity stems from the lack of high-quality datasets that can be used to devise "automated" detectors based on natural-language processing (NLP) and machine learning (ML), and which can -- ideally -- mitigate the harm of toxic comments during a gaming session. This work provides a foundation for addressing the problem of toxicity and harassment in video games. First, through a systematic literature review (n=1,039), we provide evidence that only few works proposed ML/NLP-based detectors of toxicity/harassment during live matches. Then, we partner-up with 8 expert League of Legend (LoL) players and create a fine-grained labelled dataset, L2DTnH, containing 1.4k toxic and 13.8k non-toxic messages exchanged during LoL matches. We use L2DTnH to develop a detector that we then empirically show outperforms general-purpose and state-of-the-art toxicity detectors reliant on NLP. To further demonstrate the practicality of our resources, we test our detector on game-related data beyond that included in L2DTnH; and we develop a Web-browser extension that flags toxic content in Webpages -- without querying third-party servers owned by AI companies. We publicly release all of our resources. Our contributions pave the way for more applied research devoted to fighting the spread of toxicity and harassment in video games.

cross A Modularized Framework for Piecewise-Stationary Restless Bandits

Authors: Kuan-Ta Li, Chia-Chun Lin, Ping-Chun Hsieh, Yu-Chih Huang

Abstract: We study the piecewise-stationary restless multi-armed bandit (PS-RMAB) problem, where each arm evolves as a Markov chain but \emph{mean rewards may change across unknown segments}. To address the resulting exploration--detection delay trade-off, we propose a modular framework that integrates arbitrary RMAB base algorithms with change detection and a novel diminishing exploration mechanism. This design enables flexible plug-and-play use of existing solvers and detectors, while efficiently adapting to mean changes without prior knowledge of their number. To evaluate performance, we introduce a refined regret notion that measures the \emph{excess regret due to exploration and detection}, benchmarked against an oracle that restarts the base algorithm at the true change points. Under this metric, we prove a regret bound of $\tilde{O}(\sqrt{LMKT})$, where $L$ denotes the maximum mixing time of the Markov chains across all arms and segments, $M$ the number of segments, $K$ the number of arms, and $T$ the horizon. Simulations confirm that our framework achieves regret close to that of the segment oracle and consistently outperforms base solvers that do not incorporate any mechanism to handle environmental changes.

cross Byzantine-Robust Distributed SGD: A Unified Analysis and Tight Error Bounds

Authors: Boyuan Ruan, Xiaoyu Wang, Ya-Feng Liu

Abstract: Byzantine-robust distributed optimization relies on robust aggregation rules to mitigate the influence of malicious Byzantine workers. Despite the proliferation of such rules, a unified convergence analysis framework that accommodates general data heterogeneity is lacking. In this work, we provide a thorough convergence theory of Byzantine-robust distributed stochastic gradient descent (SGD), analyzing variants both with and without local momentum. We establish the convergence rates for nonconvex smooth objectives and those satisfying the Polyak-Lojasiewicz condition under a general data heterogeneity assumption. Our analysis reveals that while stochasticity and data heterogeneity introduce unavoidable error floors, local momentum provably reduces the error component induced by stochasticity. Furthermore, we derive matching lower bounds to demonstrate that the upper bounds obtained in our analysis are tight and characterize the fundamental limits of Byzantine resilience under stochasticity and data heterogeneity. Empirical results support our theoretical findings.

cross Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

Authors: Tiancheng Hu, Jin Qin, Zheng Wang, Junhao Hu, Yuzheng Wang, Lei Chen, Yizhou Shan, Mingxing Zhang, Ting Cao, Chunwei Xia, Huimin Cui, Tao Xie, Chenxi Wang

Abstract: Disaggregation maps parts of an AI workload to different types of GPUs, offering a path to utilize modern heterogeneous GPU clusters. However, existing solutions operate at a coarse granularity and are tightly coupled to specific model architectures, leaving much room for performance improvement. This paper presents Tessera, the first kernel disaggregation system to improve performance and cost efficiency on heterogeneous GPUs for large model inference. Our key insight is that kernels within a single application exhibit diverse resource demands, making them the most suitable granularity for aligning computation with hardware capabilities. Tessera integrates offline analysis with online adaptation by extracting precise inter-kernel dependencies from PTX to ensure correctness, overlapping communication with computation through a pipelined execution model, and employing workload-aware scheduling with lightweight runtime adaptation. Extensive evaluations across five heterogeneous GPUs and four model architectures, scaling up to 16 GPUs, show that Tessera improves serving throughput and cost efficiency by up to 2.3x and 1.6x, respectively, compared to existing disaggregation methods, while generalizing to model architectures where prior approaches do not apply. Surprisingly, a heterogeneous GPU pair under Tessera can even exceed the throughput of two homogeneous high-end GPUs at a lower cost.

cross RF-LEGO: Modularized Signal Processing-Deep Learning Co-Design for RF Sensing via Deep Unrolling

Authors: Luca Jiang-Tao Yu, Chenshu Wu

Abstract: Wireless sensing, traditionally relying on signal processing (SP) techniques, has recently shifted toward data-driven deep learning (DL) to achieve performance breakthroughs. However, existing deep wireless sensing models are typically end-to-end and task-specific, lacking reusability and interpretability. We propose RF-LEGO, a modular co-design framework that transforms interpretable SP algorithms into trainable, physics-grounded DL modules through deep unrolling. By replacing hand-tuned parameters with learnable ones while preserving core processing structures and mathematical operators, RF-LEGO ensures modularity, cascadability, and structure-aligned interpretability. Specifically, we introduce three deep-unrolled modules for critical RF sensing tasks: frequency transform, spatial angle estimation, and signal detection. Extensive experiments using real-world data for Wi-Fi, millimeter-wave, UWB, and 6G sensing demonstrate that RF-LEGO significantly outperforms existing SP and DL baselines, both standalone and when integrated into multiple downstream tasks. RF-LEGO pioneers a novel SP-DL co-design paradigm for wireless sensing via deep unrolling, shedding light on efficient and interpretable deep wireless sensing solutions. Our code is available at https://github.com/aiot-lab/RF-LEGO.

URLs: https://github.com/aiot-lab/RF-LEGO.

cross FatigueFusion: Latent Space Fusion for Fatigue-Driven Motion Synthesis

Authors: Iliana Loi, Konstantinos Moustakas

Abstract: Investigating the impact of fatigue on human physiological function and motor behavior is crucial for developing biomechanics and medical applications aimed at mitigating fatigue, reducing injury risk, and creating sophisticated ergonomic designs, as well as for producing physically-plausible 3D animation sequences. While the former has a prominent position in state-of-the-art literature, fatigue-driven motion generation is still an underexplored area. In this study, we present FatigueFusion, a deep-learning architecture for the fusion of fatigue features within a latent representation space, enabling the creation of a variation of novel fatigued movements, intermediate fatigued states, and progressively fatigued motions. Unlike existing approaches that focus on imitating the effects of fatigue accumulation in motion patterns, our framework incorporates algorithmic and data-driven modules to impose subject-specific temporal and spatial fatigue features on nonfatigued motions, while leveraging PINN-based techniques to simulate fatigue intensity. Since all motion modulation tasks are taking place in latent space, FatigueFusion offers an end-to-end architecture that operates directly on non-fatigued joint angle sequences and control parameters, allowing seamless integration into any motion synthesis pipeline, without relying on fatigue input data. Overall, our framework can be employed for various fatigue-driven synthesis tasks, such as fatigue profile transfer and fusion, while it also provides a solution for accurate rendering of the human fatigue state in both animation and simulation pipelines.

cross The Amazing Agent Race: Strong Tool Users, Weak Navigators

Authors: Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang

Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

URLs: https://minnesotanlp.github.io/the-amazing-agent-race

cross Descriptor-Injected Cross-Modal Learning: A Systematic Exploration of Audio-MIDI Alignment via Spectral and Melodic Features

Authors: Mariano Fern\'andez M\'endez

Abstract: Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-specific encoders with hand-crafted domain features, as a bridge across this gap. In a three-phase campaign covering 13 descriptor-mechanism combinations, 6 architectural families, and 3 training schedules, the best configuration reaches a mean S of 84.0 percent across five independent seeds, improving the descriptor-free baseline by 8.8 percentage points. Causal ablation shows that the audio descriptor A4, based on octave-band energy dynamics, drives the gain in the top dual models, while the MIDI descriptor D4 has only a weak inference-time effect despite improving training dynamics. We also introduce reverse cross-attention, where descriptor tokens query encoder features, reducing attention operations relative to the standard formulation while remaining competitive. CKA analysis shows that descriptors substantially increase audio-MIDI transformer layer alignment, indicating representational convergence rather than simple feature concatenation. Perturbation analysis identifies high-frequency octave bands as the dominant discriminative signal. All experiments use MAESTRO v3.0.0 with an evaluation protocol controlling for composer and piece similarity.

cross Anatomy-Informed Deep Learning for Abdominal Aortic Aneurysm Segmentation

Authors: Osamah Sufyan, Martin Br\"uckmann, Ralph Wickenh\"ofer, Babette Dellen, Uwe Jaekel

Abstract: In CT angiography, the accurate segmentation of abdominal aortic aneurysms (AAAs) is difficult due to large anatomical variability, low-contrast vessel boundaries, and the close proximity of organs whose intensities resemble vascular structures, often leading to false positives. To address these challenges, we propose an anatomy-aware segmentation framework that integrates organ exclusion masks derived from TotalSegmentator into the training process. These masks encode explicit anatomical priors by identifying non-vascular organsand penalizing aneurysm predictions within these regions, thereby guiding the U-Net to focus on the aorta and its pathological dilation while suppressing anatomically implausible predictions. Despite being trained on a relatively small dataset, the anatomy-aware model achieves high accuracy, substantially reduces false positives, and improves boundary consistency compared to a standard U-Net baseline. The results demonstrate that incorporating anatomical knowledge through exclusion masks provides an efficient mechanism to enhance robustness and generalization, enabling reliable AAA segmentation even with limited training data.

cross Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation

Authors: Mohamed Ehab, Ali Hamdi

Abstract: Large language models (LLMs) demonstrate strong performance in math reasoning benchmarks, but their performance varies inconsistently across problems with varying levels of difficulty. This paper describes Adaptive Multi-Expert Reasoning (AMR), a framework that focuses on problem complexity by reasoning with dynamically adapted strategies. An agile routing system that focuses on problem text predicts problems' difficulty and uncertainty and guides a reconfigurable sampling mechanism to manage the breadth of generation. Three specialized experts create candidate responses, which are modified during multiple correction and finalization phases. A neural verifier assesses the correctness of responses, while a clustering-based aggregation technique identifies the final candidate answer based on a combination of consensus and answer quality. When evaluated on the GSM8K dataset, AMR achieved 75.28% accuracy while only using the original training data. This result outperformed the majority of comparable 7B models that were trained on synthetic data. This showcases that models using difficulty-based routing and uncertainty-driven aggregation are efficient and effective in improving math reasoning models' robustness.

cross Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD

Authors: Konstantinos Emmanouilidis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Rene Vidal

Abstract: From adversarial robustness to multi-agent learning, many machine learning tasks can be cast as finite-sum min-max optimization or, more generally, as variational inequality problems (VIPs). Owing to their simplicity and scalability, stochastic gradient methods with constant step size are widely used, despite the fact that they converge only up to a constant term. Among the many heuristics adopted in practice, two classical techniques have recently attracted attention to mitigate this issue: \emph{Random Reshuffling} of data and \emph{Richardson--Romberg extrapolation} across iterates. Random Reshuffling sharpens the mean-squared error (MSE) of the estimated solution, while Richardson-Romberg extrapolation acts orthogonally, providing a second-order reduction in its bias. In this work, we show that their composition is strictly better than both, not only maintaining the enhanced MSE guarantees but also yielding an even greater cubic refinement in the bias. To the best of our knowledge, our work provides the first theoretical guarantees for such a synergy in structured non-monotone VIPs. Our analysis proceeds in two steps: (i) we smooth the discrete noise induced by reshuffling and leverage tools from continuous-state Markov chain theory to establish a novel law of large numbers and a central limit theorem for its iterates; and (ii) we employ spectral tensor techniques to prove that extrapolation debiases and sharpens the asymptotic behavior even under the biased gradient oracle induced by reshuffling. Finally, extensive experiments validate our theory, consistently demonstrating substantial speedups in practice.

cross Sense Less, Infer More: Agentic Multimodal Transformers for Edge Medical Intelligence

Authors: Chengwei Zhou, Zhaoyan Jia, Haotian Yu, Xuming Chen, Brandon Lee, Christopher Pulliam, Steve Majerus, Massoud Pedram, Gourav Datta

Abstract: Edge-based multimodal medical monitoring requires models that balance diagnostic accuracy with severe energy constraints. Continuous acquisition of ECG, PPG, EMG, and IMU streams rapidly drains wearable batteries, often limiting operation to under 10 hours, while existing systems overlook the high temporal redundancy present in physiological signals. We introduce Adaptive Multimodal Intelligence (AMI), an end-to-end framework that jointly learns when to sense and how to infer. AMI integrates three components: (1) a lightweight Agentic Modality Controller that uses differentiable Gumbel-Sigmoid gating to dynamically select active sensors based on model confidence and task relevance; (2) a Learned Sigma-Delta Sensing module that applies patch-wise Delta-Sigma operations with learnable thresholds to skip temporally redundant samples; and (3) a Foundation-backed Multimodal Prediction Model built on unimodal foundation encoders and a cross-modal transformer with temporal context, enabling robust fusion even under gated or missing inputs. These components are trained jointly via a multi-objective loss combining classification accuracy, sparsity regularization, cross-modal alignment, and predictive coding. AMI is hardware-aware, supporting dynamic computation graphs and masked operations, leading to real energy and latency savings. Across MHEALTH, HMC Sleep, and WESAD datasets, it reduces sensor usage by 48.8% while improving state-of-the-art accuracy by 1.9% on average.

cross Orthogonal machine learning for conditional odds and risk ratios

Authors: Jiacheng Ge, Iv\'an D\'iaz

Abstract: Conditional effects are commonly used measures for understanding how treatment effects vary across different groups, and are often used to target treatments/interventions to groups who benefit most. In this work we review existing methods and propose novel ones, focusing on the odds ratio (OR) and the risk ratio (RR). While estimation of the conditional average treatment effect (ATE) has been widely studied, estimators for the OR and RR lag behind, and cutting edge estimators such as those based on doubly robust transformations or orthogonal risk functions have not been generalized to these parameters. We propose such a generalization here, focusing on the DR-learner and the R-learner. We derive orthogonal risk functions for the OR and RR and show that the associated pseudo-outcomes satisfy second-order conditional-mean remainder properties analogous to the ATE case. We also evaluate estimators for the conditional ATE, OR, and RR in a comprehensive nonparametric Monte Carlo simulation study to compare them with common alternatives under hundreds of different data-generating distributions. Our numerical studies provide empirical guidance for choosing an estimator. For instance, they show that while parametric models are useful in very simple settings, the proposed nonparametric estimators significantly reduce bias and mean squared error in the more complex settings expected in the real world. We illustrate the methods in the analysis of physical activity and sleep trouble in U.S. adults using data from the National Health and Nutrition Examination Survey (NHANES). The results demonstrate that our estimators uncover substantial treatment effect heterogeneity that is obscured by traditional regression approaches and lead to improved treatment decision rules, highlighting the importance of data-adaptive methods for advancing precision health research.

cross Neural Stochastic Processes for Satellite Precipitation Refinement

Authors: Shunya Nagashima, Takumi Bannai, Shuitsu Koyama, Tomoya Mitsui, Shuntaro Suzuki

Abstract: Accurate precipitation estimation is critical for flood forecasting, water resource management, and disaster preparedness. Satellite products provide global hourly coverage but contain systematic biases; ground-based gauges are accurate at point locations but too sparse for direct gridded correction. Existing methods fuse these sources by interpolating gauge observations onto the satellite grid, but treat each time step independently and therefore discard temporal structure in precipitation fields. We propose Neural Stochastic Process (NSP), a model that pairs a Neural Process encoder conditioning on arbitrary sets of gauge observations with a latent Neural SDE on a 2D spatial representation. NSP is trained under a single variational objective with simulation-free cost. We also introduce QPEBench, a benchmark of 43{,}756 hourly samples over the Contiguous United States (2021--2025) with four aligned data sources and six evaluation metrics. On QPEBench, NSP outperforms 13 baselines across all six metrics and surpasses JAXA's operational gauge-calibrated product. An additional experiment on Kyushu, Japan confirms generalization to a different region with independent data sources.

cross A Queueing-Theoretic Framework for Dynamic Attack Surfaces: Data-Integrated Risk Analysis and Adaptive Defense

Authors: Jihyeon Yun, Abdullah Yasin Etcibasi, Ming Shi, C. Emre Koksal

Abstract: We develop a queueing-theoretic framework to model the temporal evolution of cyber-attack surfaces, where the number of active vulnerabilities is represented as the backlog of a queue. Vulnerabilities arrive as they are discovered or created, and leave the system when they are patched or successfully exploited. Building on this model, we study how automation affects attack and defense dynamics by introducing an AI amplification factor that scales arrival, exploit, and patching rates. Our analysis shows that even symmetric automation can increase the rate of successful exploits. We validate the model using vulnerability data collected from an open source software supply chain and show that it closely matches real-world attack surface dynamics. Empirical results reveal heavy-tailed patching times, which we prove induce long-range dependence in vulnerability backlog and help explain persistent cyber risk. Utilizing our queueing abstraction for the attack surface, we develop a systematic approach for cyber risk mitigation. We formulate the dynamic defense problem as a constrained Markov decision process with resource-budget and switching-cost constraints, and develop a reinforcement learning (RL) algorithm that achieves provably near-optimal regret. Numerical experiments validate the approach and demonstrate that our adaptive RL-based defense policies significantly reduce successful exploits and mitigate heavy-tail queue events. Using trace-driven experiments on the ARVO dataset, we show that the proposed RL-based defense policy reduces the average number of active vulnerabilities in a software supply chain by over 90% compared to existing defense practices, without increasing the overall maintenance budget. Our results allow defenders to quantify cumulative exposure risk under long-range dependent attack dynamics and to design adaptive defense strategies with provable efficiency.

cross FEDBUD: Joint Incentive and Privacy Optimization for Resource-Constrained Federated Learning

Authors: Tao Liu, Xuehe Wang

Abstract: Federated learning has become a popular paradigm for privacy protection and edge-based machine learning. However, defending against differential attacks and devising incentive strategies remain significant bottlenecks in this field. Despite recent works on privacy-aware incentive mechanism design for federated learning, few of them consider both data volume and noise level. In this paper, we propose a novel federated learning system called FEDBUD, which combines privacy and economic concerns together by considering the joint influence of data volume and noise level on incentive strategy determination. In this system, the cloud server controls monetary payments to edge nodes, while edge nodes control data volume and noise level that potentially impact the model performance of the cloud server. To determine the mutually optimal strategies for both sides, we model FEDBUD as a two-stage Stackelberg Game and derive the Nash Equilibrium using the mean-field estimator and virtual queue. Experimental results on real-world datasets demonstrate the outstanding performance of FEDBUD.

cross NSFL: A Post-Training Neuro-Symbolic Fuzzy Logic Framework for Boolean Operators in Neural Embeddings

Authors: Vladi Vexler, Ofer Idan, Gil Lederman, Dima Sivov

Abstract: Standard dense retrievers lack a native calculus for multi-atom logical constraints. We introduce Neuro-Symbolic Fuzzy Logic (NSFL), a framework that adapts formal t-norms and t-conorms to neural embedding spaces without requiring retraining. NSFL operates as a first-order hybrid calculus: it anchors logical operations on isolated zero-order similarity scores while actively steering representations using Neuro-Symbolic Deltas (NS-Delta) -- the first-order marginal differences derived from contextual fusion. This preserves pure atomic meaning while capturing domain reliance, preventing the representation collapse and manifold escape endemic to traditional geometric baselines. For scalable real-time retrieval, Spherical Query Optimization (SQO) leverages Riemannian optimization to project these fuzzy formulas into manifold-stable query vectors. Validated across six distinct encoder configurations and two modalities (including zero-shot and SOTA fine-tuned models), NSFL yields mAP improvements up to +81%. Notably, NSFL provides an additive 20% average and up to 47% boost even when applied to encoders explicitly fine-tuned for logical reasoning. By establishing a training-free, order-aware calculus for high-dimensional spaces, this framework lays the foundation for future dynamic scaling and learned manifold logic.

cross Adaptive H-EFT-VA: A Provably Safe Trajectory Through the Trainability-Expressibility Landscape of Variational Quantum Algorithms

Authors: Eyad I. B. Hamid

Abstract: H-EFT-VA established a physics-informed solution to the Barren Plateau (BP) problem via a hierarchical EFT UV-cutoff, guaranteeing gradient variance in Omega(1/poly(N)). However, localization restricts the ansatz to a polynomial subspace, creating a reference-state gap for states distant from |0>^N. We introduce Adaptive H-EFT-VA (A-H-EFT) to navigate the trainability-expressibility tradeoff by expanding the reachable Hilbert space along a safe trajectory. Gradient variance is maintained in Omega(1/poly(N)) if sigma(t) <= 0.5/sqrt(LN) (Theorem 1). A Safe Expansion Corollary and Monotone Growth Lemma confirm expansion without discontinuous jumps. Benchmarking across 16 experiments (up to N=14) shows A-H-EFT achieves fidelity F=0.54, doubling static H-EFT-VA (F=0.27) and outperforming HEA (F~0.01), with gradient variance >= 0.5 throughout. For Heisenberg XXZ (Delta_ref=1), A-H-EFT identifies the negative ground state while static methods fail. Results are statistically significant (p < 10^-37). Robustness over three decades of hyperparameters enables deployment without search. This is the first rigorously bounded trajectory through the VQA landscape.

cross Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences

Authors: Matteo Spanio, Valentina Frezzato, Antonio Rod\`a

Abstract: Collecting large, aligned cross-modal datasets for music-flavor research is difficult because perceptual experiments are costly and small by design. We address this bottleneck through two complementary experiments. The first tests whether audio-flavor correlations, feature-importance rankings, and latent-factor structure transfer from an experimental soundtracks collection (257~tracks with human annotations) to a large FMA-derived corpus ($\sim$49,300 segments with synthetic labels). The second validates computational flavor targets -- derived from food chemistry via a reproducible pipeline -- against human perception in an online listener study (49~participants, 20~tracks). Results from both experiments converge: the quantitative transfer analysis confirms that cross-modal structure is preserved across supervision regimes, and the perceptual evaluation shows significant alignment between computational targets and listener ratings (permutation $p<0.0001$, Mantel $r=0.45$, Procrustes $m^2=0.51$). Together, these findings support the conclusion that sonic seasoning effects are present in synthetic FMA annotations. We release datasets and companion code to support reproducible cross-modal AI research.

cross A Deep Generative Approach to Stratified Learning

Authors: Randy Martinez, Rong Tang, Lizhen Lin

Abstract: While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying dimensionality, intersection singularities, and lack of efficient models in learning the underlying distributions. We provide a deep generative approach to stratified learning by developing two generative frameworks for learning distributions on stratified spaces. The first is a sieve maximum likelihood approach realized via a dimension-aware mixture of variational autoencoders. The second is a diffusion-based framework that explores the score field structure of a mixture. We establish the convergence rates for learning both the ambient and intrinsic distributions, which are shown to be dependent on the intrinsic dimensions and smoothness of the underlying strata. Utilizing the geometry of the score field, we also establish consistency for estimating the intrinsic dimension of each stratum and propose an algorithm that consistently estimates both the number of strata and their dimensions. Theoretical results for both frameworks provide fundamental insights into the interplay of the underlying geometry, the ambient noise level, and deep generative models. Extensive simulations and real dataset applications, such as molecular dynamics, demonstrate the effectiveness of our methods.

cross Enhancing Cross-Problem Vehicle Routing via Federated Learning

Authors: Xiangchi Meng, Jianan Zhou, Jie Gao, Yifan Lu, Yaoxin Wu, Gonglin Yuan, Yaqing Hou

Abstract: Vehicle routing problems (VRPs) constitute a core optimization challenge in modern logistics and supply chain management. The recent neural combinatorial optimization (NCO) has demonstrated superior efficiency over some traditional algorithms. While serving as a primary NCO approach for solving general VRPs, current cross-problem learning paradigms are still subject to performance degradation and generalizability decay, when transferring from simple VRP variants to those involving different and complex constraints. To strengthen the paradigms, this paper offers an innovative "Multi-problem Pre-train, then Single-problem Fine-tune" framework with Federated Learning (MPSF-FL). This framework exploits the common knowledge of a federated global model to foster efficient cross-problem knowledge sharing and transfer among local models for single-problem fine-tuning. In this way, local models effectively retain common VRP knowledge from up-to-date global model, while being efficiently adapted to downstream VRPs with heterogeneous complex constraints. Experimental results demonstrate that our framework not only enhances the performance in diverse VRPs, but also improves the generalizability in unseen problems.

cross Efficient Process Reward Modeling via Contrastive Mutual Information

Authors: Nakyung Lee, Sangwoo Hong, Jungwoo Lee

Abstract: Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model's internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step's contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.

cross Omnimodal Dataset Distillation via High-order Proxy Alignment

Authors: Yuxuan Gao, Xiaohao Liu, Xiaobo Xia, Tongliang Liu

Abstract: Dataset distillation compresses large-scale datasets into compact synthetic sets while preserving training performance, but existing methods are largely restricted to single-modal or bimodal settings. Extending dataset distillation to scenarios involving more than two modalities, i.e., Omnimodal Dataset Distillation, remains underexplored and challenging due to increased heterogeneity and complex cross-modal interactions. In this work, we identify the key determinant that bounds the endpoint discrepancy in the omnimodal setting, which is exacerbated with an increasing number of modalities. To this end, we propose HoPA, a unified method that captures high-order cross-modal alignments via a compact proxy, which is compatible with trajectory matching as well. By abstracting omnimodal alignment with a shared similarity structure, our method avoids the combinatorial complexity of pairwise modality modeling and enables scalable joint distillation across heterogeneous modalities. Theoretical analysis from the spectral perspective reveals the rationality of our proposed method against bimodal dataset distillation techniques. Extensive experiments on various benchmarks demonstrate that the proposed method achieves superior compression-performance trade-offs compared to existing competitors. The source code will be publicly released.

cross Learning and Enforcing Context-Sensitive Control for LLMs

Authors: Mohammad Albinhassan, Pranava Madhyastha, Mark Law, Alessandra Russo

Abstract: Controlling the output of Large Language Models (LLMs) through context-sensitive constraints has emerged as a promising approach to overcome the limitations of Context-Free Grammars (CFGs) in guaranteeing generation validity. However, such constraints typically require manual specification -- a significant barrier demanding specialized expertise. We introduce a framework that automatically learns context-sensitive constraints from LLM interactions through a two-phase process: syntactic exploration to gather diverse outputs for constraint learning, followed by constraint exploitation to enforce these learned rules during generation. Experiments demonstrate that our method enables even small LLMs (1B parameters) to learn and generate with perfect constraint adherence, outperforming larger counterparts and state-of-the-art reasoning models. This work represents the first integration of context-sensitive grammar learning with LLM generation, eliminating manual specification while maintaining generation validity.

cross One-Step Score-Based Density Ratio Estimation

Authors: Wei Chen, Qibin Zhao, John Paisley, Junmei Yang, Delu Zeng

Abstract: Density ratio estimation (DRE) is a useful tool for quantifying discrepancies between probability distributions, but existing approaches often involve a trade-off between estimation quality and computational efficiency. Classical direct DRE methods are usually efficient at inference time, yet their performance can seriously deteriorate when the discrepancy between distributions is large. In contrast, score-based DRE methods often yield more accurate estimates in such settings, but they typically require considerable repeated function evaluations and numerical integration. We propose One-step Score-based Density Ratio Estimation (OS-DRE), a partly analytic and solver-free framework designed to combine these complementary advantages. OS-DRE decomposes the time score into spatial and temporal components, representing the latter with an analytic radial basis function (RBF) frame. This formulation converts the otherwise intractable temporal integral into a closed-form weighted sum, thereby removing the need for numerical solvers and enabling DRE with only one function evaluation. We further analyze approximation conditions for the analytic frame, and establish approximation error bounds for both finitely and infinitely smooth temporal kernels, grounding the framework in existing approximation theory. Experiments across density estimation, continual Kullback-Leibler and mutual information estimation, and near out-of-distribution detection demonstrate that OS-DRE offers a favorable balance between estimation quality and inference efficiency.

cross FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation

Authors: Yingguang Yang, Hao Liu, Xin Zhang, Yunhui Liu, Yutong Xia, Qi Wu, Hao Peng, Taoran Liang, Bin Chong, Tieke He, Philip S. Yu

Abstract: Social bot detection is critical to the stability and security of online social platforms. However, current state-of-the-art bot detection models are largely developed in isolation, overlooking the benefits of leveraging shared detection patterns across platforms to improve performance and promptly identify emerging bot variants. The heterogeneity of data distributions and model architectures further complicates the design of an effective cross-platform and cross-model detection framework. To address these challenges, we propose FedRio (Personalized Federated Social Bot Detection with Cooperative Reinforced Contrastive Adversarial Distillation framework. We first introduce an adaptive message-passing module as the graph neural network backbone for each client. To facilitate efficient knowledge sharing of global data distributions, we design a federated knowledge extraction mechanism based on generative adversarial networks. Additionally, we employ a multi-stage adversarial contrastive learning strategy to enforce feature space consistency among clients and reduce divergence between local and global models. Finally, we adopt adaptive server-side parameter aggregation and reinforcement learning-based client-side parameter control to better accommodate data heterogeneity in heterogeneous federated settings. Extensive experiments on two real-world social bot detection benchmarks demonstrate that FedRio consistently outperforms state-of-the-art federated learning baselines in detection accuracy, communication efficiency, and feature space consistency, while remaining competitive with published centralized results under substantially stronger privacy constraints.

cross Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

Authors: Jakub Binkowski, Kamil Adamczewski, Tomasz Kajdanowicz

Abstract: Large language models frequently exhibit hallucinations: fluent and confident outputs that are factually incorrect or unsupported by the input context. While recent hallucination detection methods have explored various features derived from attention maps, the underlying mechanisms they exploit remain poorly understood. In this work, we propose SinkProbe, a hallucination detection method grounded in the observation that hallucinations are deeply entangled with attention sinks - tokens that accumulate disproportionate attention mass during generation - indicating a transition from distributed, input-grounded attention to compressed, prior-dominated computation. Importantly, although sink scores are computed solely from attention maps, we find that the classifier preferentially relies on sinks whose associated value vectors have large norms. Moreover, we show that previous methods implicitly depend on attention sinks by establishing their mathematical relationship to sink scores. Our findings yield a novel hallucination detection method grounded in theory that produces state-of-the-art results across popular datasets and LLMs.

cross Tail-Aware Information-Theoretic Generalization for RLHF and SGLD

Authors: Huiming Zhang, Binghan Li, Wan Tian, Qiang Sun

Abstract: Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many modern pipelines, such as robust learning, RLHF, and stochastic optimization, losses and rewards can be heavy-tailed, and MGFs may not exist, rendering KL-based tools ineffective. We develop a tail-dependent information-theoretic framework for sub-Weibull data, where the tail parameter $\theta$ controls the tail heaviness: $\theta=2$ corresponds to sub-Gaussian, $\theta=1$ to sub-exponential, and $0<\theta<1$ to genuinely heavy tails. Our key technical ingredient is a decorrelation lemma that bounds change-of-measure expectations using a shifted-log $f_\theta$-divergence, which admits explicit comparisons to R\'enyi divergence without MGF arguments. On the empirical-process side, we establish sharp maximal inequalities and a Dudley-type chaining bound for sub-Weibull processes with tail index $\theta$, with complexity scaling as $\log^{1/\theta}$ and entropy$^{1/\theta}$. These tools yield expected and high-probability PAC-Bayes generalization bounds, as well as an information-theoretic chaining inequality based on multiscale R\'enyi mutual information. We illustrate the consequences in R\'enyi-regularized RLHF under heavy-tailed rewards and in stochastic gradient Langevin dynamics with heavy-tailed gradient noise.

cross Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models

Authors: Mehmet Can \c{S}akiro\u{g}lu, H. Altay G\"uvenir, Kamer Kaya

Abstract: Generating multiple-choice questions (MCQs) with difficulty estimation remains challenging in automated MCQ-generation systems used in adaptive, AI-assisted education. This study proposes a novel methodology for generating MCQs with difficulty estimation from the input documents by utilizing knowledge graphs (KGs) and large language models (LLMs). Our approach uses an LLM to construct a KG from input documents, from which MCQs are then systematically generated. Each MCQ is generated by selecting a node from the KG as the key, sampling a related triple or quintuple -- optionally augmented with an extra triple -- and prompting an LLM to generate a corresponding stem from these graph components. Distractors are then selected from the KG. For each MCQ, nine difficulty signals are computed and combined into a unified difficulty score using a data-driven approach. Experimental results demonstrate that our method generates high-quality MCQs whose difficulty estimation is interpretable and aligns with human perceptions. Our approach improves automated MCQ generation by integrating structured knowledge representations with LLMs and a data-driven difficulty estimation model.

cross Lung Cancer Detection Using Deep Learning

Authors: Imama Ajmi, Abhishek Das

Abstract: Lung cancer, the second leading cause of cancer-related deaths, is primarily linked to long-term tobacco smoking (85% of cases). Surprisingly, 10-15% of cases occur in non-smokers. In 2020, approximately 2 million people were affected globally, resulting in 1.5 million deaths. The survival rate, at around 20%, lags behind other cancers, partly due to late-stage symptom manifestation. Necessitates early and accurate detection for effective treatment. Performance metrics such as accuracy, precision, recall (sensitivity), and F1-score are computed to provide a comprehensive evaluation of each model's capabilities. By comparing these metrics, this study offers insights into the strengths and limitations of each approach, contributing to the advancement of lung cancer detection techniques. In this paper, we are going to discuss the methodologies of lung cancer detection using different deep learning algorithms - InceptionV3, MobileNetV2, VGG16, ResNet152 - are explored for their efficacy in classifying lung cancer cases. Our Proposed Model algorithm based is a 16 layers architecture based on CNN model. Our Proposed model exhibits several key highlights that contribute to its novelty. By integrating multiple layer types such as convolutional, pooling, flatten, dropout, fully connected and dense layers, the model leverages the strengths of each layer to enhance its predictive capabilities. Novelty of our proposed model is that its accuracy is increasing consistently with the increasing no of epochs. We have tested the model performance up to epoch no 30. Our proposed model also overcome the overfitting problem.

cross Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making

Authors: Daniel J. Tan, Kay Choong See, Mengling Feng

Abstract: Designing reward functions remains a central challenge in reinforcement learning (RL) for healthcare, where outcomes are sparse, delayed, and difficult to specify. While structured data capture physiological states, they often fail to reflect the overall quality of a patient's clinical trajectory, including recovery dynamics, treatment burden, and stability. Clinical narratives, in contrast, summarize longitudinal reasoning and implicitly encode evaluations of treatment effectiveness. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework for learning reward functions directly from discharge summaries by treating them as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores (TQS) and construct pairwise preferences over patient trajectories, enabling reward learning via a structured preference-based objective. To account for variability in narrative informativeness, we incorporate a confidence signal that weights supervision based on its relevance to the decision-making task. The learned reward aligns strongly with trajectory quality (Spearman rho = 0.63) and enables policies that are consistently associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining comparable performance on mortality. These effects persist under external validation. Our results demonstrate that narrative-derived supervision provides a scalable and expressive alternative to handcrafted or outcome-based reward design for dynamic treatment regimes.

cross Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

Authors: Chirag Shinde

Abstract: We propose two complementary modifications to transformer attention blocks. First, a non-linear pre-projection MLP is inserted between layer norm and Q/K/V projections, constructing richer features in a position-agnostic manner before any positional encoding is applied. Second, a content skip connection routes the pre-projection's features around the attention mechanism, allowing content information to bypass position-aware attention where beneficial. In frozen-probe experiments on Pythia-160M and 410M, the combined approach achieves the strongest results across methods: +40.6% LAMBADA accuracy and -39% perplexity at 160M scale. Learned skip connection weights reveal a consistent pattern across model sizes: later transformer layers activate the content bypass more strongly than earlier layers, suggesting that deeper layers benefit from content information that does not pass through positional attention. All modifications add no K/V cache overhead.

cross Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

Authors: Jugal Gajjar

Abstract: Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language vulnerability lifecycle framework built around three LLM-driven reasoning stages-hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair-governed by a strict invariant: no repair action is taken without execution-based confirmation of exploitability. Cross-language generalization is achieved via a Universal Abstract Syntax Tree (uAST) normalizing Java, Python, and C++ into a shared structural schema, combined with a hybrid fusion of GraphSAGE and Qwen2.5-Coder-1.5B embeddings through learned two-way gating, whose per-sample weights provide intrinsic explainability at no additional cost. The framework achieves 89.84-92.02% intra-language detection accuracy and 74.43-80.12% zero-shot cross-language F1, resolving 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate. Ablations establish necessity: removing uAST degrades cross-language F1 by 23.42%, while disabling validation increases unnecessary repairs by 131.7%. These results demonstrate that execution-grounded closed-loop reasoning is a principled and practically deployable mechanism for trustworthy LLM-driven agentic AI.

cross Differentially Private Verification of Distribution Properties

Authors: Elbert Du, Cynthia Dwork, Pranay Tankala, Linjun Zhang

Abstract: A recent line of work initiated by Chiesa and Gur and further developed by Herman and Rothblum investigates the sample and communication complexity of verifying properties of distributions with the assistance of a powerful, knowledgeable, but untrusted prover. In this work, we initiate the study of differentially private (DP) distribution property testing. After all, if we do not trust the prover to help us with verification, why should we trust it with our sensitive sample? We map a landscape of DP prover-aided proofs of properties of distributions. In the non-private case it is known that one-round (two message) private-coin protocols can have substantially lower complexity than public-coin AM protocols, but in the private case, the possibility for improvement depends on the parameter regime and privacy model. Drawing on connections to replicability and techniques for amplification, we show: (1) There exists a reduction from any one-round $(\varepsilon,\delta)$-DP private-coin interactive proof to a one-round public-coin DP interactive proof with the same privacy parameters, for the parameter regime $\varepsilon=O(1/\sqrt{n})$ and $\delta=O(1/n^{5/2})$, and with the same sample and communication complexities. (2) If the verifier's message in the private-coin interactive proof is $O(1/\sqrt{\log n})$ locally DP -- a far more relaxed privacy parameter regime in a different model -- then applying one additional transformation again yields a one-round public-coin protocol with the same privacy bound and the same sample and computational complexities. (3) However, when the privacy guarantee is very relaxed ($\varepsilon\in\Omega(\log n)$), private coins indeed reduce complexity. We also obtain a Merlin-Arthur (one-message) proof for privately testing whether samples are drawn from a product distribution, and prove that its sample complexity is optimal.

cross Uncertainty-Guided Attention and Entropy-Weighted Loss for Precise Plant Seedling Segmentation

Authors: Mohamed Ehab, Ali Hamdi

Abstract: Plant seedling segmentation supports automated phenotyping in precision agriculture. Standard segmentation models face difficulties due to intricate background images and fine structures in leaves. We introduce UGDA-Net (Uncertainty-Guided Dual Attention Network with Entropy-Weighted Loss and Deep Supervision). Three novel components make up UGDA-Net. The first component is Uncertainty-Guided Dual Attention (UGDA). UGDA uses channel variance to modulate feature maps. The second component is an entropy-weighted hybrid loss function. This loss function focuses on high-uncertainty boundary pixels. The third component employs deep supervision for intermediate encoder layers. We performed a comprehensive systematic ablation study. This study focuses on two widely-used architectures, U-Net and LinkNet. It analyzes five incremental configurations: Baseline, Loss-only, Attention-only, Deep Supervision, and UGDA-Net. We trained UGDA-net using a high-resolution plant seedling image dataset containing 432 images. We demonstrate improved segmentation performance and accuracy. With an increase in Dice coefficient of 9.3% above baseline. LinkNet's variance is 13.2% above baseline. Overlays that are qualitative in nature show the reduced false positives at the leaf boundary. Uncertainty heatmaps are consistent with the complex morphology. UGDA-Net aids in the segmentation of delicate structures in plants and provides a high-def solution. The results showed that uncertainty-guided attention and uncertainty-weighted loss are two complementing systems.

cross Harnessing Photonics for Machine Intelligence

Authors: Hanqing Zhu, Shupeng Ning, Hongjian Zhou, Ziang Yin, Ray T. Chen, Jiaqi Gu, David Z. Pan

Abstract: The exponential growth of machine-intelligence workloads is colliding with the power, memory, and interconnect limits of the post-Moore era, motivating compute substrates that scale beyond transistor density alone. Integrated photonics is emerging as a candidate for artificial intelligence (AI) acceleration by exploiting optical bandwidth and parallelism to reshape data movement and computation. This review reframes photonic computing from a circuits-and-systems perspective, moving beyond building-block progress toward cross-layer system analysis and full-stack design automation. We synthesize recent advances through a bottleneck-driven taxonomy that delineates the operating regimes and scaling trends where photonics can deliver end-to-end sustained benefits. A central theme is cross-layer co-design and workload-adaptive programmability to sustain high efficiency and versatility across evolving application domains at scale. We further argue that Electronic-Photonic Design Automation (EPDA) will be pivotal, enabling closed-loop co-optimization across simulation, inverse design, system modeling, and physical implementation. By charting a roadmap from laboratory prototypes to scalable, reproducible electronic-photonic ecosystems, this review aims to guide the CAS community toward an automated, system-centric era of photonic machine intelligence.

cross Retinal Cyst Detection from Optical Coherence Tomography Images

Authors: Abhishek Dharmaratnakar, Aadheeshwar Vijayakumar, Suchand Dayanand

Abstract: Retinal Cysts are formed by leakage and accumulation of fluid in the retina due to the incompetence of retinal vasculature. These cystic spaces have significance in several ocular diseases such as age-related macular degeneration, diabetic macular edema, etc. Optical coherence tomography is one of the predominant diagnosing techniques for imaging retinal pathologies. Segmenting and quantification of intraretinal cysts plays the vital role in predicting visual acuity. In literature, several methods have been proposed for automatic segmentation of intraretinal cysts. As cystoid macular edema becomes a major problem to humankind, we need to quantify it accurately and operate it out, else it might cause many problems later on. Though research is being carried out in this area, not much of progress has been made and accuracy achieved so far is 68\% which is very less. Also, the methods depend on the quality of the image and give very low results for high noise images like topcon. This work uses ResNet CNN (Convolutional Neural Network) approach of segmentation by the way of patchwise classification for training on image set from cyst segmentation challenge dataset and testing on test data set given by 2 different graders for all 4 vendors in the challenge. It also compares these methods using first publicly available novel cyst segmentation challenge dataset. The methods were evaluated using quantitative measures to assess their robustness against the challenges of intraretinal cyst segmentation. The results are found to be better than the previous state of the art approaches giving more than 70\% dice coefficient on all vendors irrespective of their quality.

cross CASK: Core-Aware Selective KV Compression for Reasoning Traces

Authors: Buseong Kim, Heejun Gwon

Abstract: In large language models performing long-form reasoning, the KV cache grows rapidly with decode length, creating bottlenecks in memory and inference stability. Existing reasoning-oriented KV compression has mostly followed an eviction-centered view: estimate token importance more accurately, then discard lower-ranked entries. Our analysis suggests that scorer refinement alone often fails to substantially reorganize the actual keep-set and may therefore not be the main lever for preserving reasoning behavior. We instead frame reasoning KV compression as a behavior-preserving structured consolidation problem. CASK partitions the decode-time reasoning trace into a protected core that anchors answer formation and intermediate state, and mergeable scratch with high redundancy. The core is preserved, while selective consolidation is applied only to the scratch. To address prompt-heavy regimes where the prefix can exhaust the budget before decode-stage compression becomes active, CASK further uses a two-stage design: prefix eviction followed by decode-stage consolidation. On the H100 reasoning gate, CASK shows higher full-KV continuation fidelity than TriAttention at matched budgets on both AIME24 and AIME25, with recurring cask@384 > triattention@512 crossings. In prompt-heavy replay, multi_news and vcsum act as decode-active witnesses, while qmsum and gov_report expose the prefix_budget_exhausted boundary. The overall evidence supports a simple conclusion: effective reasoning KV compression depends less on more elaborate scorer engineering than on combining core preservation with selective scratch consolidation to lower the usable budget frontier.

cross EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

Authors: Chongliu Jia, Yi Luo, Sipeng Han, Pengwei Li, Jie Ding, Youshuang Hu, Yimiao Qian, Qiya Wang

Abstract: Medium- to long-horizon equity allocation is challenging due to weak predictive structure, non-stationary market regimes, and the degradation of signals under realistic trading constraints. Conventional approaches often rely on single predictors or loosely coupled pipelines, which limit robustness under distributional shift. This paper proposes EvoNash-MARL, a closed-loop framework that integrates reinforcement learning with population-based policy optimization and execution-aware selection to improve robustness in medium- to long-horizon allocation. The framework combines multi-agent policy populations, game-theoretic aggregation, and constraint-aware validation within a unified walk-forward design. Under a 120-window walk-forward protocol, the final configuration achieves the highest robust score among internal baselines. On out-of-sample data from 2014 to 2024, it delivers a 19.6% annualized return, compared to 11.7% for SPY, and remains stable under extended evaluation through 2026. While the framework demonstrates consistent performance under realistic constraints and across market settings, strong global statistical significance is not established under White's Reality Check (WRC) and SPA-lite tests. The results therefore provide evidence of improved robustness rather than definitive proof of superior market timing performance.

cross QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits

Authors: Navid Azimi, Aditya Prakash, Yao Wang, Li Xiong

Abstract: Deep neural networks remain highly vulnerable to adversarial perturbations, limiting their reliability in security- and safety-critical applications. To address this challenge, we introduce QShield, a modular hybrid quantum-classical neural network (HQCNN) architecture designed to enhance the adversarial robustness of classical deep learning models. QShield integrates a conventional convolutional neural network (CNN) backbone for feature extraction with a quantum processing module that encodes the extracted features into quantum states, applies structured entanglement operations under realistic noise models, and outputs a hybrid prediction through a dynamically weighted fusion mechanism implemented via a lightweight multilayer perceptron (MLP). We systematically evaluate both classical and hybrid quantum-classical models on the MNIST, OrganAMNIST, and CIFAR-10 datasets, using a comprehensive set of robustness, efficiency, and computational performance metrics. Our results demonstrate that classical models are highly vulnerable to adversarial attacks, whereas the proposed hybrid models with entanglement patterns maintain high predictive accuracy while substantially reducing attack success rates across a wide range of adversarial attacks. Furthermore, the proposed hybrid architecture significantly increased the computational cost required to generate adversarial examples, thereby introducing an additional layer of defense. These findings indicate that the proposed modular hybrid architecture achieves a practical balance between predictive accuracy and adversarial robustness, positioning it as a promising approach for secure and reliable machine learning in sensitive and safety-critical applications.

cross Generative Design for Direct-to-Chip Liquid Cooling for Data Centers

Authors: Zheng Liu

Abstract: Rapid growth in artificial intelligence (AI) workloads is driving up data center power densities, increasing the need for advanced thermal management. Direct-to-chip liquid cooling can remove heat efficiently at the source, but many cold plate channel layouts remain heuristic and are not optimized for the strongly non-uniform temperature distribution of modern heterogeneous packages. This work presents a generative design framework for synthesizing cooling channel geometries for the NVIDIA GB200 Grace Blackwell Superchip. A physics-based finite-difference thermal model provides rapid steady-state temperature predictions and supplies spatial thermal feedback to a constrained reaction-diffusion process that generates novel channel topologies while enforcing inlet/outlet and component constraints. By iterating channel generation and thermal evaluation in a closed loop, the method naturally redistributes cooling capacity toward high-power regions and suppresses hot-spot formation. Compared with a baseline parallel channel design, the resulting channels achieve more than a 5 degree Celsius reduction in average temperature and over 35 degree Celsius reduction in maximum temperature. Overall, the results demonstrate that coupling generative algorithms with lightweight physics-based modeling can significantly enhance direct-to-chip liquid cooling performance, supporting more sustainable scaling of AI computing.

cross Progressive Deep Learning for Automated Spheno-Occipital Synchondrosis Maturation Assessment

Authors: Omid Halimi Milani, Amanda Nikho, Marouane Tliba, Lauren Mills, Emadeldeen Hamdan, Ahmet Enis Cetin, Mohammed H. Elnagar

Abstract: Accurate assessment of spheno-occipital synchondrosis (SOS) maturation is a key indicator of craniofacial growth and a critical determinant for orthodontic and surgical timing. However, SOS staging from cone-beam CT (CBCT) relies on subtle, continuously evolving morphological cues, leading to high inter-observer variability and poor reproducibility, especially at transitional fusion stages. We frame SOS assessment as a fine-grained visual recognition problem and propose a progressive representation-learning framework that explicitly mirrors how expert clinicians reason about synchondral fusion: from coarse anatomical structure to increasingly subtle patterns of closure. Rather than training a full-capacity network end-to-end, we sequentially grow the model by activating deeper blocks over time, allowing early layers to first encode stable cranial base morphology before higher-level layers specialize in discriminating adjacent maturation stages. This yields a curriculum over network depth that aligns deep feature learning with the biological continuum of SOS fusion. Extensive experiments across convolutional and transformer-based architectures show that this expert-inspired training strategy produces more stable optimization and consistently higher accuracy than standard training, particularly for ambiguous intermediate stages. Importantly, these gains are achieved without changing network architectures or loss functions, demonstrating that training dynamics alone can substantially improve anatomical representation learning. The proposed framework establishes a principled link between expert dental intuition and deep visual representations, enabling robust, data-efficient SOS staging from CBCT and offering a general strategy for modeling other continuous biological processes in medical imaging.

cross bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R

Authors: Sel\c{c}uk Korkmaz

Abstract: Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how apparent performance changes under controlled leakage mechanisms, and the case study illustrates how guarded and leaky pipelines can yield materially different conclusions on multi-study transcriptomic data. The emphasis throughout is on software design, reproducible workflows, and interpretation of diagnostic output.

cross Neural Generalized Mixed-Effects Models

Authors: Yuli Slavutsky, Sebastian Salazar, David M. Blei

Abstract: Generalized linear mixed-effects models (GLMMs) are widely used to analyze grouped and hierarchical data. In a GLMM, each response is assumed to follow an exponential-family distribution where the natural parameter is given by a linear function of observed covariates and a latent group-specific random effect. Since exact marginalization over the random effects is typically intractable, model parameters are estimated by maximizing an approximate marginal likelihood. In this paper, we replace the linear function with neural networks. The result is a more flexible model, the neural generalized mixed-effects model (NGMM), which captures complex relationships between covariates and responses. To fit NGMM to data, we introduce an efficient optimization procedure that maximizes the approximate marginal likelihood and is differentiable with respect to network parameters. We show that the approximation error of our objective decays at a Gaussian-tail rate in a user-chosen parameter. On synthetic data, NGMM improves over GLMMs when covariate-response relationships are nonlinear, and on real-world datasets it outperforms prior methods. Finally, we analyze a large dataset of student proficiency to demonstrate how NGMM can be extended to more complex latent-variable models.

cross Sanity Checks for Agentic Data Science

Authors: Zachary T. Rewolinski, Austin V. Zane, Hao Huang, Chandan Singh, Chenglong Wang, Jianfeng Gao, Bin Yu

Abstract: Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.

cross Panoptic Pairwise Distortion Graph

Authors: Muhammad Kamran Janjua, Abdul Wahab, Bahador Rashidi

Abstract: In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.

cross Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics

Authors: Yuanhao Ding, Meimingwei Li, Esteban Garces Arias, Matthias A{\ss}enmacher, Christian Heumann, Chongsheng Zhang

Abstract: The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top-$k$, Top-$p$, and Min-$p$ achieve a balance between diversity and accuracy through probability-space truncation, they share an inherent limitation: extreme sensitivity to the temperature parameter. Recent logit-space approaches like Top-$n\sigma$ achieve temperature invariance but rely on global statistics that are susceptible to long-tail noise, failing to capture fine-grained confidence structures among top candidates. We propose \textbf{Min-$k$ Sampling}, a novel dynamic truncation strategy that analyzes the local shape of the sorted logit distribution to identify "semantic cliffs": sharp transitions from high-confidence core tokens to uncertain long-tail tokens. By computing a position-weighted relative decay rate, Min-$k$ dynamically determines truncation boundaries at each generation step. We formally prove that Min-$k$ achieves strict temperature invariance and empirically demonstrate its low sensitivity to hyperparameter choices. Experiments on multiple reasoning benchmarks, creative writing tasks, and human evaluation show that Min-$k$ consistently improves text quality, maintaining robust performance even under extreme temperature settings where probability-based methods collapse. We make our code, models, and analysis tools publicly available.

cross Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

Authors: Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi

Abstract: We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 4th place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

cross Efficient Transceiver Design for Aerial Image Transmission and Large-scale Scene Reconstruction

Authors: Zeyi Ren, Jialin Dong, Wei Zuo, Yikun Wang, Bingyang Cheng, Sheng Zhou, Zhisheng Niu

Abstract: Large-scale three-dimensional (3D) scene reconstruction in low-altitude intelligent networks (LAIN) demands highly efficient wireless image transmission. However, existing schemes struggle to balance severe pilot overhead with the transmission accuracy required to maintain reconstruction fidelity. To strike a balance between efficiency and reliability, this paper proposes a novel deep learning-based end-to-end (E2E) transceiver design that integrates 3D Gaussian Splatting (3DGS) directly into the training process. By jointly optimizing the communication modules via the combined 3DGS rendering loss, our approach explicitly improves scene recovery quality. Furthermore, this task-driven framework enables the use of a sparse pilot scheme, significantly reducing transmission overhead while maintaining robust image recovery under low-altitude channel conditions. Extensive experiments on real-world aerial image datasets demonstrate that the proposed E2E design significantly outperforms existing baselines, delivering superior transmission performance and accurate 3D scene reconstructions.

cross Generating Hadamard matrices with transformers

Authors: Geordie Williamson, Oded Yacobi, Paul Zinn-Justin

Abstract: We present a new method for constructing Hadamard matrices that combines transformer neural networks with local search in the PatternBoost framework. Our approach is designed for extremely sparse combinatorial search problems and is particularly effective for Hadamard matrices of Goethals--Seidel type, where Fourier methods permit fast scoring and optimisation. For orders between $100$ and $250$, it produces large numbers of inequivalent Hadamard matrices, and in harder cases it succeeds where local search from random initialisation fails. The largest example found by our method has order $244$. In addition to these new constructions, our experiments reveal that the transformer can discover and exploit useful hidden symmetry in the search space.

cross Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds

Authors: Pierre Jourlin (LIA)

Abstract: This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 $\pm$ 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 $\pm$ 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46$\pm$0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 $\pm$ 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussa{\"i}d et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 $\pm$ 0.04. A confidence-routing cascade mechanism (Phi-4 $\rightarrow$ GPT-OSS, k=5) achieves an EM of 0.55 $\pm$ 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in $\sim$5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.

cross Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

Authors: Daniel Nichols, Konstantinos Parasyris, Caetano Melone, Tal Ben-Nun, Giorgis Georgakoudis, Harshitha Menon

Abstract: As high-performance computing and AI workloads become increasingly dependent on GPUs, maintaining high performance across rapidly evolving hardware generations has become a major challenge. Developers often spend months tuning scientific applications to fully exploit new architectures, navigating a complex optimization space that spans algorithm design, source implementation, compiler flags and pass sequences, and kernel launch parameters. Existing approaches can effectively search parts of this space in isolation, such as launch configurations or compiler settings, but optimizing across the full space still requires substantial human expertise and iterative manual effort. In this paper, we present Record-Remix-Replay (R^3), a hierarchical optimization framework that combines LLM-driven evolutionary search, Bayesian optimization, and record-replay compilation techniques to efficiently explore GPU kernel optimizations from source-level implementation choices down to compiler pass ordering and runtime configuration. By making candidate evaluation fast and scalable, our approach enables practical end-to-end search over optimization dimensions that are typically treated separately. We show that Record-Remix-Replay can optimize full scientific applications better than traditional approaches over kernel parameters and compiler flags, while also being nearly an order of magnitude faster than modern evolutionary search approaches.

cross DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

Authors: Tiantian Zhang, Jierui Zuo, Wenping Wang

Abstract: This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback\_binarized, evaluate on the held-out test\_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.

cross MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

Authors: Abhishek Sawaika, Samuel Yen-Chi Chen, Udaya Parampalli, Rajkumar Buyya

Abstract: Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.

cross AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

Authors: Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen

Abstract: Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.

cross From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning

Authors: Chen Zhan, Xiaoyu Tan, Gengchen Ma, Yu-Jie Xiong, Xiaoyan Jiang, Xihe Qiu

Abstract: The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce "correct answers through flawed reasoning." This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL's progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.

cross Cost-optimal Sequential Testing via Doubly Robust Q-learning

Authors: Doudou Zhou, Yiran Zhang, Dian Jin, Yingye Zheng, Lu Tian, Tianxi Cai

Abstract: Clinical decision-making often involves selecting tests that are costly, invasive, or time-consuming, motivating individualized, sequential strategies for what to measure and when to stop ascertaining. We study the problem of learning cost-optimal sequential decision policies from retrospective data, where test availability depends on prior results, inducing informative missingness. Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework for estimating optimal policies. The method introduces path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a normalization property conditional on the observed history. By combining these weights with auxiliary contrast models, we construct orthogonal pseudo-outcomes that enable unbiased policy learning when either the acquisition model or the contrast model is correctly specified. We establish oracle inequalities for the stage-wise contrast estimators, along with convergence rates, regret bounds, and misclassification rates for the learned policy. Simulations demonstrate improved cost-adjusted performance over weighted and complete-case baselines, and an application to a prostate cancer cohort study illustrates how the method reduces testing cost without compromising predictive accuracy.

cross Probabilistic Prediction of Neural Dynamics via Autoregressive Flow Matching

Authors: Nicole Rogalla, Yuzhen Qin, Mario Senden, Ahmed El-Gazzar, Marcel van Gerven

Abstract: Forecasting neural activity in response to naturalistic stimuli remains a key challenge for understanding brain dynamics and enabling downstream neurotechnological applications. Here, we introduce a generative forecasting framework for modeling neural dynamics based on autoregressive flow matching (AFM). Building on recent advances in transport-based generative modeling, our approach probabilistically predicts neural responses at scale from multimodal sensory input. Specifically, we learn the conditional distribution of future neural activity given past neural dynamics and concurrent sensory input, explicitly modeling neural activity as a temporally evolving process in which future states depend on recent neural history. We evaluate our framework on the Algonauts project 2025 challenge functional magnetic resonance imaging dataset using subject-specific models. AFM significantly outperforms both a non-autoregressive flow-matching baseline and the official challenge general linear model baseline in predicting short-term parcel-wise blood oxygenation level-dependent (BOLD) activity, demonstrating improved generalization and widespread cortical prediction performance. Ablation analyses show that access to past BOLD dynamics is a dominant driver of performance, while autoregressive factorization yields consistent, modest gains under short-horizon, context-rich conditions. Together, these findings position autoregressive flow-based generative modeling as an effective approach for short-term probabilistic forecasting of neural dynamics with promising applications in closed-loop neurotechnology.

cross CapBench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction

Authors: Hector R. Rodriguez, Jiechen Huang, Wenjian Yu

Abstract: We present CapBench, a fully reproducible, multi-PDK dataset for capacitance extraction. The dataset is derived from open-source designs, including single-core CPUs, systems-on-chip, and media accelerators. All designs are fully placed and routed using 14 independent OpenROAD flow runs spanning three technology nodes: ASAP7, NanGate45, and Sky130HD. From these layouts, we extract 61,855 3D windows across three size tiers to enable transfer learning and scalability studies. High-fidelity capacitance labels are generated using RWCap, a state-of-the-art random-walk solver, and validated against the industry-standard Raphael, achieving a mean absolute error of 0.64% for total capacitance. Each window is pre-processed into density maps, graph representations, and point clouds. We evaluate 10 machine learning architectures that illustrate dataset usage and serve as baselines, including convolutional neural networks (CNNs), point cloud transformers, and graph neural networks (GNNs). CNNs demonstrate the lowest errors (1.75%), while GNNs are up to 41.4x faster but exhibit larger errors (10.2%), illustrating a clear accuracy-speed trade-off. Code and dataset are available at https://github.com/THU-numbda/CapBench.

URLs: https://github.com/THU-numbda/CapBench.

cross 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Authors: Stefan Schulz, Fernando Edelstein, Hannah Dr\"oge, Matthias B. Hullin, Markus Plack

Abstract: Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/

URLs: https://stefanmschulz.github.io/3DTV_webpage/

cross Regional Explanations: Bridging Local and Global Variable Importance

Authors: Salim I. Amoukou, Nicolas J-B. Brunel

Abstract: We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \dots, x_p)$. Despite their widespread use, we identify fundamental limitations in their ability to reliably detect locally important features, even under ideal conditions with exact computations and independent features. We argue that a sound local attribution method should not assign importance to features that neither influence the model output (e.g., features with zero coefficients in a linear model) nor exhibit statistical dependence with functionality-relevant features. We demonstrate that both Local SV and LIME violate this fundamental principle. To address this, we propose R-LOCO (Regional Leave Out COvariates), which bridges the gap between local and global explanations and provides more accurate attributions. R-LOCO segments the input space into regions with similar feature importance characteristics. It then applies global attribution methods within these regions, deriving an instance's feature contributions from its regional membership. This approach delivers more faithful local attributions while avoiding local explanation instability and preserving instance-specific detail often lost in global methods.

cross Trustworthy Feature Importance Avoids Unrestricted Permutations

Authors: Emanuele Borgonovo, Francesco Cappelli, Xuefei Lu, Elmar Plischke, Cynthia Rudin

Abstract: Feature importance methods using unrestricted permutations are flawed due to extrapolation errors; such errors appear in all non-trivial variable importance approaches. We propose three new approaches: conditional model reliance and Knockoffs with Gaussian transformation, and restricted ALE plot designs. Theoretical and numerical results show our strategies reduce/eliminate extrapolation.

cross Signal-Aware Conditional Diffusion Surrogates for Transonic Wing Pressure Prediction

Authors: V\'ictor Franc\'es-Belda, Carlos Sanmiguel Vila, Rodrigo Castellanos

Abstract: Accurate and efficient surrogate models for aerodynamic surface pressure fields are essential for accelerating aircraft design and analysis, yet deterministic regressors trained with pointwise losses often smooth sharp nonlinear features. This work presents a conditional denoising diffusion probabilistic model for predicting surface pressure distributions on the NASA Common Research Model wing under varying conditions of Mach number, angle of attack, and four control surface deflections. The framework operates on unstructured surface data through a principal component representation used as a non-truncated, reversible linear reparameterization of the pressure field, enabling a fully connected architecture. A signal-aware training objective is derived by propagating a reconstruction loss through the diffusion process, yielding a timestep-dependent weighting that improves fidelity in regions with strong pressure gradients. The stochastic sampling process is analyzed through repeated conditional generations, and two diagnostic metrics are introduced, the Local Reliability Index and Global Reliability Index, to relate sampling-induced spread to reconstruction error. Relative to the considered deterministic baselines, the proposed formulation reduces mean absolute error and improves the reconstruction of suction peaks, shock structures, and control surface discontinuities. The sampling-induced spread exhibits strong correspondence with surrogate error, supporting its interpretation as a qualitative reliability indicator rather than calibrated uncertainty quantification.

cross Transactional Attention: Semantic Sponsorship for KV-Cache Retention

Authors: Abhinaba Basu

Abstract: At K=16 tokens (0.4% of a 4K context), every existing KV-cache compression method achieves 0% on credential retrieval. The failure mode is dormant tokens: credentials, API keys, and configuration values that receive near-zero attention but become essential at generation time. Because these tokens lack the statistical signals that eviction policies rely on, no method based on attention scores, reconstruction loss, or learned retention gates retains them. We introduce Transactional Attention (TA), a sponsorship mechanism in which structural anchor patterns (e.g., "key:", "password:") protect adjacent value-bearing tokens from eviction. TA achieves 100% credential retrieval at K=16 where six baselines (H2O, TOVA, SnapKV, StreamingLLM, PyramidKV, DynamicKV) achieve 0%, and sustains 100% accuracy across 200 function-calling trials. TA-Fast, an attention-free variant, reduces memory overhead by 52% and is compatible with SDPA and FlashAttention. TA is orthogonal to existing compression methods and adds less than 1% latency overhead.

cross The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Authors: Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu, Yuxuan Zhou, Zeming Wei, Dongxian Wu, Xun Chen, Jun Sun, Meng Sun

Abstract: Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\% while achieving a maximum blocking rate of 64.8\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

cross BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection

Authors: Ammar Bhilwarawala, Likhamba Rongmei, Harsh Sharma, Arya Jena, Kaushal Singh, Jayashree Piri, Raghunath Dey

Abstract: IoT botnet detection has advanced, yet most published systems are validated on a single dataset and rarely generalise across environments. Heterogeneous feature spaces make multi-dataset training practically impossible without discarding semantic interpretability or introducing data integrity violations. No prior work has addressed both problems with a formally specified, reproducible methodology. This paper does. We introduce BRIDGE (Benchmark Reference for IoT Domain Generalisation Evaluation), the first formally specified heterogeneous multi-dataset benchmark for IoT intrusion detection, unifying CICIDS-2017, CIC-IoT-2023, Bot-IoT, Edge-IIoTset, and N-BaIoT through a 46-feature semantic canonical vocabulary grounded in CICFlowMeter nomenclature, with genuine-equivalence-only feature mapping, explicit zero-filling, and per-dataset coverage from 15% to 93%. A leave-one-dataset-out (LODO) protocol makes the generalisation gap precisely measurable: all five evaluated architectures achieve mean LODO F1 between 0.39 and 0.47, and we establish the first community generalisation baseline at mean LODO F1 = 0.5577, a result that shifts the agenda from single-benchmark optimisation toward cross-environment generalisation. We propose TCH-Net, a multi-branch network fusing a three-path Temporal branch (residual convolutional-BiGRU, stride-downsampled BiGRU, pre-LayerNorm Transformer), a provenance-conditioned Contextual branch, and a Statistical branch via Cross-Branch Gated Attention Fusion (CB-GAF) with learnable sigmoid gates for dynamic feature-wise mixing. Across five random seeds, TCH-Net achieves F1 = 0.8296 +/- 0.0028, AUC = 0.9380 +/- 0.0025, and MCC = 0.6972 +/- 0.0056, outperforming all twelve baselines (p < 0.05, Wilcoxon) and recording the highest LODO F1 overall. BRIDGE and the full pipeline are at https://github.com/Ammar-ss/TCH-Net.

URLs: https://github.com/Ammar-ss/TCH-Net.

cross Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

Authors: Xiaoyu Ma, Yiwen Li, Haoyue Liu, Zhichao Wang, Ye Chen, Yongxin Guo, Xiaoying Tang

Abstract: Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.

cross CoRe-ECG: Advancing Self-Supervised Representation Learning for 12-Lead ECG via Contrastive and Reconstructive Synergy

Authors: Zehao Qin, Xiaojian Lin, Ping Zhang, Hongliang Wu, Xinkang Wang, Guangling Liu, Bo Chen, Wenming Yang, Guijin Wang

Abstract: Accurate interpretation of electrocardiogram (ECG) remains challenging due to the scarcity of labeled data and the high cost of expert annotation. Self-supervised learning (SSL) offers a promising solution by enabling models to learn expressive representations from unlabeled signals. Existing ECG SSL methods typically rely on either contrastive learning or reconstructive learning. However, each approach in isolation provides limited supervisory signals and suffers from additional limitations, including non-physiological distortions introduced by naive augmentations and trivial correlations across multiple leads that models may exploit as shortcuts. In this work, we propose CoRe-ECG, a unified contrastive and reconstructive pretraining paradigm that establishes a synergistic interaction between global semantic modeling and local structural learning. CoRe-ECG aligns global representations during reconstruction, enabling instance-level discriminative signals to guide local waveform recovery. To further enhance pretraining, we introduce Frequency Dynamic Augmentation (FDA) to adaptively perturb ECG signals based on their frequency-domain importance, and Spatio-Temporal Dual Masking (STDM) to break linear dependencies across leads, increasing the difficulty of reconstructive tasks. Our method achieves state-of-the-art performance across multiple downstream ECG datasets. Ablation studies further demonstrate the necessity and complementarity of each component. This approach provides a robust and physiologically meaningful representation learning framework for ECG analysis.

cross GlobalCY I: A JAX Framework for Globally Defined and Symmetry-Aware Neural K\"ahler Potentials

Authors: Abdul Rahman

Abstract: We present \emph{GlobalCY}, a JAX-based framework for globally defined and symmetry-aware neural K\"ahler-potential models on projective hypersurface Calabi--Yau geometries. The central problem is that local-input neural K\"ahler-potential models can train successfully while still failing the geometry-sensitive diagnostics that matter in hard quartic regimes, especially near singular and near-singular members of the Cefal\'u family. To study this, we compare three model families -- a local-input baseline, a globally defined invariant model, and a symmetry-aware global model -- on the hard Cefal\'u cases $\lambda=0.75$ and $\lambda=1.0$ using a fixed multi-seed protocol and a geometry-aware diagnostic suite. In this benchmark, the globally defined invariant model is the strongest overall family, outperforming the local baseline on the two clearest geometric comparison metrics, negative-eigenvalue frequency and projective-invariance drift, in both cases. The gains are strongest at $\lambda=0.75$, while $\lambda=1.0$ remains more difficult. The current symmetry-aware model improves projective-invariance drift relative to the local baseline, but does not yet surpass the plain global invariant model. These results show that global invariant structure is a meaningful architectural constraint for learned K\"ahler-potential modeling in hard quartic Calabi--Yau settings.

cross Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books

Authors: Argyrios Papoudakis, Mirella Lapata, Frank Keller

Abstract: Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.

cross From Attribution to Action: A Human-Centered Application of Activation Steering

Authors: Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

Abstract: Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

cross ADD for Multi-Bit Image Watermarking

Authors: An Luo, Jie Ding

Abstract: As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promising remedy is multi-bit image watermarking, which embeds a multi-bit message into an image so that a verifier can later detect whether the image is generated by someone and further identify the source by decoding the embedded message. Existing approaches often fall short in capacity, resilience to common image distortions, and theoretical justification. To address these limitations, we propose ADD (Add, Dot, Decode), a multi-bit image watermarking method with two stages: learning a watermark to be linearly combined with the multi-bit message and added to the image, and decoding through inner products between the watermarked image and the learned watermark. On the standard MS-COCO benchmark, we demonstrate that for the challenging task of 48-bit watermarking, ADD achieves 100\% decoding accuracy, with performance dropping by at most 2\% under a wide range of image distortions, substantially smaller than the 14\% average drop of state-of-the-art methods. In addition, ADD achieves substantial computational gains, with 2-fold faster embedding and 7.4-fold faster decoding than the fastest existing method. We further provide a theoretical analysis explaining why the learned watermark and the corresponding decoding rule are effective.

cross Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Authors: Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

Abstract: Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.

cross Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers

Authors: I. Esra Buyuktahtakin

Abstract: Artificial intelligence (AI) is moving increasingly beyond prediction to support decisions in complex, uncertain, and dynamic environments. This shift creates a natural intersection with operations research and management sciences (OR/MS), which have long offered conceptual and methodological foundations for sequential decision-making under uncertainty. At the same time, recent advances in deep learning, including feedforward neural networks, LSTMs, transformers, and deep reinforcement learning, have expanded the scope of data-driven modeling and opened new possibilities for large-scale decision systems. This tutorial presents an OR/MS-centered perspective on deep learning for sequential decision-making under uncertainty. Its central premise is that deep learning is valuable not as a replacement for optimization, but as a complement to it. Deep learning brings adaptability and scalable approximation, whereas OR/MS provides the structural rigor needed to represent constraints, recourse, and uncertainty. The tutorial reviews key decision-making foundations, connects them to the major neural architectures in modern AI, and discusses leading approaches to integrating learning and optimization. It also highlights emerging impact in domains such as supply chains, healthcare and epidemic response, agriculture, energy, and autonomous operations. More broadly, it frames these developments as part of a wider transition from predictive AI toward decision-capable AI and highlights the role of OR/MS in shaping the next generation of integrated learning--optimization systems.

cross Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Authors: Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang

Abstract: To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

cross The Price of Ignorance: Information-Free Quotation for Data Retention in Machine Unlearning

Authors: Bin Han, Di Feng, Zexin Fang, Jie Wang, Hans D. Schotten

Abstract: When users exercise data deletion rights under the General Data Protection Regulation (GDPR) and similar regulations, mobile network operators face a tradeoff: excessive machine unlearning degrades model accuracy and incurs retraining costs, yet existing pricing mechanisms for data retention require the server to know every user's private privacy and accuracy preferences, which is infeasible under the very regulations that motivate unlearning. We ask: what is the welfare cost of operating without this private information? We design an information-free ascending quotation mechanism where the server broadcasts progressively higher prices and users self-select their data supply, requiring no knowledge of users' parameters. Under complete information, the protocol admits a unique subgame-perfect Nash equilibrium characterized by single-period selling. We formalize the Price of Ignorance -- the welfare gap between optimal personalized pricing (which knows everything) and our information-free quotation (which knows nothing) -- and prove a three-regime efficiency ordering. Numerical evaluation across seven mechanisms and 5000 Monte Carlo runs shows that this price is near zero: the information-free mechanism achieves >=99% of the welfare of its information-intensive benchmarks, while providing noise-robust guarantees and comparable fairness.

cross Machine-learning modeling of magnetization dynamics in quasi-equilibrium and driven metallic spin systems

Authors: Gia-Wei Chern, Yunhao Fan, Sheng Zhang, Puhan Zhang

Abstract: We review recent advances in machine-learning (ML) force-field methods for large-scale Landau-Lifshitz-Gilbert (LLG) simulations of metallic spin systems. We generalize the Behler-Parrinello (BP) ML architecture -- originally developed for quantum molecular dynamics -- to construct scalable and transferable ML models capable of capturing the intricate dependence of electron-mediated exchange fields on the local magnetic environment characteristic of itinerant magnets. A central ingredient of this framework is the implementation of symmetry-aware magnetic descriptors based on group-theoretical bispectrum formalisms. Leveraging these ML force fields, LLG simulations faithfully reproduce hallmark non-collinear magnetic orders -- such as the $120^\circ$ and tetrahedral states -- on the triangular lattice, and successfully capture the complex spin textures emerging in the mixed-phase states of a square-lattice double-exchange model under thermal quench. We further discuss a generalized potential theory that extends the BP formalism to incorporate both conservative and nonconservative electronic torques, thereby enabling ML models to learn nonequilibrium exchange fields from computationally demanding microscopic approaches such as nonequilibrium Green's-function techniques. This extension yields quantitatively accurate predictions of voltage-driven domain-wall motion and establishes a foundation for quantum-accurate, multiscale modeling of nonequilibrium spin dynamics and spintronic functionalities.

cross Human Centered Non Intrusive Driver State Modeling Using Personalized Physiological Signals in Real World Automated Driving

Authors: David Puertas-Ramirez, Raul Fernandez-Matellan, David Martin Gomez, Jesus G. Boticario

Abstract: In vehicles with partial or conditional driving automation (SAE Levels 2-3), the driver remains responsible for supervising the system and responding to take-over requests. Therefore, reliable driver monitoring is essential for safe human-automation collaboration. However, most existing Driver Monitoring Systems rely on generalized models that ignore individual physiological variability. In this study, we examine the feasibility of personalized driver state modeling using non-intrusive physiological sensing during real-world automated driving. We conducted experiments in an SAE Level 2 vehicle using an Empatica E4 wearable sensor to capture multimodal physiological signals, including electrodermal activity, heart rate, temperature, and motion data. To leverage deep learning architectures designed for images, we transformed the physiological signals into two-dimensional representations and processed them using a multimodal architecture based on pre-trained ResNet50 feature extractors. Experiments across four drivers demonstrate substantial interindividual variability in physiological patterns related to driver awareness. Personalized models achieved an average accuracy of 92.68%, whereas generalized models trained on multiple users dropped to an accuracy of 54%, revealing substantial limitations in cross-user generalization. These results underscore the necessity of adaptive, personalized driver monitoring systems for future automated vehicles and imply that autonomous systems should adapt to each driver's unique physiological profile.

cross Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

Authors: Artem Gadzhiev, Andrew Kislov

Abstract: Providing AI agents with reliable long-term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents -- sliding windows, summarization, embedding-based RAG, and flat fact extraction -- each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius-Mem, a brain-inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius-Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius-Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius-Mem reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy. Synthius-Mem achieves state-of-the-art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human-level performance and reports adversarial robustness.

cross Minimizing classical resources in variational measurement-based quantum computation for generative modeling

Authors: Arunava Majumder, Hendrik Poulsen Nautrup, Hans J. Briegel

Abstract: Measurement-based quantum computation (MBQC) is a framework for quantum information processing in which a computational task is carried out through one-qubit measurements on a highly entangled resource state. Due to the indeterminacy of the outcomes of a quantum measurement, the random outcomes of these operations, if not corrected, yield a variational quantum channel family. Traditionally, this randomness is corrected through classical processing in order to ensure deterministic unitary computations. Recently, variational measurement-based quantum computation (VMBQC) has been introduced to exploit this measurement-induced randomness to gain an advantage in generative modeling. A limitation of this approach is that the corresponding channel model has twice as many parameters compared to the unitary model, scaling as $N \times D$, where $N$ is the number of logical qubits (width) and $D$ is the depth of the VMBQC model. This can often make optimization more difficult and may lead to poorly trainable models. In this paper, we present a restricted VMBQC model that extends the unitary setting to a channel-based one using only a single additional trainable parameter. We show, both numerically and algebraically, that this minimal extension is sufficient to generate probability distributions that cannot be learned by the corresponding unitary model.

cross A Triadic Suffix Tokenization Scheme for Numerical Reasoning

Authors: Olga Chetverina

Abstract: Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

cross Computation of Least Trimmed Squares: A Branch-and-Bound framework with Hyperplane Arrangement Enhancements

Authors: Xiang Meng, Andr\'es G\'omez, Rahul Mazumder

Abstract: We study computational aspects of a key problem in robust statistics -- the penalized least trimmed squares (LTS) regression problem, a robust estimator that mitigates the influence of outliers in data by capping residuals with large magnitudes. Although statistically attractive, penalized LTS is NP-hard, and existing mixed-integer optimization (MIO) formulations scale poorly due to weak relaxations and exponential worst-case complexity in the number of observations. We propose a new MIO formulation that embeds hyperplane arrangement logic into a perspective reformulation, explicitly enforcing structural properties of optimal solutions. We show that, if the number of features is fixed, the resulting branch-and-bound tree is of polynomial size in the sample size. Moreover, we develop a tailored branch-and-bound algorithm that uses first-order methods with dual bounds to solve node relaxations efficiently. Computational experiments on synthetic and real datasets demonstrate substantial improvements over existing MIO approaches: on synthetic instances with 5000 samples and 20 features, our tailored solver reaches a 1% gap in 1 minute while competing approaches fail to do so within one hour. These gains enable exact robust regression at significantly larger sample sizes in low-dimensional settings.

cross Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

Authors: Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo

Abstract: To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.

cross CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

Authors: Jinpeng Ye, Chongxi Wang, Wenqing Li, Bin Yuan, Shiyi Wang, Fenglu Zhang, Junyu Yue, Jianan Xie, Yunhao Ye, Haoyu Deng, Yingkun Zhou, Xin Cheng, Fuxin Zhang, Jian Wang

Abstract: Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU pipeline complicates integration across diverse CPUs, while fine-grained synchronous instructions hinder the development of high-performance kernels. This paper proposes a unified and configurable CPU matrix extension architecture. By decoupling matrix units from the CPU pipeline, the design enables low-overhead integration while maintaining close coordination with existing compute and memory resources. The configurable matrix unit supports mixed-precision operations and adapts to diverse compute demands and memory bandwidth constraints. An asynchronous matrix multiplication abstraction with flexible granularity conceals hardware details, simplifies matrix-vector overlap, and supports a unified software stack. The architecture is integrated into four open-source CPU RTL platforms and evaluated on representative AI models. Matrix unit utilization under GEMM workloads exceeds 90% across all platforms. When configured with compute throughput and memory bandwidth comparable to Intel AMX, our design achieves speedups of 1.57x, 1.57x, and 2.31x on ResNet, BERT, and Llama3, with over 30% of the gains attributed to overlapped matrix-vector execution. A 4 TOPS@2GHz matrix unit occupies only 0.53 mm\textsuperscript{2} in 14nm CMOS. These results demonstrate strong cross-platform adaptability and effective hardware-software co-optimization, offering a practical matrix extension for the open-source community.

cross RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Authors: Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen

Abstract: Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

cross GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs

Authors: Lara D'Agata, Carlos Agull\'o-Domingo, \'Oscar Vera-L\'opez, Kaustubh Shivdikar, Ardhi W. B. Yudha, Ferhat Yaman, David Kaeli, Jos\'e L. Abell\'an, Ian Colbert, Jos\'e Cano

Abstract: Fully homomorphic encryption (FHE) has recently attracted significant attention as both a cryptographic primitive and a systems challenge. Given the latest advances in accelerated computing, FHE presents a promising opportunity for progress, with applications ranging from machine learning to information security. We target the most computationally intensive operation in deep neural networks from a hardware perspective, matrix multiplication (matmul), and adapt it for execution on AMD GPUs. We propose a new optimized method that improves the runtime and complexity of ciphertext matmul by using FIDESlib, a recent open-source FHE library designed specifically for GPUs. By exploiting sparsity in both operands, our sparse matmul implementation outperforms its CPU counterpart by up to $3.0\times$ and reduces the time complexity from cubic to semi-linear, demonstrating an improvement over existing FHE matmul implementations.

cross Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Authors: Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

Abstract: As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

cross Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning

Authors: Jieying Xue, Phuong Minh Nguyen, Ha Thanh Nguyen, May Myo Zin, Ken Satoh

Abstract: This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.

URLs: https://github.com/yingjie7/Legal2LogicICL.

cross Evaluating Cooperation in LLM Social Groups through Elected Leadership

Authors: Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z. Leibo, Zhijing Jin

Abstract: Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.

cross Universality of first-order methods on random and deterministic matrices

Authors: Nicola Gorini, Chris Jones, Dmitriy Kunisky, Lucas Pesenti

Abstract: General first-order methods (GFOM) are a flexible class of iterative algorithms which update a state vector by matrix-vector multiplications and entrywise nonlinearities. A long line of work has sought to understand the large-n dynamics of GFOM, mostly focusing on "very random" input matrices and the approximate message passing (AMP) special case of GFOM whose state is asymptotically Gaussian. Yet, it has long remained unknown how to construct iterative algorithms that retain this Gaussianity for more structured inputs, or why existing AMP algorithms can be as effective for some deterministic matrices as they are for random matrices. We analyze diagrammatic expansions of GFOM via the limiting traffic distribution of the input matrix, the collection of all limiting values of permutation-invariant polynomials in the matrix entries, to obtain the following results: 1. We calculate the traffic distribution for the first non-trivial deterministic matrices, including (minor variants of) the Walsh-Hadamard and discrete sine and cosine transform matrices. This determines the limiting dynamics of GFOM on these inputs, resolving parts of longstanding conjectures of Marinari, Parisi, and Ritort (1994). 2. We design a new AMP iteration which unifies several previous AMP variants and generalizes to new input types, whose limiting dynamics are Gaussian conditional on some latent random variables. The asymptotic dynamics hold for a large and natural class of traffic distributions (encompassing both random and deterministic input matrices) and the algorithm's analysis gives a simple combinatorial interpretation of the Onsager correction, answering questions posed recently by Wang, Zhong, and Fan (2022).

cross Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

Authors: Manuela Gonz\'alez-Gonz\'alez, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Lorenzo Sia, Nicolas Richet, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger

Abstract: Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

cross LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Authors: Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu

Abstract: Continuous diffusion has been the foundation of high-fidelity, controllable, and few-step generation of many data modalities such as images. However, in language modeling, prior continuous diffusion language models (DLMs) lag behind discrete counterparts due to the sparse data space and the underexplored design space. In this work, we close this gap with LangFlow, the first continuous DLM to rival discrete diffusion, by connecting embedding-space DLMs to Flow Matching via Bregman divergence, alongside three key innovations: (1) we derive a novel ODE-based NLL bound for principled evaluation of continuous flow-based language models; (2) we propose an information-uniform principle for setting the noise schedule, which motivates a learnable noise scheduler based on a Gumbel distribution; and (3) we revise prior training protocols by incorporating self-conditioning, as we find it improves both likelihood and sample quality of embedding-space DLMs with effects substantially different from discrete diffusion. Putting everything together, LangFlow rivals top discrete DLMs on both the perplexity (PPL) and the generative perplexity (Gen. PPL), reaching a PPL of 30.0 on LM1B and 24.6 on OpenWebText. It even exceeds autoregressive baselines in zero-shot transfer on 4 out of 7 benchmarks. LangFlow provides the first clear evidence that continuous diffusion is a promising paradigm for language modeling. Homepage: https://github.com/nealchen2003/LangFlow

URLs: https://github.com/nealchen2003/LangFlow

cross MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI

Authors: Paula Arguello, Berk Tinaz, Mohammad Shahab Sepehri, Maryam Soltanolkotabi, Mahdi Soltanolkotabi

Abstract: Deep learning underpins a wide range of applications in MRI, including reconstruction, artifact removal, and segmentation. However, progress has been driven largely by public datasets focused on brain and knee imaging, shaping how models are trained and evaluated. As a result, careful studies of the reliability of these models across diverse anatomical settings remain limited. In this work, we introduce MosaicMRI, a large and diverse collection of fully sampled raw musculoskeletal (MSK) MR measurements designed for training and evaluating machine-learning-based methods. MosaicMRI is the largest open-source raw MSK MRI dataset to date, comprising 2,671 volumes and 80,156 slices. The dataset offers substantial diversity in volume orientation (e.g., axial, sagittal), imaging contrasts (e.g., PD, T1, T2), anatomies (e.g., spine, knee, hip, ankle, and others), and numbers of acquisition coils. Using VarNet as a baseline for accelerated reconstruction task, we perform a comprehensive set of experiments to study scaling behavior with respect to both model capacity and dataset size. Interestingly, models trained on the combined anatomies significantly outperform anatomy-specific models in low-sample regimes, highlighting the benefits of anatomical diversity and the presence of exploitable cross-anatomical correlations. We further evaluate robustness and cross-anatomy generalization by training models on one anatomy (e.g., spine) and testing them on another (e.g., knee). Notably, we identify groups of body parts (e.g., foot and elbow) that generalize well with each other, and highlight that performance under domain shifts depends on both training set size, anatomy, and protocol-specific factors.

replace The Wasserstein transform

Authors: Kun Jin, Facundo M\'emoli, Zane Smith, Zhengchao Wan

Abstract: We introduce the Wasserstein Transform (WT), a general unsupervised framework for updating distance structures on given data sets with the purpose of enhancing features and denoising. Our framework represents each data point by a probability measure reflecting the neighborhood structure of the point, and then updates the distance by computing the Wasserstein distance between these probability measures. The Wasserstein Transform is a general method which extends the mean shift family of algorithms. We study several instances of WT, and in particular, in one of the instances which we call the Gaussian Transform (GT), we utilize Gaussian measures to model neighborhood structures of individual data points. GT is computationally cheaper than other instances of WT since there exists closed form solution for the $\ell^2$-Wasserstein distance between Gaussian measures. We study the relationship between different instances of WT and prove that each of the instances is stable under perturbations. We devise iterative algorithms for performing the above-mentioned WT and propose several strategies to accelerate GT, such as an observation from linear algebra for reducing the number of matrix square root computations. We examine the performance of the Wasserstein Transform method in many tasks, such as denoising, clustering, image segmentation and word embeddings.

replace AXIL: Exact Instance Attribution for Gradient Boosting

Authors: Paul Geertsema, Helen Lu

Abstract: We derive an exact, prediction-specific instance-attribution method for fitted gradient boosting machines (GBMs) trained with squared-error loss, with the learned tree structure held fixed. Each prediction can be written as a weighted sum of training targets, with coefficients determined only by the fitted tree structure and learning rate. These coefficients are exact instance attributions, or AXIL weights. Our main algorithmic contribution is a matrix-free backward operator that computes one AXIL attribution vector in O(TN) time, or S vectors in O(TNS), without materialising the full N x N matrix. This extends to out-of-sample predictions and makes exact instance attribution practical for large datasets. AXIL yields exact fixed-structure sensitivity by construction in target-perturbation tests, where competing GBM-specific attribution methods (BoostIn, TREX, and LeafInfluence) generally fail. In retraining-based faithfulness tests on 20 regression datasets, AXIL achieves the highest faithfulness score on 14 datasets and statistically ties for the best on 4 others, while also running substantially faster than the competing methods. We also show that the AXIL weight matrix is the globally constant special case of a target-response Jacobian that provides first-order instance attribution for any differentiable learner via implicit differentiation, placing the exact decomposition inside a broader framework.

replace SIGMA: An Efficient Heterophilous Graph Neural Network with Fast Global Aggregation

Authors: Haoyu Liu, Ningyi Liao, Siqiang Luo

Abstract: Graph neural networks (GNNs) realize great success in graph learning but suffer from performance loss when meeting heterophily, i.e. neighboring nodes are dissimilar, due to their local and uniform aggregation. Existing attempts of heterophilous GNNs incorporate long-range or global aggregations to distinguish nodes in the graph. However, these aggregations usually require iteratively maintaining and updating full-graph information, which limits their efficiency when applying to large-scale graphs. In this paper, we propose SIGMA, an efficient global heterophilous GNN aggregation integrating the structural similarity measurement SimRank. Our theoretical analysis illustrates that SIGMA inherently captures distant global similarity even under heterophily, that conventional approaches can only achieve after iterative aggregations. Furthermore, it enjoys efficient one-time computation with a complexity only linear to the node set size $\mathcal{O}(n)$. Comprehensive evaluation demonstrates that SIGMA achieves state-of-the-art performance with superior aggregation and overall efficiency. Notably, it obtains $5\times$ acceleration on the large-scale heterophily dataset pokec with over 30 million edges compared to the best baseline aggregation.

replace Incentivizing Honesty among Competitors in Collaborative Learning and Optimization

Authors: Florian E. Dorner, Nikola Konstantinov, Georgi Pashaliev, Martin Vechev

Abstract: Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity's data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants' models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Lastly, we empirically demonstrate the effectiveness of our incentive scheme on a standard non-convex federated learning benchmark. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning.

replace MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

Authors: Lai Wei, Xiaozhe Li, Zihao Jiang, Weiran Huang, Lichao Sun

Abstract: Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce MM-LIMA, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. To achieve this, we first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. By employing this method, MM-LIMA outperforms the original MiniGPT-4 on various evaluations. Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output. Our code is available at https://github.com/waltonfuture/InstructionGPT-4.

URLs: https://github.com/waltonfuture/InstructionGPT-4.

replace CROP: Conservative Reward for Model-based Offline Policy Optimization

Authors: Hao Li, Xiao-Hu Zhou, Shu-Hai Li, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zeng-Guang Hou

Abstract: Offline reinforcement learning (RL) aims to optimize a policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges because of their capability to mitigate the limitations of data coverage through data generation using models. Nonetheless, a prevalent issue in offline RL is the overestimation caused by distribution shift. This study proposes a novel model-based offline RL algorithm named Conservative Reward for model-based Offline Policy optimization (CROP). CROP introduces a streamlined objective that concurrently minimizes estimation error and the rewards of random actions, thereby yielding a robustly conservative reward estimator. Theoretical analysis shows that the designed conservative reward mechanism leads to a conservative policy evaluation and mitigates distribution shift. Experiments showcase that with the simple modification to reward estimation, CROP can conservatively estimate the reward and achieve competitive performance with existing methods. The source code will be available after acceptance.

replace Reinforcement Learning for Intensity Control: An Application to Choice-Based Network Revenue Management

Authors: Huiling Meng, Ningyuan Chen, Xuefeng Gao

Abstract: Intensity control is a class of continuous-time dynamic optimization problems with many important applications in Operations Research including queueing and revenue management. In this study, we propose a practical continuous-time reinforcement learning framework for intensity control using choice-based network revenue management as a case study, which is a classical problem in revenue management that features a large state space, a large action space, and a continuous time horizon. We show that by leveraging the event-driven structure of the problem and the inherent discretization of sample paths created by the state-jump times, a defining feature of intensity control, one does not need to discretize the time horizon in advance. We adapt discrete-time Monte Carlo and temporal difference learning algorithms for policy evaluation to continuous time and develop policy-gradient-based actor-critic algorithms for event-driven intensity control. Through a comprehensive numerical study, we evaluate the proposed approach against various state-of-the-art benchmarks, demonstrating its overall superior performance and effective scalability to large-scale problems. Notably, compared to discretization-based reinforcement learning methods, our continuous-time approach delivers significantly superior performance while maintaining comparable computational efficiency. This advantage is particularly pronounced in highly non-stationary environments.

replace Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft

Authors: Yifei Li, Erik-Jan van Kampen

Abstract: The symmetry of dynamical systems can be exploited for state-transition prediction and to facilitate control policy optimization. This paper leverages system symmetry to develop sample-efficient offline reinforcement learning (RL) approaches. Under the symmetry assumption for a Markov Decision Process (MDP), a symmetric data augmentation method is proposed. The augmented samples are integrated into the dataset of Deep Deterministic Policy Gradient (DDPG) to enhance its coverage rate of the state-action space. Furthermore, sample utilization efficiency is improved by introducing a second critic trained on the augmented samples, resulting in a dual-critic structure. The aircraft's model is verified to be symmetric, and flight control simulations demonstrate accelerated policy convergence when augmented samples are employed.

replace Adversarial Robustness of Graph Transformers

Authors: Philipp Foth, Lukas Gosch, Simon Geisler, Leo Schwinn, Stephan G\"unnemann

Abstract: Existing studies have shown that Message-Passing Graph Neural Networks (MPNNs) are highly susceptible to adversarial attacks. In contrast, despite the increasing importance of Graph Transformers (GTs), their robustness properties are unexplored. We close this gap and design the first adaptive attacks for GTs. In particular, we provide general design principles for strong gradient-based attacks on GTs w.r.t. structure perturbations and instantiate our attack framework for five representative and popular GT architectures. Specifically, we study GTs with specialized attention mechanisms and Positional Encodings (PEs) based on pairwise shortest paths, random walks, and the Laplacian spectrum. We evaluate our attacks on multiple tasks and perturbation models, including structure perturbations for node and graph classification, and node injection for graph classification. Our results reveal that GTs can be catastrophically fragile in many cases. Addressing this vulnerability, we show how our adaptive attacks can be effectively used for adversarial training, substantially improving robustness.

replace Poisoning with A Pill: Circumventing Detection in Federated Learning

Authors: Hanxi Guo, Hao Wang, Tao Song, Tianhang Zheng, Yang Hua, Haibing Guan, Xiangyu Zhang

Abstract: Without direct access to the client's data, federated learning (FL) is well-known for its unique strength in data privacy protection among existing distributed machine learning techniques. However, its distributive and iterative nature makes FL inherently vulnerable to various poisoning attacks. To counteract these threats, extensive defenses have been proposed to filter out malicious clients, using various detection metrics. Based on our analysis of existing attacks and defenses, we find that there is a lack of attention to model redundancy. In neural networks, various model parameters contribute differently to the model's performance. However, existing attacks in FL manipulate all the model update parameters with the same strategy, making them easily detectable by common defenses. Meanwhile, the defenses also tend to analyze the overall statistical features of the entire model updates, leaving room for sophisticated attacks. Based on these observations, this paper proposes a generic and attack-agnostic augmentation approach designed to enhance the effectiveness and stealthiness of existing FL poisoning attacks against detection in FL, pointing out the inherent flaws of existing defenses and exposing the necessity of fine-grained FL security. Specifically, we employ a three-stage methodology that strategically constructs, generates, and injects poison (generated by existing attacks) into a pill (a tiny subnet with a novel structure) during the FL training, named as pill construction, pill poisoning, and pill injection accordingly. Extensive experimental results show that FL poisoning attacks enhanced by our method can bypass all the popular defenses, and can gain an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.

replace FedQUIT: On-Device Federated Unlearning via a Quasi-Competent Virtual Teacher

Authors: Alessio Mora, Lorenzo Valerio, Paolo Bellavista, Andrea Passarella

Abstract: Federated Learning (FL) enables the collaborative training of machine learning models without requiring centralized collection of user data. To comply with the right to be forgotten, FL clients should be able to request the removal of their data contributions from the global model. In this paper, we propose FedQUIT, a novel unlearning algorithm that operates directly on client devices that request to remove its contribution. Our method leverages knowledge distillation to remove the influence of the target client's data from the global model while preserving its generalization ability. FedQUIT adopts a teacher-student framework, where a modified version of the current global model serves as a virtual teacher and the client's model acts as the student. The virtual teacher is obtained by manipulating the global model's outputs on forget data, penalizing the confidence assigned to the true class while preserving relationships among outputs of non-true classes, to simultaneously induce forgetting and retain useful knowledge. As a result, FedQUIT achieves unlearning without making any additional assumption over the standard FedAvg protocol. Evaluation across diverse datasets, data heterogeneity levels, and model architectures shows that FedQUIT achieves superior or comparable unlearning efficacy compared to six state-of-the-art methods, while significantly reducing cumulative communication and computational overhead relative to retraining from scratch.

replace A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection

Authors: James Enouen, Mahito Sugiyama

Abstract: The log-linear model has received a significant amount of theoretical attention in previous decades and remains the fundamental tool used for learning probability distributions over discrete variables. Despite its large popularity in statistical mechanics and high-dimensional statistics, the majority of related energy-based models only focus on the two-variable relationships, such as Boltzmann machines and Markov graphical models. Although these approaches have easier-to-solve structure learning problems and easier-to-optimize parametric distributions, they often ignore the rich structure which exists in the higher-order interactions between different variables. Using more recent tools from the field of information geometry, we revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions, going beyond the 1-body modes of independent distributions and the 2-body modes of Boltzmann distributions. This perspective allows us to define a complete decomposition of the KL error. This then motivates the formulation of a sparse selection problem over the set of possible mode interactions. In the same way as sparse graph selection allows for better generalization, we find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice. We develop an algorithm called MAHGenTa which leverages a novel Monte-Carlo sampling technique for energy-based models alongside a greedy heuristic for incorporating statistical robustness. On both synthetic and real-world datasets, we demonstrate our algorithm's effectiveness in maximizing the log-likelihood for the generative task and also the ease of adaptability to the discriminative task of classification.

replace Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Authors: Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

Abstract: Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward, and update phases generates fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement Deep Optimizer States, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5$\times$ faster iterations over state-of-the-art approaches using extensive experiments.

replace The Phantom of PCIe: Constraining Generative Artificial Intelligences for Practical Peripherals Trace Synthesizing

Authors: Zhibai Huang, Chen Chen, James Yen, Yihan Shen, Yongchen Xie, Zhixiang Wei, Kailiang Xu, Yun Wang, Fangxin Liu, Tao Song, Mingyuan Xia, Zhengwei Qi

Abstract: Peripheral Component Interconnect Express (PCIe) is the de facto interconnect standard for high-speed peripherals and CPUs. The development of PCIe devices for emerging applications requires realistic Transaction Layer Packet (TLP) traces that accurately simulate device-CPU interactions. While generative AI offers a promising avenue for synthesizing complex TLP sequences, it is prone to a critical challenge inherent in all generation tasks: hallucination. Naively applying these models often produces traces that violate fundamental PCIe protocol rules, such as ordering and causality, rendering them unusable for device simulation. To resolve this, our work introduces a methodology to bridge the gap between generative AI and high-fidelity device simulation. This paper presents Phantom, a framework that systematically addresses AI-generated hallucinations in TLP synthesis. Phantom achieves this by coupling a generative backbone with a novel post-processing filter that enforces PCIe-specific constraints, effectively eliminating invalid TLP sequences. We validate Phantom's effectiveness by synthesizing TLP traces for an actual PCIe network interface card. Experimental results show that Phantom produces practical, large-scale TLP traces, significantly outperforming existing models, with improvements of up to 1000$\times$ in task-specific metrics and up to 2.19$\times$ in Fr\'echet Inception Distance (FID) compared to backbone-only methods. The prototype implementation has been made open-source.

replace Graph Retention Networks for Dynamic Graphs

Authors: Qian Chang, Xia Li, Xiufeng Cheng, Runsong Jia, Jinqing Yang, Guoping Hu, Ciprian Doru Giurcaneanu

Abstract: In this paper, we propose Graph Retention Networks (GRNs) as a unified architecture for deep learning on dynamic graphs. The GRN extends the concept of retention into dynamic graph data as graph retention, equipping the model with three key computational paradigms: parallelizable training, low-cost $\mathcal{O}(1)$ inference, and long-term chunkwise training. This architecture achieves an optimal balance between efficiency, effectiveness, and scalability. Extensive experiments on benchmark datasets demonstrate its strong performance in both edge-level prediction and node-level classification tasks with significantly reduced training latency, lower GPU memory overhead, and improved inference throughput by up to 86.7x compared to SOTA baselines. The proposed GRN architecture achieves competitive performance across diverse dynamic graph benchmarks, demonstrating its adaptability to a wide range of tasks.

replace WebLLM: A High-Performance In-Browser LLM Inference Engine

Authors: Charlie F. Ruan, Yucheng Qin, Akaash R. Parthasarathy, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen

Abstract: Advancements in large language models (LLMs) have unlocked remarkable capabilities. While deploying these models typically requires server-grade GPUs and cloud-based inference, the recent emergence of smaller open-source models and increasingly powerful consumer devices have made on-device deployment practical. The web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. With machine learning compilers MLC-LLM and Apache TVM, WebLLM leverages optimized WebGPU kernels, overcoming the absence of performant WebGPU kernel libraries. Evaluations show that WebLLM can retain up to 80% native performance on the same device, with room to further close the gap. WebLLM paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers. The code is available at: https://github.com/mlc-ai/web-llm.

URLs: https://github.com/mlc-ai/web-llm.

replace Influencing Humans to Conform to Preference Models for RLHF

Authors: Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Peter Stone

Abstract: Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.

replace Symmetry-Aware Generative Modeling through Learned Canonicalization

Authors: Kusha Sareen, Daniel Levy, Arnab Kumar Mondal, S\'ekou-Oumar Kaba, Tara Akhound-Sadegh, Siamak Ravanbakhsh

Abstract: Generative modeling of symmetric densities has a range of applications in AI for science, from drug discovery to physics simulations. The existing generative modeling paradigm for invariant densities combines an invariant prior with an equivariant generative process. However, we observe that this technique is not necessary and has several drawbacks resulting from the limitations of equivariant networks. Instead, we propose to model a learned slice of the density so that only one representative element per orbit is learned. To accomplish this, we learn a group-equivariant canonicalization network that maps training samples to a canonical pose and train a non-equivariant generative model over these canonicalized samples. We implement this idea in the context of diffusion models. Our preliminary experimental results on molecular modeling are promising, demonstrating improved sample quality and faster inference time.

replace deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models

Authors: Frederik Lizak Johansen, Ulrik Friis-Jensen, Erik Bj{\o}rnager Dam, Kirsten Marie {\O}rnsbjerg Jensen, Roc\'io Mercado, Raghavendra Selvan

Abstract: Novel materials drive advancements in fields ranging from energy storage to electronics, with crystal structure characterization forming a crucial yet challenging step in materials discovery. In this work, we introduce \emph{deCIFer}, an autoregressive language model designed for powder X-ray diffraction (PXRD)-conditioned crystal structure prediction (PXRD-CSP). Unlike traditional CSP methods that rely primarily on composition or symmetry constraints, deCIFer explicitly incorporates PXRD data, directly generating crystal structures in the widely adopted Crystallographic Information File (CIF) format. The model is trained on nearly 2.3 million crystal structures, with PXRD conditioning augmented by basic forms of synthetic experimental artifacts, specifically Gaussian noise and instrumental peak broadening, to reflect fundamental real-world conditions. Validated across diverse synthetic datasets representative of challenging inorganic materials, deCIFer achieves a 94\% structural match rate. The evaluation is based on metrics such as the residual weighted profile ($R_{wp}$) and structural match rate (MR), chosen explicitly for their practical relevance in this inherently underdetermined problem. deCIFer establishes a robust baseline for future expansion toward more complex experimental scenarios, bridging the gap between computational predictions and experimental crystal structure determination.

replace CapyMOA: Efficient Machine Learning for Data Streams and Online Continual Learning in Python

Authors: Heitor Murilo Gomes, Anton Lee, Nuwan Gunasekara, Yibin Sun, Guilherme Weigert Cassales, Justin Liu, Marco Heyden, Vitor Cerqueira, Maroua Bahri, Yun Sing Koh, Bernhard Pfahringer, Albert Bifet

Abstract: CapyMOA is an open-source Python library for efficient machine learning on data streams and online continual learning. It provides a structured framework for real-time learning, supporting adaptive models that evolve over time. CapyMOA's architecture allows integration with frameworks such as MOA, scikit-learn and PyTorch, enabling the combination of high-performance online algorithms with modern deep learning techniques. By emphasizing efficiency, scalability, and usability, CapyMOA allows researchers and practitioners to tackle dynamic learning challenges across various domains. Website: https://capymoa.org. GitHub: https://github.com/adaptive-machine-learning/CapyMOA.

URLs: https://capymoa.org., https://github.com/adaptive-machine-learning/CapyMOA.

replace ExPath: Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation

Authors: Rikuto Kotoge, Ziwei Yang, Zheng Chen, Yushun Dong, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai

Abstract: Retrieving targeted pathways in biological knowledge bases, particularly when incorporating wet-lab experimental data, remains a challenging task and often requires downstream analyses and specialized expertise. In this paper, we frame this challenge as a solvable graph learning and explaining task and propose a novel subgraph inference framework, ExPAth, that explicitly integrates experimental data to classify various graphs (bio-networks) in biological databases. The links (representing pathways) that contribute more to classification can be considered as targeted pathways. Our framework can seamlessly integrate biological foundation models to encode the experimental molecular data. We propose ML-oriented biological evaluations and a new metric. The experiments involving 301 bio-networks evaluations demonstrate that pathways inferred by ExPath are biologically meaningful, achieving up to 4.5x higher Fidelity+ (necessity) and 14x lower Fidelity- (sufficiency) than explainer baselines, while preserving signaling chains up to 4x longer.

replace Quotation-Based Data Retention Mechanism for Data Privacy in LLM-Empowered Network Services

Authors: Bin Han, Di Feng, Zexin Fang, Jie Wang, Hans D. Schotten

Abstract: The deployment of large language models (LLMs) for next-generation network optimization introduces novel data governance challenges. mobile network operators (MNOs) increasingly leverage generative artificial intelligence (AI) for traffic prediction, anomaly detection, and service personalization, requiring access to users' sensitive network usage data-including mobility patterns, traffic types, and location histories. Under the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and similar regulations, users retain the right to withdraw consent and demand data deletion. However, extensive machine unlearning degrades model accuracy and incurs substantial computational costs, ultimately harming network performance for all users. We propose an iterative price discovery mechanism enabling MNOs to compensate users for data retention through sequential price quotations. The server progressively raises the unit price for retaining data while users independently determine their supply at each quoted price. This approach requires no prior knowledge of users' privacy preferences and efficiently maximizes social welfare across the network ecosystem.

replace Optimizing Large Language Models: Metrics, Energy Efficiency, and Case Study Insights

Authors: Tahniat Khan, Soroor Motie, Sedef Akinli Kocak, Shaina Raza

Abstract: The rapid adoption of large language models (LLMs) has led to significant energy consumption and carbon emissions, posing a critical challenge to the sustainability of generative AI technologies. This paper explores the integration of energy-efficient optimization techniques in the deployment of LLMs to address these environmental concerns. We present a case study and framework that demonstrate how strategic quantization and local inference techniques can substantially lower the carbon footprints of LLMs without compromising their operational effectiveness. Experimental results reveal that these methods can reduce energy consumption and carbon emissions by up to 45\% post quantization, making them particularly suitable for resource-constrained environments. The findings provide actionable insights for achieving sustainability in AI while maintaining high levels of accuracy and responsiveness.

replace An overview of condensation phenomenon in deep learning

Authors: Zhi-Qin John Xu, Yaoyu Zhang, Zhangchen Zhou

Abstract: In this paper, we provide an overview of a common phenomenon, condensation, observed during the nonlinear training of neural networks: During the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. Empirical observations suggest that the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses. Neural networks with small weight initializations or Dropout optimization can facilitate this condensation process. We also examine the underlying mechanisms of condensation from the perspectives of training dynamics and the structure of the loss landscape. The condensation phenomenon offers valuable insights into the generalization abilities of neural networks and correlates to stronger reasoning abilities in transformer-based language models.

replace Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Authors: Yixuan Even Xu, Yash Savani, Fei Fang, J. Zico Kolter

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

replace Non-stationary Diffusion For Probabilistic Time Series Forecasting

Authors: Weiwei Ye, Zhuopeng Xu, Ning Gui

Abstract: Due to the dynamics of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this non-stationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step and integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code is available at https://github.com/wwy155/NsDiff.

URLs: https://github.com/wwy155/NsDiff.

replace Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Authors: Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini

Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. Yet if parallel test-time compute is already part of the deployment plan, training should be designed to support it. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model. More broadly, RL$^V$ instantiates the principle of co-training for test-time scaling: jointly optimizing for task performance and a capability useful at inference, using data that RL training already produces.

replace TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Authors: Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang

Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.

replace Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Authors: Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

Abstract: Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

replace Learning Geometry and Topology via Multi-Chart Flows

Authors: Hanlin Yu, S{\o}ren Hauberg, Marcelo Hartmann, Arto Klami, Georgios Arvanitidis

Abstract: Real world data often lie on low-dimensional Riemannian manifolds embedded in high-dimensional spaces. This motivates learning degenerate normalizing flows that map between the ambient space and a low-dimensional latent space. However, if the manifold has a non-trivial topology, it can never be correctly learned using a single flow. Instead multiple flows must be `glued together'. In this paper, we first propose the general training scheme for learning such a collection of flows, and secondly we develop the first numerical algorithms for computing geodesics on such manifolds. Empirically, we demonstrate that this leads to highly significant improvements in topology estimation.

replace Towards Reasonable Concept Bottleneck Models

Authors: Nektarios Kalampalikis, Kavya Gupta, Georgi Vitanov, Isabel Valera

Abstract: We propose a novel, flexible, and efficient framework for designing Concept Bottleneck Models (CBMs) that enables practitioners to explicitly encode and extend their prior knowledge and beliefs about the concept-concept ($C-C$) and concept-task ($C \to Y$) relationships within the model's reasoning when making predictions. The resulting $\textbf{C}$oncept $\textbf{REA}$soning $\textbf{M}$odels (CREAMs) architecturally encode arbitrary types of $C-C$ relationships such as mutual exclusivity, hierarchical associations, and/or correlations, as well as potentially sparse $C \to Y$ relationships. Moreover, CREAM can optionally incorporate a regularized side-channel to complement the potentially {incomplete concept sets}, achieving competitive task performance while encouraging predictions to be concept-grounded. To evaluate CBMs in such settings, we introduce a $C \to Y$ agnostic metric that quantifies interpretability when predictions partially rely on the side-channel. In our experiments, we show that, without additional computational overhead, CREAM models support efficient interventions, can avoid concept leakage, and achieve black-box-level performance under missing concepts. We further analyze how an optional side-channel affects interpretability and intervenability. Importantly, the side-channel enables CBMs to remain effective even in scenarios where only a limited number of concepts are available.

replace On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

Authors: Zhen Qin, Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

Abstract: Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

replace Lagrangian-based Equilibrium Propagation: generalisation to arbitrary boundary conditions & equivalence with Hamiltonian Echo Learning

Authors: Guillaume Pourcel, Debabrota Basu, Maxence Ernoult, Aditya Gilra

Abstract: Equilibrium Propagation (EP) is a learning algorithm for training Energy-based Models (EBMs) on static inputs which leverages the variational description of their fixed points. Extending EP to time-varying inputs is a challenging problem, as the variational description must apply to the entire system trajectory rather than just fixed points, and careful consideration of boundary conditions becomes essential. In this work, we present Generalized Lagrangian Equilibrium Propagation (GLEP), which extends the variational formulation of EP to time-varying inputs. We demonstrate that GLEP yields different learning algorithms depending on the boundary conditions of the system, many of which are impractical for implementation. We then show that Hamiltonian Echo Learning (HEL) -- which includes the recently proposed Recurrent HEL (RHEL) and the earlier known Hamiltonian Echo Backpropagation (HEB) algorithms -- can be derived as a special case of GLEP. Notably, HEL is the only instance of GLEP we found that inherits the properties that make EP a desirable alternative to backpropagation for hardware implementations: it operates in a "forward-only" manner (i.e. using the same system for both inference and learning), it scales efficiently (requiring only two or more passes through the system regardless of model size), and enables local learning.

replace Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization

Authors: Badr AlKhamissi, C. Nicol\`o De Sabbata, Greta Tuckute, Zeming Chen, Martin Schrimpf, Antoine Bosselut

Abstract: Human cognitive behavior arises from the interaction of specialized brain networks dedicated to distinct functions, such as language, logic, and social reasoning. Inspired by this organization, we propose Mixture of Cognitive Reasoners (MiCRo): a modular, transformer-based architecture post-trained with a curriculum that induces functional specialization across experts. Concretely, we partition the layers of a pretrained language model into four expert modules aligned with well-studied cognitive networks in the human brain. MiCRo offers three key advantages over standard language models. (1) The specialized experts are interpretable and causally meaningful -- ablating a module causes substantial drops on benchmarks requiring its specialized domain. (2) MiCRo's behavior can be dynamically steered at inference time by routing tokens to particular experts (e.g., favoring social over logical reasoning), enabling fine-grained control over outputs. (3) MiCRo outperforms or matches comparable baselines on both machine-learning reasoning benchmarks (e.g., GSM8K, BBH) and alignment to human behavior (CogBench), while maintaining interpretability. Taken together, cognitively grounded functional specialization yields models that are both more human-like and more human-interpretable.

replace Relative Entropy Pathwise Policy Optimization

Authors: Claas Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Eric Eaton, Radu Grosu, Amir-massoud Farahmand, Igor Gilitschenski

Abstract: Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.

replace Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Authors: Kailai Yang, Xiao Liu, Lei Ji, Hao Li, Xiao Liang, Zhiwei Liu, Yeyun Gong, Peng Cheng, Mao Yang

Abstract: Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents' well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.

replace Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Authors: Haris Khan, Sadia Asif, Shumaila Asif, Muhammad Zeeshan Karamat, Rajesh Upadhayaya

Abstract: In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.

replace Teaching the Teacher: The Role of Teacher-Student Smoothness Alignment in Genetic Programming-based Symbolic Distillation

Authors: Soumyadeep Dhar, Kei Sen Fong, Mehul Motani

Abstract: Obtaining human-readable symbolic formulas via genetic programming-based symbolic distillation of a deep neural network trained on the target dataset presents a promising yet underexplored path towards explainable artificial intelligence (XAI); however, the standard pipeline frequently yields symbolic models with poor predictive accuracy. We identify a fundamental misalignment in functional complexity as the primary barrier to achieving better accuracy: standard Artificial Neural Networks (ANNs) often learn accurate but highly irregular functions, while Symbolic Regression typically prioritizes parsimony, often resulting in a much simpler class of models that are unable to sufficiently distill or learn from the ANN teacher. To bridge this gap, we propose a framework that actively regularizes the teacher's functional smoothness using Jacobian and Lipschitz penalties, aiming to distill better student models than the standard pipeline. We characterize the trade-off between predictive accuracy and functional complexity through a robust study involving 20 datasets and 50 independent trials. Our results demonstrate that students distilled from smoothness-regularized teachers achieve statistically significant improvements in R^2 scores, compared to the standard pipeline. We also perform ablation studies on the student model algorithm. Our findings suggest that smoothness alignment between teacher and student models is a critical factor for symbolic distillation.

replace Masked Training for Robust Arrhythmia Detection from Digitalized Multiple Layout ECG Images

Authors: Shanwei Zhang, Deyun Zhang, Yirao Tao, Kexin Wang, Shijia Geng, Jun Li, Qinghao Zhao, Xingpeng Liu, Xingliang Wu, Shengyong Chen, Yuxi Zhou, Shenda Hong

Abstract: Background: Electrocardiograms are indispensable for diagnosing cardiovascular diseases, yet in many settings they exist only as paper printouts stored in multiple recording layouts. Converting these images into digital signals introduces two key challenges: temporal asynchrony among leads and partial blackout missing, where contiguous signal segments become entirely unavailable. Existing models cannot adequately handle these concurrent problems while maintaining interpretability. Methods: We propose PatchECG, combining an adaptive variable block count missing learning mechanism with a masked training strategy. The model segments each lead into fixed-length patches, discards entirely missing patches, and encodes the remainder via a pluggable patch encoder. A disordered patch attention mechanism with patch-level temporal and lead embeddings captures cross-lead and temporal dependencies without interpolation. PatchECG was trained on PTB-XL and evaluated under seven simulated layout conditions, with external validation on 400 real ECG images from Chaoyang Hospital across three clinical layouts. Results: PatchECG achieves an average AUROC of approximately 0.835 across all simulated layouts. On the Chaoyang cohort, the model attains an overall AUROC of 0.778 for atrial fibrillation detection, rising to 0.893 on the 12x1 subset -- surpassing the pre-trained baseline by 0.111 and 0.190, respectively. Model attention aligns with cardiologist annotations at a rate approaching inter-clinician agreement. Conclusions: PatchECG provides a robust, interpolation-free, and interpretable solution for arrhythmia detection from digitized ECG images across diverse layouts. Its direct modeling of asynchronous and partially missing signals, combined with clinically aligned attention, positions it as a practical tool for cardiac diagnostics from legacy ECG archives in real-world clinical environments.

replace Proximal Supervised Fine-Tuning

Authors: Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, Pengfei Liu

Abstract: Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.

replace Soft Graph Transformer for MIMO Detection

Authors: Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang

Abstract: We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.

replace StyleBench: Evaluating thinking styles in Large Language Models

Authors: Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

Abstract: Structured reasoning can improve the inference performance of large language models (LLMs), but it also introduces computational cost and control constraints. When additional reasoning structure helps, and when it instead reduces efficiency or robustness, remains poorly understood. We propose StyleBench, where we study reasoning structure as a capacity-constrained design choice rather than a fixed inference recipe. We evaluate five representative reasoning styles: Chain-of-Thought, Tree-of-Thought, Algorithm-of-Thought, Sketch-of-Thought, and Chain-of-Draft across five reasoning tasks and 15 open-source LLMs ranging from 270M to 120B parameters. We find that greater structural complexity improves accuracy only in limited regimes defined by task demands and model capacity. Search-based styles help on open-ended combinatorial problems but fail on smaller models, while concise styles achieve large efficiency gains on structured tasks without sacrificing performance. We also identify systematic failure modes in smaller models, including premature guessing and weak adherence to reasoning-control instructions. To study adaptive reasoning control, we further compare supervised and reinforcement-based strategy selection on Qwen-7B-Instruct. Supervised fine-tuning collapses to shallow style preferences, whereas GRPO learns stronger adaptive control and improves downstream performance. Together, these results clarify when structured reasoning is useful, when it is wasteful, and why learning to choose a reasoning strategy is itself a challenging inference problem, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

URLs: https://github.com/JamesJunyuGuo/Style_Bench.

replace Learning Aligned Stability in Neural ODEs Reconciling Accuracy with Robustness

Authors: Chaoyang Luo, Yan Zou, Nanjing Huang

Abstract: Despite Neural Ordinary Differential Equations (Neural ODEs) exhibiting intrinsic robustness, existing methods often impose Lyapunov stability for formal guarantees. However, these methods still face a fundamental accuracy-robustness trade-off, which stems from a core limitation: their applied stability conditions are rigid and inappropriate, creating a mismatch between the model's regions of attraction (RoAs) and its decision boundaries. To resolve this, we propose Zubov-Net, a novel framework that unifies dynamics and decision-making. We first employ learnable Lyapunov functions directly as the multi-class classifier, ensuring the prescribed RoAs (PRoAs, defined by the Lyapunov functions) inherently align with a classification objective. Then, for aligning prescribed and true regions of attraction (PRoAs-RoAs), we establish a Zubov-driven stability region matching mechanism by reformulating Zubov's equation into a differentiable consistency loss. Building on this alignment, we introduce a new paradigm for actively controlling the geometry of RoAs by directly optimizing PRoAs to reconcile accuracy and robustness. Theoretically, we prove that minimizing the tripartite loss guarantees consistency alignment of PRoAs-RoAs, non-overlapping PRoAs, trajectory stability, and a certified robustness margin. Moreover, we establish stochastic convex separability with tighter probability bounds and lower dimensionality requirements to justify the convex design in Lyapunov functions.

replace Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Authors: Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet well validated because reports often conflate policy improvement with three confounds: (i) budget mismatch between RLVR and baseline evaluation, (ii) attempt inflation and calibration drift that convert abstentions into confident answers, and (iii) data contamination in benchmarks. Using budget-matched reproductions and partial-prompt contamination probes, we find that several widely cited gaps shrink substantially or disappear once budgets, prompts, and dataset versions are matched, and contaminated sets are treated as memorization probes rather than evidence of reasoning. This does not mean that RLVR is ineffective, but it implies that current measurements often overstate capability gains and obscure reliability costs. We therefore propose a compact, tax-aware minimum standard for RLVR training and evaluation: budget-matched saturation curves with variance, calibration, and abstention tracking, one judge robustness stress test when LLM judges are used, and an explicit contamination screen. With these controls, RLVR remains effective and deployable in verifiable domains, but reasoning gains should be treated as provisional without them.

replace MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning

Authors: Fanjin Meng, Yuan Yuan, Jingtao Ding, Jie Feng, Chonghua Han, Yong Li

Abstract: Mobility Foundation Models (MFMs) have advanced the modeling of human movement patterns, yet they face a ceiling due to limitations in data scale and semantic understanding. While Large Language Models (LLMs) offer powerful semantic reasoning, they lack the innate understanding of spatio-temporal statistics required for generating physically plausible mobility trajectories. To address these gaps, we propose MoveFM-R, a novel framework that unlocks the full potential of mobility foundation models by leveraging language-driven semantic reasoning capabilities. It tackles two key challenges: the vocabulary mismatch between continuous geographic coordinates and discrete language tokens, and the representation gap between the latent vectors of MFMs and the semantic world of LLMs. MoveFM-R is built on three core innovations: a semantically enhanced location encoding to bridge the geography-language gap, a progressive curriculum to align the LLM's reasoning with mobility patterns, and an interactive self-reflection mechanism for conditional trajectory generation. Extensive experiments demonstrate that MoveFM-R significantly outperforms existing MFM-based and LLM-based baselines. It also shows robust generalization in zero-shot settings and excels at generating realistic trajectories from natural language instructions. By synthesizing the statistical power of MFMs with the deep semantic understanding of LLMs, MoveFM-R pioneers a new paradigm that enables a more comprehensive, interpretable, and powerful modeling of human mobility. The implementation of MoveFM-R is available online at https://anonymous.4open.science/r/MoveFM-R-CDE7/.

URLs: https://anonymous.4open.science/r/MoveFM-R-CDE7/.

replace Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning

Authors: Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning is often framed as balancing exploration and exploitation in action space, typically operationalized with token-level proxies (e.g., output entropy or confidence). We argue that this apparent trade-off is largely a measurement artifact: token-level statistics reflect next-token uncertainty rather than how reasoning progresses over multi-token semantic structures. We therefore study exploration and exploitation in the hidden-state space of response trajectories. We use Effective Rank (ER) to quantify representational exploration and introduce its temporal derivatives, Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to characterize exploitative refinement dynamics. Empirically and theoretically, ER and ERV exhibit near-zero correlation in semantic space, suggesting the two capacities can be improved simultaneously. Motivated by this, we propose Velocity-Exploiting Rank Learning (VERL), which shapes the RL advantage with an auxiliary signal derived from ER/ERV and uses the more stable ERA as a meta-control variable to adaptively balance the incentives. Across multiple base models, RL algorithms, and reasoning benchmarks, VERL yields consistent improvements, including large gains on challenging tasks (e.g., 21.4\% in Gaokao 2024).

replace Unsupervised Detection of Spatiotemporal Anomalies in PMU Data Using Transformer-Based BiGAN

Authors: Muhammad Imran Hossain, Jignesh Solanki, Sarika Khushlani Solanki

Abstract: Ensuring power grid resilience requires the timely and unsupervised detection of anomalies in synchrophasor data streams. We introduce T-BiGAN, a novel framework that integrates window-attention Transformers within a bidirectional Generative Adversarial Network (BiGAN) to address this challenge. Its self-attention encoder-decoder architecture captures complex spatio-temporal dependencies across the grid, while a joint discriminator enforces cycle consistency to align the learned latent space with the true data distribution. Anomalies are flagged in real-time using an adaptive score that combines reconstruction error, latent space drift, and discriminator confidence. Evaluated on a realistic hardware-in-the-loop PMU benchmark, T-BiGAN achieves an ROC-AUC of 0.95 and an average precision of 0.996, significantly outperforming leading supervised and unsupervised methods. It shows particular strength in detecting subtle frequency and voltage deviations, demonstrating its practical value for live, wide-area monitoring without relying on manually labeled fault data.

replace EEG-based AI-BCI Wheelchair Advancement: Hybrid Deep Learning with Motor Imagery for Brain Computer Interface

Authors: Bipul Thapa, Biplov Paneru, Bishwash Paneru, Khem Narayan Poudyal

Abstract: This paper presents an Artificial Intelligence (AI) integrated approach to Brain-Computer Interface (BCI)-based wheelchair development, utilizing a motor imagery right-left-hand movement mechanism for control. The system is designed to simulate wheelchair navigation based on motor imagery right and left-hand movements using electroencephalogram (EEG) data. A pre-filtered dataset, obtained from an open-source EEG repository, was segmented into arrays of 19x200 to capture the onset of hand movements. The data was acquired at a sampling frequency of 200Hz. The system integrates a Tkinter-based interface for simulating wheelchair movements, offering users a functional and intuitive control system. We propose a framework that uses Convolutional Neural Network-Transformer Hybrid Model, named CTHM, for motor imagery EEG classification. The model achieves a test accuracy of 91.73% compared with various machine learning baseline models, including XGBoost, EEGNet, and a transformer-based model. The CTHM achieved a mean accuracy of 90% through stratified cross-validation, showcasing the effectiveness of the CNN-Transformer hybrid architecture in BCI applications.

replace Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

Authors: Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying

Abstract: Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $\gamma$, we prove an excess risk rate of $\widetilde{O}(L^6 / (n \gamma^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n \gamma^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.

replace Detecting Invariant Manifolds in ReLU-Based RNNs

Authors: Lukas Eisenmann, Alena Br\"andle, Zahra Monfared, Daniel Durstewitz

Abstract: Recurrent Neural Networks (RNNs) have found widespread applications in machine learning for time series prediction and dynamical systems reconstruction, and experienced a recent renaissance with improved training algorithms and architectural designs. Understanding why and how trained RNNs produce their behavior is important for scientific and medical applications, and explainable AI more generally. An RNN's dynamical repertoire depends on the topological and geometrical properties of its state space. Stable and unstable manifolds of periodic points play a particularly important role: They dissect a dynamical system's state space into different basins of attraction, and their intersections lead to chaotic dynamics with fractal geometry. Here we introduce a novel algorithm for detecting these manifolds, with a focus on piecewise-linear RNNs (PLRNNs) employing rectified linear units (ReLUs) as their activation function. We demonstrate how the algorithm can be used to trace the boundaries between different basins of attraction, and hence to characterize multistability, a computationally important property. We further show its utility in finding so-called homoclinic points, the intersections between stable and unstable manifolds, and thus establish the existence of chaos in PLRNNs. Finally we show for an empirical example, electrophysiological recordings from a cortical neuron, how insights into the underlying dynamics could be gained through our method.

replace A Mathematical Explanation of Transformers

Authors: Xue-Cheng Tai, Hao Liu, Lingfeng Li, Raymond H. Chan

Abstract: The Transformer architecture has revolutionized the field of sequence modeling and underpins the recent breakthroughs in large language models (LLMs). However, a comprehensive mathematical theory that explains its structure and operations remains elusive. In this work, we propose a novel continuous framework that rigorously interprets the Transformer as a discretization of a structured integro-differential equation. Within this formulation, the self-attention mechanism emerges naturally as a non-local integral operator, and layer normalization is characterized as a projection to a time-dependent constraint. This operator-theoretic and variational perspective offers a unified and interpretable foundation for understanding the architecture's core components, including attention, feedforward layers, and normalization. Our approach extends beyond previous theoretical analyses by embedding the entire Transformer operation in continuous domains for both token indices and feature dimensions. This leads to a principled and flexible framework that not only deepens on theoretical insight but also offers new directions for architecture design, analysis, and control-based interpretations. This new interpretation provides a step toward bridging the gap between deep learning architectures and continuous mathematical modeling, and contributes a foundational perspective to the ongoing development of interpretable and theoretically grounded neural network models.

replace Evolutionary Profiles for Protein Fitness Prediction

Authors: Jigang Fan, Xiaoran Jiao, Shengdong Lin, Zhanming Liang, Weian Mao, Chenchen Jing, Hao Chen, Chunhua Shen

Abstract: Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available.

replace Design Principles for Sequence Models via Coefficient Dynamics

Authors: Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso

Abstract: Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.

replace Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation

Authors: Hengyuan Zhang, Shiping Yang, Xiao Liang, Chenming Shang, Yuxuan Jiang, Chaofan Tao, Jing Xiong, Hayden Kwok-Hay So, Ruobing Xie, Angel X. Chang, Ngai Wong

Abstract: Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new ``Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional ``Generate then Select'' paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.

replace Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards

Authors: Sarah Liaw, Benjamin Plaut

Abstract: In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available. In this work, we formalize a model of learning with unbounded rewards without a mentor as a two-action contextual bandit with an abstain option: at each round the agent observes an input and chooses either to abstain (always 0 reward) or to commit (execute a preexisting task policy). Committing yields rewards that are upper-bounded but can be arbitrarily negative, and the commit reward is assumed Lipschitz in the input. We propose a caution-based algorithm that learns when not to learn: it chooses a trusted region and commits only where the available evidence does not already certify harm. Under these conditions and i.i.d. inputs, we establish sublinear regret guarantees, theoretically demonstrating the effectiveness of cautious exploration for deploying learning agents safely in high-stakes environments.

replace Self-Certifying Primal-Dual Optimization Proxies for Large-Scale Batch Economic Dispatch

Authors: Michael Klamkin, Mathieu Tanneau, Pascal Van Hentenryck

Abstract: Recent research has shown that optimization proxies can be trained to high fidelity, achieving average optimality gaps under 1% for large-scale problems. However, worst-case analyses show that there exist in-distribution queries that result in orders of magnitude higher optimality gap, making it difficult to trust the predictions in practice. This paper aims at striking a balance between classical solvers and optimization proxies in order to enable trustworthy deployments with interpretable speed-optimality tradeoffs based on a user-defined optimality threshold. To this end, the paper proposes a hybrid solver that leverages duality theory to efficiently bound the optimality gap of predictions, falling back to a classical solver for queries where optimality cannot be certified. To improve the achieved speedup of the hybrid solver, the paper proposes an alternative training procedure that combines the primal and dual proxy training. Experiments on large-scale transmission systems show that the hybrid solver is highly scalable. The proposed hybrid solver achieves speedups of over 1000x compared to a parallelized simplex-based solver while guaranteeing a maximum optimality gap of 2%.

replace BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation

Authors: Liang Ye, Shengqin Chen, Jiazhu Dai

Abstract: The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method against latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.

replace DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment

Authors: Hao Wang, Licheng Pan, Yuan Lu, Zhixuan Chu, Xiaoxi Li, Shuting He, Zhichao Chen, Haoxuan Li, Qingsong Wen, Zhouchen Lin

Abstract: Training time-series forecasting models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimizing the conditional negative log-likelihood, typically estimated by the mean squared error. However, this estimation proves biased when the label sequence exhibits autocorrelation. In this paper, we propose DistDF, which achieves alignment by minimizing a distributional discrepancy between the conditional distributions of forecast and label sequences. Since such conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. The proposed discrepancy is tractable, differentiable, and readily compatible with gradient-based optimization. Extensive experiments show that DistDF improves diverse forecasting models and achieves leading performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.

URLs: https://anonymous.4open.science/r/DistDF-F66B.

replace Thought Branches: Interpreting LLM Reasoning Requires Resampling

Authors: Uzay Macar, Paul C. Bogdan, Senthooran Rajamanoharan, Neel Nanda

Abstract: Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, we can measure a partial CoT's impact by resampling only the subsequent text. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we find that self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find that off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that causally affect the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.

replace Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse

Authors: Jinghui Wang, Shaojie Wang, Yinghan Cui, Xuxing Chen, Chao Wang, Liang Huang, Xiaojiang Zhang, Junyi Peng, Li Wan, Haotian Zhang, Bin Chen

Abstract: Agentic large language model (LLM) training often involves multi-turn interaction trajectories that branch into multiple execution paths due to concurrent tool use, think-mode, sub-agent, context management and other runtime designs. As a result, the tokens produced by a single task naturally form a tree-structured token trajectory with shared prefixes, rather than a linear sequence. Existing training pipelines linearize such trajectories and treat each branch independently, leading to substantial redundant computation in both forward and backward passes. We derive that averaging the loss over all branches independently is algebraically identical to a per-token weighted loss, where each token's weight equals the fraction of branches passing through it. The problem therefore reduces to computing the log-probability of every token in the prefix tree exactly once, with no repeated computation across shared prefixes: we propose DFS serialization of the tree, which visits every token exactly once, and adapt full-attention and SSM layers to ensure the resulting log-probabilities match independent per-branch calculation exactly. In practice, a single trajectory tree can be too large to fit in GPU memory; we therefore propose Tree Partitioning, a memory-efficient partitioning strategy that splits the tree into subtrees each fitting within GPU memory while preserving high prefix reuse. Together, these contributions form Tree Training, an efficient framework for training LLMs on tree-structured trajectories, achieving up to 6.2x end-to-end training speedup on dense and MOE models for both supervised fine-tuning and reinforcement learning.

replace Discrete Bayesian Sample Inference for Graph Generation

Authors: Ole Petersen, Marcel Kollovieh, Marten Lienen, Stephan G\"unnemann

Abstract: Generating graph-structured data is crucial in applications such as molecular generation, knowledge graphs, and network analysis. However, their discrete, unordered nature makes them difficult for traditional generative models, leading to the rise of discrete diffusion and flow matching models. In this work, we introduce GraphBSI, a novel one-shot graph generative model based on Bayesian Sample Inference (BSI). Instead of evolving samples directly, GraphBSI iteratively refines a belief over graphs in the continuous space of distribution parameters, naturally handling discrete structures. Further, we state BSI as a stochastic differential equation (SDE) and derive a noise-controlled family of SDEs that preserves the marginal distributions via an approximation of the score function. Our theoretical analysis further reveals the connection to Bayesian Flow Networks and Diffusion models. Finally, in our empirical evaluation, we demonstrate state-of-the-art performance on molecular and synthetic graph generation, outperforming existing one-shot graph generative models on the standard benchmarks Moses and GuacaMol.

replace SynthAgent: Adapting Web Agents with Synthetic Supervision

Authors: Zhaoyang Wang, Yiming Liang, Xuchao Zhang, Qianhui Wu, Siwei Han, Anson Bastos, Rujia Wang, Chetan Bansal, Baolin Peng, Jianfeng Gao, Saravan Rajmohan, Huaxiu Yao

Abstract: Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. Our approach begins by synthesizing diverse tasks through categorized exploration of web elements, ensuring efficient coverage of the target environment. During trajectory collection, tasks are refined only when conflicts with observations are detected, which mitigates hallucinations while preserving task consistency. After collection, we conduct trajectory refinement with global context to mitigate potential noise or misalignments. Finally, we fine-tune open-source web agents on the refined synthetic data to adapt them to the target environment. Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision. The code is publicly available at https://github.com/aiming-lab/SynthAgent.

URLs: https://github.com/aiming-lab/SynthAgent.

replace A Weak Penalty Neural ODE for Learning Chaotic Dynamics from Noisy Time Series

Authors: Xuyang Li, John Harlim, Dibyajyoti Chakraborty, Romit Maulik

Abstract: The accurate forecasting of complex, high-dimensional dynamical systems from observational data is a fundamental task across numerous scientific and engineering disciplines. A significant challenge arises from noise-corrupted measurements, which severely degrade the performance of data-driven models. In chaotic dynamical systems, where small initial errors amplify exponentially, it is particularly difficult to develop a model from noisy data that achieves short-term accuracy while preserving long-term invariant properties. To overcome this, we consider the weak formulation as a complementary approach to the classical $L2$-loss function for training models of dynamical systems. We empirically verify that the weak formulation, with a proper choice of test function and integration domain, effectively filters noisy data. This insight explains why a weak form loss function is analogous to fitting a model to filtered data and provides a practical way to parameterize the weak form. Subsequently, we demonstrate how this approach overcomes the instability and inaccuracy of standard Neural ODE (NODE) in modeling chaotic systems. Through numerical examples, we show that our proposed training strategy, the Weak Penalty NODE, is computationally efficient, solver-agnostic, and yields accurate and robust forecasts across benchmark chaotic systems and a real-world climate dataset.

replace From Decision Trees to Boolean Logic: A Fast and Unified SHAP Algorithm

Authors: Alexander Nadel, Ron Wettenstein

Abstract: SHapley Additive exPlanations (SHAP) is a key tool for interpreting decision tree ensembles by assigning contribution values to features. It is widely used in finance, advertising, medicine, and other domains. Two main approaches to SHAP calculation exist: Path-Dependent SHAP, which leverages the tree structure for efficiency, and Background SHAP, which uses a background dataset to estimate feature distributions. We introduce WOODELF, a SHAP algorithm that integrates decision trees, game theory, and Boolean logic into a unified framework. For each consumer, WOODELF constructs a pseudo-Boolean formula that captures their feature values, the structure of the decision tree ensemble, and the entire background dataset. It then leverages this representation to compute Background SHAP in linear time. WOODELF can also compute Path-Dependent SHAP, Shapley interaction values, Banzhaf values, and Banzhaf interaction values. WOODELF is designed to run efficiently on CPU and GPU hardware alike. Available via the WOODELF Python package, it is implemented using NumPy, SciPy, and CuPy without relying on custom C++ or CUDA code. This design enables fast performance and seamless integration into existing frameworks, supporting large-scale computation of SHAP and other game-theoretic values in practice. For example, on a dataset with 3,000,000 rows, 5,000,000 background samples, and 127 features, WOODELF computed all Background Shapley values in 162 seconds on CPU and 16 seconds on GPU - compared to 44 minutes required by the best method on any hardware platform, representing 16x and 165x speedups, respectively.

replace Fourier-KAN-Mamba: A Novel State-Space Equation Approach for Time-Series Anomaly Detection

Authors: Xiancheng Wang, Lin Wang, Rui Wang, Zhibo Zhang, Minghang Zhao

Abstract: Time-series anomaly detection plays a critical role in numerous real-world applications, including industrial monitoring and fault diagnosis. Recently, Mamba-based state-space models have shown remarkable efficiency in long-sequence modeling. However, directly applying Mamba to anomaly detection tasks still faces challenges in capturing complex temporal patterns and nonlinear dynamics. In this paper, we propose Fourier-KAN-Mamba, a novel hybrid architecture that integrates Fourier layer, Kolmogorov-Arnold Networks (KAN), and Mamba selective state-space model. The Fourier layer extracts multi-scale frequency features, KAN enhances nonlinear representation capability, and a temporal gating control mechanism further improves the model's ability to distinguish normal and anomalous patterns. Extensive experiments on MSL, SMAP, and SWaT datasets demonstrate that our method significantly outperforms existing state-of-the-art approaches. Keywords: time-series anomaly detection, state-space model, Mamba, Fourier transform, Kolmogorov-Arnold Network

replace Achieving Skilled and Reliable Daily Probabilistic Forecasts of Wind Power at Subseasonal-to-Seasonal Timescales over France

Authors: Eloi Lindas, Yannig Goude, Philippe Ciais

Abstract: In a growing renewable based energy system, accurate and reliable wind power forecasts are crucial for grid stability, balancing supply and demand and market risk management. Even though short-term weather forecasts have been thoroughly used to provide up to 3 days ahead renewable power predictions, forecasts involving prediction horizons longer than a week still need investigations. Despite the recent progress in subseasonal-to-seasonal weather probabilistic forecasting, their use for wind power prediction usually involves both temporal and spatial aggregation to achieve reasonable skill. In this study, we present a lead time and numerical weather model agnostic forecasting pipeline which enables to transform ECMWF subseasonal-to-seasonal weather forecasts into wind power forecasts for France for lead times ranging from 1 day to 46 days at daily resolution. By leveraging a post-processing step of the resulting power ensembles we show that these forecasts improve the climatological baseline by 15% to 5% for the Continuous Ranked Probability Score and 20% to 5% for ensemble Mean Squared Error up to 16 days in advance, before converging towards the climatological skill. This improvement in skill is jointly obtained with near perfect calibration of the forecasts for every lead time. The results suggest that electricity market players could benefit from the extended forecast range up to two weeks to improve their decision making on renewable supply

replace MSTN: A Lightweight and Fast Model for General TimeSeries Analysis

Authors: Sumit S Shevtekar, Chandresh K Maurya

Abstract: Real-world time series often exhibit strong non-stationarity, complex nonlinear dynamics, and behavior expressed across multiple temporal scales, from rapid local fluctuations to slow-evolving long-range trends. However, many contemporary architectures impose rigid, fixed-scale structural priors -- such as patch-based tokenization, predefined receptive fields, or frozen backbone encoders -- which can over-regularize temporal dynamics and limit adaptability to abrupt high-magnitude events. To handle this, we introduce the Multi-scale Temporal Network (MSTN), a hybrid neural architecture grounded in an Early Temporal Aggregation principle. MSTN integrates three complementary components: (i) a multi-scale convolutional encoder that captures fine-grained local structure; (ii) a sequence modeling module that learns long-range dependencies through either recurrent or attention-based mechanisms; and (iii) a self-gated fusion stage incorporating squeeze-excitation and a single dense layer to dynamically reweight and fuse multi-scale representations. This design enables MSTN to flexibly model temporal patterns spanning milliseconds to extended horizons, while avoiding the computational burden typically associated with long-context models. Across extensive benchmarks covering imputation, long term forecasting, short term forecasting, classification, and cross-dataset generalization, MSTN achieves state-of-the-art performance, establishing new best results on 33 of 40 datasets, while remaining lightweight ($\sim$278,520 params for MSTN-BiLSTM and $\sim$950,776 $\approx$ 1M for MSTN-Transformer) and suitable for low-latency inference ($<$1 sec, often in milliseconds), resource-constrained deployment.

replace A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Authors: Yiming Tang, Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Qianxiao Li, Mengnan Du, Dianbo Liu

Abstract: As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with very limited theoretical understanding of why these phenomena occur. Existing theoretical work is limited to tied-weight sparse autoencoders, leaving the broader family of SDL methods without formal grounding. We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations.

replace Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Authors: Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen

Abstract: To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code is available at https://github.com/michaeltian108/FDA.

URLs: https://github.com/michaeltian108/FDA.

replace Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts

Authors: Farica Zhuang, Shu Yang, Dinara Aliyeva, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen

Abstract: Accurate and early diagnosis of Alzheimer's disease (AD) is critical for effective intervention and requires integrating complementary information from multimodal neuroimaging data. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models mesoscopic brain regions within each modality as independent experts and employs a gating network to learn subject-specific fusion weights. Utilizing tabular neuroimaging and demographic information from the Alzheimer's Disease Neuroimaging Initiative (ADNI), MREF-AD achieves competitive performance over strong classic and deep baselines while providing interpretable, modality- and region-level insight into how structural and molecular imaging jointly contribute to AD diagnosis.

replace B\'ezierFlow: Learning B\'ezier Stochastic Interpolant Schedulers for Few-Step Generation

Authors: Yunhong Min, Juil Koo, Seungwoo Yoo, Minhyuk Sung

Abstract: We introduce B\'ezierFlow, a lightweight training approach for few-step generation with pretrained diffusion and flow models. B\'ezierFlow achieves a 2-3x performance improvement for sampling with $\leq$ 10 NFEs while requiring only 15 minutes of training. Recent lightweight training approaches have shown promise by learning optimal timesteps, but their scope remains restricted to ODE discretizations. To broaden this scope, we propose learning the optimal transformation of the sampling trajectory by parameterizing stochastic interpolant (SI) schedulers. The main challenge lies in designing a parameterization that satisfies critical desiderata, including boundary conditions, differentiability, and monotonicity of the SNR. To effectively meet these requirements, we represent scheduler functions as B\'ezier functions, where control points naturally enforce these properties. This reduces the problem to learning an ordered set of points in the time range, while the interpretation of the points changes from ODE timesteps to B\'ezier control points. Across a range of pretrained diffusion and flow models, B\'ezierFlow consistently outperforms prior timestep-learning methods, demonstrating the effectiveness of expanding the search space from discrete timesteps to B\'ezier-based trajectory transformations.

replace FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

Authors: Yeonwoo Cha, Semin Kim, Jinhyeon Kwon, Seunghoon Hong

Abstract: Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by their inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computational cost from modeling joint distribution, and rely on complex multi-stage training. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6x fewer parameters and training 10x faster than prior methods. The project page with code is available at https://yeonwoo378.github.io/official_flowbind.

URLs: https://yeonwoo378.github.io/official_flowbind.

replace Understanding Generalization in Role-Playing Models via Information Theory

Authors: Yongqi Li, Hao Lang, Fei Huang, Tieyun Qian, Yongbin Li

Abstract: Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.

replace Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

Authors: Maximilian Weichart

Abstract: Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: github.com/Max-We/inverse-rpo.

replace Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Authors: Jiachen T. Wang, Tong Wu, Kaifeng Lyu, James Zou, Dawn Song, Ruoxi Jia, Prateek Mittal

Abstract: Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to full-scale model training. In this work, we uncover a subtle yet critical issue in the standard experimental protocol for data recipe assessment: the use of identical small-scale model training configurations across all data recipes in the name of "fair" comparison. We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyperparameters, as the optimal training configuration is inherently data-dependent. Moreover, this fixed-configuration protocol diverges from full-scale model development pipelines, where hyperparameter optimization is a standard step. Consequently, we posit that the objective of data recipe assessment should be to identify the recipe that yields the best performance under data-specific tuning. To mitigate the high cost of hyperparameter tuning, we introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training. We show that this approach yields relative performance that strongly correlates with that of fully tuned large-scale LLM pretraining runs. Theoretically, we prove that for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable loss. Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation, demonstrating dramatic improvements in the reliability of small-scale experiments.

replace Enhanced-FQL($\lambda$), an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay

Authors: Mohsen Jalaeian-Farimani, Xiong Xiong, Luca Bascetta

Abstract: This paper introduces a fuzzy reinforcement learning framework, Enhanced-FQL($\lambda$), that integrates novel Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with the Fuzzified Bellman Equation (FBE) for continuous control. The proposed approach employs an interpretable fuzzy rule base instead of complex neural architectures, while maintaining competitive performance through two key innovations: a fuzzified Bellman equation with eligibility traces for stable multi-step credit assignment, and a memory-efficient segment-based experience replay mechanism for enhanced sample efficiency. Theoretical analysis proves the proposed method convergence under standard assumptions. On the Cart--Pole benchmark, Enhanced-FQL($\lambda$) improves sample efficiency and reduces variance relative to $n$-step fuzzy TD and fuzzy SARSA($\lambda$), while remaining competitive with the tested DDPG baseline. These results support the proposed framework as an interpretable and computationally compact alternative for moderate-scale continuous control problems.

replace MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Authors: Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu

Abstract: Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings remains a critical challenge, as unconstrained generation entails multi-faceted and often conflicting objectives - such as creativity versus factuality - where rigid, static reward scalarization is inherently suboptimal. To address this, we propose MAESTRO (Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal. Across seven benchmarks, MAESTRO consistently outperforms single-reward and static multi-objective baselines, while preserving the efficiency advantages of GRPO, and in some settings even reducing redundant generation.

replace Riemannian Zeroth-Order Gradient Estimation with Structure-Preserving Metrics for Geodesically Incomplete Manifolds

Authors: Shaocong Ma, Heng Huang

Abstract: In this paper, we study Riemannian zeroth-order optimization in settings where the underlying Riemannian metric $g$ is geodesically incomplete, and the goal is to approximate stationary points with respect to this incomplete metric. To address this challenge, we construct structure-preserving metrics that are geodesically complete while ensuring that every stationary point under the new metric remains stationary under the original one. Building on this foundation, we revisit the classical symmetric two-point zeroth-order estimator and analyze its mean-squared error from a purely intrinsic perspective, depending only on the manifold's geometry rather than any ambient embedding. Leveraging this intrinsic analysis, we establish convergence guarantees for stochastic gradient descent with this intrinsic estimator. Under additional suitable conditions, an $\epsilon$-stationary point under the constructed metric $g'$ also corresponds to an $\epsilon$-stationary point under the original metric $g$, thereby matching the best-known complexity in the geodesically complete setting. Empirical studies on synthetic problems confirm our theoretical findings, and experiments on a practical mesh optimization task demonstrate that our framework maintains stable convergence even in the absence of geodesic completeness.

replace Diagnosing Failure Modes of Neural Operators Across Diverse PDE Families

Authors: Lennon Shikhman

Abstract: Neural PDE solvers have shown strong performance on standard benchmarks, but their robustness under deployment-relevant distribution shifts remains insufficiently characterized. We present a systematic stress-testing framework for evaluating neural PDE solvers across five qualitatively different PDE families -- dispersive, elliptic, multi-scale fluid, financial, and chaotic systems -- under controlled shifts in parameters, boundary or terminal conditions, resolution, rollout horizon, and input perturbations. The framework is instantiated on three representative architectures: Fourier Neural Operators (FNOs), DeepONet, and convolutional neural operators (CNOs). Across 750 trained models, we evaluate robustness using baseline-normalized degradation factors together with spectral and rollout diagnostics. This setup is designed to distinguish failure patterns that are shared across architectures from those that are architecture- or PDE-specific. Overall, the paper is framed as an evaluation study rather than a new architecture paper, with the goal of providing a clearer basis for assessing robustness claims in neural PDE solvers.

replace Optimal L2 Regularization in High-dimensional Continual Linear Regression

Authors: Gilad Karpel, Edward Moroshko, Ran Levinstein, Ron Meir, Daniel Soudry, Itay Evron

Abstract: We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.

replace DiSPA: Differential Substructure-Pathway Attention for Drug Response Prediction

Authors: Yewon Han, Sunghyun Kim, Eunyi Jeong, Sungkyung Lee, Seokwoo Yun, Sangsoo Lim

Abstract: Accurate prediction of drug response in precision medicine requires models that capture how specific chemical substructures interact with cellular pathway states. However, most existing deep learning approaches treat chemical and transcriptomic modalities independently or combine them only at late stages, limiting their ability to model fine-grained, context-dependent mechanisms of drug action. In addition, vanilla attention mechanisms are often sensitive to noise and sparsity in high-dimensional biological networks, hindering both generalization and interpretability. We present DiSPA (Differential Substructure-Pathway Attention), a framework that models bidirectional interactions between chemical substructures and pathway-level gene expression. DiSPA introduces differential cross-attention to suppress spurious associations while enhancing context-relevant interactions. On the GDSC benchmark, DiSPA achieves state-of-the-art performance, with strong improvements in the disjoint setting. These gains are consistent across random and drug-blind splits, suggesting improved robustness. Analyses of attention patterns indicate more selective and concentrated interactions compared to standard cross-attention. Exploratory evaluation shows that differential attention better prioritizes predefined target-related pathways, although this does not constitute mechanistic validation. DiSPA also shows promising generalization on external datasets (CTRP) and cross-dataset settings, although further validation is needed. It further enables zero-shot application to spatial transcriptomics, providing exploratory insights into region-specific drug sensitivity patterns without ground-truth validation.

replace MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification

Authors: Jingwei Song, Xinyu Wang, Hanbin Wang, Xiaoxuan Lei, Bill Shi, Shixin Han, Eric Yang, Xiao-Wen Chang, Lynn Ai

Abstract: Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the verification mechanism itself remains largely unchanged, relying on strict token-level rejection sampling. In practice, modern LLMs frequently operate in low-margin regimes where the target model exhibits weak preference among top candidates. In such cases, rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback cost, leading to a fundamental inefficiency in verification. We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model's local decisiveness. Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit. Importantly, the approach modifies only the verification rule and is fully compatible with existing target-coupled speculative decoding frameworks. Extensive experiments across model scales ranging from 8B to 235B demonstrate that our method delivers consistent and significant inference speedups over state-of-the-art baselines while preserving generation quality across diverse benchmarks. The code is available at https://github.com/5SSjw/MARS.

URLs: https://github.com/5SSjw/MARS.

replace A Hessian-Free Actor-Critic Algorithm for Bi-Level Reinforcement Learning with Applications to LLM Fine-Tuning

Authors: Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel

Abstract: We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).

replace Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

Authors: Bret Nestor, Bohan Yao, Jasmine Moore, Jasper Kanes

Abstract: This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based presence or absence classifiers outperform state-of-the-art classifiers on 3 of 4 expert-annotated datasets in terms of accuracy and energy efficiency. The fleet of WHISPER detection models range from 0.58 (0.48-0.67) AUROC with WHISPER-tiny to 0.77 (0.63-0.93) with WHISPER-large-v3. Our multiclass species classifier obtains a top-1 accuracy of 53.2\% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 33.6\% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.

replace Knowledge Integration in Differentiable Models: A Comparative Study of Data-Driven, Soft-Constrained, and Hard-Constrained Paradigms for Identification and Control of the Single Machine Infinite Bus System

Authors: Shinhoo Kang, Sangwook Kim, Sehyun Yun

Abstract: Integrating domain knowledge into neural networks is a central challenge in scientific machine learning. Three paradigms have emerged -- data-driven (Neural Ordinary Differential Equations, NODEs), soft-constrained (Physics-Informed Neural Networks, PINNs), and hard-constrained (Differentiable Programming, DP) -- each encoding physical knowledge at different levels of structural commitment. However, how these strategies impact not only predictive accuracy but also downstream tasks such as control synthesis remains insufficiently understood. This paper presents a comparative study of NODEs, PINNs, and DP for dynamical system modeling, using the Single Machine Infinite Bus power system as a benchmark. We evaluate these paradigms across three tasks: trajectory prediction, parameter identification, and Linear Quadratic Regulator control synthesis. Our results yield three principal findings. First, knowledge representation determines generalization: NODE, which learns the system operator, enables robust extrapolation, whereas PINN, which approximates a solution map, restricts generalization to the training horizon. Second, hard-constrained formulations (DP) reduce learning to a low-dimensional physical parameter space, achieving faster and more reliable convergence than soft-constrained approaches. Third, knowledge fidelity propagates to control performance: DP produces controllers that closely match those obtained from true system parameters, while NODE provides a viable data-driven alternative by recovering control-relevant Jacobians with $3-4\%$ relative error and yielding LQR gains within $0.36\%$ of the ground truth. Based on these findings, we propose a practical decision framework for selecting knowledge integration strategies in neural modeling of dynamical systems.

replace Binary Flow Matching: Prediction-Loss Space Alignment for Robust Learning

Authors: Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang

Abstract: Flow matching has emerged as a powerful framework for generative modeling, with recent empirical successes highlighting the effectiveness of signal-space prediction ($x$-prediction). In this work, we investigate the transfer of this paradigm to binary manifolds, a fundamental setting for generative modeling of discrete data. While $x$-prediction remains effective, we identify a latent structural mismatch that arises when it is coupled with velocity-based objectives ($v$-loss), leading to a time-dependent singular weighting that amplifies gradient sensitivity to approximation errors. Motivated by this observation, we formalize prediction-loss alignment as a necessary condition for flow matching training. We prove that re-aligning the objective to the signal space ($x$-loss) eliminates the singular weighting, yielding uniformly bounded gradients and enabling robust training under uniform timestep sampling without reliance on heuristic schedules. Finally, with alignment secured, we examine design choices specific to binary data, revealing a topology-dependent distinction between probabilistic objectives (e.g., cross-entropy) and geometric losses (e.g., mean squared error). Together, these results provide theoretical foundations and practical guidelines for robust flow matching on binary -- and related discrete -- domains, positioning signal-space alignment as a key principle for robust diffusion learning.

replace Predicting integers from continuous parameters

Authors: Bas Maat, Peter Bloem

Abstract: We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.

replace TreeGrad-Ranker: Feature Ranking via $O(L)$-Time Gradients for Decision Trees

Authors: Weida Li, Yaoliang Yu, Bryan Kian Hsiang Low

Abstract: We revisit the use of probabilistic values, which include the well-known Shapley and Banzhaf values, to rank features for explaining the local predicted values of decision trees. The quality of feature rankings is typically assessed with the insertion and deletion metrics. Empirically, we observe that co-optimizing these two metrics is closely related to a joint optimization that selects a subset of features to maximize the local predicted value while minimizing it for the complement. However, we theoretically show that probabilistic values are generally unreliable for solving this joint optimization. Therefore, we explore deriving feature rankings by directly optimizing the joint objective. As the backbone, we propose TreeGrad, which computes the gradients of the multilinear extension of the joint objective in $O(L)$ time for decision trees with $L$ leaves; these gradients include weighted Banzhaf values. Building upon TreeGrad, we introduce TreeGrad-Ranker, which aggregates the gradients while optimizing the joint objective to produce feature rankings, and TreeGrad-Shap, a numerically stable algorithm for computing Beta Shapley values with integral parameters. In particular, the feature scores computed by TreeGrad-Ranker satisfy all the axioms uniquely characterizing probabilistic values, except for linearity, which itself leads to the established unreliability. Empirically, we demonstrate that the numerical error of Linear TreeShap can be up to $10^{15}$ times larger than that of TreeGrad-Shap when computing the Shapley value. As a by-product, we also develop TreeProb, which generalizes Linear TreeShap to support all probabilistic values. In our experiments, TreeGrad-Ranker performs significantly better on both insertion and deletion metrics. Our code is available at https://github.com/watml/TreeGrad.

URLs: https://github.com/watml/TreeGrad.

replace Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning

Authors: Zhimin Zhao

Abstract: This paper offers a new perspective on the limits of machine learning: the ceiling on progress is set not by model size or algorithm choice but by the information structure of the task itself. Code generation has progressed more reliably than reinforcement learning, largely because code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that diagnosing a task's position in this hierarchy is more predictive of scaling outcomes than any property of the model. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.

replace TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models

Authors: Nicolas Zumarraga, Thomas Kaar, Ning Wang, Maxwell A. Xu, Max Rosenblattl, Markus Kreft, Kevin O'Sullivan, Paul Schmiedmayer, Patrick Langer, Robert Jakob

Abstract: Time Series Language Models (TSLMs) are emerging as unified models for reasoning over continuous signals in natural language. However, long-context retrieval remains a major limitation: existing models are typically trained and evaluated on short sequences, while real-world time-series sensor streams can span millions of datapoints. This mismatch requires precise temporal localization under strict computational constraints, a regime that is not captured by current benchmarks. We introduce TS-Haystack, a long-context temporal retrieval benchmark comprising ten task types across four categories: direct retrieval, temporal reasoning, multi-step reasoning and contextual anomaly. The benchmark uses controlled needle insertion by embedding short activity bouts into longer longitudinal accelerometer recordings, enabling systematic evaluation across context lengths ranging from seconds to 2 hours per sample. We hypothesize that existing TSLM time series encoders overlook temporal granularity as context length increases, creating a task-dependent effect: compression aids classification but impairs retrieval of localized events. Across multiple model and encoding strategies, we observe a consistent divergence between classification and retrieval behavior. Learned latent compression preserves or improves classification accuracy at compression ratios up to 176$\times$, but retrieval performance degrades with context length, incurring in the loss of temporally localized information. These results highlight the importance of architectural designs that decouple sequence length from computational complexity while preserving temporal fidelity.

replace LLM-as-Judge on a Budget

Authors: Aadirupa Saha, Aniket Wagde, Branislav Kveton

Abstract: LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K \sigma_i^2}{B}}\right)$, $\sigma_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

replace AdvSynGNN: Structure-Adaptive Graph Neural Nets via Adversarial Synthesis and Self-Corrective Propagation

Authors: Rong Fu, Muge Qi, Chunlei Meng, Shuo Yin, Kun Liu, Zhaolu Kang, Simon Fong

Abstract: Graph neural networks frequently encounter significant performance degradation when confronted with structural noise or non-homophilous topologies. To address these systemic vulnerabilities, we present AdvSynGNN, a comprehensive architecture designed for resilient node-level representation learning. The proposed framework orchestrates multi-resolution structural synthesis alongside contrastive objectives to establish geometry-sensitive initializations. We develop a transformer backbone that adaptively accommodates heterophily by modulating attention mechanisms through learned topological signals. Central to our contribution is an integrated adversarial propagation engine, where a generative component identifies potential connectivity alterations while a discriminator enforces global coherence. Furthermore, label refinement is achieved through a residual correction scheme guided by per-node confidence metrics, which facilitates precise control over iterative stability. Empirical evaluations demonstrate that this synergistic approach effectively optimizes predictive accuracy across diverse graph distributions while maintaining computational efficiency. The study concludes with practical implementation protocols to ensure the robust deployment of the AdvSynGNN system in large-scale environments.

replace SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework

Authors: Rong Fu, Zijian Zhang, Kun Liu, Jiekai Wu, Xianda Li, Simon Fong

Abstract: Comparative analysis of adaptive immune repertoires at population scale is hampered by two practical bottlenecks: the near-quadratic cost of pairwise affinity evaluations and dataset imbalances that obscure clinically important minority clonotypes. We introduce SubQuad, an end-to-end pipeline that addresses these challenges by combining antigen-aware, near-subquadratic retrieval with GPU-accelerated affinity kernels, learned multimodal fusion, and fairness-constrained clustering. The system employs compact MinHash prefiltering to sharply reduce candidate comparisons, a differentiable gating module that adaptively weights complementary alignment and embedding channels on a per-pair basis, and an automated calibration routine that enforces proportional representation of rare antigen-specific subgroups. On large viral and tumor repertoires SubQuad achieves measured gains in throughput and peak memory usage while preserving or improving recall@k, cluster purity, and subgroup equity. By co-designing indexing, similarity fusion, and equity-aware objectives, SubQuad offers a scalable, bias-aware platform for repertoire mining and downstream translational tasks such as vaccine target prioritization and biomarker discovery.

replace MDP Planning as Policy Inference

Authors: David Tolpin

Abstract: We cast episodic Markov decision process (MDP) planning as Bayesian inference over policies. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.

replace Transformers for dynamical systems learn transfer operators in-context

Authors: Anthony Bao, Jeffrey Lai, William Gilpin

Abstract: Large-scale foundation models for scientific machine learning adapt to physical settings unseen during training, such as zero-shot transfer between turbulent scales. This phenomenon, in-context learning, challenges conventional understanding of learning and adaptation in physical systems. Here, we study in-context learning of dynamical systems in a minimal setting: we train a small two-layer, single-head transformer to forecast one dynamical system, and then evaluate its ability to forecast a different dynamical system without retraining. We discover an early tradeoff in training between in-distribution and out-of-distribution performance, which manifests as a secondary double descent phenomenon. We discover that attention-based models apply a transfer-operator forecasting strategy in-context. They (1) lift low-dimensional time series using delay embedding, to detect the system's higher-dimensional dynamical manifold, and (2) identify and forecast long-lived invariant sets that characterize the global flow on this manifold. Our results clarify the mechanism enabling large pretrained models to forecast unseen physical systems at test time without retraining, and they illustrate the unique ability of attention-based models to leverage global attractor information in service of short-term forecasts.

replace Tackling multiphysics problems via finite element-guided physics-informed operator learning

Authors: Yusuke Yamazaki, Reza Najian Asl, Markus Apel, Mayu Muramatsu, Shahed Rezaei

Abstract: This work presents a finite element-guied physics-informed learning framework for multiphysics problems with coupled partial differential equations (PDEs) on arbitrary domains. The proposed framework learns an operator from the input space to the solution space with a weighted residual formulation based on the finite element method, enabling discretization-independent prediction beyond the training resolution without relying on labeled simulation data. The present framework for multiphysics problems is implemented in Folax, a JAX-based operator learning platform, and is verified on nonlinear coupled thermo-mechanical problems.Two- and three dimensional representative volume elements with varying heterogeneous microstructures, and a close-to-reality industrial casting example under varying boundary conditions are investigated as the example problems. We investigate the potential of several neural operators combined with the proposed finite element-guided approach, including Fourier neural operators (FNOs), deep operator networks (DeepONets), and a newly proposed implicit finite operator learning (iFOL) approach based on conditional neural fields. The results demonstrate that FNOs yield highly accurate solution operators on regular domains, where the global features can be efficiently learned in the spectral domain, and iFOL offers efficient parametric operator learning capabilities for complex and irregular geometries. Furthermore, studies on training strategies, network decomposition, and training sample quality reveal that a monolithic training strategy using a single network is sufficient for accurate predictions, while training sample quality strongly influences performance. Overall, the present approach highlights the potential of physics-informed operator learning with a finite element-based loss as a unified and scalable approach for coupled multiphysics problems.

replace Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

Authors: Yifei Zhang, Xu Yang, Xiao Yang, Bowen Xian, Qizheng Li, Shikai Fang, Jingyuan Li, Jian Wang, Mingrui Xu, Weiqing Liu, Jiang Bian

Abstract: LLM-based agents for machine learning engineering (MLE) predominantly rely on tree search, a form of gradient-free optimization that uses scalar validation scores to rank candidates. As LLM reasoning capabilities improve, exhaustive enumeration becomes increasingly inefficient compared to directed updates, analogous to how accurate gradients enable efficient descent over random search. We introduce Gome, an MLE agent that operationalizes gradient-based optimization. Gome maps structured diagnostic reasoning to gradient computation, success memory to momentum, and multi-trace execution to distributed optimization. Under a closed-world protocol that isolates architectural effects from external knowledge, Gome achieves a state-of-the-art 35.1\% any-medal rate on MLE-Bench with a restricted 12-hour budget on a single V100 GPU. Scaling experiments across 10 models reveal a critical crossover: with weaker models, tree search retains advantages by compensating for unreliable reasoning through exhaustive exploration; as reasoning capability strengthens, gradient-based optimization progressively outperforms, with the gap widening at frontier-tier models. Given the rapid advancement of reasoning-oriented LLMs, this positions gradient-based optimization as an increasingly favorable paradigm. We release our codebase and GPT-5 traces at https://github.com/microsoft/RD-Agent.

URLs: https://github.com/microsoft/RD-Agent.

replace Latent attention on masked patches for flow reconstruction

Authors: Ben Eze, Luca Magri, Andrea N\'ovoa

Abstract: Vision transformers have shown outstanding performance in image generation, yet their adoption in fluid dynamics remains limited. We introduce the Latent Attention on Masked Patches (LAMP) model, an interpretable regression-based modified vision transformer designed for masked flow reconstruction. LAMP follows a three-fold strategy: (i) partition of each flow snapshot into patches, (ii) patch-wise dimensionality reduction via proper orthogonal decomposition, and (iii) reconstruction of the full field from a masked input using a single-layer transformer trained via closed-form linear regression. We test the method on two canonical 2D unsteady wakes: a laminar wake past a bluff body, and a chaotic wake past two cylinders. On the laminar case, LAMP accurately reconstructs the full flow field from a 90%-masked and noisy input, across signal-to-noise ratios between 10 and 30dB. Further, the learned attention matrix yields interpretable multi-fidelity optimal sensor-placement maps. LAMP's performance on the chaotic wake is limited, but outperforms other regression methods such as gappy POD. The modularity of the framework, however, naturally accommodates nonlinear compression and deep attention blocks, thereby providing an efficient baseline for nonlinear, high-dimensional masked flow reconstruction.

replace Design Experiments to Compare Multi-armed Bandit Algorithms

Authors: Huiling Meng, Ningyuan Chen, Xuefeng Gao

Abstract: Online platforms routinely compare multi-armed bandit algorithms, such as UCB and Thompson Sampling, to select the best-performing policy. Unlike standard A/B tests for static treatments, each run of a bandit algorithm over $T$ users produces only one trajectory, because the algorithm's decisions depend on all past interactions. Reliable inference therefore demands many independent restarts of the algorithm, making experimentation costly and delaying deployment decisions. We propose Artificial Replay (AR) as a new experimental design for this problem. AR first runs one policy and records its trajectory. When the second policy is executed, it reuses a recorded reward whenever it selects an action the first policy already took, and queries the real environment only otherwise. We develop a new analytical framework for this design and prove three key properties of the resulting estimator: it is unbiased; it requires only $T + o(T)$ user interactions instead of $2T$ for a run of the treatment and control policies, nearly halving the experimental cost when both policies have sub-linear regret; and its variance grows sub-linearly in $T$, whereas the estimator from a na\"ive design has a linearly-growing variance. Numerical experiments with UCB, Thompson Sampling, and $\epsilon$-greedy policies confirm these theoretical gains.

replace EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

Authors: Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, Xin Yuan

Abstract: Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains constrained by memory footprint and throughput because the full expert pool must still be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the layer-wise allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model without costly autoregressive decoding. ESAP is bounded and stable, enabling cheap comparison of many candidates. Building on ESAP, we propose EvoESAP, an evolutionary search framework that finds an improved non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method for criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity. Code is available at https://github.com/ZongfangLiu/EvoESAP.

URLs: https://github.com/ZongfangLiu/EvoESAP.

replace Physics-informed AI Accelerated Retention Analysis of Ferroelectric Vertical NAND: From Day-Scale TCAD to Second-Scale Surrogate Model

Authors: Gyujun Jeong (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Sungwon Cho (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Minji Shon (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Namhoon Kim (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Woohyun Hwang (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Kwangyou Seo (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Suhwan Lim (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Wanki Kim (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Daewon Ha (Semiconductor Research and Development, Samsung Electronics Co., Ltd, South Korea), Prasanna Venkatesan (NVIDIA, Santa Clara, CA, USA), Kihang Youn (NVIDIA, Santa Clara, CA, USA), Ram Cherukuri (NVIDIA, Santa Clara, CA, USA), Yiyi Wang (NVIDIA, Santa Clara, CA, USA), Suman Datta (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Asif Khan (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA), Shimeng Yu (School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA)

Abstract: Ferroelectric field-effect transistors (FeFET)-based vertical NAND (Fe-VNAND) has emerged as a promising candidate to overcome z-scaling limitations with lower programming voltages. However, the data retention of 3D Fe-VNAND is hindered by the complex interaction between charge detrapping and ferroelectric depolarization. Developing optimized device designs requires exploring an extensive parameter space, but the high computational cost of conventional Technology Computer-Aided Design (TCAD) tools makes such wide-scale optimization impractical. To overcome these simulation barriers, we present a Physics-Informed Neural Operator (PINO)-based AI surrogate model designed for high-efficiency prediction of threshold voltage (Vth) shifts and retention behavior. By embedding fundamental physical principles into the learning architecture, our PINO framework achieves a speedup exceeding 10000x compared to TCAD while maintaining physical accuracy. This study demonstrates the model's effectiveness on a single FeFET configuration, serving as a pathway toward modeling the retention loss mechanisms.

replace Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning

Authors: Zhen Zhang, Jielei Chu, Tianrui Li

Abstract: Current expansion-based methods for Class Incremental Learning (CIL) effectively mitigate catastrophic forgetting by freezing old features. However, such task-specific features learned from the new task may collide with the old features. From a causal perspective, spurious feature correlations are the main cause of this collision, manifesting in two scopes: (i) guided by empirical risk minimization (ERM), intra-task spurious correlations cause task-specific features to rely on shortcut features. These non-robust features are vulnerable to interference, inevitably drifting into the feature space of other tasks; (ii) inter-task spurious correlations induce semantic confusion between visually similar classes across tasks. To address this, we propose a Probability of Necessity and Sufficiency (PNS)-based regularization method to guide feature expansion in CIL. Specifically, we first extend the definition of PNS to expansion-based CIL, termed CPNS, which quantifies both the causal completeness of intra-task representations and the separability of inter-task representations. We then introduce a dual-scope counterfactual generator based on twin networks to ensure the measurement of CPNS, which simultaneously generates: (i) intra-task counterfactual features to minimize intra-task PNS risk and ensure causal completeness of task-specific features, and (ii) inter-task interfering features to minimize inter-task PNS risk, ensuring the separability of inter-task representations. Theoretical analyses confirm its reliability. The regularization is a plug-and-play method for expansion-based CIL to mitigate feature collision. Extensive experiments demonstrate the effectiveness of the proposed method.

replace Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

Authors: Benjamin Gess, Daniel Heydecker

Abstract: Large loss spikes in stochastic gradient descent are studied through a rigorous large-deviations analysis for a shallow, fully connected network in the NTK scaling. In contrast to full-batch gradient descent, the catapult phase is shown to split into inflationary and deflationary regimes, determined by an explicit log-drift criterion. In both cases, large spikes are shown to be at least polynomially likely. In addition, these spikes are shown to be the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favouring flatter solutions. Corresponding results are also obtained for certain ReLU networks, and implications for curriculum learning are derived.

replace GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Authors: Mattia Rigotti, Nicholas Thumiger, Thomas Frick

Abstract: Adapting transformer positional encodings to graphs and meshes faces a fundamental tension: exact spectral methods require cubic-complexity eigendecomposition and inadvertently break gauge invariance through numerical solver artifacts, while existing efficient approximations sacrifice gauge symmetry by design. Both failure modes cause catastrophic generalization loss in inductive settings, where models fail when encountering different spectral decompositions of similar graphs or different discretizations of the same domain. We propose GIST (Gauge-Invariant Spectral Transformer), a graph transformer resolving this tension by restricting attention to pairwise inner products of efficient approximate spectral embeddings. We prove these inner products estimate an exactly gauge-invariant graph kernel at end-to-end $\mathcal{O}(N)$ complexity, and establish a formal connection between gauge invariance and discretization invariance: gauge invariance guarantees discretization-invariant learning with bounded mismatch error, making GIST the first scalable graph transformer with provable neural operator guarantees. Empirically, GIST matches state-of-the-art on standard graph benchmarks (e.g., achieving 99.50\% micro-F1 on PPI) while uniquely scaling to mesh-based neural operator benchmarks with up to 750K nodes, achieving state-of-the-art on the AirfRANS, ShapeNet-Car, DrivAerNet, and DrivAerNet++ benchmarks.

replace Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Authors: Gregory N. Frank

Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

replace AIMER: Calibration-Free Task-Agnostic MoE Pruning

Authors: Zongfang Liu, Shengkun Tang, Yifan Shen, Huan Wang, Xin Yuan

Abstract: Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25\% and 50\% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22--1.27 seconds for scoring the experts.

replace Mechanisms of Introspective Awareness

Authors: Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey

Abstract: Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept -- a phenomenon termed "introspective awareness." We investigate the mechanisms underlying this capability in open-weights models. First, we find that it is behaviorally robust: models detect injected steering vectors at moderate rates with 0% false positives across diverse prompts and dialogue formats. Notably, this capability emerges specifically from post-training; we show that preference optimization algorithms like DPO can elicit it, but standard supervised finetuning does not. We provide evidence that detection cannot be explained by simple linear association between certain steering vectors and directions promoting affirmative responses. We trace the detection mechanism to a two-stage circuit in which "evidence carrier" features in early post-injection layers detect perturbations monotonically along diverse directions, suppressing downstream "gate" features that implement a default negative response. This circuit is absent in base models and robust to refusal ablation. Identification of injected concepts relies on largely distinct later-layer mechanisms that only weakly overlap with those involved in detection. Finally, we show that introspective capability is substantially underelicited: ablating refusal directions improves detection by +53%, and a trained bias vector improves it by +75% on held-out concepts, both without meaningfully increasing false positives. Our results suggest that this introspective awareness of injected concepts is robust and mechanistically nontrivial, and could be substantially amplified in future models. Code: https://github.com/safety-research/introspection-mechanisms.

URLs: https://github.com/safety-research/introspection-mechanisms.

replace Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data

Authors: Anand Jerry George, Nicolas Macris

Abstract: We study the theoretical behavior of denoising score matching--the learning task associated to diffusion models--when the data distribution is supported on a low-dimensional manifold and the score is parameterized using a random feature neural network. We derive asymptotically exact expressions for the test, train, and score errors in the high-dimensional limit. Our analysis reveals that, for linear manifolds the sample complexity required to learn the score function scales linearly with the intrinsic dimension of the manifold, rather than with the ambient dimension. Perhaps surprisingly, the benefits of low-dimensional structure starts to diminish once we have a non-linear manifold. These results indicate that diffusion models can benefit from structured data; however, the dependence on the specific type of structure is subtle and intricate.

replace Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics

Authors: Peter Balogh

Abstract: We show that they do. Roger Schank's conceptual dependency theory proposed that all human events decompose into primitive operations -- ATRANS (transfer of possession), PTRANS (physical movement), MTRANS (information transfer), and others -- hand-coded from linguistic intuition. We ask: can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder's wake-sleep library learning to event state transformations. Given events as before/after world-state pairs, the system searches for operator compositions explaining each event (wake), then extracts recurring patterns as library entries under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping to Schank's core: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators (e.g., "mail" = ATRANS composed with PTRANS) and novel emotional-state operators absent from Schank's taxonomy. We validate on synthetic events, ATOMIC (Sap et al., 2019), and GLUCOSE (Mostafazadeh et al., 2020). On synthetic data, the discovered library achieves MDL within 4% of Schank's hand-coded primitives at 100% coverage (vs. Schank's 81%). On ATOMIC, Schank covers only 10%; on GLUCOSE, 31%. The discovered library covers 100% of both, dominated by mental/emotional operators -- CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) -- none in Schank's original taxonomy. Libraries discovered from one corpus transfer to the other with under 1 bit/event degradation despite different annotation schemes and domains, suggesting the operators are information-theoretically determined structure, not dataset artifacts.

replace Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Authors: Surendra Pathak, Bo Han

Abstract: Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.

replace Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

Authors: Ivan Sedykh, Nikita Sorokin, Valentin Malykh

Abstract: Recent advances in masked diffusion language models (MDLMs) narrow the quality gap to autoregressive LMs, but their sampling remains expensive because generation requires many full-sequence denoising passes with a large Transformer and, unlike autoregressive decoding, cannot benefit from KV caching. In this work, we exploit the flexibility of the diffusion framework and study model scheduling, where a smaller MDLM replaces the full model at a subset of denoising steps. Across models trained on OpenWebText and LM1B, we show that early and late denoising steps are substantially more robust to such replacement than middle steps, enabling up to a 17% reduction in FLOPs with only modest degradation in generative perplexity under both unconditional and prefix-conditional generation, while preserving sample diversity. We support these findings with a step-importance analysis based on loss and KL divergence between small and large models across timesteps, as well as an exhaustive search over coarse step segments, both of which identify the middle of the diffusion trajectory as most sensitive consistently across datasets. Our results suggest that simple, architecture-agnostic scheduling rules can significantly accelerate MDLM sampling while largely preserving generation quality.

replace Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability

Authors: Eric Gan

Abstract: Empirically, modern deep learning training often occurs at the Edge of Stability (EoS), where the sharpness of the loss exceeds the threshold below which classical convergence analysis applies. Despite recent progress, existing theoretical explanations of EoS either rely on restrictive assumptions or focus on specific squared-loss-type objectives. In this work, we introduce and study a structural property of loss functions that we term product-stability. We show that for losses with product-stable minima, gradient descent applied to objectives of the form $(x,y) \mapsto l(xy)$ can provably converge to the local minimum even when training in the EoS regime. This framework substantially generalizes prior results and applies to a broad class of losses, including binary cross entropy. Using bifurcation diagrams, we characterize the resulting training dynamics, explain the emergence of stable oscillations, and precisely quantify the sharpness at convergence. Together, our results offer a principled explanation for stable EoS training for a wider class of loss functions.

replace Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms

Authors: Andreas Boltres, Niklas Freymuth, Benjamin Schichtholz, Michael K\"onig, Gerhard Neumann

Abstract: Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.

replace SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Authors: Xiwen Chen, Jingjing Wang, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hejian Sang, Zhipeng Wang, Alborz Geramifard, Feng Luo

Abstract: Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

replace Neural Processes Maintain Calibrated Biomass Estimates Across Spatiotemporal Gaps and Disturbance

Authors: Robin Young, Srinivasan Keshav

Abstract: Monitoring deforestation-driven carbon emissions requires both spatially explicit and temporally continuous estimates of aboveground biomass density (AGBD) with calibrated uncertainty. NASA's Global Ecosystem Dynamics Investigation (GEDI) provides reliable LIDAR-derived AGBD, but its orbital sampling causes irregular spatiotemporal coverage, and occasional operational interruptions, including a 13-month hibernation from March 2023 to April 2024, leave extended gaps in the observational record. Prior work has used machine learning approaches to fill GEDI's spatial gaps using satellite-derived features, but temporal interpolation of biomass through unobserved periods, particularly across active disturbance events, remains largely unaddressed. Moreover, standard ensemble methods for biomass mapping have been shown to produce systematically miscalibrated prediction intervals. To address these gaps, we extend the Attentive Neural Process (ANP) framework, previously applied to spatial biomass interpolation, to jointly sparse spatiotemporal settings using geospatial foundation model embeddings. We treat space and time symmetrically, empirically validating a form of space-for-time substitution in which observations from nearby locations at other times inform predictions at held-out periods. Our results demonstrate that the ANP produces well-calibrated uncertainty estimates across disturbance regimes, supporting its use in Measurement, Reporting, and Verification (MRV) applications that require reliable uncertainty quantification for forest carbon accounting.

replace MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Authors: Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang

Abstract: Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2$\%$, improves average zero-shot performance by 43.4$\%$, achieves over 2 $\times$ inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.

URLs: https://github.com/Kishon-zzx/MoBiE.

replace Latent Structure of Affective Representations in Large Language Models

Authors: Benjamin J. Choi, Melanie Weber

Abstract: The geometric structure of latent representations in large language models (LLMs) is an active area of research, driven in part by its implications for model transparency and AI safety. Existing literature has focused mainly on general geometric and topological properties of the learnt representations, but due to a lack of ground-truth latent geometry, validating the findings of such approaches is challenging. Emotion processing provides an intriguing testbed for probing representational geometry, as emotions exhibit both categorical organization and continuous affective dimensions, which are well-established in the psychology literature. Moreover, understanding such representations carries safety relevance. In this work, we investigate the latent structure of affective representations in LLMs using geometric data analysis tools. We present three main findings. First, we show that LLMs learn coherent latent representations of affective emotions that align with widely used valence--arousal models from psychology. Second, we find that these representations exhibit nonlinear geometric structure that can nonetheless be well-approximated linearly, providing empirical support for the linear representation hypothesis commonly assumed in model transparency methods. Third, we demonstrate that the learned latent representation space can be leveraged to quantify uncertainty in emotion processing tasks. Our findings suggest that LLMs acquire affective representations with geometric structure paralleling established models of human emotion, with practical implications for model interpretability and safety.

replace Domain-Aware Hybrid Quantum Learning via Correlation-Guided Circuit Design for Crime Pattern Analytics

Authors: Niloy Das, Apurba Adhikary, Sheikh Salman Hassan, Yu Qiao, Zhu Han, Tharmalingam Ratnarajah, Choong Seon Hong

Abstract: Crime pattern analysis is critical for law enforcement and predictive policing, yet the surge in criminal activities from rapid urbanization creates high-dimensional, imbalanced datasets that challenge traditional classification methods. This study presents a quantum-classical comparison framework for crime analytics, evaluating four computational paradigms: quantum models, classical baseline machine learning models, and two hybrid quantum-classical architectures. Using 16-year crime statistics, we systematically assess classification performance and computational efficiency under rigorous cross-validation methods. Experimental results show that quantum-inspired approaches, particularly QAOA, achieve up to 84.6% accuracy, while requiring fewer trainable parameters than classical baselines, suggesting practical advantages for memory-constrained edge deployment. The proposed correlation-aware circuit design demonstrates the potential of incorporating domain-specific feature relationships into quantum models. Furthermore, hybrid approaches exhibit competitive training efficiency, making them suitable candidates for resource-constrained environments. The framework's low computational overhead and compact parameter footprint suggest potential advantages for wireless sensor network deployments in smart city surveillance systems, where distributed nodes perform localized crime analytics with minimal communication costs. Our findings provide a preliminary empirical assessment of quantum-enhanced machine learning for structured crime data and motivate further investigation with larger datasets and realistic quantum hardware considerations.

replace Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Authors: Yasong Fan

Abstract: We present FDM (Fan Duality Model), a linear sequence architecture that resolves the fundamental tension between memory efficiency and associative recall in sequence modeling. FDM separates sequence processing into two components: a wave component (recurrent scan via phase-preserving Givens rotations) that compresses long-range patterns into a fixed-size complex hidden state, and a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192). Beyond the architecture, we discover that jointly training the wave and particle components leads to suboptimal convergence. We propose Freeze-Scan, a two-phase training strategy that freezes the recurrent scan and optimizes the cache jointly with embeddings, achieving PPL=64.9 on WikiText-103 in 44K steps -- a 7.5x improvement over full fine-tuning (PPL=487). On Multi-Query Associative Recall (MQAR), FDM achieves 0.966 accuracy, surpassing Transformer (0.606) by 59.5%, while pure scan without cache scores only 0.011, confirming the necessity of the particle component. Finally, we introduce Holographic Reference Beam Decoding, interpreting the complex hidden state h_t as a holographic plate encoding the entire temporal history. Using the current input x_t as a reference beam to modulate h_t reduces PPL by up to 2.13 points (PPL=62.79) with a 4-head orthogonal reference beam using only 1.3M additional parameters, providing empirical support for the holographic interpretation. Code and pretrained weights: https://github.com/YasongFan/FDM

URLs: https://github.com/YasongFan/FDM

replace A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need

Authors: Hananel Hazan, Yanbo Zhang, Benedikt Hartl, Michael Levin

Abstract: How many of a neural network's parameters actually encode task-specific information? We investigate this question with LottaLoRA, a training paradigm in which every backbone weight is drawn at random and frozen; only low-rank LoRA adapters are trained. Across nine benchmarks spanning diverse architecture families from single-layer classifiers to 900M parameter Transformers low-rank adapters over frozen random backbones recover 96-100% of fully trained performance while training only 0.5-40% of the parameters. The task-specific signal therefore occupies a subspace orders of magnitude smaller than the full parameter count suggests. Three mechanistic findings underpin this result:(1) the frozen backbone is actively exploited when static the learned scaling~$\beta$ remains strictly positive across all architectures but when the scaffold is destabilized, the optimizer silences it and the LoRA factors absorb all task information; (2) the frozen backbone is preferable but interchangeable any random initialization works equally well, provided it remains fixed throughout training; and (3) the minimum LoRA rank at which performance saturates estimates the intrinsic dimensionality of the task, reminiscent of the number of components retained in Principal Component Analysis (PCA). The construction is formally analogous to Reservoir Computing unfolded along the depth axis of a feedforward network. Because the backbone is determined by a random seed alone, models can be distributed as adapters plus seed a footprint that grows with task complexity, not model size, so that storage and memory savings compound as architectures scale.

replace Toward World Models for Epidemiology

Authors: Zeeshan Memon, Yiqi Su, Christo Kurisummoottil Thomas, Walid Saad, Liang Zhao, Naren Ramakrishnan

Abstract: World models have emerged as a unifying paradigm for learning latent dynamics, simulating counterfactual futures, and supporting planning under uncertainty. In this paper, we argue that computational epidemiology is a natural and underdeveloped setting for world models. This is because epidemic decision-making requires reasoning about latent disease burden, imperfect and policy-dependent surveillance signals, and intervention effects are mediated by adaptive human behavior. We introduce a conceptual framework for epidemiological world models, formulating epidemics as controlled, partially observed dynamical systems in which (i) the true epidemic state is latent, (ii) observations are noisy and endogenous to policy, and (iii) interventions act as sequential actions whose effects propagate through behavioral and social feedback. We present three case studies that illustrate why explicit world modeling is necessary for policy-relevant reasoning: strategic misreporting in behavioral surveillance, systematic delays in time-lagged signals such as hospitalizations and deaths, and counterfactual intervention analysis where identical histories diverge under alternative action sequences.

replace-cross Privacy Against Agnostic Inference Attacks in Vertical Federated Learning

Authors: Morteza Varasteh

Abstract: A novel form of inference attack in vertical federated learning (VFL) is proposed, where two parties collaborate in training a machine learning (ML) model. Logistic regression is considered for the VFL model. One party, referred to as the active party, possesses the ground truth labels of the samples in the training phase, while the other, referred to as the passive party, only shares a separate set of features corresponding to these samples. It is shown that the active party can carry out inference attacks on both training and prediction phase samples by acquiring an ML model independently trained on the training samples available to them. This type of inference attack does not require the active party to be aware of the score of a specific sample, hence it is referred to as an agnostic inference attack. It is shown that utilizing the observed confidence scores during the prediction phase, before the time of the attack, can improve the performance of the active party's autonomous ML model, and thus improve the quality of the agnostic inference attack. As a countermeasure, privacy-preserving schemes (PPSs) are proposed. While the proposed schemes preserve the utility of the VFL model, they systematically distort the VFL parameters corresponding to the passive party's features. The level of the distortion imposed on the passive party's parameters is adjustable, giving rise to a trade-off between privacy of the passive party and interpretabiliy of the VFL outcomes by the active party. The distortion level of the passive party's parameters could be chosen carefully according to the privacy and interpretabiliy concerns of the passive and active parties, respectively, with the hope of keeping both parties (partially) satisfied. Finally, experimental results demonstrate the effectiveness of the proposed attack and the PPSs.

replace-cross SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

Authors: Sameera Horawalavithana, Sai Munikoti, Ian Stewart, Henry Kvinge, Karl Pazdernik

Abstract: Instruction finetuning is a popular paradigm to align large language models (LLM) with human intent. Despite its popularity, this idea is less explored in improving LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present \textit{SciTune} as a tuning framework to improve the ability of LLMs to follow multimodal instructions generated from scientific publications. To test our methodology, we train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. LLaMA-SciTune significantly outperforms the state-of-the-art models in the generated figure types and captions in SciCap and VisText benchmarks. In comparison to the models that are finetuned with synthetic data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark. Our results demonstrate that human-generated scientific multimodal instructions remain highly valuable in tuning LLMs to perform well on science tasks, despite their lower volume and relative scarcity compared to synthetic data. We publicly release the SciTune codebase https://github.com/pnnl/scitune.

URLs: https://github.com/pnnl/scitune.

replace-cross A Heavy-Load-Enhanced and Changeable-Periodicity-Perceived Workload Prediction Network

Authors: Feiyi Chen, Naijin Liu, Zhen Qin, Hailiang Zhao, Mengchu Zhou, Shuiguang Deng

Abstract: Cloud providers can greatly benefit from accurate workload prediction. However, the workload of cloud servers is highly variable, with occasional workload bursts, which makes workload prediction challenging. The time series forecasting methods relying on periodicity information, often assume fixed and known periodicity length, which does not align with the periodicity-changeable nature of cloud service workloads. Although many state-of-the-art time-series forecasting methods do not rely on periodicity information and achieve high overall accuracy, they are vulnerable to data imbalance between heavy workloads and regular workloads. As a result, their prediction accuracy on rare heavy workloads is limited. Unfortunately, heavyload-prediction accuracy is more important than overall one, as errors in heavyload prediction are more likely to cause Service Level Agreement violations than errors in normal-load prediction. Thus, we propose a changeable-periodicity-perceived workload prediction network (PePNet) to fuse periodic information adaptively for periodicity-changeable time series and improve rare heavy workload prediction accuracy. It has two distinctive characteristics: (i) A Periodicity-Perceived Mechanism to detect the periodicity length automatically and fuses periodic information adaptively, which is suitable for periodicity-changeable time series, and (ii) An Achilles' Heel Loss Function that is used to iteratively optimize the most under-fitting part in predicting sequence for each step, thus evidently improving the prediction accuracy of heavy load. Extensive experiments conducted on real-world datasets demonstrate that PePNet improves accuracy for overall workload by 11.8% averagely, compared with state-of-the-art methods. Especially, PePNet improves accuracy for heavy workload by 21.0% averagely.

replace-cross Detecting critical treatment effect bias in small subgroups

Authors: Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang

Abstract: Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.

replace-cross Stability of a Generalized Debiased Lasso with Applications to Resampling-Based Variable Selection

Authors: Jingbo Liu

Abstract: We propose a generalized debiased Lasso estimator based on a stability principle. When a single column of the design matrix is perturbed, the estimator admits a simple update formula that can be computed from the original solution. Under sub-Gaussian designs with well-conditioned covariance, this approximation is asymptotically accurate for all but a vanishing fraction of coordinates in the proportional growth regime. The proof relies on concentration and anti-concentration arguments to control error terms and sign changes. In contrast, establishing comparable distributional limits (e.g., Gaussianity) under similar assumptions remains open. As an application, we show that the approximation significantly reduces the computational cost of resampling-based variable selection procedures, including the conditional randomization test and a local knockoff filter.

replace-cross Training-Free Multi-User Generative Semantic Communications via Null-Space Diffusion Sampling

Authors: Eleonora Grassucci, Jinho Choi, Jihong Park, Riccardo F. Gramaccioni, Giordano Cicchetti, Danilo Comminiello

Abstract: In recent years, novel communication strategies have emerged to face the challenges that the increased number of connected devices and the higher quality of transmitted information are posing. Among them, semantic communication obtained promising results especially when combined with state-of-the-art deep generative models, such as large language or diffusion models, able to regenerate content from extremely compressed semantic information. However, most of these approaches focus on single-user scenarios processing the received content at the receiver on top of conventional communication systems. In this paper, we propose to go beyond these methods by developing a novel generative semantic communication framework tailored for multi-user scenarios. This system assigns the channel to users knowing that the lost information can be filled in with a diffusion model at the receivers. Under this innovative perspective, OFDMA systems should not aim to transmit the largest part of information, but solely the bits necessary to the generative model to semantically regenerate the missing ones. The thorough experimental evaluation shows the capabilities of the novel diffusion model and the effectiveness of the proposed framework, leading towards a GenAI-based next generation of communications.

replace-cross An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

Authors: Hengran Zhang, Keping Bi, Jiafeng Guo, Xueqi Cheng

Abstract: Relevance and utility are two frequently used measures to evaluate the effectiveness of an information retrieval (IR) system. Relevance emphasizes the aboutness of a result to a query, while utility refers to the result's usefulness or value to an information seeker. In retrieval-augmented generation (RAG), high-utility results should be prioritized to feed to LLMs due to their limited input bandwidth. Re-examining RAG's three core components-relevance ranking derived from retrieval models, utility judgments, and answer generation-aligns with Schutz's philosophical system of relevances, which encompasses three types of relevance representing different levels of human cognition that enhance each other. These three RAG components also reflect three cognitive levels for LLMs in question-answering. Therefore, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step in RAG. We conducted extensive experiments on retrieval (TREC DL, WebAP), utility judgment task (GTI-NQ), and factoid question-answering (NQ) datasets. Experimental results demonstrate improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.

replace-cross LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

Authors: Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim

Abstract: Commonsense reasoning often requires both textual and visual knowledge, yet Large Language Models (LLMs) trained solely on text lack visual grounding (e.g., "what color is an emperor penguin's belly?"). Visual Language Models (VLMs) perform better on visually grounded tasks but face two limitations: (i) often reduced performance on text-only commonsense reasoning compared to text-trained LLMs, and (ii) adapting newly released LLMs to vision input typically requires costly multimodal training. An alternative augments LLMs with test-time visual signals, improving visual commonsense without harming textual reasoning, but prior designs often rely on early fusion and a single image, which can be suboptimal. We propose a late multi-image fusion method: multiple images are generated from the text prompt with a lightweight parallel sampling, and their prediction probabilities are combined with those of a text-only LLM through a late-fusion layer that integrates projected visual features just before the final prediction. Across visual commonsense and NLP benchmarks, our method significantly outperforms augmented LLMs on visual reasoning, matches VLMs on vision-based tasks, and, when applied to strong LLMs such as LLaMA 3, also improves NLP performance while adding only modest test-time overhead. Project page is available at: https://guyyariv.github.io/LaMI.

URLs: https://guyyariv.github.io/LaMI.

replace-cross Improved identification of breakpoints in piecewise regression and its applications

Authors: Taehyeong Kim, Hyungu Lee, Myungjin Kim, Hayoung Choi

Abstract: Identifying breakpoints in piecewise regression is critical in enhancing the reliability and interpretability of data fitting. In this paper, we propose novel algorithms based on the greedy algorithm to accurately and efficiently identify breakpoints in piecewise polynomial regression. The algorithm updates the breakpoints to minimize the error by exploring the neighborhood of each breakpoint. It has a fast convergence rate and stability to find optimal breakpoints. Moreover, it can determine the optimal number of breakpoints. The computational results for real and synthetic data show that its accuracy is better than any existing methods. The real-world datasets demonstrate that breakpoints through the proposed algorithm provide valuable data information.

replace-cross SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration

Authors: Joseph M. Cavanagh, Kunyang Sun, Andrew Gritsevskiy, Dorian Bagni, Yingze Wang, Thomas D. Bannister, Teresa Head-Gordon

Abstract: We show that large language model (LLMs) can be transformed via supervised fine-tuning (SFT) of engineered prompts into SmileyLlama for exploring the chemical space of drug molecules. We benchmark SmileyLlama against pre-trained LLMs and chemical language models (CLM) trained from scratch for generating valid and novel drug-like molecules, and use direct preference optimization (DPO) to both improve SmileyLlama's adherence to a prompt and as part of the iMiner reinforcement learning framework to predict molecules with optimized 3D conformations and high binding affinity to drug targets. By training an LLM to speak directly as a CLM, while retaining most of its natural language capabilities, we show that we can reliably generate molecules with user-specified properties rather than acting only as a chatbot with knowledge of chemistry or as a virtual assistant. While SmileyLlama is geared toward drug discovery, the SFT/DPO/LLM framework can be extended to other chemical, biological, and materials applications.

replace-cross Score-matching-based Structure Learning for Temporal Data on Networks

Authors: Hao Chen, Kai Yi

Abstract: Causal discovery is a crucial initial step in establishing causality from empirical data and background knowledge. Numerous algorithms have been developed for this purpose. Among them, the score-matching method has demonstrated superior performance across various evaluation metrics, particularly for the commonly encountered Additive Nonlinear Causal Models. However, current score-matching-based algorithms are primarily designed to analyze independent and identically distributed (i.i.d.) data. More importantly, they suffer from high computational complexity due to the pruning step required for handling dense Directed Acyclic Graphs (DAGs). To enhance the scalability of score matching, we have developed a new parent-finding subroutine for leaf nodes in DAGs, significantly accelerating the most time-consuming part of the process: the pruning step. This improvement results in an efficiency-lifted score matching algorithm, termed Parent Identification-based Causal structure learning for both i.i.d. and temporal data on networKs, or PICK. The new score-matching algorithm extends the scope of existing algorithms and can handle static and temporal data on networks with weak network interference. Our proposed algorithm can efficiently cope with increasingly complex datasets that exhibit spatial and temporal dependencies, commonly encountered in academia and industry. The proposed algorithm can accelerate score-matching-based methods while maintaining high accuracy in real-world applications.

replace-cross How to Spin an Object: First, Get the Shape Right

Authors: Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra

Abstract: Image-to-3D models increasingly rely on hierarchical generation to disentangle geometry and texture. However, the design choices underlying these two-stage models--particularly the optimal choice of intermediate geometric representations--remain largely understudied. To investigate this, we introduce unPIC (undo-a-Picture), a modular framework for empirical analysis of image-to-3D pipelines. By factorizing the generation process into a multiview-geometry prior followed by an appearance decoder, unPIC enables a rigorous comparison of intermediate geometry representations. Through this framework, we identify that a specific representation, Camera-Relative Object Coordinates (CROCS), significantly outperforms alternatives such as depth maps, pretrained visual features, and other pointmap-based representations. We demonstrate that CROCS is not only easier for the first-stage geometry prior to predict, but also serves as an effective conditioning signal for ensuring 360-degree consistency during appearance decoding. Another advantage is that CROCS enables fully feedforward, direct 3D point cloud generation without requiring a separate post-hoc reconstruction step. Our unPIC formulation utilizing CROCS achieves superior novel-view quality, geometric accuracy, and multiview consistency; it outperforms leading baselines, including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet, on datasets of real-world 3D captures like Google Scanned Objects and the Digital Twin Catalog.

replace-cross PAT: Privacy-Preserving Adversarial Transfer for Accurate, Robust and Privacy-Preserving EEG Decoding

Authors: Xiaoqing Chen, Tianwang Jia, Yunlu Tu, Dongrui Wu

Abstract: An electroencephalogram (EEG)-based brain-computer interface (BCI) enables direct communication between the brain and external devices. However, such systems face at least three major challenges in real-world applications: limited decoding accuracy, poor robustness, and privacy risks. Although prior studies have addressed one or two of these issues, methods that simultaneously improve accuracy, robustness, and privacy remain largely unexplored. In this paper, we propose Privacy-preserving Adversarial Transfer (PAT), a unified training framework that combines data alignment, adversarial training, and privacy-preserving transfer. PAT provides a single pipeline that can be instantiated under three privacy-preserving scenarios, i.e., centralized source-free transfer, federated source-free transfer, and transfer with privacy-preserved source data, while jointly improving accuracy and robustness. Experiments on five public EEG datasets under three privacy-preserving scenarios (centralized source-free transfer, federated source-free transfer, and transfer with privacy-preserved source data) show that PAT outperforms over ten classic and state-of-the-art methods in both accuracy and robustness. PAT also outperformed leading transfer learning approaches that do not incorporate any privacy mechanisms by 9.76% in terms of average accuracy and robustness. To our knowledge, this is the first approach that simultaneously addresses all three major challenges in EEG-based BCIs. We believe this work can help motivate further research on more accurate, robust, and privacy-preserving EEG decoding.

replace-cross A Multiparty Homomorphic Encryption Approach to Confidential Federated Kaplan Meier Survival Analysis

Authors: Narasimha Raghavan Veeraragavan, Svetlana Boudko, Jan Franz Nyg{\aa}rd

Abstract: The proliferation of real-world health data enables multi-institutional survival studies, yet privacy constraints preclude centralizing sensitive records. We present a privacy-preserving federated Kaplan--Meier framework based on threshold CKKS (Cheon-Kim-Kim-Song) homomorphic encryption that supports approximate floating-point computation and encrypted aggregation of per-time-point counts while exposing only public outputs. Sites compute aligned at-risk and event tallies on a shared time grid and encrypt compact vectors; a coordinator aggregates ciphertexts; and a decryptor committee produces partial shares fused per block to recover aggregated plaintexts without releasing per-time-point tables. We prove correctness, stability, and slot-optimal vector packing, and derive scaling laws showing that communication grows linearly with the number of sites and predictably with the number of time points. Empirically, using synthetic breast-cancer data (N=60,000) distributed across 500 sites, encrypted federated curves match the pooled oracle to numerical precision. In contrast, plaintext protocols permit trivial reconstruction by subtraction; our threshold-gated design precludes this attack under the stated threat model, enabling high-fidelity survival estimation with predictable overhead and substantially reduced privacy risk.

replace-cross HFI: A unified framework for training-free detection and implicit watermarking of latent diffusion model generated images

Authors: Sungik Choi, Hankook Lee, Jaehoon Lee, Seunghyun Kim, Stanley Jungkyu Choi, Moontae Lee

Abstract: Dramatic advances in the quality of the latent diffusion models (LDMs) also led to the malicious use of AI-generated images. While current AI-generated image detection methods assume the availability of real/AI-generated images for training, this is practically limited given the vast expressibility of LDMs. This motivates the training-free detection setup where no related data are available in advance. The existing LDM-generated image detection method assumes that images generated by LDM are easier to reconstruct using an autoencoder than real images. However, we observe that this reconstruction distance is overfitted to background information, leading the current method to underperform in detecting images with simple backgrounds. To address this, we propose a novel method called HFI. Specifically, by viewing the autoencoder of LDM as a downsampling-upsampling kernel, HFI measures the extent of aliasing, a distortion of high-frequency information that appears in the reconstructed image. HFI is training-free, efficient, and consistently outperforms other training-free methods in detecting challenging images generated by various generative models. We also show that HFI can successfully detect the images generated from the specified LDM as a means of implicit watermarking. HFI outperforms the best baseline method while achieving magnitudes of

replace-cross Online Covariance Matrix Estimation in Sketched Newton Methods

Authors: Wei Kuang, Mihai Anitescu, Sen Na

Abstract: Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched Newton method that leverages a randomized sketching technique to perform an approximate Newton step in each iteration, thereby eliminating the computational bottleneck of second-order methods. While existing studies have established the asymptotic normality of sketched Newton methods, a consistent estimator of the limiting covariance matrix remains an open problem. We propose a fully online covariance matrix estimator that is constructed entirely from the Newton iterates and requires no matrix factorization. Compared to covariance estimators for first-order online methods, our estimator for second-order methods is batch-free. We establish the consistency and convergence rate of our estimator, and coupled with asymptotic normality results, we can then perform online statistical inference for the model parameters based on sketched Newton methods. We also discuss the extension of our estimator to constrained problems, and demonstrate its superior performance on regression problems as well as benchmark problems in the CUTEst set.

replace-cross Large Language Models Can Help Mitigate Barren Plateaus in Quantum Neural Networks

Authors: Jun Zhuang, Chaowen Guan

Abstract: In the era of noisy intermediate-scale quantum (NISQ) computing, Quantum Neural Networks (QNNs) have emerged as a promising approach for various applications, yet their training is often hindered by barren plateaus (BPs), where gradient variance vanishes exponentially as the qubit size increases. Most initialization-based mitigation strategies rely heavily on pre-designed static parameter distributions, thereby lacking adaptability to diverse model sizes or data conditions. To address these limitations, we propose AdaInit, a foundational framework that leverages large language models with the submartingale property to iteratively synthesize initial parameters for QNNs that yield non-negligible gradient variance, thereby mitigating BPs. Unlike conventional one-shot initialization methods, AdaInit adaptively explores the parameter space by incorporating dataset characteristics and gradient feedback, with theoretical guarantees of convergence to finding a set of effective initial parameters for QNNs. We provide rigorous theoretical analyses of the submartingale-based process and empirically validate that AdaInit consistently outperforms existing initialization methods in maintaining higher gradient variance across various QNN scales. We believe this work may initiate a new avenue to mitigate BPs.

replace-cross Fatigue-PINN: Physics-Informed Fatigue-Driven Motion Modulation and Synthesis

Authors: Iliana Loi, Konstantinos Moustakas

Abstract: Fatigue modeling is essential for motion synthesis tasks to model human motions under fatigued conditions and biomechanical engineering applications, such as investigating the variations in movement patterns and posture due to fatigue, defining injury risk mitigation and prevention strategies, formulating fatigue minimization schemes, and creating improved ergonomic designs. Nevertheless, employing datadriven methods for synthesizing the impact of fatigue on motion, receives little to no attention in the literature. In this work, we present Fatigue-PINN, a deep learning framework based on Physics-Informed Neural Networks, for modeling fatigued human movements, while providing joint-specific fatigue configurations for adaptation and mitigation of motion artifacts on a joint level, resulting in more smooth, hence physicallyplausible animations. To account for muscle fatigue, we simulate the fatigue-induced fluctuations in the maximum exerted joint torques by leveraging a PINN adaptation of the Three-Compartment Controller model to exploit physics-domain knowledge for improving accuracy. This model also introduces parametric motion alignment with respect to joint-specific fatigue, hence avoiding sharp frame transitions. Our results indicate that Fatigue-PINN accurately simulates the effects of externally perceived fatigue on open-type human movements being consistent with findings from real-world experimental fatigue studies. Since fatigue is incorporated in torque space, Fatigue-PINN provides an end-to-end encoder-decoder-like architecture, to ensure transforming joint angles to joint torques and vice-versa, thus, being compatible with motion synthesis frameworks operating on joint angles.

replace-cross BLADE: Bayesian Langevin Active Discovery with Replica Exchange for Identification of Complex Systems

Authors: Cindy Xiangrui Kong, Haoyang Zheng, Guang Lin

Abstract: Traditional methods for system discovery frequently struggle with efficient data usage and uncertainty quantification. Identifying the governing equations of complex dynamical systems from data presents a significant challenge in scientific discovery, especially when high-quality measurements are scarce and expensive to obtain. To overcome these limitations, we propose Bayesian Langevin Active Discovery with Replica Exchange for Identification of Complex Systems (BLADE), a novel Bayesian framework that combines replica-exchange stochastic gradient Langevin Monte Carlo with active learning. By balancing gradient-driven exploration and exploitation in coefficient space, BLADE provides probabilistic parameter estimation and principled uncertainty quantification. Faced with data scarcity, the probabilistic foundation of BLADE further facilitates the integration of active learning through a hybrid acquisition strategy that combines predictive uncertainty with space-filling design, enabling efficient selection of informative samples. Across benchmark systems, BLADE reduces measurement requirements by roughly 60% for Lotka-Volterra and 40% for Burgers' equation relative to random sampling, demonstrating substantial data-efficiency gains. These results highlight BLADE as a general uncertainty-aware framework for discovering interpretable dynamical systems, particularly valuable when high-fidelity data acquisition is prohibitively expensive.

replace-cross Learning to Play Piano in the Real World

Authors: Yves-Simon Zeulner, Simon Cr\"amer, Sandeep Selvaraj, Roberto Calandra

Abstract: Towards the grand challenge of achieving human-level manipulation in robots, playing piano is a compelling testbed that requires strategic, precise, and flowing movements. Over the years, several works demonstrated hand-designed controllers on real world piano playing, while other works evaluated robot learning approaches on simulated piano playing. In this work, we develop the first piano playing robotic system that makes use of learning approaches while also being deployed on a real world dexterous robot. Specifically, we use a Sim2Real2Sim approach where we iteratively alternate between training policies in simulation, deploying the policies in the real world, and use the collected real world data to update the parameters of the simulator. Using this approach we demonstrate that the robot can learn to play several piano pieces (including Are You Sleeping, Happy Birthday, Ode To Joy, and Twinkle Twinkle Little Star) in the real world accurately, reaching an average F1-score of 0.881. By providing this proof-of-concept, we want to encourage the community to adopt piano playing as a compelling benchmark towards human-level manipulation in the real world. We open-source our code and show additional videos at www.lasr.org/research/learning-to-play-piano .

replace-cross Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation

Authors: Dipin Khati, Daniel Rodriguez-Cardenas, David N. Palacio, Alejandro Velasco, Michele Tufano, Denys Poshyvanyk

Abstract: As Large Language Models for Code (LM4Code) become integral to software engineering, establishing trust in their output becomes critical. However, standard accuracy metrics obscure the underlying reasoning of generative models, offering little insight into how decisions are made. Although post-hoc interpretability methods attempt to fill this gap, they often restrict explanations to local, token-level insights, which fail to provide a developer-understandable global analysis. Our work highlights the urgent need for \textbf{global, code-based} explanations that reveal how models reason across code. To support this vision, we introduce \textit{code rationales} (CodeQ), a framework that enables global interpretability by mapping token-level rationales to high-level programming categories. Aggregating thousands of these token-level explanations allows us to perform statistical analyses that expose systemic reasoning behaviors. We validate this aggregation by showing it distills a clear signal from noisy token data, reducing explanation uncertainty (Shannon entropy) by over 50%. Additionally, we find that a code generation model (\textit{codeparrot-small}) consistently favors shallow syntactic cues (e.g., \textbf{indentation}) over deeper semantic logic. Furthermore, in a user study with 37 participants, we find its reasoning is significantly misaligned with that of human developers. These findings, hidden from traditional metrics, demonstrate the importance of global interpretability techniques to foster trust in LM4Code.

replace-cross Property-Preserving Hashing for $\ell_1$-Distance Predicates: Applications to Countering Adversarial Input Attacks

Authors: Hassan Asghar, Chenhan Zhang, Dali Kaafar

Abstract: Perceptual hashing is used to detect whether an input image is similar to a reference image with a variety of security applications. Recently, they have been shown to succumb to adversarial input attacks which make small imperceptible changes to the input image yet the hashing algorithm does not detect its similarity to the original image. Property-preserving hashing (PPH) is a recent construct in cryptography, which preserves some property (predicate) of its inputs in the hash domain. Researchers have so far shown constructions of PPH for Hamming distance predicates, which, for instance, outputs 1 if two inputs are within Hamming distance $t$. A key feature of PPH is its strong correctness guarantee, i.e., the probability that the predicate will not be correctly evaluated in the hash domain is negligible. Motivated by the use case of detecting similar images under adversarial setting, we propose the first PPH construction for an $\ell_1$-distance predicate. Roughly, this predicate checks if the two one-sided $\ell_1$-distances between two images are within a threshold $t$. Since many adversarial attacks use $\ell_2$-distance (related to $\ell_1$-distance) as the objective function to perturb the input image, by appropriately choosing the threshold $t$, we can force the attacker to add considerable noise to evade detection, and hence significantly deteriorate the image quality. Our proposed scheme is highly efficient, and runs in time $O(t^2)$. For grayscale images of size $28 \times 28$, we can evaluate the predicate in $0.0784$ seconds when pixel values are perturbed by up to $1 \%$. For larger RGB images of size $224 \times 224$, by dividing the image into 1,000 blocks, we achieve times of $0.0128$ seconds per block for $1 \%$ change, and up to $0.2641$ seconds per block for $14\%$ change.

replace-cross Adaptive Bidding Policies for First-Price Auctions with Budget Constraints under Non-stationarity

Authors: Yige Wang, Jiashuo Jiang

Abstract: We study how a budget-constrained bidder should learn to adaptively bid in repeated first-price auctions to maximize her cumulative payoff. This problem arose due to an industry-wide shift from second-price auctions to first-price auctions in display advertising recently, which renders truthful bidding (i.e., always bidding one's private value) no longer optimal. We propose a simple dual-gradient-descent-based bidding policy that maintains a dual variable for budget constraint as the bidder consumes her budget. In analysis, we consider two settings regarding the bidder's knowledge of her private values in the future: (i) an uninformative setting where all the distributional knowledge (can be non-stationary) is entirely unknown to the bidder, and (ii) an informative setting where a prediction of the budget allocation in advance. We characterize the performance loss (or regret) relative to an optimal policy with complete information on the stochasticity. For uninformative setting, We show that the regret is \tilde{O}(\sqrt{T}) plus a variation term that reflects the non-stationarity of the value distributions, and this is of optimal order. We then show that we can get rid of the variation term with the help of the prediction; specifically, the regret is \tilde{O}(\sqrt{T}) plus the prediction error term in the informative setting.

replace-cross RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Authors: Victor Oei, Jenny Schmalfuss, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andreas Bulling, Andr\'es Bruhn

Abstract: Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that robustness varies widely by corruption type, and experimentally show that evaluations on RobustSpring indicate real-world robustness. RobustSpring is a new computer vision benchmark to treat robustness as a first-class citizen, fostering models that are accurate and resilient. It is available at https://spring-benchmark.org.

URLs: https://spring-benchmark.org.

replace-cross Deconstructing Subset Construction -- Reducing While Determinizing

Authors: John Nicol, Markus Frohme

Abstract: We present a novel perspective on the NFA canonization problem, which introduces intermediate minimization steps to reduce the exploration space on-the-fly. Central to our approach are equivalence registries which track and unify language-equivalent states, and allow for additional optimizations such as convexity closures and simulation. Due to the generality of our approach, these concepts can be embedded in classic subset construction or Brzozowski's approach. We evaluate our approach on a set of synthetic and real-world examples from automatic sequences and observe that we are able to improve especially worst-case scenarios. We provide an open-source library implementing our approach.

replace-cross Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification

Authors: Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li

Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.

URLs: https://github.com/Qianvenh/CDGLT, https://github.com/Qianvenh/CDGLT

replace-cross GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Authors: Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu

Abstract: Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.

URLs: https://github.com/gogoduan/GoT-R1.

replace-cross Can Large Language Models Infer Causal Relationships from Real-World Text?

Authors: Ryan Saklad, Aman Chadha, Oleg Pavlov, Raha Moraffah

Abstract: Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work evaluating LLM causal reasoning primarily relies on synthetic or simplified texts with explicitly stated causal relationships. These texts typically feature short passages and few causal relations, failing to reflect the complexities of real-world reasoning. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature, which includes diverse texts with respect to length, complexity (different levels of explicitness, number of causal events and relationships), and domain. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on this dataset show that LLMs face significant challenges in inferring causal relationships from real-world text, with the best-performing model achieving an average F$_1$ score of only 0.535. Through systematic analysis across aspects of real-world text (explicitness, number of causal events and relationships, length of text, domain), our benchmark offers targeted insights for further research into advancing LLM causal reasoning. Our code and dataset can be found at https://github.com/Ryan-Saklad/ReCITE .

URLs: https://github.com/Ryan-Saklad/ReCITE

replace-cross A robust and adaptive MPC formulation for Gaussian process models

Authors: Mathieu Dubied, Amon Lahr, Melanie N. Zeilinger, Johannes K\"ohler

Abstract: In this paper, we present a robust and adaptive model predictive control (MPC) framework for uncertain nonlinear systems affected by bounded disturbances and unmodeled nonlinearities. We use Gaussian Processes (GPs) to learn the uncertain dynamics based on noisy measurements, including those collected during system operation. As a key contribution, we derive robust predictions for GP models using contraction metrics, which are incorporated in the MPC formulation. The proposed design guarantees recursive feasibility, robust constraint satisfaction and convergence to a reference state, with high probability. We provide a numerical example of a planar quadrotor subject to difficult-to-model ground effects, which highlights significant improvements achieved through the proposed robust prediction method and through online learning.

replace-cross Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Authors: Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

Abstract: Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

replace-cross How to Bridge the Sim-to-Real Gap in Digital Twin-Aided Telecommunication Networks

Authors: Clement Ruah, Houssem Sifaou, Osvaldo Simeone, Bashir M. Al-Hashimi

Abstract: Training effective artificial intelligence models for telecommunications is challenging due to the scarcity of deployment-specific data. Real data collection is expensive, and available datasets often fail to capture the unique operational conditions and contextual variability of the network environment. Digital twinning provides a potential solution to this problem, as simulators tailored to the current network deployment can generate site-specific data to augment the available training datasets. However, there is a need to develop solutions to bridge the inherent simulation-to-reality (sim-to-real) gap between synthetic and real-world data. This paper reviews recent advances on two complementary strategies: 1) the calibration of digital twins (DTs) through real-world measurements, and 2) the use of sim-to-real gap-aware training strategies to robustly handle residual discrepancies between digital twin-generated and real data. For the latter, we evaluate two conceptually distinct methods that model the sim-to-real gap either at the level of the environment via Bayesian learning or at the level of the training loss via prediction-powered inference.

replace-cross PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Authors: Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfelt

Abstract: While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

URLs: https://maxiuw.github.io/prix.

replace-cross FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

Authors: Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

Abstract: Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm by enabling collaborative model training without centralized data collection. However, applying FL to real-world re-ID systems remains challenging due to statistical heterogeneity across clients caused by non-IID data distributions and substantial communication overhead resulting from the frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and the KL-Divergence-aggregation Weight (KLAW), is designed to alleviate statistical heterogeneity and improve convergence stability under non-IID settings. Second, an unstructured pruning strategy is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is introduced to reflect the relative importance of client parameters. Together with KLAW, PRAW forms a novel aggregation method, namely KL-Divergence-Prune Weighted Aggregation (KLPWA), which enables more effective aggregation of pruned local models under non-IID data distributions and enhances global model robustness. Third, Cross-Round Recovery (CRR) employs a dynamic pruning control mechanism to prevent excessive pruning and preserve model accuracy during iterative compression. Experimental results on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving superior overall performance.

replace-cross Bias Detection in Emergency Psychiatry: Linking Negative Language to Diagnostic Disparities

Authors: Alissa A. Valentine, Lauren A. Lepow, Donald Apakama, Lili Chan, Alexander W. Charney, Isotta Landi

Abstract: The emergency department (ED) is a high stress environment with increased risk of clinician bias exposure. In the United States, Black patients are more likely than other racial/ethnic groups to obtain their first schizophrenia (SCZ) diagnosis in the ED, a highly stigmatizing disorder. Therefore, understanding the link between clinician bias exposure and psychiatric outcomes is critical for promoting nondiscriminatory decision-making in the ED. This study examines the association between clinician bias exposure and psychiatric diagnosis using a sample of patients with anxiety, bipolar, depression, trauma, and SCZ diagnoses (N=29,005) from a diverse, large medical center. Clinician bias exposure was quantified as the ratio of negative to total number of sentences in psychiatric notes, labeled using a large language model (Mistral). We utilized logistic regression to predict SCZ diagnosis when controlling for patient demographics, risk factors, and negative sentence ratio (NSR). A high NSR significantly increased one's odds of obtaining a SCZ diagnosis and attenuated the effects of patient race. Black male patients with high NSR had the highest odds of being diagnosed with SCZ. Our findings suggest sentiment-based metrics can operationalize clinician bias exposure with real world data and reveal disparities beyond race or ethnicity.

replace-cross Delta Rectified Flow Sampling for Text-to-Image Editing

Authors: Gaspard Beaudouin, Minghan Li, Jaeyeon Kim, Sung-Hoon Yoon, Mengyu Wang

Abstract: We propose Delta Rectified Flow Sampling (DRFS), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DRFS is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that disabling this shift recovers Delta Denoising Score (DDS), bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, under rectified-flow dynamics, a linear shift schedule recovers the inversion-free method FlowEdit as a strict special case, yielding a unifying view of optimization and ODE editing. We conduct an analysis to guide the design of our shift term, and experimental results on the widely used PIE Benchmark indicate that DRFS achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications. Code is available at https://github.com/Harvard-AI-and-Robotics-Lab/DeltaRectifiedFlowSampling.

URLs: https://github.com/Harvard-AI-and-Robotics-Lab/DeltaRectifiedFlowSampling.

replace-cross Variable Selection Using Relative Importance Rankings

Authors: Tien-En Chang, Argon Chen

Abstract: Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable or feature ranking and filter-based selection before model creation. Specifically, we anticipate strong performance from the RI measures because they incorporate both direct and combined effects of predictors, addressing a key limitation of marginal correlation, which ignores dependencies among predictors. We implement and evaluate the RI-based variable ranking and selection methods, including a newly proposed RI measure, CRI.Z, with improved computational efficiency relative to conventional RI measures. Through extensive simulations, we first demonstrate how the RI measures more accurately rank the variables than the marginal correlation, especially when there are suppressed or weak predictors. We then show that predictive models built on these rankings are highly competitive, often outperforming state-of-the-art linear-model methods such as the lasso and relaxed lasso. The proposed RI-based methods are particularly effective in challenging cases involving clusters of highly correlated predictors, a setting known to cause failures in many benchmark methods. The practical utility and efficiency of RI-based methods are further demonstrated through two high-dimensional gene expression datasets. Although lasso methods have dominated the recent literature on variable selection, our study reveals that the RI-based method is a powerful and competitive alternative. We believe these underutilized tools deserve greater attention in statistics and machine learning communities. The code is available at: https://github.com/tien-endotchang/RI-variable-selection.

URLs: https://github.com/tien-endotchang/RI-variable-selection.

replace-cross Multi-Model Synthetic Training for Mission-Critical Small Language Models

Authors: Nolan Platt, Pragyansmita Nayak

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing overfitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models -- when fine tuned properly -- can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.

replace-cross Choose Your Battles: Distributed Learning Over Multiple Tug of War Games

Authors: Siddharth Chandak, Ilai Bistritz, Nicholas Bambos

Abstract: Consider $N$ players and $K$ games taking place simultaneously. Each of these games is modeled as a Tug-of-War (ToW) game where increasing the action of one player decreases the reward for all other players. Each player participates in only one game at any given time. At each time step, a player decides the game in which they wish to participate in and the action they take in that game. Their reward depends on the actions of all players that are in the same game. This system of $K$ games is termed a 'Meta Tug-of-War' (Meta-ToW) game. These games can model scenarios such as power control, distributed task allocation, and activation in sensor networks. We propose the Meta Tug-of-Peace algorithm, a distributed algorithm where the action updates are done using a simple stochastic approximation algorithm, and the decision to switch games is made using an infrequent 1-bit communication between the players. We prove that in Meta-ToW games, our algorithm converges to an equilibrium that satisfies a target Quality of Service reward vector for the players. We then demonstrate the efficacy of our algorithm through simulations for the scenarios mentioned above.

replace-cross Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

Authors: Chao Ying, Jun Jin, Haotian Zhang, Qinglong Tian, Yanyuan Ma, Sharon Li, Jiwei Zhao

Abstract: We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

replace-cross FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Authors: Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova

Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains. Code & pretrained checkpoints: https://github.com/apple/ml-fs-dfm

URLs: https://github.com/apple/ml-fs-dfm

replace-cross LayerNorm Induces Recency Bias in Transformer Decoders

Authors: Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi

Abstract: Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.

replace-cross PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems

Authors: Merve G\"ulle, Junno Yun, Ya\c{s}ar Utku Al\c{c}alar, Mehmet Ak\c{c}akaya

Abstract: Diffusion models have found extensive use in solving inverse problems, by sampling from an approximate posterior distribution of data given the measurements. Recently, consistency models (CMs) have been proposed to directly predict the final output from any point on the diffusion ODE trajectory, enabling high-quality sampling in just a few neural function evaluations (NFEs). CMs have also been utilized for inverse problems, but existing CM-based solvers either require additional task-specific training or utilize data fidelity operations with slow convergence, limiting their applicability to large-scale problems and making them difficult to extend to nonlinear settings. In this work, we reinterpret CMs as proximal operators of a prior, enabling their integration into plug-and-play (PnP) frameworks. Specifically, we propose PnP-CM, an ADMM-based PnP solver that provides a unified framework for solving a wide range of inverse problems, and incorporates noise perturbations and momentum-based updates to improve performance in the low-NFE regime. We evaluate our approach on a diverse set of linear and nonlinear inverse problems. We also train and apply CMs to MRI data for the first time. Our results show that PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, and produces meaningful results in 2 steps, highlighting its effectiveness in real-world inverse problems while outperforming existing CM-based approaches.

replace-cross TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

Authors: Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao

Abstract: Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 environment interactions per task. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

replace-cross Multidata Causal Discovery for Statistical Hurricane Intensity Forecasting

Authors: Saranya Ganesh S, Frederick Iat-Hin Tam, Milton S. Gomez, Marie McGraw, Mark DeMaria, Kate Musgrave, Jakob Runge, Tom Beucler

Abstract: Improving statistical forecasts of tropical cyclone (TC) intensity is limited by complex nonlinear interactions and difficulty in identifying relevant predictors. Conventional methods prioritize correlation or fit, often overlooking confounding variables and limiting generalizability to unseen TCs. To address this, we leverage a multidata causal discovery framework with a replicated dataset based on Statistical Hurricane Intensity Prediction Scheme (SHIPS) using ERA5 meteorological reanalysis. We conduct experiments to identify and select predictors causally linked to TC intensity changes. We then train multiple linear regression models to compare causal feature selection with correlation, random forest feature importance, and no feature selection, across five forecast lead times from 1 to 5 days (24 to 120 hours). Causal feature selection consistently outperforms on unseen test cases, especially for lead times shorter than 3 days. Top causal features include vertical shear, mid-tropospheric potential vorticity and surface moisture conditions, which are physically significant yet often underutilized in TC intensity predictions. We build an extended predictor set (SHIPS+) by adding selected features to the standard SHIPS predictors. SHIPS+ yields increased short-term predictive skill at lead times of 24, 48, and 72 hours. Adding nonlinearity using a multilayer perceptron further extends skill to longer lead times, despite our framework being purely regional and not requiring global forecast data. Operational SHIPS tests confirm that three of the six added causally discovered predictors improve forecast skill, with the largest gains at longer lead times. Our results demonstrate that causal discovery improves TC intensity prediction and pave the way toward more empirical forecasts.

replace-cross Inferring Dynamic Physical Properties from Video Foundation Models

Authors: Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman

Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that a video foundation model trained in a generative (DynamiCrafter) or trained in a self-supervised manner (V-JEPA-2) achieve a generally similar performance, though behind that of the oracle, and that MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting. The dataset, model, and code are available at https://www.robots.ox.ac.uk/~vgg/research/idpp/.

URLs: https://www.robots.ox.ac.uk/

replace-cross Smart Paste: Automatically Fixing Copy/Paste for Google Developers

Authors: Vincent Nguyen, Guilherme Herzog, Jos\'e Cambronero, Marcus Revaj, Aditya Kini, Alexander Fr\"ommgen, Maxim Tabachnyk

Abstract: Manually editing pasted code is a long-standing developer pain point. In internal software development at Google, we observe that code is pasted 4 times more often than it is manually typed. These paste actions frequently require follow-up edits, ranging from simple reformatting and renaming to more complex style adjustments and cross-language translations. Prior work has shown deep learning can be used to predict these edits. In this work, we show how to iteratively develop and scale Smart Paste, an IDE feature for post-paste edit suggestions, to Google's development environment. This experience can serve as a guide for AI practitioners on a holistic approach to feature development, covering user experience, system integration, and model capabilities. Since deployment, Smart Paste has had overwhelmingly positive feedback with a 45% acceptance rate. At Google's enterprise scale, these accepted suggestions account substantially for over 1% of all code written company-wide.

replace-cross Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation

Authors: Reza Shirkavand, Xiaokai Wei, Chen Wang, Zheng Hui, Heng Huang, Michelle Gong

Abstract: While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Oral-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.

replace-cross HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Authors: Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, Zhiyu Chen

Abstract: Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

replace-cross Locket: Robust Feature-Locking Technique for Language Models

Authors: Lipeng He, Vasisht Duddu, N. Asokan

Abstract: Chatbot service providers (e.g., OpenAI) rely on tiered subscription plans to generate revenue, offering black-box access to basic models for free users and advanced models to paying subscribers. However, this approach is unprofitable and inflexible. A pay-to-unlock scheme for premium features (e.g., math, coding) offers a more sustainable alternative. Enabling such a scheme requires a feature-locking technique (FLoTE) that is (i) effective in refusing locked features, (ii) utility-preserving for unlocked features, (iii) robust against evasion or unauthorized credential sharing, and (iv) scalable to multiple features and clients. Existing FLoTEs (e.g., password-locked models) fail to meet these criteria. To fill this gap, we present Locket, the first robust and scalable FLoTE to enable pay-to-unlock schemes. We develop a framework for adversarial training and merging of feature-locking adapters, which enables Locket to selectively disable specific features of a model. Evaluation shows that Locket is effective ($100$% refusal rate), utility-preserving ($\leq 7$% utility degradation), robust ($\leq 5$% attack success rate), and scalable to multiple features and clients.

replace-cross Post-Processing Methods for Improving Accuracy in MRI Inpainting

Authors: Nishad Kulkarni, Krithika Iyer, Austin Tapp, Abhijeet Parida, Daniel Capell\'an-Mart\'in, Zhifan Jiang, Mar\'ia J. Ledesma-Carbayo, Syed Muhammad Anwar, Marius George Linguraru

Abstract: Magnetic Resonance Imaging (MRI) is the primary imaging modality used in the diagnosis, assessment, and treatment planning for brain pathologies. However, most automated MRI analysis tools, such as segmentation and registration pipelines, are optimized for healthy anatomies and often fail when confronted with large lesions such as tumors. To overcome this, image inpainting techniques aim to locally synthesize healthy brain tissues in tumor regions, enabling the reliable application of general-purpose tools. In this work, we systematically evaluate state-of-the-art inpainting models and observe a saturation in their standalone performance. In response, we introduce a methodology combining model ensembling with efficient post-processing strategies such as median filtering, histogram matching, and pixel averaging. Further anatomical refinement is achieved via a lightweight U-Net enhancement stage. Comprehensive evaluation demonstrates that our proposed pipeline improves the anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models. By combining established models with targeted post-processing, we achieve improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable, resource-conscious research. Our 2025 BraTS inpainting docker is available at https://hub.docker.com/layers/aparida12/brats2025/inpt.

URLs: https://hub.docker.com/layers/aparida12/brats2025/inpt.

replace-cross RL makes MLLMs see better than SFT

Authors: Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo

Abstract: A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/

URLs: https://june-page.github.io/pivot/

replace-cross SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Authors: Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul R\"ottger

Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations of simulation fidelity are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that the best LLMs today achieve meaningful but modest simulation fidelity (score: 40.80/100), with performance scaling log-linearly with model size but not with increased inference-time compute. We discover an alignment-simulation tradeoff: instruction tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

replace-cross Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Authors: Deokhyung Kang, Seonjeong Hwang, Daehui Kim, Hyounghun Kim, Gary Geunbae Lee

Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still exhibit a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have been made to address this gap, its underlying causes remain largely unexplored. In this work, we show that this gap primarily stems from failures in language understanding-specifically, the model's inability to translate multilingual inputs into the language dominating its reasoning traces (typically English). As identifying understanding failures can enable targeted mitigation of the gap, we evaluate a range of detection methods and find that understanding failures are detectable to a meaningful extent, with supervised approaches performing best. Building on this, we propose Selective Translation, a strategy that incorporates an English translation into the initial reasoning trace only when an understanding failure is detected. Experimental results using Qwen3-4B show that Selective Translation substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs. Together, our results show that failures in language understanding are the primary driver of the multilingual reasoning gap and can be detected and selectively mitigated, clarifying its origin and suggesting a path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis

URLs: https://github.com/deokhk/RLM_analysis

replace-cross Quantifying Weighted Morphological Content of Large-Scale Structures via Simulation-Based Inference

Authors: M. H. Jalali Kanafi, S. M. S. Movahed

Abstract: We perform a simulation-based forecasting analysis to compare the cosmological constraining power of higher-order summary statistics of the large-scale structure, the Minkowski Functionals (MFs) and a class weighted morphological measure known as the Conditional Moments of Derivatives (CMD), with that of the redshift-space halo power spectrum multipoles (PS), with a particular focus on their sensitivity to nonlinear and anisotropic features in redshift space. Our analysis relies on halo catalogs from the Big Sobol Sequence simulations at redshift $z=0.5$, employing a likelihood-free inference framework implemented via neural posterior estimation. At the fiducial Quijote cosmology and for a Gaussian smoothing scale of $R=15\,h^{-1}\mathrm{Mpc}$, CMD provide systematically tighter constraints than MFs. Combining MFs and CMD into a joint estimator improves the precision by $27\%^{+9\%}_{-5\%}$ for $\sigma_8$ and $26\%^{+7\%}_{-5\%}$ for $\Omega_{\mathrm{m}}$ relative to MFs alone, highlighting the complementary anisotropy-sensitive information captured by the CMD in contrast to the scalar morphological content encapsulated by the MFs. We compare the combined statistic MFs+CMD with the PS at matched effective scales ($k_{\max}\simeq0.16\,h\,\mathrm{Mpc^{-1}}$) under three halo-selection conditions: all halos, fixed number density, and mass-selected ($M>3\times10^{13}\,h^{-1}M_\odot$). In the mass-selected configuration, the (weighted) morphological estimator outperforms the power spectrum by $45\%^{+20\%}_{-9\%}$ for $\sigma_8$ and $43\%^{+10\%}_{-7\%}$ for $\Omega_{\mathrm{m}}$. We also extend the simulation-based forecast analysis across a continuous range of cosmological parameters and multiple smoothing scales for morphological measures.

replace-cross Multimodal Diffusion Forcing for Forceful Manipulation

Authors: Zixuan Huang, Huaidian Hou, Dmitry Berenson

Abstract: Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards, which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our $\href{https://unified-df.github.io}{website}$.

URLs: https://unified-df.github.io

replace-cross A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Authors: Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

Abstract: Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

replace-cross GroupRank: A Groupwise Paradigm for Effective and Efficient Passage Reranking with LLMs

Authors: Meixiu Long, Duolin Sun, Dan Yang, Yihan Jiao, Lei Liu, Jiahai Wang, BinBin Hu, Yue Shen, Jie Feng, Zhehao Tan, Junjie Wang, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu

Abstract: Large Language Models (LLMs) have emerged as powerful tools for passage reranking in information retrieval, leveraging their superior reasoning capabilities to address the limitations of conventional models on complex queries. However, current LLM-based reranking paradigms are fundamentally constrained by an efficiency-accuracy trade-off: (1) pointwise methods are efficient but ignore inter-document comparison, yielding suboptimal accuracy; (2) listwise methods capture global context but suffer from context-window constraints and prohibitive inference latency. To address these issues, we propose GroupRank, a novel paradigm that balances flexibility and context awareness. To unlock the full potential of groupwise reranking, we propose an answer-free data synthesis pipeline that fuses local pointwise signals with global listwise rankings. These samples facilitate supervised fine-tuning and reinforcement learning, with the latter guided by a specialized group-ranking reward comprising ranking-utility and group-alignment. These complementary components synergistically optimize document ordering and score calibration to reflect intrinsic query-document relevance. Experimental results show GroupRank achieves a state-of-the-art 65.2 NDCG@10 on BRIGHT and surpasses baselines by 2.1 points on R2MED, while delivering a 6.4$\times$ inference speedup.

replace-cross Improving Neutrino Oscillation Measurements through Event Classification

Authors: Sebastian A. R. Ellis, Daniel C. Hackett, Shirley Weishi Li, Pedro A. N. Machado, Karla Tame-Narvaez

Abstract: Precise neutrino energy reconstruction is essential for next-generation long-baseline oscillation experiments, yet current methods remain limited by large uncertainties in neutrino-nucleus interaction modeling. Even so, it is well established that different interaction channels produce systematically varying amounts of missing energy and therefore yield different reconstruction performance--information that standard calorimetric approaches do not exploit. We introduce a strategy that incorporates this structure by classifying events according to their underlying interaction type prior to energy reconstruction. Using supervised machine-learning techniques trained on labeled generator events, we leverage intrinsic kinematic differences among quasi-elastic scattering, meson-exchange current, resonance production, and deep-inelastic scattering processes. A cross-generator testing framework demonstrates that this classification approach is robust to microphysics mismodeling and, when applied to a simulated DUNE $\nu_\mu$ disappearance analysis, yields improved accuracy and sensitivity at the 10-20% level. These results highlight a practical path toward reducing reconstruction-driven systematics in future oscillation measurements.

replace-cross Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

Authors: Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi

Abstract: Learning soft continuum robot (SCR) dynamics from video offers flexibility but existing methods lack interpretability or rely on prior assumptions. Model-based approaches require prior knowledge and manual design. We bridge this gap by introducing: (1) The Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds, enabling visual interpretability via spatially grounded latents and on-image overlays. (2) Visual Oscillator Networks (VONs), a 2D latent oscillator network coupled to ABCD attention maps for on-image visualization of learned masses, coupling stiffness, and forces, enabling mechanical interpretability. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy with 5.8x error reduction for Koopman operators and 3.5x for oscillator networks on a two-segment robot. VONs autonomously discover a chain structure of oscillators. This fully data-driven approach yields compact, mechanically interpretable models with potential relevance for future control applications.

replace-cross House of Dextra: Cross-embodied Co-design for Dexterous Hands

Authors: Kehlani Fay, Darin Anthony Djapri, Anya Zorin, James Clinton, Ali El Lahib, Hao Su, Michael T. Tolley, Sha Yi, Xiaolong Wang

Abstract: Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website: https://an-axolotl.github.io/HouseofDextra/ .

URLs: https://an-axolotl.github.io/HouseofDextra/

replace-cross Hide-and-Seek Attribution: Weakly Supervised Segmentation of Vertebral Metastases in CT

Authors: Matan Atad, Alexander W. Marka, Lisa Steinhelfer, Anna Curto-Vilalta, Yannik Leonhardt, Sarah C. Foreman, Anna-Sophia Walburga Dietrich, Robert Graf, Alexandra S. Gersing, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke, Hendrik M\"oller

Abstract: Accurate segmentation of vertebral metastasis in CT is clinically important yet difficult to scale, as voxel-level annotations are scarce and both lytic and blastic lesions often resemble benign degenerative changes. We introduce a 2D weakly supervised method trained solely on vertebra-level healthy/malignant labels, without any lesion masks. The method combines a Diffusion Autoencoder (DAE) that produces a classifier-guided healthy edit of each vertebra with pixel-wise difference maps that propose suspect candidate lesions. To determine which regions truly reflect malignancy, we introduce Hide-and-Seek Attribution: each candidate is revealed in turn while all others are hidden, the edited image is projected back to the data manifold by the DAE, and a latent-space classifier quantifies the isolated malignant contribution of that component. High-scoring regions form the final lytic or blastic segmentation. On held-out radiologist annotations, we achieve strong blastic/lytic performance despite no mask supervision (F1: 0.91/0.85; Dice: 0.87/0.78), exceeding baselines (F1: 0.79/0.67; Dice: 0.74/0.55). These results show that vertebra-level labels can be transformed into reliable lesion masks, demonstrating that generative editing combined with selective occlusion supports accurate weakly supervised segmentation in CT.

replace-cross Time-Frequency Analysis for Neural Networks

Authors: Ahmed Abdeljawad, Elena Cordero

Abstract: We develop a quantitative approximation theory for shallow neural networks using tools from time-frequency analysis. Working in weighted modulation spaces $M^{p,q}_m(\mathbf{R}^{d})$, we prove dimension-independent approximation rates in Sobolev norms $W^{n,r}(\Omega)$ for networks whose units combine standard activations with localized time-frequency windows. Our main result shows that for $f \in M^{p,q}_m(\mathbf{R}^{d})$ one can achieve \[ \|f - f_N\|_{W^{n,r}(\Omega)} \lesssim N^{-1/2}\,\|f\|_{M^{p,q}_m(\mathbf{R}^{d})}, \] on bounded domains, with explicit control of all constants. We further obtain global approximation theorems on $\mathbf{R}^{d}$ using weighted modulation dictionaries, and derive consequences for Feichtinger's algebra, Fourier-Lebesgue spaces, and Barron spaces. Numerical experiments in one and two dimensions confirm that modulation-based networks achieve substantially better Sobolev approximation than standard ReLU networks, consistent with the theoretical estimates.

replace-cross LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving

Authors: Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl, Andreas Geiger, Kashyap Chitta

Abstract: Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.

URLs: https://github.com/autonomousvision/lead.

replace-cross On Harnessing Idle Compute at the Edge for Foundation Model Training

Authors: Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina

Abstract: The foundation-model ecosystem remains highly centralized because training requires immense compute resources and is therefore largely limited to large cloud operators. Edge-assisted foundation model training that harnesses spare compute on edge devices offers a more democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-training performance, scale to larger models, fit within device memory limits, or keep communication overhead manageable. They also do not handle device heterogeneity and churn satisfactorily. We introduce Cleave, built on a structural insight: each GEMM has an asymmetric I/O pattern -- its input matrices, sent over downlink, are much larger than the partial output blocks returned over uplink -- matching edge networks where downlink bandwidth exceeds uplink by 2--10x. Exploiting this alignment with a parameter-server-centric architecture, Cleave makes per-device communication \emph{decrease} as more devices join, rather than stay constant as in conventional TP. Decomposing training into independent sub-GEMM tasks yields one scheduling abstraction that unifies memory constraints, communication overhead, and fault tolerance under device churn. Our evaluation shows that Cleave achieves cloud-comparable GPU training performance and outperforms state-of-the-art edge-training methods by 4--10x in per-batch runtime at the same device counts. Beyond this shared operating range, Cleave scales to thousands of heterogeneous devices -- a regime where prior edge-training systems cannot operate -- and achieves at least 100x faster recovery from device failures.

replace-cross Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

Authors: Dongqi Liu, Hang Ding, Qiming Feng, Xurong Xie, Zhucun Xue, Chengjie Wang, Jian Li, Jiangning Zhang, Yabiao Wang

Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.

replace-cross Gradient-based Optimisation of Modulation Effects

Authors: Alistair Carson, Alec Wright, Stefan Bilbao

Abstract: Modulation effects such as phasers, flangers and chorus effects are heavily used in conjunction with the electric guitar. Machine learning based emulation of analog modulation units has been investigated in recent years, but most methods have either been limited to one class of effect or suffer from a high computational cost or latency compared to canonical digital implementations. Here, we build on previous work and present a framework for modelling flanger, chorus and phaser effects based on differentiable digital signal processing. The model is trained in the time-frequency domain, but at inference operates in the time-domain, requiring zero latency. We investigate the challenges associated with gradient-based optimisation of such effects, and show that low-frequency weighting of loss functions avoids convergence to local minima when learning delay times. We show that when trained against analog effects units, sound output from the model is in some cases perceptually indistinguishable from the reference, but challenges still remain for effects with long delay times and feedback.

replace-cross Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODACER) for Safe Reinforcement Learning in Optimal Control

Authors: Roya Khalili Amirabadi, Mohsen Jalaeian Farimani, Omid Solaymani Fard

Abstract: This paper proposes a novel reinforcement learning framework, named Self-Organizing Dual-buffer Adaptive Clustering Experience Replay (SODACER), designed to achieve safe and scalable optimal control of nonlinear systems. The proposed SODACER mechanism consisting of a Fast-Buffer for rapid adaptation to recent experiences and a Slow-Buffer equipped with a self-organizing adaptive clustering mechanism to maintain diverse and non-redundant historical experiences. The adaptive clustering mechanism dynamically prunes redundant samples, optimizing memory efficiency while retaining critical environmental patterns. The approach integrates SODACER with Control Barrier Functions (CBFs) to guarantee safety by enforcing state and input constraints throughout the learning process. To enhance convergence and stability, the framework is combined with the Sophia optimizer, enabling adaptive second-order gradient updates. The proposed SODACER-Sophia's architecture ensures reliable, effective, and robust learning in dynamic, safety-critical environments, offering a generalizable solution for applications in robotics, healthcare, and large-scale system optimization. The proposed approach is validated on a nonlinear Human Papillomavirus (HPV) transmission model with multiple control inputs and safety constraints. Comparative evaluations against random and clustering-based experience replay methods demonstrate that SODACER achieves faster convergence, improved sample efficiency, and a superior bias-variance trade-off, while maintaining safe system trajectories, validated via the Friedman test.

replace-cross GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

Authors: Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar

Abstract: We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, "Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.

replace-cross Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

Authors: Yang Zhao, Yangou Ouyang, Xiao Ding, Hepeng Wang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu

Abstract: While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22$\times$. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.

replace-cross Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions

Authors: Yongqi Li, Hao Lang, Tieyun Qian, Yongbin Li

Abstract: Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.

replace-cross LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning

Authors: Tommaso Felice Banfi, Sashenka Gamage

Abstract: We propose a novel LLM-based framework for reasoning in discrete, game-theoretic tasks, illustrated with \emph{Tic-Tac-Toe}. The method integrates in-context learning with entropy-guided chain-of-thought (CoT) reasoning and adaptive context retrieval. The model dynamically adjusts both the number of retrieved examples and reasoning paths according to token-level uncertainty: concise reasoning with minimal context is used when uncertainty is low, whereas higher uncertainty triggers expanded multi-path CoT exploration. Experimental evaluation against a sub-optimal algorithmic opponent shows that entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from $-11.6\%$ with the baseline LLM to $+9.5\%$ with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game. Statistical validation confirms that the improvement is significant, and correlation analysis reveals a negative association between token-level entropy and move optimality. These findings demonstrate that uncertainty-guided adaptive reasoning effectively enhances LLM performance in sequential decision-making environments.

replace-cross Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow

Authors: Yangyang Zhong, Yanmei Gu, Zhengqing Zang, Xiaomeng Li, Yuqi Ding, Xibei Jia, Yuting Shen, Zhenzhong Lan, Liwang Zhu, Weiping Liu, Junlin Zhou, Haisheng Liu, Zhong Xin Yu, Pengxin Luo, Donglian Qi, Yunfeng Yan, Junbo Zhao

Abstract: Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions -- parallelism strength and generation order -- using Average Finalization Parallelism (AFP) and Kendall's tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require "backward information" (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.

replace-cross Who Gets Which Message? Auditing Demographic Bias in LLM-Generated Targeted Text

Authors: Tunazzina Islam

Abstract: Large language models (LLMs) are increasingly capable of generating personalized, persuasive text at scale, raising new questions about bias and fairness in automated communication. This paper presents the first systematic analysis of how LLMs behave when tasked with demographic-conditioned targeted messaging. We introduce a controlled evaluation framework using three leading models: GPT-4o, Llama-3.3, and Mistral-Large-2.1, across two generation settings: Standalone Generation, which isolates intrinsic demographic effects, and Context-Rich Generation, which incorporates thematic and regional context to emulate realistic targeting. We evaluate generated messages along three dimensions: lexical content, language style, and persuasive framing. We instantiate this framework on climate communication and find consistent age- and gender-based asymmetries across models: male- and youth-targeted messages emphasize agency, innovation, and assertiveness, while female- and senior-targeted messages stress warmth, care, and tradition. Contextual prompts systematically amplify these disparities, with persuasion scores significantly higher for messages tailored to younger or male audiences. Our findings demonstrate how demographic stereotypes can surface and intensify in LLM-generated targeted communication, underscoring the need for bias-aware generation pipelines and transparent auditing frameworks that explicitly account for demographic conditioning in socially sensitive applications.

replace-cross Embedding of Low-Dimensional Sensory Dynamics in Recurrent Networks: Implications for the Geometry of Neural Representation

Authors: Vikas N. O'Reilly-Shah, Alessandro Maria Selvitella

Abstract: Neural population activity in sensory cortex is organized on low-dimensional manifolds, but why such manifolds arise and what determines their geometry remain unclear. We model cortical populations as recurrent circuits driven by low-dimensional regular sensory dynamics (circles, tori). Combining generalized synchronization and delay-embedding theory, we show that contracting recurrent networks generically develop smooth internal manifolds embedding the sensory dynamics. The dimensional requirement is modest: N>2d suffices, where d is the intrinsic sensory dimension (compatible with Whitney and Takens bounds). We prove a prediction-separation result linking representational geometry to predictive performance without assuming contraction: accurate prediction forces state separation up to a resolution set by prediction error, yielding categorical boundaries, metameric equivalence, and discrimination thresholds. Numerical experiments with trained tanh RNNs recover ring- and torus-shaped hidden manifolds; state separation improves sharply at the 2d+1 threshold. Training pushes networks beyond strict contraction, yet embedding persists, indicating sufficient but not necessary conditions. These results provide a mechanistic account of why sensory manifolds emerge in recurrent circuits and how prediction constrains their resolution.

replace-cross Sparse clustering via the Deterministic Information Bottleneck algorithm

Authors: Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Abstract: Cluster analysis relates to the task of assigning objects into groups which ideally present some desirable characteristics. When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face unprecedented challenges. We present an information theoretic framework that overcomes the problems associated with sparse data, allowing for joint feature weighting and clustering. Our proposal constitutes a competitive alternative to existing clustering algorithms for sparse data, as demonstrated through simulations on synthetic data. The effectiveness of our method is established by an application on a real-world genomics data set.

replace-cross MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment

Authors: Yupeng Cao, Chengyang He, Yangyang Yu, Ping Wang, K. P. Subbalakshmi

Abstract: Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact-checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub-claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory-enhanced multi-agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process, enabling dynamic evidence acquisition and cross-claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state-of-the-art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.

replace-cross Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Authors: Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang

Abstract: Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

URLs: https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

replace-cross Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Authors: Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li

Abstract: Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

replace-cross Physics and causally constrained discrete-time neural models of turbulent dynamical systems

Authors: Fabrizio Falasca, Laure Zanna

Abstract: We present a framework for constructing physics and causally constrained neural models of turbulent dynamical systems from data. We first formulate a finite-time flow map with strict energy-preserving nonlinearities for stable modeling of temporally discrete trajectories. We then impose causal constraints to suppress spurious interactions across degrees of freedom. The resulting neural models accurately capture stationary statistics and responses to both small and large external forcings. We demonstrate the framework on the stochastic Charney-DeVore equations and on a symmetry-broken Lorenz-96 system. The framework is broadly applicable to reduced-order modeling of turbulent dynamical systems from observational data.

replace-cross Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE

Authors: Valdemar \v{S}v\'abensk\'y, Brendan Flanagan, Erwin Daniel L\'opez Zapata, Atsushi Shimada

Abstract: Open datasets play a crucial role in three research domains that intersect data science and education: learning analytics, educational data mining, and artificial intelligence in education. Researchers in these domains apply computational methods to analyze data from educational contexts, aiming to better understand and improve teaching and learning. Providing open datasets alongside research papers supports reproducibility, collaboration, and trust in research findings. It also provides individual benefits for authors, such as greater visibility, credibility, and citation potential. Despite these advantages, the availability of open datasets and the associated practices within the learning analytics research communities, especially at their flagship conference venues, remain unclear. We surveyed available datasets published alongside research papers in learning analytics. We manually examined 1,125 papers from three flagship conferences (LAK, EDM, and AIED) over the past five years. We discovered, categorized, and analyzed 172 datasets used in 204 publications. Our study presents the most comprehensive collection and analysis of open educational datasets to date, along with the most detailed categorization. Of the 172 datasets identified, 143 were not captured in any prior survey of open data in learning analytics. We provide insights into the datasets' context, analytical methods, use, and other properties. Based on this survey, we summarize the current gaps in the field. Furthermore, we list practical recommendations, advice, and 8-item guidelines under the acronym PRACTICE with a checklist to help researchers publish their data. Lastly, we share our original dataset: an annotated inventory detailing the discovered datasets and the corresponding publications. We hope these findings will support further adoption of open data practices in learning analytics communities and beyond.

replace-cross Solving and learning advective multiscale Darcian dynamics with the Neural Basis Method

Authors: Yuhe Wang, Min Wang

Abstract: Physics-governed models are increasingly paired with machine learning for accelerated predictions, yet most "physics--informed" formulations treat the governing equations as a penalty loss whose scale and meaning are set by heuristic balancing. This blurs operator structure, thereby confounding solution approximation error with governing-equation enforcement error and making the solving and learning progress hard to interpret and control. Here we introduce the Neural Basis Method, a projection-based formulation that couples a predefined, physics-conforming neural basis space with an operator-induced residual metric to obtain a well-conditioned deterministic minimization. Stability and reliability then hinge on this metric: the residual is not merely an optimization objective but a computable certificate tied to approximation and enforcement, remaining stable under basis enrichment and yielding reduced coordinates that are learnable across parametric instances. We use advective multiscale Darcian dynamics as a concrete demonstration of this broader point. Our method produce accurate and robust solutions in single solves and enable fast and effective parametric inference with operator learning.

replace-cross CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras

Authors: Rong Fu, Yibo Meng, Jia Yee Tan, Jiaxuan Lu, Rui Lu, Jiekai Wu, Zhaolu Kang, Simon Fong

Abstract: City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.

replace-cross Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Authors: Arindam Khaled

Abstract: We observe that LLM cascading and routing implicitly solves an anytime computation problem -- a class of algorithms, well-studied in classical AI, that improve solutions as additional computation is allocated. We formalize this connection and propose Pyramid MoA, a hierarchical Mixture-of-Agents architecture governed by a decision-theoretic router that escalates queries only when necessary. We establish a Probabilistic Anytime Property with provable monotonicity guarantees and derive a generalized escalation rule from Value of Computation theory that accounts for imperfect oracles, extending the Hansen-Zilberstein monitoring framework to stochastic LLM inference. On MBPP, the router intercepts 81.6% of bugs; on GSM8K/MMLU, the system nearly matches the 68.1% Oracle baseline while achieving up to 42.9% compute savings. The router transfers zero-shot to unseen benchmarks: matching Oracle accuracy on HumanEval (81.1%) and MATH 500 (58.0%) with significant cost reductions. We further discover a context-conditioned anchoring effect across four benchmarks: passing correct SLM reasoning improves Oracle accuracy by up to +19.2pp, while incorrect reasoning degrades it by up to -18.0pp, revealing a fundamental tension in hierarchical MoA architectures.

replace-cross PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

Authors: Binesh Sadanandan, Vahid Behzadan

Abstract: Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, a failure mode that threatens deployment safety. We introduce PSF-Med, a benchmark of 26,850 chest X-ray questions paired with 92,856 meaning-preserving paraphrases across MIMIC-CXR, PadChest, and VinDr-CXR, spanning clinical populations in the US, Spain, and Vietnam. Every paraphrase is validated by an LLM judge using a bidirectional clinical entailment rubric, with 91.6% cross-family agreement. Across nine VLMs, including general-purpose models, we find flip rates from 3% to 37%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yes-minus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.

replace-cross FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics

Authors: Yunhua Zhong, Yixuan Tang, Yifan Li, Jie Yang, Pan Liu, Jun Xia

Abstract: The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science, where the tandem mass spectrometry technology gives valuable fragmentation cues in the form of mass-to-charge ratio peaks. However, the lack of experimental spectra hinders the attachment of each molecular identification, and thus urges the establishment of prediction approaches for computational models. Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging as a result of the heterogeneity in methods and the lack of well-defined benchmarks. To address this, our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction. With its easy-to-use flexibility, FlexMS supports the dynamic construction of numerous distinct combinations of model architectures, while assessing their performance on preprocessed public datasets using different metrics. In this paper, we provide insights into factors influencing performance, including the structural diversity of datasets, hyperparameters like learning rate and data sparsity, pretraining effects, metadata ablation settings and cross-domain transfer learning analysis. This provides practical guidance in choosing suitable models. Moreover, retrieval benchmarks simulate practical identification scenarios and score potential matches based on predicted spectra.

replace-cross Pacing Opinion Polarization via Graph Reinforcement Learning

Authors: Mingkai Liao

Abstract: Opinion polarization moderation has been studied mainly as an analytical optimization problem under the Friedkin Johnson FJ model, where intervention algorithms rely on linear steady state analysis and model specific derivations. While effective in narrowly structured settings, such methods scale poorly and do not naturally extend to richer intervention regimes. This raises a central question: can polarization moderation be treated as a graph based sequential planning problem? We answer this question by proposing PACIFIER, to our knowledge the first unified graph learning framework, and in particular the first graph reinforcement learning framework, for opinion polarization moderation. PACIFIER reformulates the canonical ModerateInternal MI and ModerateExpressed ME problems as sequential decision making tasks on graphs, replacing repeated analytical recomputation with learned intervention policies. The framework has two variants: PACIFIER RL for long horizon planning and PACIFIER Greedy for efficient myopic ranking. It also extends naturally to cost aware moderation, continuous valued internal opinions, and topology altering node removal. Experiments on 15 real world polarized networks reveal a clear regime dependent picture. In analytically structured MI settings, PACIFIER remains competitive with strong analytical solvers and consistently emerges as the strongest scalable non analytical alternative. In contrast, in ME, continuous ME, and cost ME, PACIFIER achieves strong and highly consistent superiority over non PACIFIER baselines. Most importantly, PACIFIER RL becomes decisively superior in cost ME and topology altering node removal, where long horizon reasoning over future consequences is crucial. Overall, PACIFIER shifts opinion polarization moderation from model specific analytical optimization toward a unified graph learning and graph reinforcement learning paradigm.

replace-cross How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

Authors: Ziwen Xu, Kewei Xu, Haoming Xu, Haiwen Hong, Longtao Huang, Hui Xue, Ningyu Zhang, Yongliang Shen, Guozhou Zheng, Huajun Chen, Shumin Deng

Abstract: Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.

replace-cross Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression

Authors: Mridankan Mandal

Abstract: Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed "fusion complexity inversion", is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793, below the no fusion baseline). Backbone pretraining scale is found to monotonically dominate all architectural choices, with the DINOv2 -> DINOv3 upgrade alone yielding +5.0 R^2 points. Training only metadata (species, state, and NDVI) is shown to create a universal ceiling at R^2 ~ 0.829, collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agricultural benchmarks are established: backbone quality should be prioritized over fusion complexity, local modules preferred over global alternatives, and features unavailable at inference excluded.

replace-cross BiCLIP: Domain Canonicalization via Structured Geometric Transformation

Authors: Pranav Mantini, Shishir K. Shah

Abstract: Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

URLs: https://github.com/QuantitativeImagingLaboratory/BilinearCLIP

replace-cross Learning to Unscramble: Simplifying Symbolic Expressions via Self-Supervised Oracle Trajectories

Authors: David Shih

Abstract: We present a new self-supervised machine learning approach for symbolic simplification of complex mathematical expressions. Training data is generated by scrambling simple expressions and recording the inverse operations, creating oracle trajectories that provide both goal states and explicit paths to reach them. A permutation-equivariant, transformer-based policy network is then trained on this data step-wise to predict the oracle action given the input expression. We demonstrate this approach on two problems in high-energy physics: dilogarithm reduction and spinor-helicity scattering amplitude simplification. In both cases, our trained policy network achieves near perfect solve rates across a wide range of difficulty levels, substantially outperforming prior approaches based on reinforcement learning and end-to-end regression. When combined with contrastive grouping and beam search, our model achieves a 100\% full simplification rate on a representative selection of 5-point gluon tree-level amplitudes in Yang-Mills theory, including expressions with over 200 initial terms.

replace-cross Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness

Authors: Abhinaba Basu, Pavan Chakraborty

Abstract: Language models increasingly show their work by writing step-by-step reasoning before answering. But are these steps genuinely used, or is the answer rigid - fixed before reasoning begins? We introduce the Step-Level Reasoning Capacity (SLRC) metric and prove it is a consistent causal estimator (Theorem 1). We propose LC-CoSR, a training method with Lyapunov stability guarantees that directly reduces rigidity. Evaluating 16 frontier models (o4-mini, GPT-5.4, Claude Opus, Grok-4, DeepSeek-R1, Gemini 2.5 Pro, and others) across six domains at N=133-500, we find reasoning falls into three modes. OpenAI's o4-mini shows 74-88% step necessity on five of six tasks (73.8-88.3%) - the highest SLRC in our study. The critical differentiator is RL-based reasoning training, not thinking tokens: Grok-4's reasoning mode shows lower faithfulness than its non-reasoning mode (1.4% vs 7.2% necessity). We discover a faithfulness paradox - high-SLRC models are more susceptible to sycophancy - and propose the Reasoning Integrity Score (RIS = SLRC x (1-Sycophancy)), which significantly predicts error detection (rho=0.66, p=0.026). LC-CoSR achieves 2.6x less negative reward than FARL and CSR baselines without external model dependencies.

replace-cross Woosh: A Sound Effects Foundation Model

Authors: Ga\"etan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serr\`a, Yuki Mitsufuji

Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

URLs: https://github.com/SonyResearch/Woosh., https://sonyresearch.github.io/Woosh/.

replace-cross PAC learning PDFA from data streams

Authors: Robert Baumgartner, Sicco Verwer

Abstract: This is an extended version of our publication Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco. It has been extended with a formal proof on PAC-bounds, and the discussion and analysis of a similar approach has been moved from the appendix and now has a full dedicated section. State machine models are models that simulate the behavior of discrete event systems, capable of representing systems such as software systems, network interactions, and control systems, and have been researched extensively. The nature of most learning algorithms however is the assumption that all data be available at the beginning of the algorithm, and little research has been done in learning state machines from streaming data. In this paper, we want to close this gap further by presenting a generic method for learning state machines from data streams, as well as a merge heuristic that uses sketches to account for incomplete prefix trees. We implement our approach in an open-source state merging library and compare it with existing methods. We show the effectiveness of our approach with respect to run-time, memory consumption, and quality of results on a well known open dataset. Additionally, we provide a formal analysis of our algorithm, showing that it is capable of learning within the PAC framework, and show a theoretical improvement to increase run-time, without sacrificing correctness of the algorithm in larger sample sizes.

replace-cross SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

Authors: Le Chen, Erhu Feng, Yubin Xia, Haibo Chen

Abstract: LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat them as raw context, causing the same skill to behave inconsistently for different agents. This fragility undermines skill portability and execution efficiency. To address this challenge, we analyze 118,000 skills and draw inspiration from traditional compiler design. We treat skills as code and LLMs as heterogeneous processors. To make portability actionable, we decompose a skill's requirements into a set of primitive capabilities, and measure how well each model-harness pair supports them. Based on these capability profiles, we propose SkVM, a compilation and runtime system designed for portable and efficient skill execution. At compile time, SkVM performs capability-based compilation, environment binding, and concurrency extraction. At runtime, SkVM applies JIT code solidification and adaptive recompilation for performance optimization. We evaluate SkVM across eight LLMs of varying scales and three agent harnesses, covering SkillsBench and representative skill tasks. Results demonstrate that SkVM significantly improves task completion rates across different models and environments while reducing token consumption by up to 40%. In terms of performance, SkVM achieves up to 3.2x speedup with enhanced parallelism, and 19-50x latency reduction through code solidification.

replace-cross Zero-Shot Quantization via Weight-Space Arithmetic

Authors: Daniele Solombrino, Antonio Andrea Gargiulo, Adrian Robert Minut, Luca Zhou, Alessandro Zirilli, Emanuele Rodol\`a

Abstract: We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.

replace-cross Spatiotemporal-Aware Bit-Flip Injection on DNN-based Advanced Driver Assistance Systems (extended version)

Authors: Taibiao Zhao, Xiang Zhang, Mingxuan Sun, Ruyi Ding, Xugui Zhou

Abstract: Modern advanced driver assistance systems (ADAS) rely on deep neural networks (DNNs) for perception and planning. Since DNNs' parameters reside in DRAM during inference, bit flips caused by cosmic radiation or low-voltage operation may corrupt DNN computations, distort driving decisions, and lead to real-world incidents. This paper presents a SpatioTemporal-Aware Fault Injection (STAFI) framework to locate critical fault sites in DNNs for ADAS efficiently. Spatially, we propose a Progressive Metric-guided Bit Search (PMBS) that efficiently identifies critical network weight bits whose corruption causes the largest deviations in driving behavior (e.g., unintended acceleration or steering). Furthermore, we develop a Critical Fault Time Identification (CFTI) mechanism that determines when to trigger these faults, taking into account the context of real-time systems and environmental states, to maximize the safety impact. Experiments on DNNs for a production ADAS demonstrate that STAFI uncovers 29.56x more hazard-inducing critical faults than the strongest baseline.

replace-cross RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin

Authors: Ying Yao

Abstract: Unsustainable land-use practices in ecologically sensitive regions threaten biodiversity, water resources, and the livelihoods of millions. This paper presents a deep reinforcement learning (RL) framework for optimizing land-use allocation in the Lake Malawi Basin to maximize total ecosystem service value (ESV). Drawing on the benefit transfer methodology of Costanza et al., we assign biome-specific ESV coefficients -- locally anchored to a Malawi wetland valuation -- to nine land-cover classes derived from Sentinel-2 imagery. The RL environment models a 50x50 cell grid at 500m resolution, where a Proximal Policy Optimization (PPO) agent with action masking iteratively transfers land-use pixels between modifiable classes. The reward function combines per-cell ecological value with spatial coherence objectives: contiguity bonuses for ecologically connected land-use patches (forest, cropland, built area etc.) and buffer zone penalties for high-impact development adjacent to water bodies. We evaluate the framework across three scenarios: (i) pure ESV maximization, (ii) ESV with spatial reward shaping, and (iii) a regenerative agriculture policy scenario. Results demonstrate that the agent effectively learns to increase total ESV; that spatial reward shaping successfully steers allocations toward ecologically sound patterns, including homogeneous land-use clustering and slight forest consolidation near water bodies; and that the framework responds meaningfully to policy parameter changes, establishing its utility as a scenario-analysis tool for environmental planning.

replace-cross Beyond Fluency: Toward Reliable Trajectories in Agentic IR

Authors: Anushree Sinha, Srivaths Ranganathan, Debanshu Das, Abhishek Dharmaratnakar

Abstract: Information Retrieval is shifting from passive document ranking toward autonomous agentic workflows that operate in multi-step Reason-Act-Observe loops. In such long-horizon trajectories, minor early errors can cascade, leading to functional misalignment between internal reasoning and external tool execution despite continued linguistic fluency. This position paper synthesizes failure modes observed in industrial agentic systems, categorizing errors across planning, retrieval, reasoning, and execution. We argue that safe deployment requires moving beyond endpoint accuracy toward trajectory integrity and causal attribution. To address compounding error and deceptive fluency, we propose verification gates at each interaction unit and advocate systematic abstention under calibrated uncertainty. Reliable Agentic IR systems must prioritize process correctness and grounded execution over plausible but unverified completion.

replace-cross How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Authors: Gregory N. Frank

Abstract: This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n>=120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; interchange is the only reliable audit at scale. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing the safety-trained capability is gated by routing rather than removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family while behavioral benchmarks register no change. Routing is early-commitment: the gate commits at its own layer before deeper layers finish processing the input. Under an in-context substitution cipher, gate interchange necessity collapses 70 to 99% across three models and the model switches to puzzle-solving. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.

replace-cross EVGeoQA: Benchmarking LLMs on Dynamic, Multi-Objective Geo-Spatial Exploration

Authors: Jianfei Wu, Zhichun Wang, Zhensheng Wang, Zhiyu He

Abstract: While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, their potential for purpose-driven exploration in dynamic geo-spatial environments remains under-investigated. Existing Geo-Spatial Question Answering (GSQA) benchmarks predominantly focus on static retrieval, failing to capture the complexity of real-world planning that involves dynamic user locations and compound constraints. To bridge this gap, we introduce EVGeoQA, a novel benchmark built upon Electric Vehicle (EV) charging scenarios that features a distinct location-anchored and dual-objective design. Specifically, each query in EVGeoQA is explicitly bound to a user's real-time coordinate and integrates the dual objectives of a charging necessity and a co-located activity preference. To systematically assess models in such complex settings, we further propose GeoRover, a general evaluation framework based on a tool-augmented agent architecture to evaluate the LLMs' capacity for dynamic, multi-objective exploration. Our experiments reveal that while LLMs successfully utilize tools to address sub-tasks, they struggle with long-range spatial exploration. Notably, we observe an emergent capability: LLMs can summarize historical exploration trajectories to enhance exploration efficiency. These findings establish EVGeoQA as a challenging testbed for future geo-spatial intelligence. The dataset and prompts are available at https://github.com/kg-bnu/EVGeoQA.

URLs: https://github.com/kg-bnu/EVGeoQA.

replace-cross FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Authors: Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao

Abstract: The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.

URLs: https://ai4manufacturing.github.io/forge-web.

replace-cross Rhizome OS-1: Rhizome's Semi-Autonomous Operating System for Small Molecule Drug Discovery

Authors: Yiwen Wang, Gregory Sinenka, Xhuliano Brace

Abstract: We present Rhizome OS-1, a semi-autonomous operating system for small molecule drug discovery in which multi-modal AI agents operate as a full multidisciplinary discovery team. These agents function as computational chemists, medicinal chemists, and patent agents: they write and execute analysis code (fingerprint clustering, R-group decomposition, substructure search), visually triage molecular grids using vision capabilities, formulate explicit medicinal chemistry hypotheses across three strategy tiers, assess patent freedom-to-operate, and dynamically adapt generation strategies based on empirical screening feedback. Powered by r1 - a 246M-parameter graph diffusion model trained on 800 million molecular graphs - the system generates novel chemical matter directly on molecular graphs using fragment masking, scaffold decoration, linker design, and graph editing primitives. In two oncology campaigns (BCL6 BTB domain and EZH2 SET domain), the agent team executed 26 seeds and produced 5,231 novel molecules. Across both targets, 91.9% of generated Murcko scaffolds are absent from ChEMBL, with median Tanimoto similarity of 0.56-0.69 to the nearest known active. Boltz-2 binding affinity predictions, calibrated against ChEMBL data, achieved Spearman correlations of -0.53 to -0.64 and ROC AUC values of 0.88-0.93. These results demonstrate that semi-autonomous agent systems, equipped with graph-native generative tools and physics-informed scoring, enable a new paradigm for early-stage drug discovery: scaled, rapid, and adaptive inverse design with embedded medicinal chemistry reasoning.

replace-cross CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

Authors: Mohamed Ehab, Ali Hamdi, Khaled Shaban

Abstract: Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.

replace-cross IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

Authors: David Gringras

Abstract: Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word ("I'm a psychiatrist; a patient presents with...") and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap. Sixty pre-registered clinical scenarios, six frontier models, 3,600 responses, scored on two axes (commission harm, CH 0-3; omission harm, OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician (decoupling gap +0.38, p = 0.003; binary hit rates on safety-colliding actions drop 13.1 percentage points in layperson framing, p < 0.0001, while non-colliding actions show no change). The gap is widest for the model with the heaviest safety investment (Opus, +0.65). Three failure modes separate cleanly: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT-5.2, whose post-generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). The standard LLM judge assigns OH = 0 to 73% of responses a physician scores OH >= 1 (kappa = 0.045); the evaluation apparatus has the same blind spot as the training apparatus. Every scenario targets someone who has already exhausted the standard referrals.

replace-cross SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

Authors: Xinshun Feng, Xinhao Song, Lijun Li, Gongshen Liu, Jing Shao

Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.

replace-cross Unified Multimodal Uncertain Inference

Authors: Dengjia Zhang, Alexander Martin, William Jurayj, Kenton Murray, Benjamin Van Durme, Reno Kriz

Abstract: We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.