Authors: Vladimer Khasia
Abstract: The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L * BN ) spatial bottleneck (where B is the sequence-batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors. To solve the foundational instability of sketched gradients, we propose two novel mechanisms: Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry. Theoretically, BASIS reduces activation memory to O(L * RN ) and heavily decreases the backward pass matrix-multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator. The code is available at https://github.com/VladimerKhasia/basis
Authors: Xingsheng Chen, Xianpei Mu, Deyu Yi, Yilin Yuan, Xingwei He, Bo Gao, Regina Zhang, Pietro Lio, Siu-Ming Yiu
Abstract: Multivariate time series forecasting is fundamental to numerous domains such as energy, finance, and environmental monitoring, where complex temporal dependencies and cross-variable interactions pose enduring challenges. Existing Transformer-based methods capture temporal correlations through attention mechanisms but suffer from quadratic computational cost, while state-space models like Mamba achieve efficient long-context modeling yet lack explicit temporal pattern recognition. Therefore we introduce UniMamba, a unified spatial-temporal forecasting framework that integrates efficient state-space dynamics with attention-based dependency learning. UniMamba employs a Mamba Variate-Channel Encoding Layer enhanced with FFT-Laplace Transform and TCN to capture global temporal dependencies, and a Spatial Temporal Attention Layer to jointly model inter-variate correlations and temporal evolution. A Feedforward Temporal Dynamics Layer further fuses continuous and discrete contexts for accurate forecasting. Comprehensive experiments on eight public benchmark datasets demonstrate that UniMamba consistently outperforms state-of-the-art forecasting models in both forecasting accuracy and computational efficiency, establishing a scalable and robust solution for long-sequence multivariate time-series prediction.
Authors: Brady Steele
Abstract: We find that LoRA fine-tuning exhibits un-learning on contested examples: items with high annotator disagreement show increasing loss during training, a qualitatively distinct pattern largely absent under full fine-tuning and consistent across all six models tested (four encoder, two decoder-only). This discovery emerges from correlating annotation entropy, computed from ChaosNLI's 100 labels per example, with per-example area under the loss curve (AULC) on SNLI and MNLI. The correlation is positive in all 25 conditions tested (Spearman $\rho = 0.06$-$0.43$), with decoder-only models showing stronger correlations than encoders at matched LoRA rank. The effect survives partial-correlation controls and replicates across seeds and datasets. A preliminary noise-injection experiment is consistent with these findings.
Authors: Pegah Ahadian, Mingrui Yang, Sixu Chen, Xiaojuan Li, Qiang Guan
Abstract: Knee osteoarthritis frequently exhibits discordance between structural damage observed in imaging and patient-reported symptoms such as pain. This mismatch complicates clinical interpretation and patient stratification and remains insufficiently modeled in existing decision support systems. We propose a discordance aware multimodal framework that combines machine learning prediction models with a tool grounded multi agent reasoning system. Using baseline data from the FNIH Osteoarthritis Biomarkers Consortium, we trained multimodal models to predict two progression tasks, joint space loss only progression versus non progression, and pain only progression versus non progression. The predictive system integrates three modality specific experts: a CatBoost tabular model using demographic, radiographic, MRI-derived scalar, and biomarker features; MRI image embeddings extracted using a ResNet18 backbone; and Xray embeddings derived from the same architecture. Expert predictions are fused using a stacking ensemble. Residual based models estimate expected pain from structural features, enabling the computation of a pain structure discordance score between observed and expected symptoms. A multi-agent reasoning layer interprets these signals to assign clinically interpretable OA phenotypes and generate phenotype specific management recommendations.
Authors: Alizishaan Anwar Hussein Khatri
Abstract: The use of Deep Neural Network based systems in the real world is growing. They have achieved state-of-the-art performance on many image, speech and text datasets. They have been shown to be powerful systems that are capable of learning detailed relationships and abstractions from the data. This is a double-edged sword which makes such systems vulnerable to learning the noise in the training set, thereby negatively impacting performance. This is also known as the problem of \emph{overfitting} or \emph{poor generalization}. In a practical setting, analysts typically have limited data to build models that must generalize to unseen data. In this work, we explore the use of a differential-privacy based approach to improve generalization in Deep Neural Networks.
Authors: Jiawei Huang, Qingping Yang, Renjie Zheng, Jiaze Chen
Abstract: Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms terminal-score-only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.
Authors: Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang, Qiankun Li, Kun Wang, Yang Liu, Zhigang Zeng
Abstract: MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late-turn failures to earlier turns.I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 -> 81.84/70.77 for 3B; 56.21/60.32 -> 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 -> 55.58/70.27 for 3B; 24.66/46.48 -> 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed-Bg/SaFeR-Steer
Authors: Nikola Jovi\v{s}i\'c, Milica \v{S}kipina, Vanja \v{S}venda
Abstract: Data scarcity and weak supervision continue to limit the performance of machine learning models in many real-world applications, such as mammography, where Multiple Instance Learning (MIL) often offers the best formulation. While recent foundation models provide strong semantic representations out of the box, effective augmentation of such representations of MIL data remains limited, as existing methods operate at the instance level and fail to capture intra-bag dependencies. In this work, we introduce SetFlow, a generative architecture that models entire MIL bags (i.e., sets) directly in the representation space. Our approach leverages the flow matching paradigm combined with a Set Transformer-inspired design, enabling it to handle permutation-invariant inputs while capturing interactions between instances within each bag. The model is conditioned on both class labels and input scale, allowing it to generate coherent and semantically consistent sets of representations. We evaluate SetFlow on a large-scale mammography benchmark using a state-of-the-art MIL-PF classification pipeline. The generated samples are shown to closely match the original data distribution and even improve downstream performance when used for augmentation. Furthermore, training on synthetic data alone shows competitive results, demonstrating the effectiveness of representation-space generative modeling for data-scarce and privacy-sensitive tasks.
Authors: Ruize Xia
Abstract: CLIP adaptation can improve in-domain accuracy while degrading out-of-domain transfer, but comparisons between Full Fine-Tuning (Full FT) and LoRA are often confounded by different learning-rate conventions. We study how adaptation method and optimization scale jointly shape attention drift and transfer retention in CLIP using a controlled matched-learning-rate comparison of Full FT and LoRA. The completed matrix contains 80 runs on CLIP ViT-B/32 across EuroSAT and Oxford-IIIT Pets, spanning four shared learning rates ($10^{-6}$, $5{\times}10^{-6}$, $10^{-5}$, $5{\times}10^{-5}$) and five seeds, and evaluates attention-drift metrics, best validation accuracy, and adapter-aware CIFAR-100 zero-shot accuracy. Learning rate strongly modulates structural change: on EuroSAT, Full FT moves from mild entropy broadening at $10^{-6}$ to marked contraction at $5{\times}10^{-5}$, whereas LoRA remains entropy-positive across the full matched grid. At matched learning rates, LoRA preserves substantially more zero-shot transfer than Full FT, averaging $45.13\%$ versus $11.28\%$ CIFAR-100 accuracy on EuroSAT and $58.01\%$ versus $8.54\%$ on Pets. Oxford-IIIT Pets also reveals a regime effect: low-learning-rate LoRA underfits in-domain, so method-only averages can obscure when LoRA becomes competitive. Supporting rollout, patch-to-patch, and CKA analyses are directionally consistent with the controlled matrix. Overall, matched-learning-rate evaluation materially changes the interpretation of Full FT versus LoRA, and attention drift is most useful as a descriptive diagnostic of representation preservation rather than a causal explanation of transfer behavior.
Authors: Yunxiang Guo
Abstract: We study asynchronous alignment, a first-class multimodal learning setting in which a dense primary stream must be fused with sporadic external context whose value depends on when it arrives. Unlike standard multimodal benchmarks that assume structural synchrony, this setting requires models to reason explicitly about freshness and trust. We focus on the event-conditioned case in which continuous market states are paired with delayed web intelligence, and we use high-frequency cryptocurrency markets only as a timestamped, high-noise stress test for this broader problem. We propose CGCMA (Conditionally-Gated Cross-Modal Attention), whose central design principle is to separate text-conditioned grounding from lag-aware trust control. Text first attends over price sequences to identify event-relevant market states, after which a conditional gate uses modality agreement, web features, and lag $\tau_{\mathrm{lag}}$ to regulate residual injection and fall back toward unimodal prediction when external context is stale or contradictory. We introduce CMI (Crypto Market Intelligence), an asynchronous evaluation corpus with 27,914 real-news samples pairing high-frequency price sequences with lagged web intelligence. On the current short real-news corpus, CGCMA attains the highest mean downstream Sharpe ratio ($+0.449 \pm 0.257$) among the evaluated baselines under a shared zero-cost threshold-trading evaluation on news-available bars. Additional controls show that the gain is not explained by web scalars alone and is not recovered by simple freshness heuristics. The resulting evidence supports problem validity and a promising asynchronous multimodal gain on this stress-test setting.
Authors: Satchel Grant, Victor Gillioz, Jake Ward, Thomas McGrath
Abstract: Defensive training methods such as positive preventative steering (PPS) and inoculation prompting (IP) offer surprising results through seemingly similar processes: both add trait-inducing objects to large language models (LLMs) during training, and both defend the LLM against acquiring the trait. The surprising success of these methods comes with the question: how do they work? Are PPS and IP doing the same thing? We provide behavioral and mechanistic comparisons of these two methods using "evilness" as a case-study trait. Our central finding is that PPS and IP achieve their defensive benefits through distinct mechanisms. Behaviorally, we show that neither PPS nor IP operates through a purely associative mechanism; and PPS can both defend against trait acquisition and actively reduce pre-existing expression, whereas IP is ineffective in models that were previously finetuned to express the trait. This behavioral divergence is reflected mechanistically: PPS shifts the activation gradient towards an attenuating direction along the PPS vector axis. When the PPS vector is aligned with a trait-expressing axis, it can reverse the gradient pressure, reducing rather than increasing activation along that axis. In contrast, IP continues to resist a precise mechanistic account. Direct cosine similarity analyses reveal that IP has a characteristically different gradient signature than PPS, and qualitative analyses reveal IP's gradient to be more diffuse. Furthermore, IP reduces the next-token prediction loss on trait-expressing data where PPS need not, consistent with the notion that IP "explains away" the trait-expression in the training data. Taken together, our analyses reveal distinct mechanisms by which each method operates and highlight open questions about IP's mechanistic picture.
Authors: Kutomanov Hennadii
Abstract: As modern deep learning architectures grow in complexity, representational ambiguity emerges as a critical barrier to their interpretability and reliable merging. For ReLU networks, identical functional mappings can be achieved through entirely different weight configurations due to algebraic symmetries: neuron permutation and positive diagonal scaling. Consequently, traditional parameter-based comparison methods exhibit extreme instability to slight weight perturbations during training. This paper proposes a mathematically grounded approach to constructing a stable canonical representation of neural networks and a robust functional similarity metric. We shift focus from comparing raw weights to analyzing the topology of neuron activation regions. The algorithm first eliminates scaling ambiguity via L2-normalization of weight vectors with subsequent layer compensation. Next, discrete approximations of activation regions are generated as binary functional signatures evaluated over a data sample. To overcome the computational bottleneck of comparing large binary vectors, we adapt Locality-Sensitive Hashing, specifically MinHash, providing a fast and statistically precise approximation of the Jaccard index. The final cross-network neuron matching is formulated as a linear sum assignment problem solved via the Hungarian algorithm. We demonstrate theoretically and experimentally that our metric mitigates the neuron "flickering" effect and exhibits exceptional robustness to minor weight perturbations. This framework provides a solid foundation for model merging, transfer learning, objective assessment during pruning, and Explainable AI paradigms.
Authors: Jinmyeong Choi, Brad Shook, Artur Dubrawski
Abstract: Time series foundation models (TSFMs) are widely used as generic feature extractors, yet the notion of non-stationarity in their embedding spaces remains poorly understood. Recent work often conflates non-stationarity with distribution shift, blurring distinctions fundamental to classical time-series analysis and long-standing methodologies such as statistical process control (SPC). In SPC, non-stationarity signals a process leaving a stable regime - via shifts in mean, variance, or emerging trends - and detecting such departures is central to quality monitoring and change-point analysis. Motivated by this diagnostic tradition, we study how different forms of distributional non-stationarity - mean shifts, variance changes, and linear trends - become linearly accessible in TSFM embedding spaces under controlled conditions. We further examine temporal non-stationarity arising from persistence, which reflects violations of weak stationarity due to long-memory or near-unit-root behavior rather than explicit distributional shifts. By sweeping shift strength and probing multiple TSFMs, we find that embedding-space detectability of non-stationarity degrades smoothly and that different models exhibit distinct, model-specific failure modes.
Authors: Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent
Abstract: We introduce Mosaic, a probabilistic weather forecasting model that addresses two principal sources of spectral degradation in ML-based weather prediction: (1) deterministic training against ensemble means and (2) compressive encoding creating an information bottleneck. Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5$\deg$ resolution with 214M parameters, Mosaic matches or outperforms models trained on 6 times finer data on headline upper-air variables and achieves state-of-the-art results among 1.5$\deg$ models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12 seconds on a single H100 GPU.
Authors: Ping Wang
Abstract: Abrupt transitions between distinct dynamical regimes are a hallmark of complex systems. Grokking in deep neural networks provides a striking example -- an abrupt transition from memorization to generalization long after training accuracy saturates -- yet robust macroscopic signatures of this transition remain elusive. Here we introduce \textbf{TDU--OFC} (Thresholded Diffusion Update--Olami-Feder-Christensen), an offline avalanche probe that converts gradient snapshots into cascade statistics and extracts a \emph{macroscopic observable} -- the time-resolved effective cascade dimension $D(t)$ -- via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline $D=1$ precisely at the generalization transition. The crossing direction is task-dependent: modular addition descends through $D=1$ (approaching from $D>1$), while XOR ascends (from $D<1$). This opposite-direction convergence is consistent with attraction toward a candidate shared critical manifold, rather than trivial residence near $D \approx 1$. Negative controls confirm this picture: ungrokked runs remain supercritical ($D>1$) and never enter the post-transition regime. In addition, avalanche distributions exhibit heavy tails and finite-size scaling consistent with the dimensional exponent extracted from $D(t)$. Shadow-probe controls ($\alpha_{\mathrm{train}}=0$) confirm that $D(t)$ is non-invasive, and grokked trajectories diverge from ungrokked ones in $D(t)$ some $100$--$200$ epochs before the behavioral transition.
Authors: Jelena Markovic-Voronov, Wenhui Zhu, Bo Long, Zhipeng Wang, Suyash Gupta, Kayhan Behdin, Bee-Chung Chen, Deepak Agarwal
Abstract: We introduce a principled probabilistic framework for reward-guided decoding in large language models, addressing the limitations of standard decoding methods that optimize token-level likelihood rather than sequence-level quality. Our method defines a reward-augmented target distribution over complete sequences by combining model transition probabilities with prefix-dependent reward potentials. Importantly, the approach is training-free: it leaves model weights unchanged and instead modifies the inference distribution via reward potentials, with all gains arising purely from inference-time sampling. To sample from this distribution, we develop Sequential Monte Carlo algorithms, including a computationally efficient prefix-only variant and a lookahead variant whose intermediate targets match the exact marginals of the full sequence distribution. The framework also integrates resample-move updates with Metropolis-Hastings rejuvenation and supports block-wise generation, subsuming common decoding strategies such as temperature sampling and power-tempered objectives. Empirical results across three 7B models show significant gains. On code generation (HumanEval), our method improves base performance by up to 54.9% and surpasses the strongest sampling baselines by 9.1%-15.3%. On mathematical reasoning (MATH500), it achieves gains of up to 8.8%. Notably, it reaches 87.8% on HumanEval and 78.4% on MATH500 with Qwen2.5-7B, consistently outperforming the reinforcement learning method GRPO.
Authors: Eunjeong Park, Amrita Basak
Abstract: Accurate phase equilibria are foundational to alloy design because they encode the underlying thermodynamics governing stability, transformations, and processing windows. However, while the CALculation of Phase Diagrams (CALPHAD) provides a rigorous thermodynamic framework, exploring multicomponent composition-temperature space remains computationally expensive and is typically limited to sparse section. To enable rapid phase mapping and alloy screening, we propose a physics-informed graph attention network (GAT) that learns element-aware representations and couples them with thermodynamic constraints for multi-label phase-set prediction in the Ag-Bi-Cu-Sn alloy system. Using about 25,000 equilibrium states generated with pycalphad, each composition-temperature point is represented as a four-node element graph with atomic fractions and elemental descriptors as node features. The model combines graph attention, global pooling, and a multilayer perceptron to predict nine relevant phases. To improve physical consistency, we incorporate thermodynamic constraints, applied as training penalties or as an inference-time projection. Across six binary and three ternary subsystems, the baseline model achieves a macro-F1 score of 0.951 and 93.98% exact-set match, while physics-informed decoding improves robustness and raises exact-set accuracy to about 96% on dense in-domain grids. The surrogate also generalizes to an unseen ternary section with 99.32% exact-set accuracy and to a quaternary section at 700 {\deg}C with 91.78% accuracy. These results demonstrate that attention-based graph learning coupled with thermodynamic constraint enforcement provides an effective and physically consistent surrogate for high-resolution phase mapping and extrapolative alloy screening.
Authors: Qi Zhang
Abstract: In the field of online reinforcement learning (RL), traditional Gaussian policies and flow-based methods are often constrained by their unimodal expressiveness, complex gradient clipping, or stringent trust-region requirements. Moreover, they all rely on post-hoc penalization of negative samples to correct erroneous actions. This paper introduces Positive-Only Drifting Policy Optimization (PODPO), a likelihood-free and gradient-clipping-free generative approach for online RL. By leveraging the drifting model, PODPO performs policy updates via advantage-weighted local contrastive drifting. Relying solely on positive-advantage samples, it elegantly steers actions toward high-return regions while exploiting the inherent local smoothness of the generative model to enable proactive error prevention. In doing so, PODPO opens a promising new pathway for generative policy learning in online settings.
Authors: Jack T. Beerman, Tyler J. Abele, Mehdi Taghizadeh, Andrew Davis, Zo\"e J. Gray, Negin Alemazkoor, Xinfeng Gao, H. S. Udaykumar, Stephen S. Baek
Abstract: Physics-aware recurrent convolutional networks (PARC) have demonstrated strong performance in predicting nonlinear spatiotemporal dynamics by embedding differential operators directly into the computational graph of a neural network. However, pixel-based convolutions are restricted to static, uniform Cartesian grids, making them ill-suited to following evolving localized structures in an efficient manner. Graph neural networks (GNNs) naturally handle irregular spatial discretizations, but existing graph-based physics-aware deep learning (PADL) methods have difficulty handling extreme nonlinear regimes. To address these limitations, we propose Graph PARC (G-PARC), which uses moving least squares (MLS) kernels to approximate spatial derivatives on unstructured graphs, and embeds the derivatives of governing partial differential equations into the network's computational graph. G-PARC achieves better accuracy with 2-3x fewer parameters than MeshGraphNet, MeshGraphKAN, and GraphSAGE, replacing the traditional encoder-processor-decoder framework with analytically computed differential operators. We demonstrate that G-PARC (1) generalizes across nonuniform spatial and temporal discretizations; (2) handles moving meshes required for structural deformation; and (3) outperforms existing graph-based PADL methods on nonlinear benchmarks including fluvial hydrology, planar shock waves, and elastoplastic dynamics. By embedding explicit physical operators within the flexibility of GNNs, G-PARC enables accurate modeling of extreme nonlinear phenomena on complex computational domains, moving PADLbeyond idealized Cartesian grids.
Authors: Divya Shyamal, Marta Kne\v{z}evi\'c, Lan Tran, Chanakya Ekbote, Vijay Lingam, Paul Pu Liang
Abstract: Test-time scaling (TTS) improves large language models (LLMs) by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical reasoning benchmarks, SCATR improves over prior confidence-based baselines by up to 9%. Relative to LoRA fine-tuning on the same calibration data, it achieves comparable accuracy with up to 8000x fewer trainable parameters and much lower compute, reducing training and inference latency by up to 150x and 1000x, respectively. SCATR is also competitive with strong PRM baselines, and in several settings improves accuracy by up to 7.8% on math and 4.2% on coding while enabling up to 1000x faster inference. Overall, SCATR offers a strong accuracy-efficiency trade-off for scalable test-time selection.
Authors: Anna Mazhar, Sainyam Galhotra
Abstract: Machine learning components are now central to AI-infused software systems, from recommendations and code assistants to clinical decision support. As regulations and governance frameworks increasingly require deleting sensitive data from deployed models, machine unlearning is emerging as a practical alternative to full retraining. However, unlearning introduces a software quality-assurance challenge: under realistic deployment constraints and imperfect oracles, how can we test that a model no longer relies on targeted information? This paper frames unlearning testing as a first-class software engineering problem. We argue that practical unlearning tests must provide (i) thorough coverage over proxy and mediated influence pathways, (ii) debuggable diagnostics that localize where leakage persists, (iii) cost-effective regression-style execution under query budgets, and (iv) black-box applicability for API-deployed models. We outline a causal, pathway-centric perspective, causal fuzzing, that generates budgeted interventions to estimate residual direct and indirect effects and produce actionable "leakage reports". Proof-of-concept results illustrate that standard attribution checks can miss residual influence due to proxy pathways, cancellation effects, and subgroup masking, motivating causal testing as a promising direction for unlearning testing.
Authors: Jingke Chen, Jingrui Zhong, Tazneen Hossain Tani, Zidong Su, Xiaochun Zhang, Boxue Tian
Abstract: Despite the high accuracy of 'black box' deep learning models, drug discovery still relies on protein-ligand interaction principles and heuristics. To improve interpretability of protein-small molecule binding predictions, we developed the PWRules framework, which applies binding affinity data to identify privileged small molecule fragments and subsequently defines complementary pairing rules between these fragments and protein words (semantic sequence units) through an interpretability module. The resulting word-fragment rules are then ranked by the PWScore function to prioritize active compounds. Evaluations on benchmark datasets show that PWScore achieves competitive performance comparable to the physics-based model (Glide) and the deep learning model (PSICHIC) and shows broad applicability for protein targets outside the training dataset, e.g., SARS-CoV-2 main protease. Notably, PWScore captures complementary interaction information, yielding superior enrichment performance when integrated with these established methods. Structural analysis of protein-ligand complexes indicates that learned word-fragment rules are significantly enriched near ligand-binding pockets, despite training without explicit structural guidance. By extracting and applying complementary pairing rules, PWRules provides an interpretable framework for drug discovery.
Authors: Masakazu Yoshimura, Zitang Sun, Yuiko Sakuma, Junji Otsuka, Atsushi Irie, Takeshi Ohashi
Abstract: Neural Architecture Search (NAS) aims to automatically discover high-performing deep neural network (DNN) architectures. However, conventional algorithm-driven NAS relies on carefully hand-crafted search spaces to ensure executability, which restricts open-ended exploration. Recent coding-based agentic approaches using large language models (LLMs) reduce manual design, but current LLMs struggle to reliably generate complex, valid architectures, and their proposals are often biased toward a narrow set of patterns observed in their training data. To bridge reliable algorithmic search with powerful LLM assistance, we propose LLMasTool, a hierarchical tree-based NAS framework for stable and open-ended model evolution. Our method automatically extracts reusable modules from arbitrary source code and represents full architectures as hierarchical trees, enabling evolution through reliable tree transformations rather than code generation. At each evolution step, coarse-level planning is governed by a diversity-guided algorithm that leverages Bayesian modeling to improve exploration efficiency, while the LLM resolves the remaining degrees of freedom to ensure a meaningful evolutionary trajectory and an executable generated architecture. With this formulation, instead of fully agentic LLM approaches, our method explores diverse directions beyond the inherent biases in the LLM. Our method improves over existing NAS methods by 0.69, 1.83, and 2.68 points on CIFAR-10, CIFAR-100, and ImageNet16-120, demonstrating its effectiveness.
Authors: Yuming Yan, Kai Tang, Sihong Chen, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu
Abstract: Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model's generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajectories but frequently encounters optimization collapse - a cold-start problem where an unaligned model fails to spontaneously sample any domain-valid trajectories in sparse-reward visual tasks. In this paper, we propose Supervised Group Relative Policy Optimization (S-GRPO), a unified post-training framework that integrates the guidance of imitation learning into the multi-trajectory exploration of preference optimization. Tailored for direct-generation visual tasks, S-GRPO introduces Conditional Ground-Truth Trajectory Injection (CGI). When a binary verifier detects a complete exploratory failure within a sampled group of trajectories, CGI injects the verified ground-truth trajectory into the candidate pool. By assigning a deterministic maximal reward to this injected anchor, S-GRPO enforces a positive signal within the group-relative advantage estimation. This mechanism reformulates the supervised learning objective as a high-advantage component of the policy gradient, compelling the model to dynamically balance between exploiting the expert trajectory and exploring novel visual concepts. Theoretical analysis and empirical results demonstrate that S-GRPO gracefully bridges the gap between SFT and RL, drastically accelerates convergence, and achieves superior domain adaptation while preserving the base model's general-purpose capabilities.
Authors: Zhixiong Yang, Long Jing, Yao Li, Shuli Cheng, Guoxuan Chi, Chenyu Wen
Abstract: AIGC has shown remarkable success in CV and NLP, and has recently demonstrated promising potential in the wireless domain. However, significant data imbalance exists across RF modalities, with abundant WiFi data but scarce mmWave and RFID data due to high acquisition cost. This makes it difficult to train high-quality generative models for these data-scarce modalities. In this work, we propose RF-CMG, a diffusion-based cross-modal generative method that leverages data-rich WiFi signals to synthesize high-fidelity RF data for scarce modalities including mmWave and RFID. The key insight of RF-CMG is to decouple cross-modal generation into high-frequency guidance and low-frequency constraint, which respectively learn high-frequency distribution from limited target modality data and preserve the underlying physical structure via low-frequency constraints during generation. On this basis, we introduce a Modality-Guided Embedding (MGE) module to steer the reverse diffusion trajectory toward the target high-frequency distribution, and a Low-Frequency Modality Consistency (LFMC) module to progressively enforce low-frequency constraints to suppress the accumulation of source-modality structural biases during inference, enabling high-quality target-modality generation. Performance comparison with several prevalent generative models demonstrates that RF-CMG achieves superior performance in synthesizing RFID and mmWave signals. We further showcase the effectiveness of the data generated by RF-CMG in gesture recognition tasks, and analyze the impact of the proportion of synthetic data on downstream performance.
Authors: Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu
Abstract: While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.
Authors: Zhijiang Tang, Jiaxin Qi, Yan Cui, Jinli Ou, Yuhua Zheng, Jianqiang Huang
Abstract: DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in-depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and support the rationale behind our recommendations. Finally, we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods to advance the development of genomic foundation models.
Authors: Kemal Bayik, Olayinka Ajayi, Daniel Roggen, Philip Birch
Abstract: Wi-Fi Channel State Information (CSI) enables device-free human activity recognition, but existing multi-user approaches assume a fixed set of known users during both training and inference. This closed-set assumption limits deployment, as models trained on a specific user set degrade when applied to new individuals or environments. We reformulate multi-user activity recognition as activity counting, estimating how many users perform each activity type at a given time, without associating actions with specific individuals. We propose a pipeline that converts CSI measurements into spatial projections and extracts features using a pretrained convolutional backbone. Two formulations are evaluated on the WiMANS dataset: a conventional identity-dependent model that assigns activities to fixed user slots, and an identity-agnostic model that estimates scene-level activity composition through regression. Under standard evaluation, the identity-agnostic model achieves a mean absolute error of 0.1081 on a 0-5 count scale. Under unseen-user evaluation, the identity-dependent model's macro-F1 drops from 80.38 to 32.61, while the identity-agnostic model's counting error remains stable. Feature space analysis confirms that identity-agnostic representations are more user-invariant, which explains their stronger generalization. These results suggest that activity counting provides a more practical and generalizable alternative to identity-dependent formulations for multi-user WiFi sensing.
Authors: Xingyan Chen, Tian Du, Changqiao Xu, Fuzhen Zhuang, Lujie Zhong, Gabriel-Miro Muntean, Enmao Diao
Abstract: Federated Learning (FL) faces challenges from client data heterogeneity and resource-constrained mobile devices, which can degrade model accuracy. Personalized Federated Learning (PFL) addresses this issue by adapting shared global knowledge to local data distributions. A promising approach in PFL is model decoupling, which separates the model into global and personalized parameters, raising the key question of which parameters should be personalized to balance global knowledge sharing and local adaptation. In this paper, we propose a Federated Optimal Brain Personalization (FedOBP) algorithm with a quantile-based thresholding mechanism and introduce an element-wise importance score. This score extends Optimal Brain Damage (OBD) pruning theory by incorporating a federated approximation of the first-order derivative in the Taylor expansion to evaluate the importance of each parameter for personalization. Moreover, we move the metric computation originally performed on clients to the server side, to alleviate the burden on resource-constrained mobile devices. To the best of our knowledge, this is the first work to bridge classical saliency-based pruning theory with federated parameter decoupling, providing a rigorous theoretical justification for selecting personalized parameters based on their sensitivity to local loss landscapes. Extensive experiments demonstrate that FedOBP outperforms state-of-the-art methods across diverse datasets and heterogeneity scenarios, while requiring personalization of only a very small number of personalized parameters.
Authors: Yasmin Souza Lima, Rodrigo Moreira, Larissa F. Rodrigues Moreira, Tereza Cristina M. de B. Carvalho, Fl\'avio de Oliveira Silva
Abstract: Unsupervised anomaly detection is widely used to detect Distributed Denial-of-Service (DDoS) attacks in cloud-native 5G networks, yet most studies assume a fixed traffic representation, either temporal or structural, without validating which feature space best matches the data. We propose a lightweight decision framework that prioritizes temporal or structural features before training, using two diagnostics: lag-1 autocorrelation of an aggregated flow signal and PCA cumulative explained variance. When the probes are inconclusive, the framework reserves a hybrid option as a future fallback rather than an empirically validated branch. Experiments on two statistically distinct datasets with Isolation Forest, One-Class SVM, and KMeans show that structural features consistently match or outperform temporal ones, with the performance gap widening as temporal dependence weakens.
Authors: Fangyuan Liu, Sirui Zhao, Zeyu Zhang, Jinyang Huang, Feng-Qi Cui, Bin Luo, Tong Xu, Meng Li, Enhong Chen
Abstract: Automated depression estimation is highly vulnerable to signal corruption and ambient noise in real-world deployment. Prevailing deterministic methods produce uncalibrated point estimates, exposing safety-critical clinical systems to the severe risk of overconfident misdiagnoses. To establish a highly resilient and trustworthy assessment paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. A fundamental vulnerability in multimodal evidential fusion is the uncontrolled accumulation of cross-modal redundancies. This structural flaw artificially inflates diagnostic confidence by double-counting overlapping evidence. To guarantee robust evidence synthesis, EviDep enforces strict information integrity. First, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically isolate task-irrelevant noise, preserving the fidelity of diagnostic signals. Subsequently, a Disentangled Evidential Learning strategy separates the shared consensus from modality-specific nuances. By explicitly decoupling these representations before Bayesian fusion, EviDep systematically mitigates evidence redundancy. Extensive experiments on AVEC 2013, 2014, DAIC-WOZ, and E-DAIC confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, delivering a robust fail-safe mechanism for trustworthy clinical screening.
Authors: Agnieszka Pregowska, Stefan Marynowicz
Abstract: Accurate assessment of lithium-ion battery ageing is challenged by cell-to-cell variability, heterogeneous cycling protocols, and limited transferability of data-driven models across datasets. In particular, robust identification of degradation transitions, such as the knee point, and reliable early-life prediction of remaining useful life (RUL) remain open problems. This study proposes a unified framework for battery ageing analysis based on continuous representations of voltage-capacity and capacity-cycle trajectories learned from heterogeneous public datasets (NASA, CALCE, ISU-ILCC). The continuous formulation enables consistent extraction of degradation descriptors, including curvature, plateau length and knee-related metrics, while reducing sensitivity to dataset-specific discretisation. Across more than 250 cells, statistically significant correlations between knee onset and end-of-life (Pearson 0.75-0.84) are observed. Additional early-life analysis confirms that knee-related features retain predictive value when estimated from partial trajectories. Early-life models provide increasingly stable RUL predictions as the number of observed cycles increases, with meaningful predictive performance emerging within the first 5-20 cycles and remain robust under cross-dataset domain shift. The framework integrates continuous modelling, feature extraction and uncertainty-aware prediction, providing an interpretable and dataset-consistent approach demonstrating robustness across heterogeneous dataset types. Compared with conventional discrete or feature-based methods, the proposed representation reduces sensitivity to sampling resolution and improves cross-dataset consistency. The study is limited to laboratory-scale datasets and capacity-based end-of-life definitions.
Authors: Mahir Labib Dihan, Md. Ashrafur Rahman Khan, Wasif Jalal, Md. Roqunuzzaman Sojib, Mashroor Hasan Bhuiyan
Abstract: Neural Combinatorial Optimization (NCO) has emerged as a powerful framework for solving combinatorial optimization problems by integrating deep learning-based models. This work focuses on improving existing inference techniques to enhance solution quality and generalization. Specifically, we modify the Random Re-Construct (RRC) approach of the Light Encoder Heavy Decoder (LEHD) model by incorporating Simulated Annealing (SA). Unlike the conventional RRC, which greedily replaces suboptimal segments, our SA-based modification introduces a probabilistic acceptance mechanism that allows the model to escape local optima and explore a more diverse solution space. Additionally, we enhance the Policy Optimization with Multiple Optima (POMO) approach by integrating Beam Search, enabling systematic exploration of multiple promising solutions while maintaining diversity in the search space. We further investigate different inference strategies, including Softmax Sampling, Greedy, Gumbel-Softmax, and Epsilon-Greedy, analyzing their impact on solution quality. Furthermore, we explore instance augmentation techniques, such as horizontal and vertical flipping and rotation-based augmentations, to improve model generalization across different CVRP instances. Our extensive experiments demonstrate that these modifications significantly reduce the optimality gap across various Capacitated Vehicle Routing Problem (CVRP) benchmarks, with Beam Search and SA-based RRC consistently yielding superior performance. By refining inference techniques and leveraging enhanced search strategies, our work contributes to the broader applicability of NCO models in real-world combinatorial optimization tasks.
Authors: Shaoang Li, Jian Li
Abstract: Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori. The two decisions are tightly coupled: the cache determines the cost of exploration, and the router determines which adapters receive informative feedback. We formulate this joint caching-and-routing problem as a two-timescale contextual bandit and propose POLAR (Paging and Online Learning for Adapter Routing). POLAR pairs a cache-aware LinUCB router with an epoch-based cache controller. We study two variants. A fixed-epoch version provides a robust baseline with worst-case regret guarantees under arbitrary contexts. An epoch-doubling version, POLAR+, adds forced exploration and improved cache optimization to achieve $\widetilde{\mathcal{O}}(d\sqrt{NT}+\sqrt{KT})$ sublinear regret under stochastic regularity and cacheability conditions, where $N$ is the adapter count, $K$ the cache size, $d$ the context dimension, and $T$ the horizon. The routing term matches the standard contextual-bandit rate up to logarithmic factors, showing that the memory hierarchy does not fundamentally slow routing learning. Experiments using 15 real LoRA adapters for Qwen2.5-7B together with measured GPU paging latencies show that adaptive cache control substantially outperforms non-adaptive baselines and exhibits scaling trends consistent with the theory.
Authors: Noureddine Kermiche
Abstract: We present the Global Neural World Model (GNWM), a self-stabilizing framework that achieves topological quantization through balanced continuous entropy constraints. Operating as a continuous, action-conditioned Joint-Embedding Predictive Architecture (JEPA), the GNWM maps environments onto a discrete 2D grid, enforcing translational equivariance without pixel-level reconstruction. Our results show this architecture prevents manifold drift during autoregressive rollouts by using grid ``snapping'' as a native error-correction mechanism. Furthermore, by training via maximum entropy exploration (random walks), the model learns generalized transition dynamics rather than memorizing specific expert trajectories. We validate the GNWM across passive observation, active agent control, and abstract sequence regimes, demonstrating its capacity to act not just as a spatial physics simulator, but as a causal discovery model capable of organizing continuous, predictable concepts into structured topological maps.
Authors: Zongru Li, Xingsheng Chen, Honggang Wen, Regina Qianru Zhang, Ming Li, Xiaojin Zhang, Hongzhi Yin, Qiang Yang, Kwok-Yan Lam, Pietro Lio, Siu-Ming Yiu
Abstract: Molecular property prediction integrates quantum chemistry, cheminformatics, and deep learning to connect molecular structure with physicochemical and biological behavior. This survey traces four complementary paradigms, including Quantum, Descriptor Machine Learning, Geometric Deep Learning, and Foundation Models, and outlines a unified taxonomy linking molecular representations, model architectures, and interdisciplinary applications. Benchmark analyses integrate evidence from both widely used datasets and datasets reflecting industry perspectives, encompassing quantum, physicochemical, physiological, and biophysical domains. The survey examines current standards in data curation, splitting strategies, and evaluation protocols, highlighting challenges including inconsistent stereochemistry, heterogeneous assay sources, and reproducibility limitations under random or poorly defined splits. These observations motivate the modernization of benchmark design toward more transparent, time- and scaffold-aware methodologies. We further propose three forward-looking directions: (i) physics-aware learning embedding quantum consistency, (ii) uncertainty-calibrated foundation models for trustworthy inference, and (iii) realistic multimodal benchmark ecosystems integrating computational and experimental data. Repository: https://github.com/Zongru-Li/Survey-and-Benchmarks-of-DL-for-Molecular-Property-Prediction-in-the-Foundation-Model-Era.
Authors: Jongyeop Kim, Jinki Kim, Doyun Lee
Abstract: Structural health monitoring plays a critical role in ensuring structural safety by analyzing vibration responses from engineering systems. This paper proposes a Spectro-Temporal Alignment framework and a Hybrid Spectro-Temporal Fusion framework that integrate arrival-time interval descriptors with spectral features to capture both fine-scale and coarse-scale vibration dynamics. Experiments conducted on data collected from an LDS V406 electrodynamic shaker demonstrate that the proposed spectro-temporal representations significantly outperform conventional input formulations. The results indicate that a temporal resolution ({\Delta}{\tau}) of 0.008 of 0.02 favors traditional machine learning models, whereas a finer resolution ({\Delta}{\tau}) of 0.008 effectively unlocks the performance potential of deep learning architectures. Beyond classification accuracy, a comprehensive stability analysis based on condensed indices, including mean performance, standard deviation, coefficient of variation, and balanced score, shows that the proposed hybrid framework consistently achieves higher accuracy with substantially lower variability compared to baseline and alignment-only approaches. Overall, these results demonstrate that the proposed framework provides a robust, accurate, and reliable solution for vibration-based structural health monitoring.
Authors: Xiao Wang, Zezhong Zhang, Isaac Lyngaas, Hong-Jun Yoon, Jong-Youl Choi, Siming Liang, Janet Wang, Hristo G. Chipilski, Ashwin M. Aji, Feng Bao, Peter Jan van Leeuwen, Dan Lu, Guannan Zhang
Abstract: Accurate weather and climate prediction relies on data assimilation (DA), which estimates the Earth system state by integrating observations with models. While exascale computing has significantly advanced earth simulation, scalable and accurate inference of the Earth system state remains a fundamental bottleneck, limiting uncertainty quantification and prediction of extreme events. We introduce a unified one-stage generative DA framework that reformulates assimilation as Bayesian posterior sampling, replacing the conventional forecast-update cycle with compute-dense, GPU-efficient inference. At the core is STORM, a novel spatiotemporal transformer with a global attention linear-complexity scaling algorithm that breaks the quadratic attention barrier. On 32,768 GPUs of the Frontier supercomputer, our method achieves 63% strong scaling efficiency and 1.6 ExaFLOP sustained performance. We further scale to 20 billion spatiotemporal tokens, enabling km-scale global modeling over 177k temporal frames, regimes previously unreachable, establishing a new paradigm for Earth system prediction.
Authors: Ziwen Liu, Huawei Lin, Yide Ran, Denghui Zhang, Jianwen Xie, Chuan Li, Weijie Zhao, Zhaozhuo Xu
Abstract: Large language models (LLMs) sometimes memorize undesirable knowledge, which must be removed after deployment. Prior work on machine unlearning has focused largely on optimization methods that adjust parameters to enforce forgetting while preserving retention. However, these approaches assume that the forget and retain sets are readily available, which rarely holds in practice. Unlearning is typically triggered by an undesired generation at inference time, making the retrieval of relevant data the central challenge. We introduce the notion of data Pareto improvement for LLM unlearning, which formalizes how retrieval can expand the achievable trade-off frontier between forgetting and retention. To realize this principle, we propose Randomized Antipodal Search on Linearized Influence Kernel (RASLIK), a retrieval algorithm that combines permutation-projection hashing with randomized antipodal search. RASLIK reduces selection variance, achieves sublinear complexity, and yields a double gain in both quality and efficiency. Across multiple models, datasets, and unlearning algorithms, RASLIK consistently outperforms deterministic baselines and even oracle sampling, establishing randomized search as a principled and scalable solution for data-centric unlearning.
Authors: Seerat Kaur, Sukhjit Singh Sehra, Dariush Ebrahimi
Abstract: Traffic prediction plays a central role in intelligent transportation systems (ITS) by supporting real-time decision-making, congestion management, and long-term planning. However, many existing approaches face practical limitations. Most spatio-temporal models are trained on centralized data, rely on numerical representations, and offer limited explainability. Recent Large Language Model (LLM) methods improve reasoning capabilities but typically assume centralized data availability and do not fully capture the distributed and heterogeneous nature of real-world traffic systems. To address these challenges, this study proposes FedLLM (Federated LLM), a privacy-preserving and distributed framework for explainable multi-horizon short-term traffic flow prediction (15-60 minutes). The framework introduces four key contributions: 1) a Composite Selection Score (CSS) for data-driven freeway selection that captures structural diversity across traffic regions 2) a domain-adapted LLM fine-tuned on structured traffic prompts encoding spatial, temporal, and statistical context 3) FedLLM framework, that enables collaborative training across heterogeneous clients while exchanging only lightweight LoRA adapter parameters, 4) a structured prompt representation that supports contextual reasoning and cross-region generalization. The FedLLM design allows each client to learn from local traffic patterns while contributing to a shared global model through efficient parameter exchange, reducing communication overhead and keeping data private. This setup supports learning under non-IID traffic distributions. Experimental results show that FedLLM achieves improved predictive performance over centralized baselines, while producing structured and explainable outputs. These findings highlight the potential of combining FL with domain-adapted LLMs for scalable, privacy-aware, and explainable traffic prediction.
Authors: Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin
Abstract: We introduce CoCo-LoRA, a multimodal, uncertainty-aware parameter-efficient fine-tuning method for text prediction tasks accompanied by audio context. Existing PEFT approaches such as LoRA are efficient but typically deterministic, while recent Bayesian low-rank adapters model uncertainty in a lightweight way yet remain largely unimodal and condition uncertainty primarily on internal text features. This leaves them poorly equipped to reflect uncertainty driven by external acoustic factors such as background noise, channel variability, or speaking style, which can materially affect reliability in speech-centered applications. CoCo-LoRA addresses this gap by conditioning a contextual variational posterior in the low-rank space on both local text-derived adapter features and an audio-derived context signal. A pooled audio embedding is projected once into a shared context space and then adapted through lightweight layer-wise heads, enabling global-to-local, depth-specific modulation of the adapter uncertainty and update without high-dimensional multimodal fusion. Stochasticity is confined to a compact latent component in the rank space, preserving PEFT scalability while producing audio-sensitive, heteroscedastic uncertainty. Based on our evaluations across diverse tasks and backbone combinations, CoCo-LoRA consistently matches or outperforms text-only PEFT and conventional feature-fusion transfer baselines, particularly on high-coverage labels where reliable adaptation is critical. The results indicate that using audio as a contextual uncertainty signal, rather than as a fused feature stream, provides a robust and parameter-efficient alternative for multimodal low-resource prediction.
Authors: Arda Fazla, Ege C. Kaya, Antesh Upadhyay, Abolfazl Hashemi
Abstract: Analysis of Stochastic Gradient Descent (SGD) and its variants typically relies on the assumption of uniformly bounded variance, a condition that frequently fails in practical non-convex settings, such as neural network training, as well as in several elementary optimization settings. While several relaxations are explored in the literature, the Blum-Gladyshev (BG-0) condition, which permits the variance to grow quadratically with distance has recently been shown to be the weakest condition. However, the study of the oracle complexity of stochastic first-order non-convex optimization under BG-0 has remained underexplored. In this paper, we address this gap and establish information-theoretic lower bounds, proving that finding an $\epsilon$-stationary point requires $\Omega(\epsilon^{-6})$ stochastic BG-0 oracle queries for smooth functions and $\Omega(\epsilon^{-4})$ queries under mean-square smoothness. These limits demonstrate an unavoidable degradation from classical bounded-variance complexities, i.e., $\Omega(\epsilon^{-4})$ and $\Omega(\epsilon^{-3})$ for smooth and mean-square smooth cases, respectively. To match these lower bounds, we consider Proximally Anchored STochastic Approximation (PASTA), a unified algorithmic framework that couples Halpern anchoring with Tikhonov regularization to dynamically mitigate the extra variance explosion term permitted by the BG-0 oracle. We prove that PASTA achieves minimax optimal complexities across numerous non-convex regimes, including standard smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Lojasiewicz functions, entirely under an unbounded domain and unbounded stochastic gradients.
Authors: Montgomery Bohde, Hongxuan Liu, Mrunali Manjrekar, Magdalena Lederbauer, Shuiwang Ji, Runzhong Wang, Connor W. Coley
Abstract: In this work, we present FRIGID, a framework with a novel diffusion language model that generates molecular structures conditioned on mass spectra via intermediate fingerprint representations and determined chemical formulae, training at the scale of hundreds of millions of unlabeled structures. We then demonstrate how forward fragmentation models enable inference-time scaling by identifying spectrum-inconsistent fragments and refining them through targeted remasking and denoising. While FRIGID already achieves strong performance with its diffusion base, inference-time scaling significantly improves its accuracy, surpassing 18% Top-1 accuracy on the challenging MassSpecGym benchmark and tripling the Top-1 accuracy of the leading methods on NPLIB1. Further empirical analyses show that FRIGID exhibits log-linear performance scaling with increasing inference-time compute, opening a promising new direction for continued improvements in de novo structural elucidation. FRIGID code is publicly available at https://github.com/coleygroup/FRIGID
Authors: Kittipong Thiamchaiboonthawee, Ghadi Nehme, Ram Mohan Telikicherla, Jiawei Tian, Balaji Jayaraman, Vikas Chandan, Dhanushkodi Mariappan, Faez Ahmed
Abstract: Directed energy deposition (DED) produces complex thermo-mechanical responses that can lead to distortion and reduced dimensional accuracy of a manufactured part. Thermo-mechanical finite element simulations are widely used to estimate these effects, but their computational cost and the complexity of accurately capturing DED physics limit their use in design iteration and process optimization. This paper introduces FLARE (Field Prediction via Linear Affine Reconstruction in wEight-space), a data-efficient surrogate modeling framework for predicting post-cooling displacement fields in DED from geometric and process parameters. We develop a predefined-geometry DED simulation workflow using an open-source finite element framework and generate a dataset of simulations with varying geometry, laser power, and deposition velocity. Each simulation provides full-field displacement, stress, strain, and temperature data throughout the manufacturing process. FLARE encodes each simulation as an implicit neural field and regularizes the corresponding neural-network weights so that they follow the affine structure of the input parameter space. This enables prediction of unseen parameter combinations by reconstructing network weights through affine mixing of training examples. On this DED benchmark, the method shows improved accuracy compared to baseline methods in both in-distribution and extrapolation settings. Although the present study focuses on DED displacement prediction, the proposed affine weight-space reconstruction framework offers a promising approach for data-efficient surrogate modeling of physical fields.
Authors: Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin
Abstract: Large pre-trained language models are increasingly adapted to downstream tasks using parameter-efficient fine-tuning (PEFT), but existing PEFT methods are typically deterministic and unimodal, making them poorly suited for low-resource multimodal settings where predictive uncertainty and cross-modal reliability both matter. We introduce CALIBER (Context-Aware Low-rank Inference with Bayesian Embedding Regularization), a multimodal uncertainty-aware PEFT framework for audio-text learning. CALIBER extends Bayesian low-rank adaptation by conditioning the variational posterior in the adapter space on per-layer, token-level text-audio cross-attention. Specifically, text-derived low-rank features attend to frame-level audio embeddings to produce localized acoustic context, which then modulates the mean and variance of a compact stochastic latent matrix within the rank-$r$ adapter space. This design treats audio not only as an additional feature source, but as a contextual reliability signal that shapes both adaptation and confidence. By confining stochasticity to a low-dimensional latent component, CALIBER retains the computational efficiency and scalability of PEFT while enabling heteroscedastic multimodal uncertainty estimation. Experimental results across diverse text and audio backbones show that CALIBER consistently matches or improves upon text-only Bayesian PEFT and conventional multimodal transfer-learning baselines, with token-level cross-attention yielding the most consistent gains. Our findings demonstrate that localized cross-modal conditioning is an effective and lightweight mechanism for uncertainty-aware multimodal adaptation.
Authors: Hangke Sui, Yuqing Wang, Minh N Do
Abstract: Contrastive objectives power state-of-the-art multimodal models, but their training remains slow, relying on long stochastic optimization. We propose a Unified Framework for Efficient Contrastive Alignment via Kernels (UniCon), which spans linear and nonlinear encoders as well as one-to-one and many-to-many alignments. At its core, UniCon introduces the contrastive similarity weight matrix $S(\gamma)$, which enables closed-form global solutions that provably replace minibatch back-propagation with exact updates. Through the lens of reproducing kernel Hilbert spaces (RKHS), UniCon provides a kernelized perspective that unifies contrastive alignment and reveals its connection to spectral methods. To validate the theory, we conduct experiments on synthetic, unimodal, multimodal, and zero-shot tasks, demonstrating that UniCon achieves substantial efficiency gains while preserving generality and strong empirical performance.
Authors: Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli
Abstract: We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise-stationary (PS) setting, where both the reward and transition dynamics can change an arbitrary number of times. We propose Detection Augmented Reinforcement Learning (DARLING), a modular wrapper for PS-RL that applies to both tabular and linear MDPs, without knowledge of the changes. Under certain change-point separation and reachability conditions, DARLING improves the best available dynamic regret bounds in both settings and yields strong empirical performance. We further establish the first minimax lower bounds for PS-RL in tabular and linear MDPs, showing that DARLING is the first nearly optimal algorithm. Experiments on standard benchmarks demonstrate that DARLING consistently surpasses the state-of-the-art methods across diverse non-stationary scenarios.
Authors: Koushik Howlader, Md Tauhidul Islam, Wei Le
Abstract: Accurate prediction of cancer progression remains a challenge due to the high heterogeneity of molecular omics data across patients. While biologically informed models have improved the interpretability of these predictions, a persistent limitation lies in how they encode individual genes to construct pathway representations. Existing hierarchical models typically derive gene features by directly mapping raw molecular inputs, whereas integration frameworks often rely on simple statistical aggregations of patient-level signals. These approaches often fail to explicitly learn a shared base representation for each gene, thereby limiting the expressiveness and biological accuracy of downstream pathway embeddings. To address this, we introduce PATH, a modulation-based, patient-conditioned gene embedding strategy. PATH represents a paradigm shift by starting from a shared base embedding for each gene, preserving a stable biological identity across the population, and then dynamically adapting it using patient-specific copy number variation (CNV) and mutation signals. This allows the model to capture subtle individual molecular variations while maintaining a consistent latent understanding of the gene itself. We integrate PATH into a graph transformer framework that models interactions among biologically connected pathways through pathway-guided attention. Across pancancer metastasis prediction, PATH achieves an F1 score of 0.8766, representing an 8.8 percent improvement over the current SOTA multi-omics benchmarks. Beyond superior predictive accuracy, our approach identifies biologically meaningful pathways and, crucially, reveals disease-state-specific pathway rewiring, offering new insights into the evolving pathway-pathway interactions that drive cancer progression.
Authors: Lena Zellinger, Nicola Branchini, Lennert De Smet, V\'ictor Elvira, Nikolay Malkin, Antonio Vergari
Abstract: Classical mixture models (MMs) are widely used tractable proposals for approximate inference settings such as variational inference (VI) and importance sampling (IS). Recently, mixture models with negative coefficients, called subtractive mixture models (SMMs), have been proposed as a potentially more expressive alternative. However, how to effectively use SMMs for VI and IS is still an open question as they do not provide latent variable semantics and therefore cannot use sampling schemes for classical MMs. In this work, we study how to circumvent this issue by designing several expectation estimators for IS and learning schemes for VI with SMMs, and we empirically evaluate them for distribution approximation. Finally, we discuss the additional challenges in estimation stability and learning efficiency that they carry and propose ways to overcome them. Code is available at: https://github.com/april-tools/delta-vi.
Authors: Xan Carey, Yash Deshmukh, Aileen Huang, Sunit Jadhav, Omkar Tekawade, Lorraine Yang, Anvesha Tiwary, Gerardo Riano, Amy Greenwald, Denizalp Goktas
Abstract: Time-series forecasting is central to many scientific and industrial domains, such as energy systems, climate modeling, finance, and retail. While forecasting methods have evolved from classical statistical models to automated, and neural approaches, the surrounding software ecosystem remains anchored to the traditional Python numerical stack. Existing libraries rely on interpreter-driven execution and object-oriented abstractions, limiting composability, large-scale parallelism, and integration with modern differentiable and accelerator-oriented workflows. Meanwhile, today's forecasting increasingly involves large collections of heterogeneous time series data, irregular covariates, and frequent retraining, placing new demands on scalability and execution efficiency. JAX offers an alternative paradigm to traditional stateful numerical computation frameworks based on pure functions and program transformations such as just-in-time compilation and automatic vectorization, enabling end-to-end optimization across CPUs, GPUs, and TPUs. However, this modern paradigm has not yet been fully incorporated into the design of forecasting systems. We introduce Chronax, a JAX-native time-series forecasting library that rethinks forecasting abstractions around functional purity, composable transformations, and accelerator-ready execution. By representing preprocessing, modeling, and multi-horizon prediction as pure JAX functions, Chronax enables scalable multi-series forecasting, model-agnostic conformal uncertainty quantification, and seamless integration with modern machine learning and scientific computing pipelines.
Authors: Eva van Tegelen, Taniya Kapoor, George A. K. van Voorn, Peter van Heijster, Ioannis N. Athanasiadis
Abstract: Developing neural operators that accurately predict the behavior of systems governed by partial differential equations (PDEs) across unseen parameter regimes is crucial for robust generalization in scientific and engineering applications. In practical applications, variations in physical parameters induce distribution shifts between training and prediction regimes, making extrapolation a central challenge. As a result, the way parameters are incorporated into neural operator models plays a key role in their ability to generalize, particularly when state and parameter representations are entangled. In this work, we introduce the Late Fusion Neural Operator, an architecture that disentangles learning state dynamics from parameter effects, improving predictive performance both within and beyond the training distribution. Our approach combines neural operators for learning latent state representations with sparse regression to incorporate parameter information in a structured manner. Across four benchmark PDEs including advection, Burgers, and both 1D and 2D reaction-diffusion equations, the proposed method consistently outperforms Fourier Neural Operator and CAPE-FNO. Late Fusion Neural Operators achieve consistently the best performance in all experiments, with an average RMSE reduction of 72.9% in-domain and 71.8% out-domain compared to the second-best method. These results demonstrate strong generalization across both in-domain and out-domain parameter regimes.
Authors: William Howes, Farid Ahmed, Kazuma Kobayashi, Souvik Chakraborty, Syed Bahauddin Alam
Abstract: Predicting full-field physics through the real-time virtual sensing of engineering systems can enhance limited physical sensors but often requires sparse-to-dense reconstruction, complex multiphysics, and highly irregular geometries as well as strict latency and energy constraints for edge-deployability. Neural operators have been presented as a potential candidate for such applications but few architectures exist that explicitly address power consumption. Spiking neuron integration can provide a potential solution when integrated on neuromorphic hardware but the current existing neuron models result in severe performance degradation towards regression-based virtual sensing. To address the performance concerns and edge-constraints, we present the Variable Spiking Graph Neural Operator (VS-GNO) which integrates a sophisticated spectral-spatial convolutional analysis and a previously developed Variable Spiking Neuron (VSN) and energy-error balance loss function. With a non-spiking $L_2$ error baseline of $0.4\%$, VS-GNO can provide a reconstruction error of $0.71\%$ with $15\%$ average spiking in its spectral-only form and $1.04\%$ with $24.5\%$ spiking in its entire form. These results position VS-GNO as a promising step towards energy-efficient, edge-deployable neural operators for real-time sparse-to-dense virtual sensing in complex, highly irregular engineering environments.
Authors: Lei Liu, Jialin Chen, Kathy Macropol
Abstract: Causal inference from electronic health records (EHR) is fundamentally limited by unmeasured confounding: critical clinical states such as frailty, goals of care, and mental status are documented in free-text notes but absent from structured data. Large language models can extract these latent confounders as interpretable, structured covariates, yet how to effectively integrate them into causal estimation pipelines has not been systematically studied. Using the MIMIC-IV database with 21,859 sepsis patients, we compare seven covariate-integration strategies for estimating the effect of early vasopressor initiation on 28-day mortality, spanning tabular-only baselines, traditional NLP representations, and three LLM-augmented approaches. A central finding is that not all integration strategies are equally effective: directly augmenting the propensity score model with LLM covariates achieves the best performance, while dual-caliper matching on text-derived categorical distances restricts the donor pool and degrades estimation. In semi-synthetic experiments with known ground-truth effects, LLM-augmented propensity scores reduce estimation bias from 0.0143 to 0.0003 relative to tabular-only methods, and this advantage persists under substantial simulated extraction error. On real data, incorporating LLM-extracted covariates reduces the estimated treatment effect from 0.055 to 0.027, directionally consistent with the CLOVERS randomized trial, and a doubly robust estimator yielding 0.019 confirms the robustness of this finding. Our results offer practical guidance on when and how text-derived covariates improve causal estimation in critical care.
Authors: Inhyeok Lee, Luke Solo, Michael C. Burkhart, Bashar Ramadan, William F. Parker, Brett K. Beaulieu-Jones
Abstract: Every prediction from a generative medical event model is bounded by how clinical events are tokenized, yet input representation is rarely isolated from other system and architectural choices. We evaluate how representation decisions affect downstream prediction after a shared one-epoch pretraining budget. We train 28 matched transformers on MIMIC-IV and evaluate them on 30 clinical outcomes in three experiments: (1) quantization granularity, reference-range anchoring, and code-value fusion; (2) value encoding (hard bins, soft discretization, code-normalized xVal) crossed with temporal encoding (event order, time tokens, admission-relative RoPE); and (3) native MIMIC laboratory/vital codes versus the Common Longitudinal ICU Format (CLIF)-remapped laboratory/vital codes with compression-preserving perturbation arms. In Experiment 1, fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 (BH-adjusted p < 0.001), hospital length-of-stay AUROC from 0.763 to 0.788 (BH-adjusted p < 0.001), and, for the decile fused-vs-unfused comparison, mean regression Spearman rho across the 13 regression outcomes from 0.414 to 0.494. Across the three temporal encodings, event order only and admission-relative RoPE match or exceed inserting time tokens on average while shortening sequences by 11%. CLIF remapping preserves downstream performance in our single-site setting while yielding a smaller, clinically interpretable token set compatible with multi-site use. Finer-than-decile quantization, reference-range anchoring, and soft discretization help in selective outcomes, while code-normalized xVal remains well below the discrete and soft families, consistent with near-median suppression that persists after the affine variant.
Authors: Dixi Yao, Tahseen Rabbani, Tian Li
Abstract: LLM-powered agents often reason from scratch when presented with a new problem instance and lack automatic mechanisms to transfer learned skills to other agents. We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple agents solving different tasks to collectively generate a shared library of metacognitive insights by iteratively federating their local reasoning processes. Instead of federation over gradients (e.g., as in distributed training), FoT operates at the semantic level without any gradient optimization or supervision signal. Iteratively, each agent does local thinking and self-improvement on their specific tasks independently, and shares reasoning traces with a central server, which aggregates and distills them into a cross-task (and cross-domain) insight library that existing and future agents can leverage to improve performance on related tasks. Experiments show that FoT improves reasoning effectiveness and efficiency across a wide range of challenging applications, including mathematical problem solving, cross-domain collaboration, and machine learning research insight discovery. Specifically, it improves average accuracies of downstream tasks by 24% while reducing the reasoning tokens by 28% across the first two applications. In the research insight discovery application, FoT is able to generate insights that cover over 90% of the major contributions in the subsequent papers.
Authors: Zilin Li, Weiwei Xu, Xuchun Tong, Xuanbo Lu, Xuanqi Zhao
Abstract: While modern representation learning relies heavily on global error signals, decentralized algorithms driven by local interactions offer a fundamental distributed alternative. However, the macroscopic convergence properties of these discrete dynamics on continuous data manifolds remain theoretically unresolved, notoriously suffering from parameter explosion. We bridge this gap by formalizing decentralized learning as a coupled slow-fast dynamical system on Riemannian manifolds. First, using measure-theoretic limits, we prove that the discrete spatial transitions converge uniformly to an overdamped Langevin stochastic differential equation. Second, via the It\^o-Poisson resolvent and a stochastic extension of LaSalle's Invariance Principle, we establish that the representation weights unconditionally avoid divergence and align strictly with the principal eigenspace of the spatial measure. Finally, we construct a joint Lyapunov functional for the fully coupled spatial-parametric flow. This proves global dissipativity and demonstrates that orthogonally disentangled, linearly separable features emerge spontaneously at the stationary limit. Our framework bridges discrete algorithms with continuous stochastic analysis, providing a formal theoretical baseline for decentralized representation learning.
Authors: Sumeet Ramesh Motwani, Chuan Du, Aleksander Petrov, Christopher Davis, Philip Torr, Antonio Papania-Davis, Weishi Yan
Abstract: Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires specialized operations research (OR) expertise, making it hard to scale. We present AutoOR, a scalable synthetic data generation and reinforcement learning pipeline that trains LLMs to autoformalize optimization problems specified in natural language across linear, mixed-integer, and non-linear categories. AutoOR generates verified training data from standard optimization forms and uses solver execution feedback as the reward signal for RL post-training. AutoOR applied to an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks, matching significantly larger frontier models. For a non-linear problem class involving physical dynamics, where frontier models score near 0%, we introduce a curriculum RL strategy that bootstraps from limited initial training data to make this class tractable for post-training. We believe that methods such as AutoOR can significantly accelerate industrial decision-making with AI.
Authors: Chongsheng Zhang, Hao Wang, Zelong Yu, Esteban Garces Arias, Julian Rodemann, Zhanshuo Zhang, Qilong Li, Gaojuan Fan, Krikamol Muandet, Christian Heumann
Abstract: Imbalanced data is commonly present in real-world applications. While data synthesis can effectively mitigate the data scarcity problem of rare-classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in-context learning framework that employs progressive chain-of-thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in-context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self-reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at https://github.com/cszhangLMU/RDDG.
Authors: Yunshan Peng, Ji Wu, Wentao Bai, Yunke Bai, Jinan Pang, Wenzheng Shu, Yanxiang Zeng, Xialong Liu, Peng Jiang
Abstract: Reach and Frequency (R&F) contract advertising is an important form of widely used brand advertising. Unlike performance advertising, R&F contracts emphasize controllable delivery of UV and PV under given targeting, scheduling, and frequency control constraints. In practical systems, advertisers typically need to view the UV, PV change curves at different budget levels in real time when creating an R&F contract. However, most existing publicly available advertising datasets are based on independent samples, lacking a characterization of the core structure of the "budget-performance curve" (including UV and PV) in R&F contracts.This paper proposes and releases a large-scale R&F contract inventory estimation dataset. This dataset uses the R&F contract context consisting of "targeting-scheduling-frequency control" as the basic context, providing observations of UV and PV corresponding to multiple budget points within the same context, thus forming a complete budget-performance curve. The dataset explicitly includes a time-window-based frequency control mechanism (e.g.,"no more than 3 times within 5 days") and naturally satisfies the monotonicity and diminishing marginal returns characteristics in the budget and scheduling dimensions. We further derive the theoretical maximum exposure ceiling and use it as a consistency check to evaluate data quality and the feasibility of model predictions. Using this data set, this paper defines two standardized benchmark tasks: single-point performance prediction and reconstruction of budget-performance curves, and provides a set of reproducible baseline methods and evaluation protocols. This dataset can support systematic research on problems such as structural constraint learning, monotonic regression, curve consistency modeling, and R&F contract planning.The code for our experiments can be found at https://github.com/pengyunshan/RF-Inventory.
Authors: Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, Chien-Sheng Wu
Abstract: On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: https://github.com/SalesforceAIResearch/CaOPD
Authors: Chenwei Zhang
Abstract: This dissertation explores how deep generative models can advance the analysis of challenging biological problems by integrating domain knowledge with deep learning. It focuses on two areas: DNA reaction kinetics and cryogenic electron microscopy (cryo-EM). In the first part, we present ViDa, a biophysics-informed framework leveraging variational autoencoders (VAEs) and geometric scattering transforms to generate biophysically-plausible embeddings of DNA reaction kinetics simulations. These embeddings are reduced to a two-dimensional space to visualize DNA hybridization and toehold-mediated strand displacement reactions. ViDa preserves structure and clusters trajectory ensembles into reaction pathways, making simulation results more interpretable and revealing new mechanistic insights. In the second part, we address key challenges in cryo-EM density map interpretation and protein structure modeling. We provide a comprehensive review and benchmarking of deep learning methods for atomic model building, with improved evaluation metrics and practical guidance. We then present Struc2mapGAN, a generative adversarial network that synthesizes high-fidelity experimental-like cryo-EM density maps from protein structures. Finally, we present CryoSAMU, a structure-aware multimodal U-Net that enhances intermediate-resolution cryo-EM maps by integrating density features with structural embeddings from protein language models via cross-attention. Overall, these contributions demonstrate the potential of deep generative models to interpret DNA reaction mechanisms and advance cryo-EM density map analysis and protein structure modeling.
Authors: Akash Samanta, Manish Pratap Singh, Debasis Chaudhuri
Abstract: Standard supervised learning optimizes for predictive accuracy but remains agnostic to the internal geometry of learned features, often yielding representations that are entangled and brittle. We propose Class-Conditional Activation Regularization (CCAR) to explicitly engineer the feature space, imposing a block-diagonal structure via a soft inductive bias. By shaping the latent representation to confine class energy to orthogonal subspaces, we create an intrinsic geometric scaffold that naturally filters noise and adversarial perturbations. We provide theoretical analysis linking this structural constraint to the maximization of the Fisher Discriminant Ratio, establishing a formal connection between geometric disentanglement and algorithmic stability. Empirically, this approach demonstrates that robustness is an emergent property of a well-engineered feature space, significantly outperforming baselines on label noise and input corruption benchmarks.
Authors: Yuchen Pan, Soung Chang Liew
Abstract: Recent deployments of large language models (LLMs) as autonomous trading agents raise questions about whether financial decision-making competence generalizes beyond specific market patterns and how it should be trained and evaluated in noisy markets lacking ground truth. We propose a structured framework for training and evaluating such models. Central to our approach is a curated, multiple-choice question (MCQ) dataset derived from classic textbooks and historical markets, verified by an AI committee, enriched with structured reasoning traces, and augmented to reduce shortcut learning. To evaluate whether performance on isolated MCQs generalizes to real-world trading, we introduce a two-stage protocol combining test-set evaluation with an MCQ-based chronological trading simulation. Extensive evaluations across market regimes provide statistically robust evidence that open models trained with our framework exhibit competitive, risk-aware behavior over time, outperform open-source baselines, and approach frontier-model performance at smaller scale. We release the dataset and evaluation framework to support further research.
Authors: Nils Leutenegger
Abstract: A central question in computational neuroscience is whether the learning rule used to train a neural network determines how well its internal representations align with those of the human visual cortex. We present a systematic comparison of four learning rules -- backpropagation (BP), feedback alignment (FA), predictive coding (PC), and spike-timing-dependent plasticity (STDP) -- applied to identical convolutional architectures and evaluated against human fMRI data from the THINGS-fMRI dataset (720 stimuli, 3 subjects) using Representational Similarity Analysis (RSA). Crucially, we include an untrained random-weights baseline that reveals the dominant role of architecture. We find that early visual alignment (V1/V2) is primarily architecture-driven: an untrained CNN achieves rho = 0.071, statistically indistinguishable from BP (rho = 0.072, p = 0.43). Learning rules only differentiate at higher visual areas: BP dominates at LOC/IT, and PC with local Hebbian updates achieves IT alignment statistically indistinguishable from BP (p = 0.18). FA consistently impairs representations below the random baseline at V1. Partial RSA confirms all effects survive pixel-similarity control. These results demonstrate that the relationship between learning rules and cortical alignment is region-specific: architecture determines early alignment, while supervised objectives drive late alignment.
Authors: Zhongyuan Liang, Junhyung Jo, Hyang-Jung Lee, Sang Kyu Kim, Irene Y. Chen
Abstract: Early prediction of severe clinical deterioration and remaining length of stay can enable timely intervention and better resource allocation in high-acuity settings such as the ICU. This has driven the development of machine learning models that leverage continuous streams of vital signs and other physiological signals for real-time risk prediction. Despite their promise, existing methods have important limitations. Contrastive pretraining treats all patients as equally strong negatives, failing to capture clinically meaningful similarity between patients with related diagnoses. Meanwhile, downstream fine-tuning typically ignores complementary modalities such as clinical notes, which provide rich contextual information unavailable in physiological signals alone. To address these challenges, we propose OC-Distill, a two-stage framework that leverages multimodal supervision during training while requiring only vital signs at inference. In the first stage, we introduce an ontology-aware contrastive objective that exploits the ICD hierarchy to quantify patient similarity and learn clinically grounded representations. In the second stage, we fine-tune the pretrained encoder via cross-modal knowledge distillation, transferring complementary information from clinical notes into the model. Across multiple ICU prediction tasks on MIMIC, OC-Distill demonstrates improved label efficiency and achieves state-of-the-art performance among methods that use only vital signs at inference.
Authors: Junnan Liu, Xinyan Liu, Peifeng Gao, Zhaobo Qi, Beichen Zhang, Weigang Zhang, Antoni Bert Chen
Abstract: In long-context decoding for LLMs and LMMs, attention becomes increasingly memory-bound because each decoding step must load a large amount of KV-cache data from GPU memory. Existing acceleration strategies often trade efficiency for accuracy by relying on heuristic pruning that may discard useful information. At a deeper level, they also tend to indiscriminately preserve all high-scoring tokens, treat early tokens as indispensable anchors, or rely on heuristic head routing, reflecting an insufficient mechanistic understanding of the attention sink phenomenon. In this paper, we show that the attention sink phenomenon corresponds to a stable, reachable, and error-controllable fixed point constructed during training. Based on this insight, we propose SinkRouter, a training-free selective routing framework that detects the sink signal and skips computations that would otherwise produce near-zero output. To translate this mechanism into real-world acceleration, we develop a hardware-aware Triton kernel with block-level branching and Split-K parallelism. We conduct extensive evaluations on a diverse suite of long-context benchmarks, including LongBench, InfiniteBench, CVBench, MileBench, and MMVP, using both text-only and multimodal backbones such as Llama-3.1-8B, Llama-3.1-70B, Yi-9B-200K, LLaVA-1.5-7B, and LLaVA-1.5-13B. Across these settings, SinkRouter consistently improves decoding efficiency while maintaining competitive accuracy, and reaches 2.03x speedup with a 512K context.
Authors: Yuheng Zhao, Yu-Hu Yan, Amit Attia, Tomer Koren, Lijun Zhang, Peng Zhao
Abstract: Parameter-free stochastic optimization aims to design algorithms that are agnostic to the underlying problem parameters while still achieving convergence rates competitive with optimally tuned methods. While some parameter-free methods do not require the specific values of the problem parameters, they still rely on prior knowledge, such as the lower or upper bounds of them. We refer to such methods as ``partially parameter-free''. In this work, we target achieving ``fully parameter-free'' methods, i.e., the algorithmic inputs do not need to satisfy any unverifiable condition related to the true problem parameters. We propose a powerful and general grid search framework, named \textsc{Grasp}, with a novel self-bounding analysis technique that effectively determines the search ranges of parameters, in contrast to previous work. Our method demonstrates generality in: (i) the non-convex case, where we propose a fully parameter-free method that achieves near-optimal convergence rate, up to logarithmic factors; (ii) the convex case, where our parameter-free methods are competitive with strong performance in terms of acceleration and universality. Finally, we contribute a sharper guarantee for the model ensemble, a final step of the grid search framework, under interpolated variance characterization.
Authors: Hiroki Hasegawa, Aoba Tamura, Yukihiko Okada
Abstract: Factor-based Structural Equation Modeling (SEM) relies on likelihood-based estimation assuming a nonsingular sample covariance matrix, which breaks down in small-sample settings with $p>n$. To address this, we propose a novel estimation principle that reformulates the covariance structure into self-covariance and cross-covariance components. The resulting framework defines a likelihood-based feasible set combined with a relative error constraint, enabling stable estimation in small-sample settings where $p>n$ for sign and direction. Experiments on synthetic and real-world data show improved stability, particularly in recovering the sign and direction of structural parameters. These results extend covariance-based SEM to small-sample settings and provide practically useful directional information for decision-making.
Authors: Yingzhi Xia, Setthakorn Tanomkiattikun, Liangli Zhen, Zaiwang Gu
Abstract: Diffusion models (DMs) have recently shown remarkable performance on inverse problems (IPs). Optimization-based methods can fast solve IPs using DMs as powerful regularizers, but they are susceptible to local minima and noise overfitting. Although DMs can provide strong priors for Bayesian approaches, enforcing measurement consistency during the denoising process leads to manifold infeasibility issues. We propose Noise-space Hamiltonian Monte Carlo (N-HMC), a posterior sampling method that treats reverse diffusion as a deterministic mapping from initial noise to clean images. N-HMC enables comprehensive exploration of the solution space, avoiding local optima. By moving inference entirely into the initial-noise space, N-HMC keeps proposals on the learned data manifold. We provide a comprehensive theoretical analysis of our approach and extend the framework to a noise-adaptive variant (NA-NHMC) that effectively handles IPs with unknown noise type and level. Extensive experiments across four linear and three nonlinear inverse problems demonstrate that NA-NHMC achieves superior reconstruction quality with robust performance across different hyperparameters and initializations, significantly outperforming recent state-of-the-art methods. The code is available at https://github.com/NA-HMC/NA-HMC.
Authors: Gabriel Jason Lee, Jathurshan Pradeepkumar, Jimeng Sun
Abstract: Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored. In this work, we introduce NeuroAdapt-Bench, a systematic benchmark for evaluating test-time adaptation methods on EEG foundation models under realistic distribution shifts. We evaluate representative TTA approaches from other domains across multiple pretrained foundation models, diverse downstream tasks, and heterogeneous datasets spanning in-distribution, out-of-distribution, and extreme modality shifts (e.g., Ear-EEG). Our results show that standard TTA methods yield inconsistent gains and often degrade performance, with gradient-based approaches particularly prone to heavy degradation. In contrast, optimization-free methods demonstrate greater stability and more reliable improvements. These findings highlight the limitations of existing TTA techniques in EEG, provide guidance for future development, and underscore the need for domain-specific adaptation strategies.
Authors: Junlin Li, Shuangyong Song, Guodong Du, Ngai Wong, Xuebo Liu, Yongxiang Li, Min Zhang, Jing Li, Xuelong Li
Abstract: Supervised Fine-Tuning (SFT) accelerates taskspecific large language models (LLMs) development, but the resulting proliferation of finetuned models incurs substantial memory overhead. Delta compression addresses this by retaining a single pre-trained LLM with multiple compressed delta weights. However, existing methods fail on models fine-tuned with largescale datasets. We find that larger SFT data scale amplifies delta parameter magnitude, singular values, and entropy, exacerbating compression errors. To tackle this, we propose DQRELO (Delta Compression via Quantization and Residual Low-Rank), a novel training- and data-free delta compression method. It combines coarse-grained one-bit quantization to capture the dominant structure of the delta, followed by compensated residual low-rank approximation to recover fine-grained details from the smaller residual error. Experiments on various LLMs spanning dense and MoE architectures across multiple domains under this challenging setting demonstrate that DQRELO outperforms existing methods. Moreover, we establish key design principles for delta compression through extensive empirical analysis, demonstrating how task difficulty, architecture, and layer positioning create predictable patterns that can guide optimal compression strategies in production systems.
Authors: Yun-Peng Li, Hans-Andrea Loeliger
Abstract: The paper considers the computation of L1 regularization paths in a state space setting, which includes L1 regularized Kalman smoothing, linear SVM, LASSO, and more. The paper proposes two new algorithms, which are duals of each other; the first algorithm applies to L1 regularization of independent variables while the second applies to L1 regularization of dependent variables. The heart of the proposed algorithms is parametric Gaussian message passing (i.e., Kalman-type forward-backward recursions) in the pertinent factor graphs. The proposed methods are broadly applicable, they (usually) require only matrix multiplications, and their complexity can be competitive with prior methods in some cases.
Authors: Sai Vegasena
Abstract: We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor -- not model size -- determines whether angular quantization schemes like PolarQuant succeed or fail, with Gemma 4's attn_scale=1.0 amplifying directional error 25-100x more than Llama's standard 1/sqrt(d) scaling.
Authors: Tianyi Chen, Haobo Wang, Kai Tang, Gengyu Lyu, Tianlei Hu, Gang Chen, Hong Ma, Meixiang Xiang
Abstract: Incomplete Multi-View Clustering (IMVC) faces the challenge of learning discriminative representations from fragmentary observations while maintaining robustness against missing views. However, prevalent Euclidean-based methods suffer from a geometric mismatch when modeling real-world data with intrinsic hierarchies, leading to semantic blurring where representations drift towards spatially proximal but semantically distinct neighbors. To bridge this gap, we propose HERL, a Hyperbolic Enhanced Representation Learning framework for IMVC. Operating within the Poincar\'e ball, HERL constructs a structure-aware latent space to enhance representation learning. Specifically, we design a dual-constraint hyperbolic contrastive mechanism optimizing: an angular-based loss to preserve semantic identity via directional alignment, and a distance-based loss to enforce hierarchical compactness. Furthermore, a hyperbolic prototype head is introduced to rectify global structural drift by aligning cross-view hierarchy-aware prototype distributions. Consequently, HERL disentangles fine-grained semantic correlations to sharpen cluster boundaries and imposes geometric constraints to rectify the data recovery process. Extensive experimental results demonstrate that HERL consistently outperforms state-of-the-art approaches.
Authors: Bruce A. Bassett, Amy Rouillard, Sitwala Mundia, Michael Cameron Gramanie, Linda Camara, Ziyaad Dangor, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Ismail Kalla, Haroon Saloojee
Abstract: Background: Large language models (LLMs) are increasingly proposed for diagnostic support, but few evaluations use real-world multimodal inpatient data, particularly in low and middle-income country (LMIC) public hospitals. Methods: We conducted VALID, a retrospective evaluation of 539 multimodal inpatient cases from a tertiary public hospital in South Africa. Inputs included radiology imaging (CT, MRI, CXR) and reports, laboratory results, clinical notes, and vital signs. Expert panels adjudicated 300 cases (balanced and discordant subsets) to establish ground truth diagnoses, differentials, and reasoning. Ten multimodal LLMs generated zero-shot outputs. A calibrated three-model LLM Jury scored all outputs and routine ward diagnoses across diagnostic accuracy, differential quality, reasoning, and patient safety (>10,000 evaluations). Primary outcomes were composite scores ($S_3$, $S_4$) and win rates. Results: (i) LLM performance was tightly clustered (<15% variation) despite large cost differences; low-cost models performed comparably to top models. (ii) All LLMs significantly outperformed routine ward diagnoses on average diagnostic and safety scores. (iii) Top performance was achieved by GPT-5.1, followed by Gemini models. (vi) Adding radiology reports improved performance by 6%. (v) Diagnostic and reasoning scores were highly correlated ($\rho = 0.85$). (vi) Output rates varied (65-100%) due to input constraints. Results were robust across subsets and evaluation design. Conclusions: Across a real-world LMIC dataset, multimodal LLMs showed similar diagnostic performance despite large cost differences and outperformed routine care on average safety metrics. Affordability, robustness, and deployment constraints may outweigh marginal performance differences in LMIC settings.
Authors: Carson Dudley, Yutong Bi, Xiaofeng Liu, Samet Oymak
Abstract: Non-stationary sequences arise naturally in control, forecasting, and decision-making. The data-generating process shifts at unknown times, and models must detect the change, discard or downweight obsolete evidence, and adapt to new dynamics on the fly. Transformer-based foundation models increasingly rely on in-context learning for time series forecasting, tabular prediction, and continuous control. As these models are deployed in non-stationary environments, understanding their ability to detect and adapt to regime shifts is important. We formalize this as an in-context change-point detection problem and formally establish the existence of transformer models that solve this problem. Our construction demonstrates that model complexity, in layers and parameters, depends on the level of information available about the change-point location, from no knowledge to knowing exact timing. We validate our results with experiments on synthetic linear regression and linear dynamical systems, where trained transformers match the performance of optimal baselines across information levels. We also show that encoding and incorporating changepoint knowledge indeed improves the real-world performance of a pretrained foundation models on infectious disease forecasting and on financial volatility forecasting around Federal Open Market Committee (FOMC) announcements without retraining, demonstrating practical applicability to real-world regime changes.
Authors: Jason Yoo, Shailesh Garg, Souvik Chakraborty, Syed Bahauddin Alam
Abstract: Spiking neural operators are appealing for neuromorphic edge computing because event-driven substrates can, in principle, translate sparse activity into lower latency and energy. Whether that advantage survives deployment on commodity edge-GPU software stacks, however, remains unclear. We study this question on a Jetson Orin Nano 8 GB using five pretrained variable-spiking wavelet neural operator (VS-WNO) checkpoints and five matched dense wavelet neural operator (WNO) checkpoints on the Darcy rectangular benchmark. On a reference-aligned path, VS-WNO exhibits substantial algorithmic sparsity, with mean spike rates decreasing from 54.26% at the first spiking layer to 18.15% at the fourth. On a deployment-style request path, however, this sparsity does not reduce deployed cost: VS-WNO reaches 59.6 ms latency and 228.0 mJ dynamic energy per inference, whereas dense WNO reaches 53.2 ms and 180.7 mJ, while also achieving slightly lower reference-path error (1.77% versus 1.81%). Nsight Systems indicates that the request path remains launch-dominated and dense rather than sparsity-aware: for VS-WNO, cudaLaunchKernel accounts for 81.6% of CUDA API time within the latency window, and dense convolution kernels account for 53.8% of GPU kernel time; dense WNO shows the same pattern. On this Jetson-class GPU stack, spike sparsity is measurable but does not reduce deployed cost because the runtime does not suppress dense work as spike activity decreases.
Authors: Ji-Eun Byun, Hyeuk Ryu, Junho Song
Abstract: Coherent systems are representative of many practical applications, ranging from infrastructure networks to supply chains. Probabilistic evaluation of such systems remains challenging, however, because existing decomposition-based methods scale poorly as the number of components grows. To address this limitation, this study proposes the Reference-state System Reliability (RSR) method. Like existing approaches, RSR characterises the boundary between different system states using reference states in the component-state space. Where it departs from these methods is in how the state space is explored: rather than using reference states to decompose the space into disjoint hypercubes, RSR uses them to classify Monte Carlo samples, making computational cost significantly less sensitive to the number of reference states. To make this classification efficient, samples and reference states are stored as matrices and compared using batched matrix operations, allowing RSR to exploit the advances in high-throughput matrix computing driven by modern machine learning. We demonstrate that RSR evaluates the system-state probability of a graph with 119 nodes and 295 edges within 10~seconds, highlighting its potential for real-time risk assessment of large-scale systems. We further show that RSR scales to problems involving hundreds of thousands of reference states -- well beyond the reach of existing methods -- and extends naturally to multi-state systems. Nevertheless, when the number of boundary reference states grows exceedingly large, RSR's convergence slows down, a limitation shared with existing reference-state-based approaches that motivates future research into learning-based representations of system-state boundaries.
Authors: Dongkyu Cho, Xiyue Li, Samrachana Adhikari, Rumi Chunara
Abstract: Continual learning aims to update models under distribution shift without forgetting, yet many high-stakes deployments, such as healthcare, also require interpretability. In practice, models that adapt well (e.g., deep networks) are often opaque, while models that are interpretable (e.g., decision trees) are brittle under shift, making it difficult to achieve both properties simultaneously. In response, we propose Tree of Concepts, an interpretable continual learning framework that uses a shallow decision tree to define a fixed, rule-based concept interface and trains a concept bottleneck model to predict these concepts from raw features. Continual updates act on the concept extractor and label head while keeping concept semantics stable over time, yielding explanations that do not drift across sequential updates. On multiple tabular healthcare benchmarks under continual learning protocols, our method achieves a stronger stability-plasticity trade-off than existing baselines, including replay-enhanced variants. Our results suggest that structured concept interfaces can support continual adaptation while preserving a consistent audit interface in non-stationary, high-stakes domains.
Authors: Noel Brindise, Cedric Langbort, Melkior Ornik
Abstract: Motivated by the challenge presented by non-Markovian objectives in reinforcement learning (RL), we present a novel framework to track and represent the progress of autonomous agents through complex, multi-stage tasks. Given a specification in finite linear temporal logic (LTL), the framework establishes a 'tracking vector' which updates at each time step in a trajectory rollout. The values of the vector represent the status of the specification as the trajectory develops, assigning true, false, or 'open' labels (where 'open' is used for indeterminate cases). Applied to an LTL formula tree, the tracking vector can be used to encode detailed information about how a task is executed over a trajectory, providing a potential tool for new performance metrics, diverse exploration, and reward shaping. In this paper, we formally present the framework and algorithm, collectively named Live LTL Progress Tracking, give a simple working example, and demonstrate avenues for its integration into RL models. Future work will apply the framework to problems such as task-space exploration and diverse solution-finding in RL.
Authors: Michael C. Mozer, Shoaib Ahmed Siddiqui, Rosanne Liu
Abstract: Transformers encode structure in sequences via an expanding contextual history. However, their purely feedforward architecture fundamentally limits dynamic state tracking. State tracking -- the iterative updating of latent variables reflecting an evolving environment -- involves inherently sequential dependencies that feedforward networks struggle to maintain. Consequently, feedforward models push evolving state representations deeper into their layer stack with each new input step, rendering information inaccessible in shallow layers and ultimately exhausting the model's depth. While this depth limit can be bypassed by dynamic depth models and by explicit or latent thinking that externalizes state representations, these solutions are computationally and memory inefficient. In this article, we argue that temporally extended cognition requires refocusing from explicit thought traces to implicit activation dynamics via recurrent architectures. We introduce a taxonomy of recurrent and continuous-thought transformer architectures, categorizing them by their recurrence axis (depth versus step) and their ratio of input tokens to recurrence steps. Finally, we outline promising research directions, including enhanced state-space models and coarse-grained recurrence, to better integrate state tracking into modern foundation models.
Authors: Rohan Patil, Henrik I. Christensen
Abstract: Navigating complex environments poses challenges for multi-agent systems, requiring efficient extraction of insights from limited information. In this paper, we introduce the Blackbox Oracle Information Learning (BOIL) process, a scalable solution for extracting valuable insights from the environment structure. Leveraging the Pagerank algorithm and common information maximization, BOIL facilitates the extraction of information to guide long-term agent behavior applicable to problems such as coverage, patrolling, and stochastic reachability. Through experiments, we demonstrate the efficacy of BOIL in generating strategy distributions conducive to improved performance over extended time horizons, surpassing heuristic approaches in complex environments.
Authors: Remy Kim, Minseung Lee, Shuo Li, Osbert Bastani
Abstract: Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic-while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a Wikipedia article), and the AI agent must issue queries to retrieve passages from that document. Intuitively, the document comprehensively covers a topic, so the ability to retrieve its sections directly measures completeness of information retrieval. In addition to Wikipedia, we also consider machine learning survey papers, where the goal is to retrieve relevant sections of a survey paper. We benchmark several models and algorithms; the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML Surveys, leaving substantial room for improvement.
Authors: Khemraj Shukla, Zongren Zou, Theo Kaeufer, Michael Triantafyllou, George Em Karniadakis
Abstract: Physics-informed neural networks (PINNs) have emerged as a promising framework for solving inverse problems governed by partial differential equations (PDEs), including the reconstruction of turbulent flow fields from sparse data. However, most existing PINN formulations are deterministic and do not provide reliable quantification of epistemic uncertainty, which is critical for ill-posed problems such as data-driven Reynolds-averaged Navier-Stokes (RANS) modeling. In this work, we develop and systematically evaluate a set of probabilistic extensions of PINNs for uncertainty quantification in turbulence modeling. The proposed framework combines (i) Bayesian PINNs with Hamiltonian Monte Carlo sampling and a tempered multi-component likelihood, (ii) Monte Carlo dropout, and (iii) repulsive deep ensembles that enforce diversity in function space. Particular emphasis is placed on the role of ensemble diversity and likelihood tempering in improving uncertainty calibration for PDE-constrained inverse problems. The methods are assessed on a hierarchy of test cases, including the Van der Pol oscillator and turbulent flow past a circular cylinder at Reynolds numbers Re=3,900 (direct numerical simulation data) and Re = 10,000 (experimental particle image velocimetry data). The results demonstrate that Bayesian PINNs provide the most consistent uncertainty estimates across all inferred quantities, while function-space repulsive ensembles offer a computationally efficient approximation with competitive accuracy for primary flow variables. These findings provide quantitative insight into the trade-offs between accuracy, computational cost, and uncertainty calibration in physics-informed learning, and offer practical guidance for uncertainty quantification in data-driven turbulence modeling.
Authors: Meghana Kshirsagar, Allen Nie, Ching-An Cheng, Fanglei Xue, Rahul Dodhia, Juan Lavista Ferres, Kevin K. Yang, Frank DiMaio
Abstract: We introduce RosettaSearch, an inference-time multi-objective optimization approach for protein sequence optimization. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18\% to 68\%, translating to a 2.5$\times$ improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves sequence fidelity for ProteinMPNN-designed sequences on \textit{de novo} backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. The sequence trajectories generated by our approach can be used as training data in sequence design models or in post-training and will be released along with the code and datasets upon publication.
Authors: Jayadev Billa
Abstract: Fine-tuning adapts pretrained networks to new objectives. Whether the resulting depth profile of representational change reflects an intrinsic property of the model or the magnitude of gradient flow has not been tested directly. We measure this profile across 240 fine-tuning runs spanning 15 models in four architecture families (encoder and decoder transformers, a state-space model, and an RNN) at scales from 125M to 6.9B parameters. Representational change concentrates in output-proximal layers in every standard-training run except one. We apply a per-layer control that equalizes $\|\Delta W\|/\|W\|$ across layers after each optimizer step. Under this control, the profile persists in some conditions and collapses in others. At 125M--350M, sequential-block architectures (BERT, OPT, GPT-2) retain the slope across tested objectives while parallel-block architectures (Pythia, CodeGen) retain it only for causal-language-modeling objectives. This architectural distinction narrows at 1.3B--1.4B, where both block types show positive equal-step slopes for CausalLM. Under standard training, profile shape is described by two additional axes: steepness tracks a training-free objective distance at initialization, and profile width is dominated by architecture. We treat the locality gradient, the depthwise slope of representational change, as a composite phenomenon whose components are scale-dependent.
Authors: Nikunj Gupta, Rajgopal Kannan, Viktor Prasanna
Abstract: Multi-agent reinforcement learning (MARL) is crucial for AI systems that operate collaboratively in distributed and adversarial settings, particularly in multi-domain operations (MDO). A central challenge in cooperative MARL is determining how agents should coordinate: existing approaches must either hand-specify graph topology, rely on proximity-based heuristics, or learn structure entirely from environment interaction; all of which are brittle, semantically uninformed, or data-intensive. We investigate whether large language models (LLMs) can generate useful coordination graph priors for MARL by using minimal natural language descriptions of agent observations to infer latent coordination patterns. These priors are integrated into MARL algorithms via graph convolutional layers within a graph neural network (GNN)-based pipeline, and evaluated on four cooperative scenarios from the Multi-Agent Particle Environment (MPE) benchmark against baselines spanning the full spectrum of coordination modeling, from independent learners to state-of-the-art graph-based methods. We further ablate across five compact open-source LLMs to assess the sensitivity of prior quality to model choice. Our results provide the first quantitative evidence that LLM-derived graph priors can enhance coordination and adaptability in dynamic multi-agent environments, and demonstrate that models as small as 1.5B parameters are sufficient for effective prior generation.
Authors: Enoch Hyunwook Kang
Abstract: Iterative alignment methods based on purely greedy updates are remarkably effective in practice, yet existing theoretical guarantees of \(O(\log T)\) KL-regularized regret can seem pessimistic relative to their empirical performance. In this paper, we argue that this mismatch arises from the regret criterion itself: KL-regularized regret conflates the statistical cost of learning with the exploratory randomization induced by the softened training policy. To separate these effects, we study the traditional temperature-zero regret criterion, which evaluates only the top-ranked response at inference time. Under this decision-centric notion of performance, we prove that standard greedy online alignment methods, including online RLHF and online DPO, achieve constant \((O(1))\) cumulative regret. By isolating the cost of identifying the best response from the stochasticity induced by regularization, our results provide a sharper theoretical explanation for the practical superb efficiency of greedy alignment.
Authors: Thong Bach, Truyen Tran
Abstract: Fine-tuning well-aligned large language models (LLMs) on new domains often degrades their safety alignment, even when using benign datasets. Existing safety alignment techniques primarily focus on pretraining, leaving fine-tuned models vulnerable to behavioral shifts. In this work, we introduce safety token regularization (STR), a lightweight method designed to preserve safety properties during fine-tuning. Our approach identifies salient tokens from rejection templates of well-aligned models and constrains their associated logits during training, preventing the loss of critical safety behaviors. Unlike reinforcement learning or preference optimization methods, STR requires minimal additional computation and seamlessly integrates with parameter-efficient fine-tuning techniques such as LoRA. Comprehensive experiments demonstrate that our approach achieves safety performance on par with state-of-the-art methods, while preserving task-specific utility and requiring minimal implementation overhead. Furthermore, we show that safety token regularization enhances training stability and overall performance beyond safety considerations alone. This work offers a practical and readily deployable strategy for continual safety alignment in fine-tuned LLMs.
Authors: Thong Bach, Dung Nguyen, Thao Minh Le, Truyen Tran
Abstract: Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or architectural modifications. Our method is robust across selection ratios, task orderings, and diverse attack benchmarks.
Authors: Ege \c{C}akar, Ketan Ali Raghu, Lia Zheng
Abstract: Recursive architectures such as Tiny Recursive Models (TRMs) perform implicit reasoning through iterative latent computation, yet the geometric structure of these reasoning trajectories remains poorly understood. We investigate the activation manifold of TRMs during recursive unrolling and find that activations occupy an effectively linear, low-dimensional subspace whose principal directions can be tracked dynamically with cheap power iterations. This suggests that weight-sharing concentrates iterative computation along a small number of dominant eigendirections, and we find that this concentration varies sharply across computational sites. We exploit this structure through LASER (Low-Rank Activation SVD for Efficient Recursion), a dynamic compression framework that maintains an evolving low-rank basis via matrix-free subspace tracking with a fidelity-triggered reset mechanism, achieving ${\sim}60\%$ activation memory savings with no statistically significant accuracy degradation. Our analysis raises questions about how recursive architectures allocate representational capacity during implicit reasoning, and whether this concentration can be exploited to improve the efficiency and stability of latent computation.
Authors: Qingwei Lin
Abstract: Conditional depth execution routes a subset of tokens through a lightweight cheap FFN while the remainder execute the standard full FFN at each controlled layer. The central difficulty is gate training: the gate decision must propagate through many layers before it influences the language modeling (LM) loss, so the resulting gradients are weak and noisy. Auxiliary losses are commonly stacked to stabilise training, yet the interactions among them -- particularly between a predictive auxiliary and explicit score supervision -- have not been systematically compared under controlled conditions. We evaluate two gate designs under a 157.5M-parameter decoder-only model with controller-only training, 50% full-path budget, and 3-seed runs on a fineweb-edu subset. The MLP gate (G1) maps the current hidden state to a utility score; the JEPA-guided gate (G3) adds an action-conditional predictor that forecasts, in a low-dimensional latent space, the outcome of executing full vs. cheap per token, aligned against a fixed target head. Under the standard recipe with oracle-style utility regression and pairwise rank supervision (util/rank), G3 improves early-to-mid optimisation over G1 in 3/3 seeds (lower avg LM, faster threshold hits, ~10.3x lower grad norms), with 20k-step endpoint LM within a 0.005 heuristic reference. A key finding (ablation A3): jointly removing util/rank improves best/avg LM and threshold-hit speed in 3/3 seeds for both gates, and the early-to-mid advantage of G3 over G1 disappears. We trace this to an off-policy oracle label that assumes all subsequent layers execute full, whereas gated execution routes only a fraction through full -- making util/rank net-negative under the current recipe. Removing util/rank also cuts the training FLOPs proxy from ~1.53x to ~1.07x full-only (2.87h to 1.75h on a V100-32GB, ~39%). Conclusions are scoped to the studied regime.
Authors: Zixin Zhou, Tianxi Jiang, Menglong Yang, Zhihua Feng, Qingbo He, Shiwu Zhang
Abstract: Physical neural networks offer a transformative route to edge intelligence, providing superior inference speed and energy efficiency compared to conventional digital architectures. However, realizing scalable, end-to-end, fully analog recurrent neural networks for temporal information processing remains challenging due to the difficulty of faithfully mapping trained network models onto physical hardware. Here we present a fully analog resonant recurrent neural network (R$^2$NN) implemented via a metacircuit architecture composed of coupled electrical local resonators. A reformulated mechanical-electrical analogy establishes a direct mapping between the R$^2$NN model and metacircuit elements, enabling accurate physical implementation of trained neural network parameters. By integrating jointly trainable global resistive coupling and local resonances, which generate effective frequency-dependent negative resistances, the architecture shapes an impedance landscape that steers currents along frequency-selective pathways. This mechanism enables direct extraction of discriminative spectral features, facilitating real-time temporal classification of raw analog inputs while bypassing analog-to-digital conversion. We demonstrate the cross-domain versatility of this framework using integrated hardware for tactile perception, speech recognition, and condition monitoring. This work establishes a scalable, fully analog paradigm for intelligent temporal processing and paves the way for low-latency, resource-efficient physical neural hardware for edge intelligence.
Authors: Sajjad Ghiasvand, Mark Beliaev, Mahnoosh Alizadeh, Ramtin Pedarsani
Abstract: Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to $50\%$ in the most adversarial regime and gains that grow with model capacity.
Authors: Marcel Kollovieh, Sirine Ayadi, Stephan G\"unnemann
Abstract: Discrete diffusion models form a powerful class of generative models across diverse domains, including text and graphs. However, existing approaches face fundamental limitations. Masked diffusion models suffer from irreversible errors due to early unmasking, while uniform diffusion models, despite enabling self-correction, often yield low-quality samples due to their strong reliance on intermediate latent states. We introduce IDDM, an Interpolating Discrete Diffusion Model, that improves diffusion by reducing dependence on intermediate latent states. Central to IDDM is a controllable resampling mechanism that partially resets probability mass to the marginal distribution, mitigating error accumulation and enabling more effective token corrections. IDDM specifies a generative process whose transitions interpolate between staying at the current state, resampling from a prior, and flipping toward the target state, while enforcing marginal consistency and fully decoupling training from inference. We benchmark our model against state-of-the-art discrete diffusion models across molecular graph generation as well as text generation tasks, demonstrating competitive performance.
Authors: Zhiyin Yu, Yuchen Mou, Juncheng Yan, Junyu Luo, Chunchun Chen, Xing Wei, Yunhui Liu, Hongru Sun, Yuxing Zhang, Jun Xu, Yatao Bian, Ming Zhang, Wei Ye, Tieke He, Jie Yang, Guanjie Zheng, Zhonghai Wu, Bo Zhang, Lei Bai, Xiao Luo
Abstract: Reinforcement learning (RL) has emerged as a powerful post-training paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, reinforcement learning for LLMs faces substantial data scarcity challenges, including the limited availability of high-quality external supervision and the constrained volume of model-generated experience. These limitations make data-efficient reinforcement learning a critical research direction. In this survey, we present the first systematic review of reinforcement learning for LLMs under data scarcity. We propose a bottom-up hierarchical framework built around three complementary perspectives: the data-centric perspective, the training-centric perspective, and the framework-centric perspective. We develop a taxonomy of existing methods, summarize representative approaches in each category, and analyze their strengths and limitations. Our taxonomy aims to provide a clear conceptual foundation for understanding the design space of data-efficient RL for LLMs and to guide researchers working in this emerging area. We hope this survey offers a comprehensive roadmap for future research and inspires new directions toward more efficient and scalable reinforcement learning post-training for LLMs.
Authors: Dongxin Guo, Jikun Wu, Siu Ming Yiu
Abstract: Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets ($p < 0.05$). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a $10\times$ learning rate range, with about 1% parameter overhead on OGB.
Authors: Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Huiming Yang, Sibo wang, Linglin Liao
Abstract: This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emph{comparison unit construction} problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO, and RLOO. Through dual-track synchronous generation, prefix inheritance, and segment masking, EqLen efficiently collects effective equal-length training segments and enables stable
Authors: Jingzhou Jiang, Yixuan Tang, Yi Yang, Kar Yan Tam
Abstract: When task-specific labels are not available, it becomes difficult to select an embedding model for a specific target corpus. Existing labelless measures based on kernel estimators or Gaussian mixes fail in high-dimensional space, resulting in unstable rankings. We propose a flow-based labelless representation embedding evaluation (FLARE), which utilizes normalized streams to estimate information sufficiency directly from log-likelihood and avoid distance-based density estimation. We give a finite sample boundary, indicating that the estimation error depends on the intrinsic dimension of the data manifold rather than the original embedding dimension. On 11 datasets and 8 embedders, FLARE reached Spearman's $\rho$ of 0.90 under the supervised benchmark and remained stable in high-dimensional embeddings ($d \geq 3{,}584$) as the existing labelless baseline collapsed.
Authors: Ou Wu
Abstract: Large language model optimization has historically bifurcated into isolated data-centric and model-centric paradigms: the former manipulates involved samples through selection, augmentation, or poisoning, while the latter tunes model weights via masking, quantization, or low-rank adaptation. This paper establishes a unified \emph{data-parameter correspondence} revealing these seemingly disparate operations as dual manifestations of the same geometric structure on the statistical manifold $\mathcal{M}$. Grounded in the Fisher-Rao metric $g_{ij}(\theta)$ and Legendre duality between natural ($\theta$) and expectation ($\eta$) parameters, we identify three fundamental correspondences spanning the model lifecycle: 1. Geometric correspondence: data pruning and parameter sparsification equivalently reduce manifold volume via dual coordinate constraints; 2. Low-rank correspondence: in-context learning (ICL) and LoRA adaptation explore identical subspaces on the Grassmannian $\mathcal{G}(r,d)$, with $k$-shot samples geometrically equivalent to rank-$r$ updates; 3. Security-privacy correspondence: adversarial attacks exhibit cooperative amplification between data poisoning and parameter backdoors, whereas protective mechanisms follow cascading attenuation where data compression multiplicatively enhances parameter privacy. Extending from training through post-training compression to inference, this framework provides mathematical formalization for cross-community methodology transfer, demonstrating that cooperative optimization integrating data and parameter modalities may outperform isolated approaches across efficiency, robustness, and privacy dimensions.
Authors: Kadir-Kaan \"Ozer, Ren\'e Ebeling, Markus Enzweiler
Abstract: We introduce JuRe (Just Repair), a minimal denoising network for time series anomaly detection that exposes a central finding: architectural complexity is unnecessary when the training objective correctly implements the manifold-projection principle. JuRe consists of a single depthwise-separable convolutional residual block with hidden dimension 128, trained to repair corrupted time series windows and scored at inference by a fixed, parameter-free structural discrepancy function. Despite using no attention, no latent variable, and no adversarial component, JuRe ranks second on the TSB-AD multivariate benchmark (AUC-PR 0.404, 180 series, 17 datasets) and second on the UCR univariate archive by AUC-PR (0.198, 250 series), leading all neural baselines on AUC-PR and VUS-PR. Component ablation on TSB-AD identifies training-time corruption as the dominant factor ($\Delta$AUC-PR $= 0.047$ on removal), confirming that the denoising objective, not network capacity, drives detection quality. Pairwise Wilcoxon signed-rank tests establish statistical significance against 21 of 25 baselines on TSB-AD. Code is available at the URL https://github.com/iis-esslingen/JuRe.
Authors: Masahiro Nomura, Ryoki Hamano, Isao Ono
Abstract: Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.
Authors: Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye
Abstract: Reward-based fine-tuning aims to steer a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are motivated by different perspectives such as Soft RL, GFlowNets, etc., we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching toward a reward-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias--variance--compute tradeoffs of existing designs and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler redesigns that improve alignment effectiveness and compute efficiency across representative settings with differentiable and black-box rewards. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space.
Authors: Keyang Chen, Mingxuan Jiang, Yongsheng Zhao, Zeping Li, Zaiyuan Chen, Weiqi Luo, Zhixin Li, Sen Liu, Yinan Jing, Guangnan Ye, Xihong Wu, Hongfeng Chai
Abstract: Money laundering poses severe risks to global financial systems, driving the widespread adoption of machine learning for transaction monitoring. However, progress remains stifled by the lack of realistic benchmarks. Existing transaction-graph datasets suffer from two pervasive limitations: (i) they provide sparse node-level semantics beyond anonymized identifiers, and (ii) they rely on template-driven anomaly injection, which biases benchmarks toward static structural motifs and yields overly optimistic assessments of model robustness. We propose TransXion, a benchmark ecosystem for Anti-Money Laundering (AML) research that integrates profile-aware simulation of normal activity with stochastic, non-template synthesis of illicit subgraphs.TransXion jointly models persistent entity profiles and conditional transaction behavior, enabling evaluation of "out-of-character" anomalies where observed activity contradicts an entity's socio-economic context. The resulting dataset comprises approximately 3 million transactions among 50,000 entities, each endowed with rich demographic and behavioral attributes. Empirical analyses show that TransXion reproduces key structural properties of payment networks, including heavy-tailed activity distributions and localized subgraph structure. Across a diverse array of detection models spanning multiple algorithmic paradigms, TransXion yields substantially lower detection performance than widely used benchmarks, demonstrating increased difficulty and realism. TransXion provides a more faithful testbed for developing context-aware and robust AML detection methods. The dataset and code are publicly available at https://github.com/chaos-max/TransXion.
Authors: S. Gratton, Ph. L. Toint
Abstract: A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.
Authors: Chanik Kang, Hyewon Suk, Haejun Chung
Abstract: Meta-optics promises compact, high-performance imaging and color routing. However, designing high-performance structures is a high-dimensional optimization problem: mapping a desired optical output back to a physical 3D structure requires solving computationally expensive Maxwell's equations iteratively. Even with adjoint optimization, broadband design can require thousands of Maxwell solves, making industrial-scale optimization slow and costly. To overcome this challenge, we propose the Neural Adjoint Method, a solver-supervised surrogate that predicts 3D adjoint gradient fields from a voxelized permittivity volume using a Fourier Neural Operator (FNO). By learning the dense, per-voxel sensitivity field that drives gradient-based updates, our method can replace per-iteration adjoint solves with fast predictions, greatly reducing the computational cost of full-wave simulations required during iterative refinement. To better preserve sensitivity peaks, we introduce a stage-wise FNO that progressively refines residual errors with increasing emphasis on higher-frequency components. We curate a meta-optics dataset from paired forward/adjoint FDTD simulations and evaluate it across three tasks: spectral sorting (color routers), achromatic focusing (metalenses), and waveguide mode conversion. Our method reduces design time from hours to seconds. These results suggest a practical route toward fast, large-scale volumetric meta-optical design enabled by AI-accelerated scientific computing.
Authors: Vedanta Thapar, Abhinav Gupta
Abstract: Machine learning has become a powerful tool for discovering governing laws of dynamical systems from data. However, most existing approaches degrade severely when observations are sparse, noisy, or irregularly sampled. In this work, we address the problem of learning symbolic representations of nonlinear Hamiltonian dynamical systems under extreme data scarcity by explicitly incorporating physical structure into the learning architecture. We introduce Adaptable Symplectic Recurrent Neural Networks (ASRNNs), a parameter-cognizant, structure-preserving model that combines Hamiltonian learning with symplectic recurrent integration, avoiding time derivative estimation, and enabling stable learning under noise. We demonstrate that ASRNNs can accurately predict long-term dynamics even when each training trajectory consists of only two irregularly spaced time points, possibly corrupted by correlated noise. Leveraging ASRNNs as structure-preserving data generators, we further enable symbolic discovery using independent regression methods (SINDy and PySR), recovering exact symbolic equations for polynomial systems and consistent polynomial approximations for non-polynomial Hamiltonians. Our results show that such architectures can provide a robust pathway to interpretable discovery of Hamiltonian dynamics from sparse and noisy data.
Authors: Ciaran Bench
Abstract: In principle, deep generative models can be used to perform domain adaptation; i.e. align the input feature representations of test data with that of a separate discriminative model's training data. This can help improve the discriminative model's performance on the test data. However, generative models are prone to producing hallucinations and artefacts that may degrade the quality of generated data, and therefore, predictive performance when processed by the discriminative model. While uncertainty quantification can provide a means to assess the quality of adapted data, the standard framework for evaluating the quality of predicted uncertainties may not easily extend to generative models due to the common lack of ground truths (among other reasons). Even with ground truths, this evaluation is agnostic to how the generated outputs are used on the downstream task, limiting the extent to which the uncertainty reliability analysis provides insights about the utility of the uncertainties with respect to the intended use case of the adapted examples. Here, we describe how decision-theoretic uncertainty quantification can address these concerns and provide a convenient framework for evaluating the trustworthiness of generated outputs, in particular, for domain adaptation. We consider a case study in photoplethysmography time series denoising for Atrial Fibrillation classification. This formalises a well-known heuristic method of using a downstream classifier to assess the quality of generated outputs.
Authors: Marcin Kostrzewa, Maciej Zi\k{e}ba, Jerzy Stefanowski
Abstract: Counterfactual explanations (CFEs) are essential for interpreting black-box models, yet they often become invalid when models are slightly changed. Existing methods for generating robust CFEs are often limited to specific types of models, require costly tuning, or inflexible robustness controls. We propose a novel approach that jointly models the data distribution and the space of plausible model decisions to ensure robustness to model changes. Using a probabilistic consensus over a model ensemble, we train a conditional normalizing flow that captures the data density under varying levels of classifier agreement. At inference time, a single interpretable parameter controls the robustness level; it specifies the minimum fraction of models that should agree on the target class without retraining the generative model. Our method effectively pushes CFEs toward regions that are both plausible and stable across model changes. Experimental results demonstrate that our approach achieves superior empirical robustness while also maintaining good performance across other evaluation measures.
Authors: Mattie Ji, Indradyumna Roy, Vikas Garg
Abstract: Persistent homology (PH) encodes global information, such as cycles, and is thus increasingly integrated into graph neural networks (GNNs). PH methods in GNNs typically traverse an increasing sequence of subgraphs. In this work, we first expose limitations of this inclusion procedure. To remedy these shortcomings, we analyze contractions as a principled topological operation, in particular, for graph representation learning. We study the persistence of contraction sequences, which we call Contraction Homology (CH). We establish that forward PH and CH differ in expressivity. We then introduce Hourglass Persistence, a class of topological descriptors that interleave a sequence of inclusions and contractions to boost expressivity, learnability, and stability. We also study related families parametrized by two paradigms. We also discuss how our framework extends to simplicial and cellular networks. We further design efficient algorithms that are pluggable into end-to-end differentiable GNN pipelines, enabling consistent empirical improvements over many PH methods across standard real-world graph datasets. Code is available at \href{https://github.com/Aalto-QuML/Hourglass}{this https URL}.
Authors: Franki Nguimatsia Tiofack, Fabian Schramm, Th\'eotime Le Hellard, Justin Carpentier
Abstract: Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time-to-goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed-form identity that expresses the goal-conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right-censored trajectories. We introduce three practical value estimators, including finite-horizon truncation and two binned infinite-horizon approximations to capture long-horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long-horizon tasks.
Authors: Yujia Zheng, Zijian Li, Shunxing Fan, Andrew Gordon Wilson, Kun Zhang
Abstract: Given only observational data $X = g(Z)$, where both the latent variables $Z$ and the generating process $g$ are unknown, recovering $Z$ is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To make identifiability actionable in the real-world scenarios, we take a complementary view: in the general settings where full identifiability is unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem of diverse dictionary learning to formalize this view. Specifically, we show that intersections, complements, and symmetric differences of latent variables linked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficient structural diversity is present, they further imply full identifiability of all latent variables. Notably, all identifiability benefits follow from a simple inductive bias during estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.
Authors: Liangzu Peng, Uday Kiran Reddy Tadipatri, Ziqing Xu, Eric Eaton, Ren\'e Vidal
Abstract: Continual learning (CL) is concerned with learning multiple tasks sequentially without forgetting previously learned tasks. Despite substantial empirical advances over recent years, the theoretical development of CL remains in its infancy. At the heart of developing CL theory lies the challenge that the data distribution varies across tasks, and we argue that properly addressing this challenge requires understanding this variation--dependency among tasks. To explicitly model task dependency, we consider nonlinear regression tasks and propose the assumption that these tasks are dependent in such a way that the data of the current task is a nonlinear transformation of previous data. With this model and under natural assumptions, we prove statistical recovery guarantees (more specifically, bounds on estimation errors) for several CL paradigms in practical use, including experience replay with data-independent regularization and data-independent weights that balance the losses of tasks, replay with data-dependent weights, and continual learning with data-dependent regularization (e.g., knowledge distillation). To the best of our knowledge, our bounds are informative in cases where prior work gives vacuous bounds.
Authors: Paul M. Thompson
Abstract: How much data is enough to make a scientific discovery? As biomedical datasets scale to millions of samples and AI models grow in capacity, progress increasingly depends on predicting when additional data will substantially improve performance. In practice, model development often relies on empirical scaling curves measured across architectures, modalities, and dataset sizes, with limited theoretical guidance on when performance should improve, saturate, or exhibit cross-over behavior. We propose a scaling-law framework for cross-modal discoverability based on spectral structure of data covariance operators, task-aligned signal projections, and learned representations. Many performance metrics, including AUC, can be expressed in terms of cumulative signal-to-noise energy accumulated across identifiable spectral modes of an encoder and cross-modal operator. Under mild assumptions, this accumulation follows a zeta-like scaling law governed by power-law decay of covariance spectra and aligned signal energy, leading naturally to the appearance of the Riemann zeta function. Representation learning methods such as sparse models, low-rank embeddings, and multimodal contrastive objectives improve sample efficiency by concentrating useful signal into earlier stable modes, effectively steepening spectral decay and shifting scaling curves. The framework predicts cross-over regimes in which simpler models perform best at small sample sizes, while higher-capacity or multimodal encoders outperform them once sufficient data stabilizes additional degrees of freedom. Applications include multimodal disease classification, imaging genetics, functional MRI, and topological data analysis. The resulting zeta law provides a principled way to anticipate when scaling data, improving representations, or adding modalities is most likely to accelerate discovery.
Authors: Md Mezbahul Islam, John Michael Templeton, Christian Poellabauer, Ananda Mohan Mondal
Abstract: Parkinson's disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson's Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.
Authors: Shashank Mishra, Karan Patil, Cedric Schockaert, Didier Stricker, Jason Rambach
Abstract: Root cause analysis (RCA) for time-series anomaly detection is critical for the reliable operation of complex real-world systems. Existing explanation methods often rely on unrealistic feature perturbations and ignore temporal and cross-feature dependencies, leading to unreliable attributions. We propose a conditional attribution framework that explains anomalies relative to contextually similar normal system states. Instead of using marginal or randomly sampled baselines, our method retrieves representative normal instances conditioned on the anomalous observation, enabling dependency-preserving and operationally meaningful explanations. To support high-dimensional time-series data, contextual retrieval is performed in learned low-dimensional representations using both variational autoencoder latent spaces and UMAP manifold embeddings. By grounding the retrieval process in the system's learned manifold, this strategy avoids out-of-distribution artifacts and ensures attribution fidelity while maintaining computational efficiency. We further introduce confidence-aware and temporal evaluation metrics for assessing explanation reliability and responsiveness. Experiments on the SWaT and MSDS benchmarks demonstrate that the proposed approach consistently improves root-cause identification accuracy, temporal localization, and robustness across multiple anomaly detection models. These results highlight the practical utility of conditional attribution for explainable anomaly diagnosis in complex time-series systems. Code and models will be publicly released.
Authors: Swattik Maiti, Ritik Pratap Singh, Fardina Fathmiul Alam
Abstract: Credit risk default prediction remains a cornerstone of risk management in the financial industry. The task involves estimating the likelihood that a borrower will fail to meet debt obligations, an objective critical for lending decisions, portfolio optimization, and regulatory compliance. Traditional machine learning models such as logistic regression and tree-based ensembles are widely adopted for their interpretability and strong empirical performance. However, modern credit datasets are high-dimensional, heterogeneous, and noisy, increasing overfitting risk in monolithic models and reducing robustness under distributional shift. We introduce STRIKE (Stacking via Targeted Representations of Isolated Knowledge Extractors), a feature-group-aware stacking framework for structured tabular credit risk data. Rather than training a single monolithic model on the complete dataset, STRIKE partitions the feature space into semantically coherent groups and trains independent learners within each group. This decomposition is motivated by an additive perspective on risk modeling, where distinct feature sources contribute complementary evidence that can be combined through a structured aggregation. The resulting group-specific predictions are integrated through a meta-learner that aggregates signals while maintaining robustness and modularity. We evaluate STRIKE on three real-world datasets spanning corporate bankruptcy and consumer lending scenarios. Across all settings, STRIKE consistently outperforms strong tree-based baselines and conventional stacking approaches in terms of AUC-ROC. Ablation studies confirm that performance gains stem from meaningful feature decomposition rather than increased model complexity. Our findings demonstrate that STRIKE is a stable, scalable, and interpretable framework for credit risk default prediction tasks.
Authors: Christian Lysenst{\o}en
Abstract: Serving large language models under latency service-level objectives (SLOs) is a configuration-heavy systems problem with an unusually failure-prone search space: many plausible configurations crash outright or miss user-visible latency targets, and standard black-box optimizers treat these failures as wasted trials. We present SLO-Guard, a crash-aware autotuner for vLLM serving that treats crashes as first-class observations. SLO-Guard combines a feasible-first Thermal Budget Annealing (TBA) exploration phase with a warm-started Tree-structured Parzen Estimator (TPE) exploitation phase; the handoff replays all exploration history, including crashes encoded as extreme constraint violations. We additionally contribute a configuration-repair pass, a GPU-aware KV-cache memory guard, and a four-category crash taxonomy. We evaluate SLO-Guard on Qwen2-1.5B served with vLLM 0.19 on an NVIDIA A100 40GB. Across a pre-specified five-seed study, both SLO-Guard and uniform random search attain 75/75 feasibility with zero crashes under the corrected concurrent harness, and are statistically tied on best-achieved latency (Mann-Whitney two-sided p=0.84). SLO-Guard's advantage is in budget consistency: more trials in the fast-serving regime (10.20 vs. 7.40 out of 15; one-sided p=0.014) and higher post-handoff consistency (0.876 vs. 0.539; p=0.010). Under concurrent load, SLO-Guard's cross-seed standard deviation on best latency is 4.4x tighter than random search's (2.26 ms vs. 10.00 ms). A harness-replication analysis shows that the consistency findings survive an independent sequential-dispatch measurement condition. The central claim is not that SLO-Guard finds a better final configuration, but that it spends a fixed tuning budget more predictably once the fast regime has been found.
Authors: Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson
Abstract: Constitution-conditioned post-training can be analysed as a structured perturbation of a model's learned representational geometry. We introduce ATLAS, a geometry-first program that traces constitution-induced hidden-state structure across charts, models, and substrates. Instead of treating the relevant unit as a single behaviour, neuron, vector, or patch, ATLAS tests a local chart whose tangent structure, occupancy distribution, and behavioural coupling can be measured under system change. On Gemma, the anchored source-local chart captures 310 / 320 reviewed source rows and all 84 / 84 reviewed score-flip rows, but compact exact-patch sufficiency does not close, so the exportable unit is the broader source-defined family. Freezing that family, we re-identify a target-local realisation in an unadapted Phi model, where the fully adjudicated confirmatory contrast separates with AUC 0.984 and mean gap 5.50. In held-out ALM8 mouse frontal-cortex perturbation data, the same source-defined family receives support across 5/5 folds, with mean held-out AUC 0.72 and mean fold gap 4.50. A multiple-choice analysis provides the main boundary: nearby target-local signals can appear without source-faithful closure. The resulting correspondence is not coordinate identity, site identity, or a target-side mediation theorem. It is geometric recurrence under redistribution: written constitutions can induce recoverable latent geometry whose organisation remains detectable across model and substrate changes while its local coordinates, occupancy, and behavioural expression shift.
Authors: C\'esar Ojeda, Niklas Hartung, Wilhelm Huisinga, Tim Jahn, Purity Kamene Kavwele, Marian Klose, Piyush Kumar, Rams\'es J. S\'anchez, Darius A. Faroughy
Abstract: We introduce Prior-Fitted Functional Flows, a generative foundation model for pharmacokinetics that enables zero-shot population synthesis and individual forecasting without manual parameter tuning. We learn functional vector fields, explicitly conditioned on the sparse, irregular data of an entire study population. This enables the generation of coherent virtual cohorts as well as forecasting of partially observed patient trajectories with calibrated uncertainty. We construct a new open-access literature corpus to inform our priors, and demonstrate state-of-the-art predictive accuracy on extensive real-world datasets.
Authors: Joon Hyeok Kim, Yong-Hyun Park, Mattis Dals{\ae}tra {\O}stby, Jiatao Gu
Abstract: Despite their empirical success, how diffusion models generalize remains poorly understood from a mechanistic perspective. We demonstrate that diffusion models trained with flow-matching objectives exhibit grokking--delayed generalization after overfitting--on modular addition, enabling controlled analysis of their internal computations. We study this phenomenon across two levels of data regime. In a single-image regime, mechanistic dissection reveals that the model implements modular addition by composing periodic representations of individual operands. In a diverse-image regime with high intraclass variability, we find that the model leverages its iterative sampling process to partition the task into an arithmetic computation phase followed by a visual denoising phase, separated by a critical timestep threshold. Our work provides the mechanistic decomposition of algorithmic learning in diffusion models, revealing how these models bridge continuous pixel-space generation and discrete symbolic reasoning.
Authors: Dongxin Guo, Jikun Wu, Siu Ming Yiu
Abstract: Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed. We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with threshold-triggered corrective replay. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, SafeAnchor retains 93.2% of original safety alignment, outperforming all baselines by 18-42 points, while matching unconstrained fine-tuning to within 1.5 points on domain tasks.
Authors: Shripad Deshmukh, Jayakumar Subramanian, Raghavendra Addanki, Nikos Vlassis
Abstract: In cooperative teams where agents act in a fixed order and share a single team reward, it is hard to know how much each agent contributed, and harder still when agents are updated one at a time because data collected earlier no longer reflects the new policies. We introduce the Sequential Aristocrat Utility (SeqAU), the unique per-agent learning signal that maximizes the individual learnability of each agent's action, extending the classical framework of Wolpert and Tumer (2002) to this sequential setting. From SeqAU we derive CAPO (Counterfactual Advantage Policy Optimization), a critic-free policy-gradient algorithm. CAPO fits a per-agent reward decomposition from group rewards and computes the per-agent advantage in closed form plus a handful of forward passes through the current policy, requiring no extra environment calls beyond the initial batch. We give analytic bias and variance bounds and validate them on a controlled sequential bandit, where CAPO's advantage over standard baselines grows with the team size. The framework is general; multi-LLM pipelines are a natural deployment target.
Authors: Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin
Abstract: KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor -- token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing -- but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform. We propose MoE-nD, a mixture-of-experts framework that routes each layer to its own (eviction-ratio, K-bits, V-bits) tuple under a global memory budget. An offline-calibrated greedy solver chooses the routing that minimizes predicted quality loss; at inference time, per-layer heterogeneous eviction and quantization are applied jointly through a single attention patch. On a 4-task subset of LongBench-v1 (16k inputs, n=50 per task, adapted reasoning-model protocol; see section Experiments), MoE-nD's hetero variant matches our uncompressed 1.9~GB baseline at 14x compression (136~MB) while every other compressed baseline we tested (1d, 2d_uniform, 2d) at comparable or smaller memory stays under 8/100. The gains hold on AIME reasoning benchmarks (+6 to +27 pts over the strongest per-layer-quantization baseline across eight configurations). Two null results -- MATH-500 and LongBench's TREC -- share a principled cause (short inputs, solver picks keep=1.0 on most layers), cleanly characterizing when per-layer eviction routing has headroom to help.
Authors: Prashant C. Raju
Abstract: Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy ($\rho = 0.89$-$0.97$) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial $\rho = 0.62$-$0.76$). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks ($\rho \approx 0.10$), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly $2\times$ greater geometric change than CKA during post-training alignment (up to $5.23\times$ in Llama) while providing earlier warning in 73\% of models and maintaining a $6\times$ lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.
Authors: Kunyu Zhang, Qiang Li, Vince D. Calhoun, Shujian Yu
Abstract: Resting-state functional magnetic resonance imaging (fMRI) has emerged as a cornerstone for psychiatric diagnosis, yet most approaches rely on pairwise brain cortical or sub-cortical connectivities that overlooks higher-order interactions (HOIs) central to complex brain dynamics. While hypergraph methods encode HOIs through predefined hyperedges, their construction typically relies on heuristic similarity metrics and does not explicitly characterize whether interactions are synergy- or redundancy-dominated. In this paper, we introduce $O$-information, a signed measure that characterizes the informational nature of HOIs, and integrate third- and fourth-order $O$-information into a unified multi-view information bottleneck framework for fMRI-based psychiatric diagnosis. To enable scalable $O$-information estimation, we further develop two independent acceleration strategies: a Gaussian analytical approximation and a randomized matrix-based R\'enyi entropy estimator, achieving over a 30-fold computational speedup compared with conventional estimators. Our tri-view architecture systematically fuses pairwise, triadic, and tetradic brain interactions, capturing comprehensive brain connectivity while explicitly penalizing redundancy. Extensive evaluation across four benchmark datasets (REST-meta-MDD, ABIDE, UCLA, ADNI) demonstrates consistent improvements, outperforming 11 baseline methods including state-of-the-art graph neural network (GNN) and hypergraph based approaches. Moreover, our method reveals interpretable region-level synergy-redundancy patterns which are not explicitly characterized by conventional hypergraph formulations.
Authors: Yuzhe Fu (Helen), Hancheng Ye (Helen), Cong Guo (Helen), Junyao Zhang (Helen), Qinsi Wang (Helen), Yueqian Lin (Helen), Changchun Zhou (Helen), Hai (Helen), Li, Yiran Chen
Abstract: Point-based Neural Networks (PNNs) have become a key approach for point cloud processing. However, a core operation in these models, Farthest Point Sampling (FPS), often introduces significant inference latency, especially for large-scale processing. Despite existing CUDA- and hardware-level optimizations, FPS remains a major bottleneck due to exhaustive computations across multiple network layers in PNNs, which hinders scalability. Through systematic analysis, we identify three substantial redundancies in FPS, including unnecessary full-cloud computations, redundant late-stage iterations, and predictable inter-layer outputs that make later FPS computations avoidable. To address these, we propose \textbf{\textit{FlashFPS}}, a hardware-agnostic, plug-and-play framework for FPS acceleration, composed of \textit{FPS-Prune} and \textit{FPS-Cache}. \textit{FPS-Prune} introduces candidate pruning and iteration pruning to reduce redundant computations in FPS while preserving sampling quality, and \textit{FPS-Cache} eliminates layer-wise redundancy via cache-and-reuse. Integrated into existing CUDA libraries and state-of-the-art PNN accelerators, \textit{FlashFPS} achieves 5.16$\times$ speedup over the standard CUDA baseline on GPU and 2.69$\times$ on PNN accelerators, with negligible accuracy loss, enabling efficient and scalable PNN inference. Codes are released at https://github.com/Yuzhe-Fu/FlashFPS.
Authors: Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Junqiang Zheng, Saiyong Yang, Yunfang Wu
Abstract: Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments. Existing approaches either rely on training data with ground truth annotations or require advanced commercial language models (LMs) to synthesize environments that keep fixed once created. In this work, we propose TRUSTEE, a data-free method training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool simulation and trajectory evaluation, paired with an adaptive curriculum learning mechanism that controls various aspects of the task difficulty dynamically during training. Our empirical results show that TRUSTEE brings consistent improvements across various domains and outperforms all the baselines which require extra external resources for training. These confirm that, with a sufficiently sophisticated design, even simulated environments with a local 8B LM as the backbone could set a strong baseline for tool learning, without expensive annotated data, realistic human interactions, executable tools or costly verifiable environments from human experts or commercial LMs. We hope our proposed paradigm could inspire future research on environment scaling with limited resources.
Authors: Deyi Wang, Qining Zhang, Lei Ying
Abstract: This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.
Authors: Lixian Chen, Jianhong Tan
Abstract: Adapting foundation models under resource budgets relies heavily on Parameter-Efficient Fine-Tuning (PEFT), with LoRA being a standard modular solution. However, LoRA suffers from spectral interference. Low-rank updates often concentrate energy on the leading singular directions of pretrained weights, perturbing general capabilities and causing catastrophic forgetting and fragile multi-adapter merging. To resolve this, we propose HiP-LoRA, a spectrum-aware adaptation framework. Utilizing the cached singular value decomposition (SVD) of pretrained layers, HiP-LoRA decomposes updates into two channels: a principal channel within the dominant singular subspace, and a residual low-rank channel in the orthogonal complement. A singular-value-weighted stability budget on the principal channel continuously balances pretrained behavior preservation with task-specific plasticity. Experiments on Llama-3.1-8B demonstrate that under matched budgets, HiP-LoRA drastically reduces pretraining degradation and multi-adapter MergeFail, robustly outperforming baselines in interference-sensitive tasks like continual tuning and knowledge editing.
Authors: Pranshav Gajjar, Manan Tiwari, Sayanta Seth, Vijay K. Shah
Abstract: Data scarcity remains a fundamental bottleneck in applying deep learning to wireless communication problems, particularly in scenarios where collecting labeled Radio Frequency (RF) data is expensive, time-consuming, or operationally constrained. This paper proposes LLM-AUG, a data augmentation framework that leverages in-context learning in large language models (LLMs) to generate synthetic training samples directly in a learned embedding space. Unlike conventional generative approaches that require training task-specific models, LLM-AUG performs data generation through structured prompting, enabling rapid adaptation in low-shot regimes. We evaluate LLM-AUG on two representative tasks: modulation classification and interference classification using the RadioML 2016.10A dataset, and the Interference Classification (IC) dataset respectively. Results show that LLM-AUG consistently outperforms traditional augmentation and deep generative baselines across low-shot settings and reaches near oracle performance using only 15% labeled data. LLM-AUG further demonstrates improved robustness under distribution shifts, yielding a 29.4% relative gain over diffusion-based augmentation at a lower SNR value. On the RadioML and IC datasets, LLM-AUG yields a relative gain of 67.6% and 35.7% over the diffusion-based baseline. The t-SNE visualizations further validate that synthetic samples generated by better preserve class structure in the embedding space, leading to more consistent and informative augmentations. These results demonstrate that LLMs can serve as effective and practical data augmenters for wireless machine learning, enabling robust and data-efficient learning in evolving wireless environments.
Authors: Pranshav Gajjar, Vijay K Shah
Abstract: Large language models (LLMs) are increasingly deployed in the telecommunications domain for critical tasks, relying heavily on Retrieval-Augmented Generation (RAG) to adapt general-purpose models to continuously evolving standards. However, a significant gap exists in evaluating the embedding models that power these RAG pipelines, as general-purpose benchmarks fail to capture the dense, acronym-heavy, and highly cross-referential nature of telecommunications corpora. To address this, we introduce TeleEmbedBench, the first large-scale, multi-corpus embedding benchmark designed specifically for telecommunications. The benchmark spans three heterogeneous corpora: O-RAN Alliance specifications, 3GPP release documents, and the srsRAN open-source codebase, comprising 9,000 question-chunk pairs across three standard chunk sizes (512, 1024, and 2048 tokens). To construct this dataset at scale without manual annotation bottlenecks, we employ a novel automated pipeline where one LLM generates specific queries from text chunks and a secondary LLM validates them across strict criteria. We comprehensively evaluate eight embedding models, spanning standard sentence-transformers and LLM-based embedders. Our results demonstrate that LLM-based embedders, such as Qwen3 and EmbeddingGemma, consistently and significantly outperform traditional sentence-transformers in both retrieval accuracy and robustness against cross-domain interference. Additionally, we introduce TeleEmbedBench-Clean to evaluate model robustness against noisy, incomplete user queries. Finally, our analysis reveals that while domain-specific task instructions improve embedder performance for raw source code, they paradoxically degrade retrieval performance for natural language telecommunications specifications.
Authors: Junyi Yao, Zihao Zheng, Jiayu Long
Abstract: Pairwise ranking systems based on Maximum Likelihood Estimation (MLE), such as the Bradley-Terry model, are widely used to aggregate preferences from pairwise comparisons. However, their robustness under strategic data manipulation remains insufficiently understood. In this paper, we study the vulnerability of MLE-based ranking systems to adversarial perturbations. We formulate the manipulation task as a constrained combinatorial optimization problem and propose an Adaptive Subset Selection Attack (ASSA) to efficiently identify high-impact perturbations. Experimental results on both synthetic data and real-world election datasets show that MLE-based rankings exhibit a sharp phase-transition behavior: beyond a small perturbation budget, a limited number of strategic voters can significantly alter the global ranking. In particular, our method consistently outperforms random and greedy baselines under constrained budgets. These findings reveal a fundamental sensitivity of MLE-based ranking mechanisms to structured perturbations and highlight the need for more robust aggregation methods in collective decision-making systems.
Authors: Kijung Jeon, Michael Muehlebach, Molei Tao
Abstract: Generative modeling within constrained sets is essential for scientific and engineering applications involving physical, geometric, or safety requirements (e.g., molecular generation, robotics). We present a unified framework for constrained diffusion models on generic nonconvex feasible sets $\Sigma$ that simultaneously enforces equality and inequality constraints throughout the diffusion process. Our framework incorporates both overdamped and underdamped dynamics for forward and backward sampling. A key algorithmic innovation is a computationally efficient landing mechanism that replaces costly and often ill-defined projections onto $\Sigma$, ensuring feasibility without iterative Newton solves or projection failures. By leveraging underdamped dynamics, we accelerate mixing toward the prior distribution, effectively alleviating the high simulation costs typically associated with constrained diffusion. Empirically, this approach reduces function evaluations and memory usage during both training and inference while preserving sample quality. On benchmarks featuring equality and mixed constraints, our method achieves comparable sample quality to state-of-the-art baselines while significantly reducing computational cost, providing a practical and scalable solution for diffusion on nonconvex feasible sets.
Authors: Yan Xie, Changkui Mao, Changsong Wu, Chao Lu, Chao Suo, Cheng Qian, Chun Yang, Danyang Zhu, Hengchang Xiong, Hongzhan Lu, Hongzhen Liu, Jiafu Liu, Jie Chen, Jie Dai, Junfeng Tang, Kai Liu, Kun Li, Lipeng Ge, Meng Sun, Min Luo, Peng Chen, Peng Wang, Shaodong Yang, Shibin Tang, Shibo Chen, Weikang Zhang, Xiao Ling, Xiaobo Du, Xin Wu, Yang Liu, Yi Jiang, Yihua Jin, Yin Huang, Yuli Zhang, Zhen Yuan, Zhiyuan Man, Zhongxiao Yao
Abstract: As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto's response: a performant, cost-effective architecture for AI inference in Autonomous Driving (AD), Large Language Models (LLMs), and intelligent human interactions, domains crucial to today's most competitive automobile platforms. M100 employs a dataflow parallel architecture, where compiler-architecture co-design orchestrates not only computation but, more critically, data movement across time and space. Leveraging dataflow computing efficiency, our hardware-software co-design improves system performance while reducing hardware complexity and cost. M100 largely eliminates caching: tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, yielding greater efficiency and scalability than cache-based systems. Another key principle was selecting the right operational granularity for scheduling, issuing, and execution across compiler, firmware, and hardware. Recognizing commonalities in AI workloads, we chose the tensor as the fundamental data element. M100 demonstrates general AI computing capability across diverse inference applications, including UniAD (for AD) and LLaMA (for LLMs). Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization, representing a promising direction for future general AI computing.
Authors: Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, Qiang Lin
Abstract: Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.
Authors: Yubai Wei, Chen Wu, Hashem Haghbayan
Abstract: Vision-Language-Action (VLA) models map multimodal inputs directly to robot actions and are typically trained through large-scale imitation learning. While this paradigm has shown strong performance, prevailing VLA training procedures do not explicitly supervise hard physical constraints such as obstacle avoidance or kinematic feasibility. As a result, the geometric structure underlying physically feasible behavior must be inferred only implicitly from demonstrations. In this paper, we study whether introducing explicit feasibility supervision can provide effective structured guidance for VLA policies. We formulate a simple geometry-grounded feasibility objective and integrate it into the training stage of a diffusion-based VLA policy. To evaluate this idea systematically, we use obstacle-aware manipulation as a controlled probe of geometry-dependent physical feasibility. Empirical results show that augmenting VLA training with feasibility supervision improves both physical reliability and overall task performance, while also enhancing learning efficiency in the low-data regime. These findings indicate that explicit feasibility signals can effectively complement imitation-based VLA learning, highlighting their potential for developing more reliable VLA policies.
Authors: Hongyu Zhan, Qixin Wang, Yusen Tan, Haitao Yu, Jingbo Zhou, Shuai Chen, Jia Li, Xiao Tan, Jun Xia
Abstract: The advent of Large Language Models (LLMs) has fundamentally reshaped the way we interact with graphs, giving rise to a new paradigm called GraphLLM. As revealed in recent studies, graph learning can benefit from LLMs. However, we observe limited benefits when we directly utilize LLMs to make predictions for graph-related tasks within GraphLLM paradigm, which even yields suboptimal results compared to conventional GNN-based approaches. Through in-depth analysis, we find this failure can be attributed to LLMs' limited capability for processing graph data and their tendency to overlook graph information. To address this issue, we propose LoReC (Look, Remember, and Contrast), a novel plug-and-play method for GraphLLM paradigm, which enhances LLM's understanding of graph data through three stages: (1) Look: redistributing attention to graph; (2) Remember: re-injecting graph information into the Feed-Forward Network (FFN); (3) Contrast: rectifying the vanilla logits produced in the decoding process. Extensive experiments demonstrate that LoReC brings notable improvements over current GraphLLM methods and outperforms GNN-based approaches across diverse datasets. The implementation is available at https://github.com/Git-King-Zhan/LoReC.
Authors: Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Samet Oymak
Abstract: State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influence the training and the eventual Verification@K performance. Experiments, baselines, and ablations on synthetic and real data corroborate our theory and the benefits of CAL-GRPO over vanilla GRPO as well as naive weighting.
Authors: Xiaoyuan Cheng, Haoyu Wang, Wenxuan Yuan, Ziyan Wang, Zonghao Chen, Li Zeng, Zhuo Sun
Abstract: Recent advances in flow-based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade-offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the $L_2$ regularization as an upper bound of the 2-Wasserstein distance ($W_2$), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the $L_2$ (or upper bound of $W_2$) regularization is isotropic and density-insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state-of-the-art performance across diverse offline RL benchmarks. See project page: https://github.com/ARC0127/Fisher-Decorator.
Authors: Zhanyu Liu, Qingguo Hu, Ante Wang, Chenqing Liu, Zhishang Xiang, Hui Li, Delai Qiu, Jinsong Su
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has proven effective for training reasoning-oriented large language models, but existing methods largely assume high-resource settings with abundant training data. In low-resource scenarios, RLVR is prone to more severe entropy collapse, which substantially limits exploration and degrades reasoning performance. To address this issue, we propose Hybrid-domain Entropy dynamics ALignment (HEAL), a framework tailored for few-shot RLVR. HEAL first selectively incorporates high-value general-domain data to promote more diverse exploration. Then, we introduce Entropy Dynamics Alignment (EDA), a reward mechanism that aligns trajectory-level entropy dynamics between the target and general domains, capturing both entropy magnitude and fine-grained variation. Through this alignment, EDA not only further mitigates entropy collapse but also encourages the policy to acquire more diverse exploration behaviors from the general domain. Experiments across multiple domains show that HEAL consistently improves few-shot RLVR performance. Notably, using only 32 target-domain samples, HEAL matches or even surpasses full-shot RLVR trained with 1K target-domain samples.
Authors: Xiao Wang
Abstract: The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \geq 4k$, $s \leq \sqrt{n}/4$) requires depth $L = \Omega(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ via windowed pointer doubling, and a max-bound $L = \Omega(\max(\lceil k/s \rceil, \log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when $Hmp \lesssim \log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\lceil k/s \rceil$ once $Hmp \geq \log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over $T = \lceil \log_2 k \rceil$ doubling stages, oblivious caches give $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\Pr[\mathcal{E}] = s/n$ exactly, independent of $T$. The $\Omega((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.
Authors: Ke Wan, Kensuke Tanioka, Toshio Shimokawa
Abstract: Machine learning has become integral to medical research and is increasingly applied in clinical settings to support diagnosis and decision-making; however, its effectiveness depends on access to large, diverse datasets, which are limited within single institutions. Although integrating data across institutions can address this limitation, privacy regulations and data ownership constraints hinder these efforts. Federated learning enables collaborative model training without sharing raw data; however, most methods rely on complex architectures that lack interpretability, limiting clinical applicability. Therefore, we proposed a federated RuleFit framework to construct a unified and interpretable global model for distributed environments. It integrates three components: preprocessing based on differentially private histograms to estimate shared cutoff values, enabling consistent rule definitions and reducing heterogeneity across clients; local rule generation using gradient boosting decision trees with shared cutoffs; and coefficient estimation via $\ell_1$-regularized optimization using a Federated Dual Averaging algorithm for sparse and consistent variable selection. In simulation studies, the proposed method achieved a performance comparable to that of centralized RuleFit while outperforming existing federated approaches. Real-world analysis demonstrated its ability to provide interpretable insights with competitive predictive accuracy. Therefore, the proposed framework offers a practical and effective solution for interpretable and reliable modeling in federated learning environments.
Authors: Junyoung Yang, Kyungmin Kim, Sangdon Park
Abstract: Uncertainty quantification is crucial in safety-critical systems, where decisions must be made under uncertainty. In particular, we consider the problem of online uncertainty quantification, where data points arrive sequentially. Online conformal prediction is a principled online uncertainty quantification method that dynamically constructs a prediction set at each time step. While existing methods for online conformal prediction provide long-run coverage guarantees without any distributional assumptions, they typically assume a full feedback setting in which the true label is always observed. In this paper, we propose a novel learning method for online conformal prediction with partial feedback from an adaptive adversary-a more challenging setup where the true label is revealed only when it lies inside the constructed prediction set. Specifically, we formulate online conformal prediction as an adversarial bandit problem by treating each candidate prediction set as an arm. Building on an existing algorithm for adversarial bandits, our method achieves a long-run coverage guarantee by explicitly establishing its connection to the regret of the learner. Finally, we empirically demonstrate the effectiveness of our method in both independent and identically distributed (i.i.d.) and non-i.i.d. settings, showing that it successfully controls the miscoverage rate while maintaining a reasonable size of the prediction set.
Authors: Pooyan Khosravinia, Jo\~ao Gama, Bruno Veloso
Abstract: Anomaly detection in multivariate time series is a central challenge in industrial monitoring, as failures frequently arise from complex temporal dynamics and cross-sensor interactions. While recent deep learning models, including graph neural networks and Transformers, have demonstrated strong empirical performance, most approaches remain primarily correlational and offer limited support for causal interpretation and root-cause localization. This study introduces a causally-constrained probabilistic forecasting framework which is a Causally Guided Transformer (CGT) model for multivariate time-series anomaly detection, integrating an explicit time-lagged causal graph prior with deep sequence modeling. For each target variable, a dedicated forecasting block employs a hard parent mask derived from causal discovery to restrict the main prediction pathway to graph-supported causes, while a latent Gaussian head captures predictive uncertainty. To leverage residual correlational information without compromising the causal representation, a shadow auxiliary path with stop-gradient isolation and a safety-gated blending mechanism is incorporated to suppress non-causal contributions when reliability is low. Anomalies are identified using negative log-likelihood scores with adaptive streaming thresholding, and root-cause variables are determined through per-dimension probabilistic attribution and counterfactual clamping. Experiments on the ASD and SMD benchmarks indicate that the proposed method achieves state-of-the-art detection performance, with F1-scores of 96.19% on ASD and 95.32% on SMD, and enhances variable-level attribution quality. These findings suggest that causal structural priors can improve both robustness and interpretability in detecting deep anomalies in multivariate sensor systems.
Authors: Michael Y. Li, Jubayer Ibn Hamid, Emily B. Fox, Noah D. Goodman
Abstract: Chain-of-thought reasoning has driven striking advances in language model capability, yet every reasoning step grows the KV cache, creating a bottleneck to scaling this paradigm further. Current approaches manage these constraints on the model's behalf using hand-designed criteria. A more scalable approach would let end-to-end learning subsume this design choice entirely, following a broader pattern in deep learning. After all, if a model can learn to reason, why can't it learn to forget? We introduce Neural Garbage Collection (NGC), in which a language model learns to forget while learning to reason, trained end-to-end from outcome-based task reward alone. As the model reasons, it periodically pauses, decides which KV cache entries to evict, and continues to reason conditioned on the remaining cache. By treating tokens in a chain-of-thought and cache-eviction decisions as discrete actions sampled from the language model, we can use reinforcement learning to jointly optimize how the model reasons and how it manages its own memory: what the model evicts shapes what it remembers, what it remembers shapes its reasoning, and the correctness of that reasoning determines its reward. Crucially, the model learns this behavior entirely from a single learning signal - the outcome-based task reward - without supervised fine-tuning or proxy objectives. On Countdown, AMC, and AIME tasks, NGC maintains strong accuracy relative to the full-cache upper bound at 2-3x peak KV cache size compression and substantially outperforms eviction baselines. Our results are a first step towards a broader vision where end-to-end optimization drives both capability and efficiency in language models.
Authors: Helmut Harbrecht, Christoph Schwab
Abstract: We prove error bounds for operator surrogates of solution operators for partial differential and boundary integral equations on families of domains which are diffeomorphic to one common reference (or latent) domain $D_{ref}$. The pullback of the PDE to $D_{ref}$ via affine-parametric shape encoding produces a collection of holomorphic parametric PDEs on $D_{ref}$. Sufficient conditions for (uniformly with respect to the parameter) well-posedness are given, implying existence, uniqueness and stability of parametric solution families on $D_{ref}$. We illustrate the abstract hypotheses by reviewing recent holomorphy results for a suite of elliptic and parabolic PDEs. Quantified parametric holomorphy implies existence of finite-parametric, discrete approximations of the parametric solution families with convergence rates in terms of the number $N$ of parameters. We obtain constructive proofs of existence of Neural and Spectral Operator surrogates for the shape-to-solution maps with error bounds and convergence rate guarantees uniform on the collection of admissible shapes. We admit principal-component shape encoders and frame decoders. Our results support in particular the (empirically reported) ability of neural operators to realize data-to-solution maps for elliptic and parabolic PDEs and BIEs that generalize across parametric families of shapes.
Authors: Mudi Jiang, Jiahui Zhou, Xinying Liu, Zengyou He, Zhikui Chen
Abstract: In multi-view clustering, the quality of different views may vary substantially, and low-quality or degraded views can impair overall clustering performance. However, existing studies mainly address this issue within the clustering process through view weighting or noise-robust optimization, while paying limited attention to data-level assessment before clustering. In this paper, we study the problem of pre-clustering noisy-view analysis in multi-view data from a clusterability perspective. To this end, we propose a Multi-View Clusterability Score (MVCS), which quantifies the strength of latent cluster-related structures in multi-view data through three complementary components: per-view structural clusterability, joint-space clusterability, and cross-view neighborhood consistency. To the best of our knowledge, this is the first clusterability score specifically designed for multi-view data. We further use it to perform potentially noisy view analysis and noisy-view detection before clustering. Extensive experiments on real-world datasets demonstrate that noisy views can significantly degrade clustering performance, and that, compared with existing clusterability measures designed for single-view data, the proposed method more effectively supports noisy-view analysis and detection.
Authors: Enze Pan
Abstract: Many deployed systems expose black-box objectives whose minimizing configuration shifts with an externally observed context. When contexts revisit a small set of latent regimes, an optimizer that discards history pays repeated adaptation cost; when each step must remain inexpensive, full Gaussian-process (GP) refits at high observation counts are difficult to sustain. We cast online tuning as context-conditioned regret minimization and present RASP-Tuner, which instantiates a decomposition motivated by first principles: (i) identify a regime proxy by retrieving similar past contexts; (ii) predict short-horizon loss with a mixture-of-experts surrogate whose input concatenates parameters, context, and a retrieved soft prompt; (iii) adapt chiefly in a low-dimensional prompt subspace, invoking full surrogate updates only when scalarized error or disagreement spikes. A RealErrorComposer maps heterogeneous streaming metrics to [0,1] via EMA-stabilized logistic scores, supplying a single differentiable training target. On nine synthetic non-stationary benchmarks, an adversarial-context sanity check, and three tabular real-world streams (Section on real-world experiments), RASP-Tuner improves or matches cumulative regret relative to our GP-UCB and CMA-ES implementations on seven of nine synthetic tasks under paired tests at horizon T=100, while recording 8-12 times lower wall-clock per step than sliding-window GP-UCB on identical hardware. Idealized analysis in a cluster-separated, strongly convex regime model (RA-GD) supplies sufficient conditions for bounded dynamic regret; the deployed pipeline violates several of these premises, and we articulate which gaps remain open.
Authors: Leyla Sadighi, Stefan Karlsson, Carlos Natalino, Mojtaba Eshghie, Fehmida Usmani, Eoin Kenny, Lena Wosinska, Paolo Monti, Marija Furdek, Marco Ruffini
Abstract: Machine learning (ML) models trained to detect physical-layer threats on one optical fiber system often fail catastrophically when applied to a different system, due to variations in operating wavelength, fiber properties, and network architecture. To overcome this, we propose a Domain Adaptation (DA) framework based on a Variational Autoencoder (VAE) that learns a shared representation capturing event signatures common to both systems while suppressing system-specific differences. The shared encoder is first trained on the combined data from two distinct optical systems: a 21 km O-band dark-fiber testbed (System 1) and a 63.4 km C-band live metro ring (System 2). The encoder is then frozen, and a classifier is trained using labels from an individual system. The proposed approach achieves 95.3% and 73.5% cross-system accuracy when moving from System 1 to System 2 and vice versa, respectively. This corresponds to gains of 83.4% and 51% over a fully supervised Deep Neural Network (DNN) baseline trained on a single system, while preserving intra-system performance.
Authors: Blaise Delaney, Salil Patel, Yuji Xing, Dominic Dootson, Karin Sevegnani
Abstract: We introduce Sonata, a compact latent world model for six-axis trunk IMU representation learning under clinical data scarcity. Clinical cohorts typically comprise tens to hundreds of patients, making web-scale masked-reconstruction objectives poorly matched to the problem. Sonata is a 3.77 M-parameter hybrid model, pre-trained on a harmonised corpus of nine public datasets (739 subjects, 190k windows) with a latent world-model objective that predicts future state rather than reconstructing raw sensor traces. In a controlled comparison against a matched autoregressive forecasting baseline (MAE) on the same backbone, Sonata yields consistently stronger frozen-probe clinical discrimination, prospective fall-risk prediction, and cross-cohort transfer across a 14-arm evaluation suite, while producing higher-rank, more structured latent representations. At 3.77 M parameters the model is compatible with on-device wearable inference, offering a step toward general kinematic world models for neurological assessment.
Authors: Yunjia Yang, Babak Gholami, Caglar Gurbuz, Mohammad Rashed, Nils Thuerey
Abstract: Accurate machine-learning models for aerodynamic prediction are essential for accelerating shape optimization, yet remain challenging to develop for complex three-dimensional configurations due to the high cost of generating training data. This work introduces a methodology for efficiently constructing accurate surrogate models for design purposes by first pre-training a large-scale model on diverse geometries and then fine-tuning it with a few more detailed task-specific samples. A Transformer-based architecture, AeroTransformer, is developed and tailored for large-scale training to learn aerodynamics. The methodology is evaluated on transonic wings, where the model is pre-trained on SuperWing, a dataset of nearly 30000 samples with broad geometric diversity, and subsequently fine-tuned to handle specific wing shapes perturbed from the Common Research Model. Results show that, with 450 task-specific samples, the proposed methodology achieves 0.36% error on surface-flow prediction, reducing 84.2% compared to training from scratch. The influence of model configurations and training strategies is also systematically studied to provide guidance on effectively training and deploying such models under limited data and computational budgets. To facilitate reuse, we release the datasets and the pre-trained models at https://github.com/tum-pbs/AeroTransformer. An interactive design tool is also built on the pre-trained model and is available online at https://webwing.pbs.cit.tum.de.
URLs: https://github.com/tum-pbs/AeroTransformer., https://webwing.pbs.cit.tum.de.
Authors: Josh Millar, Ashok Samraj Thangarajan, Soumyajit Chatterjee, Hamed Haddadi
Abstract: The miniaturisation of neural processing units (NPUs) and other low-power accelerators has enabled their integration into microcontroller-scale wearable hardware, supporting near-real-time, offline, and privacy-preserving inference. Yet physiological signal analysis has remained infeasible on such hardware; recent Transformer-based models show state-of-the-art performance but are prohibitively large for resource- and power-constrained hardware and incompatible with $\mu$NPUs due to their dynamic attention operations. We introduce PhysioLite, a lightweight, NPU-compatible model architecture and training framework for ECG/EMG signal analysis. Using learnable wavelet filter banks, CPU-offloaded positional encoding, and hardware-aware layer design, PhysioLite reaches performance comparable to state-of-the-art Transformer-based foundation models on ECG and EMG benchmarks, while being <10% of the size ($\sim$370KB with 8-bit quantization). We also profile its component-wise latency and resource consumption on both the MAX78000 and HX6538 WE2 $\mu$NPUs, demonstrating its viability for signal analysis on constrained, battery-powered hardware. We release our model(s) and training framework at: https://github.com/j0shmillar/physiolite.
Authors: Agnieszka Pregowska, Hazem M. Kalaji
Abstract: Reconstructing continuous environmental fields from sparse and irregular observations remains a central challenge in environmental modelling and biodiversity informatics. Many ecological datasets are heterogeneous in space and time, making grid-based approaches difficult to scale or generalise across domains. Here, we evaluate implicit neural representations (INRs) as a coordinate-based modelling framework for learning continuous spatial and spatio-temporal fields directly from coordinate inputs. We analyse their behaviour across three representative modelling scenarios: species distribution reconstruction, phenological dynamics, and morphological segmentation derived from open biodiversity data. Beyond predictive performance, we examine interpolation behaviour, spatial coherence, and computational characteristics relevant for environmental modelling workflows, including scalability, resolution-independent querying, and architectural inductive bias. Results show that neural fields provide stable continuous representations with predictable computational cost, complementing classical smoothers and tree-based approaches. These findings position coordinate-based neural fields as a flexible representation layer that can be integrated into environmental modelling pipelines and exploratory analysis frameworks for large, irregularly sampled datasets.
Authors: Mingxue Xu
Abstract: Matrix-level low-rank compression is a promising way to reduce the cost of large language models, but running compression and evaluating the resulting models on language tasks can be prohibitively expensive. Can compression-induced degradation be predicted before committing to this compute? We systematically analyze the Qwen3 and Gemma3 model families across four representative low-rank compression methods: vanilla SVD, two ASVD variants, and SVD-LLM. We find that stable rank and information density, measured in bits per parameter, dominate performance degradation. The interaction term $\gamma \cdot \bar{\rho}_s$, defined as compression ratio times stable rank, is a robust predictor of accuracy degradation, achieving leave-one-out cross-validation Pearson correlations of $0.890$ for attention layers and $0.839$ for MLP layers. We provide theoretical intuition for why this predictor succeeds by connecting it to standard SVD truncation bounds and error composition mechanisms in transformer layers. These findings enable a predict-then-compress workflow: compute $\gamma \cdot \bar{\rho}_s$ from weights, estimate degradation, and invest compute only in desirable configurations.
Authors: Emanuel Sommer, Rickmer Schulte, Sarah Deubner, Julius Kobialka, David R\"ugamer
Abstract: Bayesian Deep Ensembles (BDEs) represent a powerful approach for uncertainty quantification in deep learning, combining the robustness of Deep Ensembles (DEs) with flexible multi-chain MCMC. While DEs are affordable in most deep learning settings, (long) sampling of Bayesian neural networks can be prohibitively costly. Yet, adding sampling after optimizing the DEs has been shown to yield significant improvements. This leaves a critical practical question: How long should the sequential sampling process continue to yield significant improvements over the initial optimized DE baseline? To tackle this question, we propose a stopping rule based on E-values. We formulate the ensemble construction as a sequential anytime-valid hypothesis test, providing a principled way to decide whether or not to reject the null hypothesis that MCMC offers no improvement over a strong baseline, to early stop the sampling. Empirically, we study this approach for diverse settings. Our results demonstrate the efficacy of our approach and reveal that only a fraction of the full-chain budget is often required.
Authors: Michal Podstawski
Abstract: Small language models fine-tuned for graph property estimation have demonstrated strong in-distribution performance, yet their generalization capabilities beyond training conditions remain poorly understood. In this work, we systematically investigate the boundaries of structural inference in fine-tuned small language models along two generalization axes - graph size and graph family distribution - and assess domain-learning capability on real-world graph benchmarks. Using a controlled experimental setup with three instruction-tuned models in the 3-4B parameter class and two graph serialization formats, we evaluate performance on graphs substantially larger than the training range and across held-out random graph families. Our results show that fine-tuned models maintain strong ordinal consistency across structurally distinct graph families and continue to rank graphs by structural properties on inputs substantially larger than those seen during training, with distinct architecture-specific degradation profiles. These findings delineate where fine-tuned small language models generalize reliably, providing empirical grounding for their use in graph-based reasoning tasks.
Authors: Yann Bouquet, Alireza Khodamoradi, Sophie Y\'ang Shen, Kristof Denolf, Mathieu Salzmann
Abstract: Post-training quantization (PTQ) is essential for deploying large diffusion transformers on resource-constrained hardware, but aggressive 4-bit quantization significantly degrades generative performance. Low-rank approximation methods have emerged as a promising solution by appending auxiliary linear branches to restore performance. However, current state-of-the-art approaches assume these branches must retain high precision (W16A16) and rely on heavy, data-dependent calibration for initialization. We challenge both limitations with LoRaQ (Low-Rank Approximated Quantization), a simple, data-free calibration approach that optimizes quantization error compensation. By overcoming the need for high-precision branches, LoRaQ enables the first fully sub-16 bit pipeline, allowing the low-rank branch itself to be quantized. We demonstrate that, at equal memory overhead, LoRaQ outperforms the state-of-the-art methods in their native implementations on Pixart-$\Sigma$ and SANA. We also analyze mixed-precision configurations, showing that setups such as W8A8, W6A6, and W4A8 for the low-rank branch, alongside a W4 main layer, yield superior results while maintaining a fully quantized architecture compatible with modern mixed-precision hardware.
Authors: Thomas Asikis, Heinrich Nax
Abstract: Digital marketplaces processing billions of dollars annually represent critical infrastructure in sociotechnical ecosystems, yet their performance optimization lacks principled measurement frameworks that can inform algorithmic governance decisions regarding market efficiency and fairness from complex market data. By looking at orderbook data from double auction markets alone, because bids and asks do not represent true maximum willingnesses to buy and true minimum willingnesses to sell, there is little an economist can say about the market's actual performance in terms of allocative efficiency. We turn to experimental data to address this issue, `inverting' the standard induced value approach of double auction experiments. Our aim is to predict key market features relevant to market efficiency, particularly allocative efficiency, using orderbook data only -- specifically bids, asks and price realizations, but not the induced reservation values -- as early as possible. Since there is no established model of strategically optimal behavior in these markets, and because orderbook data is highly unstructured, non-stationary and non-linear, we propose quantile-based normalization techniques that help us build general predictive models. We develop and train several models, including linear regressions and gradient boosting trees, leveraging quantile-based input from the underlying supply-demand model. Our models can predict allocative efficiency with reasonable accuracy from the earliest bids and asks, and these predictions improve with additional realized price data. The performance of the prediction techniques varies by target and market type. Our framework holds significant potential for application to real-world market data, offering valuable insights into market efficiency and performance, even prior to any trade realizations.
Authors: Ku Onoda, Paavo Parmas, Manato Yaguchi, Yutaka Matsuo
Abstract: In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.
Authors: Tim Goppelsroeder, Rasmus Jensen
Abstract: We propose MADDPG-K, a scalable extension to Multi-Agent Deep Deterministic Policy Gradient (MADDPG) that addresses the computational limitations of centralized critic approaches. Centralized critics, which condition on the observations and actions of all agents, have demonstrated significant performance gains in cooperative and competitive multi-agent settings. However, their critic networks grow linearly in input size with the number of agents, making them increasingly expensive to train at scale. MADDPG-K mitigates this by restricting each agent's critic to the $k$ closest agents under a chosen metric which in our case is Euclidean distance. This ensures a constant-size critic input regardless of the total agent count. We analyze the complexity of this approach, showing that the quadratic cost it retains arises from cheap scalar distance computations rather than the expensive neural network matrix multiplications that bottleneck standard MADDPG. We validate our method empirically across cooperative and adversarial environments from the Multi-Particle Environment suite, demonstrating competitive or superior performance compared to MADDPG, faster convergence in cooperative settings, and better runtime scaling as the number of agents grows. Our code is available at https://github.com/TimGop/MADDPG-K .
Authors: Arkadii Kazanskii, Tatiana Petrova, Konstantin Bagrianskii, Aleksandr Puzikov, Radu State
Abstract: Drifting Models [Deng et al., 2026] train a one-step generator by evolving samples under a kernel-based drift field, avoiding ODE integration at inference. The original analysis leaves two questions open. The drift-field iteration admits a locally repulsive regime in a two-particle surrogate, and vanishing of the drift ($V_{p,q}\equiv 0$) is not known to force the learned distribution $q$ to match the target $p$. We derive a contraction threshold for the surrogate and show that a linearly-scheduled friction coefficient gives a finite-horizon bound on the error trajectory. Under a Gaussian kernel we prove that the drift-field equilibrium is identifiable: vanishing of $V_{p,q}$ on any open set forces $q=p$, closing the converse of Proposition 3.1 of Deng et al. Our friction-augmented model, DMF (Drifting Model with Friction), matches or exceeds Optimal Flow Matching on FFHQ adult-to-child domain translation at 16x lower training compute.
Authors: Muhammad Rajabinasab, Arthur Zimek
Abstract: Feature selection is a fundamental machine learning and data mining task, involved with discriminating redundant features from informative ones. It is an attempt to address the curse of dimensionality by removing the redundant features, while unlike dimensionality reduction methods, preserving explainability. Feature selection is conducted in both supervised and unsupervised settings, with different evaluation metrics employed to determine which feature selection algorithm is the best. In this paper, we propose FSEVAL, a feature selection evaluation toolbox accompanied with a visualization dashboard, with the goal to make it easy to comprehensively evaluate feature selection algorithms. FSEVAL aims to provide a standardized, unified, evaluation and visualization toolbox to help the researchers working in the field, conduct extensive and comprehensive evaluation of feature selection algorithms with ease.
Authors: Zhuojun Tian, Chaouki Ben Issaid, Mehdi Bennis
Abstract: In large-scale distributed scenarios, increasingly complex tasks demand more intelligent collaboration across networks, requiring the joint extraction of structural representations from data samples. However, conventional task-specific approaches often result in nonstructural embeddings, leading to collapsed variability among data samples within the same class, particularly in classification tasks. To address this issue and fully leverage the intrinsic structure of data for downstream applications, we propose a novel distributed learning framework that ensures both diverse and discriminative representations. For independent and identically distributed (i.i.d.) data, we reformulate and decouple the global optimization function by introducing constraints on representation variance. The update rules are then derived and simplified using a primal-dual approach. For non-i.i.d. data distributions, we tackle the problem by clustering and virtually replicating nodes, allowing model updates within each cluster using block coordinate descent. In both cases, the resulting optimal solutions are theoretically proven to maintain discriminative and diverse properties, with a guaranteed convergence for i.i.d. conditions. Additionally, semantic information from representations is shared among nodes, reducing the need for common neural network architectures. Finally, extensive simulations on MNIST, CIFAR-10 and CIFAR-100 confirm the effectiveness of the proposed algorithms in capturing global structural representations.
Authors: Wei Chen, Yubing Wu, Junmei Yang, Delu Zeng, Qibin Zhao, John Paisley, Min Chen, Zhou Wang
Abstract: Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin-based objectives suppress the chosen response along with the rejected one, a phenomenon known as likelihood displacement, and no general mechanism currently prevents this across objectives. We bridge this gap by presenting a unified \emph{incentive-score decomposition} of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients. Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the \emph{disentanglement band} (DB), a simple, testable condition that characterizes when training can avoid likelihood displacement by realizing the preferred pathway: suppressing the loser while maintaining the winner, possibly after an initial transient. Leveraging the DB, we propose a plug-and-play \emph{reward calibration} (RC) that adaptively rebalances chosen versus rejected updates to satisfy the DB and mitigate likelihood displacement, without redesigning the base objective. Empirical results show that RC steers training toward more disentangled dynamics and often improves downstream performance across a range of objectives. Our code is available at https://github.com/IceyWuu/DisentangledPreferenceOptimization.
URLs: https://github.com/IceyWuu/DisentangledPreferenceOptimization.
Authors: Fernando Reitich
Abstract: Large language models are increasingly deployed as protocols: structured multi-call procedures that spend additional computation to transform a baseline answer into a final one. These protocols are evaluated only by end-to-end accuracy, giving limited insight into when they help, when they hurt, and whether their behavior transfers under distribution shift or composition. We propose a paired-outcome measurement interface for auditing a single protocol step on exact-match tasks. For each instance, the interface records a baseline correctness bit $E_0\in\{0,1\}$ and a post-step correctness bit $E_1\in\{0,1\}$, separating correction ($E_0=0\to E_1=1$) from corruption ($E_0=1\to E_1=0$) through two rates: $c=\Pr(E_1=1\mid E_0=0)$ and $\gamma=\Pr(E_1=0\mid E_0=1)$. These rates predict accuracy changes and define a reusable empirical interface testable across seeds, mixtures, and pipelines. We identify three failure mechanisms. Under mixture shift, pooled estimates of $(c,\gamma)$ become biased when calibration and deployment mixtures differ; conditioning on a difficulty proxy restores stability without additional model calls. Under presentation contamination, selection protocols alter the interface through stable presentation artifacts when candidate content is fixed. Under state insufficiency, the correctness bit may not carry enough history for multi-step pipelines to compose predictably; a Markov factorization test identifies when composition is valid and where additional state is needed. When a protocol step passes these diagnostics, it becomes an auditable module: gated by estimated gain, conditioned on a difficulty proxy to correct mixture bias, and composed into multi-step pipelines with predictable accuracy. We demonstrate these ideas on synthetic mathematical tasks and on GSM8K, where the calibrated interface correctly predicts when protocol steps should be activated or suppressed.
Authors: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding
Abstract: Zeroth-Order optimization presents a promising memory-efficient paradigm for fine-tuning Large Language Models by relying solely on forward passes. However, its practical adoption is severely constrained by slow wall-clock convergence and high estimation variance. In this work, we dissect the runtime characteristics of ZO algorithms and identify a critical system bottleneck where the generation of perturbations and parameter updates accounts for over 40% of the training latency. We argue that the standard uniform exploration strategy is fundamentally flawed as it fails to account for the heterogeneous sensitivity of layers in deep networks, resulting in computationally wasteful blind searches. To address this structural mismatch, we propose AdaLeZO, an Adaptive Layer-wise ZO optimization framework. By formulating the layer selection process as a non-stationary Multi-Armed Bandit problem, AdaLeZO dynamically allocates the limited perturbation budget to the most sensitive parameters. We further introduce an Inverse Probability Weighting mechanism based on sampling with replacement, which guarantees unbiased gradient estimation while effectively acting as a temporal denoiser to reduce variance. Extensive experiments on LLaMA and OPT models ranging from 6.7B to 30B parameters demonstrate that AdaLeZO achieves 1.7x to 3.0x wall-clock acceleration compared to state-of-the-art methods. Crucially, AdaLeZO functions as a universal plug-and-play module that seamlessly enhances the efficiency of existing ZO optimizers without incurring additional memory overhead.
Authors: Youyuan Long, Gokhan Solak, Arash Ajoudani
Abstract: Accurate dynamical modeling is essential for simulation and control of embodied systems, yet first-principles models of electromechanical systems often fail to capture complex dissipative effects such as joint friction, stray losses, and structural damping. While residual-learning physics-informed neural networks (PINNs) can effectively augment imperfect first-principles models with data-driven components, the residual terms are typically implemented as unconstrained multilayer perceptrons (MLPs), which may inadvertently inject artificial energy into the system. To more faithfully model the dissipative dynamics, we propose DiLaR-PINN, a dissipative latent residual PINN designed to learn unmodeled dissipative effects in a physically consistent manner. Structurally, the residual network operates only on unmeasurable (latent) state components and is parameterized in a skew-dissipative form that guarantees non-increasing energy for any choice of network parameters. To enable stable and data-efficient training under partial measurability of the state, we further develop a recurrent rollout scheme with a curriculum-based sequence length extension strategy. We validate DiLaR-PINN on a real-world helicopter system and compare it against four baselines: a pure physical model (without a residual network), an unstructured residual MLP, a DiLaR variant with a soft dissipativity constraint, and a black-box LSTM. The results demonstrate that DiLaR-PINN more accurately captures dissipative effects and achieves superior long-horizon extrapolation performance.
Authors: Etienne Tajeuna, Patrick Asante Owusu, Armelle Brun, Shengrui Wang
Abstract: In this paper we investigate forecasting coevolving time series that feature intricate dependencies and nonstationary dynamics by using an LLM Large Language Models approach We propose a novel modeling approach named ContextAware ARLLM CAARL that provides an interpretable framework to decode the contextual dynamics influencing changes in coevolving series CAARL decomposes time series into autoregressive segments constructs a temporal dependency graph and serializes this graph into a narrative to allow processing by LLM This design yields a chainofthoughtlike reasoning path where intermediate steps capture contextual dynamics and guide forecasts in a transparent manner By linking prediction to explicit reasoning traces CAARL enhances interpretability while maintaining accuracy Experiments on realworld datasets validate its effectiveness positioning CAARL as a competitive and interpretable alternative to stateoftheart forecasting methods
Authors: Peter L. Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko
Abstract: We address the problem of planning in an environment with deterministic dynamics and stochastic rewards with discounted returns. The optimal value function is not known, nor are the rewards bounded. We propose Platypoos, a simple scale-free planning algorithm that adapts to the unknown scale and smoothness of the reward function. We provide a sample complexity analysis for Platypoos that improves upon prior work and holds simultaneously over a broad range of discount factors and reward scales, without the algorithm knowing them. We also establish a matching lower bound showing our analysis is optimal up to constants.
Authors: Meheru Zannat
Abstract: Parkinson's disease (PD) is a chronic neurodegenerative disease. It shows multiple motor symptoms such as tremor, bradykinesia, postural instability, freezing of gait (FoG). PD is currently diagnosed clinically through physical exam by health-care professionals, which can be time consuming and highly subjective. Wearable IMU sensors has become a promising gateway for passive monitoring of PD patients. We propose a self-supervised cross-attention encoder that processes bilateral wrist-worn IMU signals from a public dataset called PADS, consisting of three groups, PD (Parkinson Disease), HC (Healthy Control) and DD (Differential Diagnosis) of a total of 469 subjects. We have achieved a mean accuracy of 93.12% for HC vs. PD classification and 87.04% for PD vs. DD classification. The results emphasize the clinical challenge of distinguishing Parkinson's from other neurodegenerative diseases. Self-supervised representation learning using contrastive infoNCE loss gained an accuracy of 93.56% for HC vs. PD and 92.50% for PD vs. DD using only 20% of labelled data. This demonstrates the effectiveness of our method in transfer learning for clinical use with minimal labels. The real-time applicability was tested by deploying the optimized model with a mean inference time of 48.32 ms per window on a Raspberry Pi CPU.
Authors: Mert Can Turkmen, Eng Leong Tan, Yee Hui Lee
Abstract: Most data-driven ionospheric forecasting models operate on gridded products, which do not preserve the time-varying sampling structure of satellite-based sensing. We instead model the ionosphere as a dynamic graph over ionospheric pierce points (IPPs), with connectivity that evolves as satellite positions change. Because satellite trajectories are predictable, the graph topology over the forecast horizon can be constructed in advance. We exploit this property to condition forecasts on the future graph structure, which we term ephemeris conditioning. This enables prediction on lines of sight that appear only in the forecast horizon. We evaluate our framework on multi-GNSS (Global Navigation Satellite System) data from a co-located receiver pair in Singapore spanning January 2023 through April 2025. The task is to forecast Rate of TEC Index (ROTI)-defined irregularities at 5-minute cadence up to 2 hours ahead as binary probabilistic classification per node. The resulting model, IonoDGNN, achieves a Brier Skill Score (BSS) of 0.49 and a precision-recall area under the curve (PR-AUC) of 0.75, improving over persistence by 35\% in BSS and 52\% in PR-AUC, with larger gains at longer lead times. Ablations confirm that graph structure and ephemeris conditioning each contribute meaningfully, with conditioning proving essential for satellites that rise during the forecast horizon (receiver operating characteristic AUC: 0.95 vs.\ 0.52 without). Under simulated coverage dropout, the model retains predictive skill on affected nodes through spatial message passing from observed neighbors. These results suggest that dynamic graph forecasting on evolving lines of sight is a viable alternative to grid-based representations for ionospheric irregularity forecasting. The model and evaluation code will be released upon publication.
Authors: Esteban Rodr\'iguez-Betancourt, Edgar Casasola-Murillo
Abstract: In self-supervised learning, self-distilled methods have shown impressive performance, learning representations useful for downstream tasks and even displaying emergent properties. However, state-of-the-art methods usually rely on ensembles of complex mechanisms, with many design choices that are empirically motivated and not well understood. In this work, we explore the role of self-distillation within learning dynamics. Specifically, we isolate the effect of self-distillation by training a group of randomly initialized networks, removing all other common components such as projectors, predictors, and even pretext tasks. Our findings show that even this minimal setup can lead to learned representations with non-trivial improvements over a random baseline on downstream tasks. We also demonstrate how this effect varies with different hyperparameters and present a short analysis of what is being learned by the models under this setup.
Authors: Takato Yasuno
Abstract: Daily infrastructure management in preparation for disasters is critical for urban resilience. When bridges remain resilient against disaster-induced external forces, access to hospitals, shops, and residences via metapaths can be sustained, maintaining essential urban functions. However, prioritizing bridge maintenance under limited budgets requires quantifying the multi-dimensional roles that bridges play in disaster scenarios -- a challenge that existing single-indicator approaches fail to address. We focus on metapaths from national highways through bridges to buildings (hospitals, shops, residences), constructing a heterogeneous graph with road, bridge, and building layers. A Relation-centric Graph Convolutional Network Variational Autoencoder (R-GCN-VGAE) learns metapath-based feature representations, enabling classification of bridges into disaster-preparedness categories: Supply Chain (commercial logistics), Medical Access (emergency healthcare), and Residential Protection (preventing isolation). Using OSMnx and open data, we validate our methodology on three diverse cities in Ibaraki Prefecture, Japan: Mito (697 bridges), Chikusei (258 bridges), and Moriya (148 bridges), totaling 1,103 bridges. The heterogeneous graph construction from open data enables redefining bridge roles for disaster scenarios, supporting maintenance budget decision-making. We contributed that (1) Open-data methodology for constructing urban heterogeneous graphs. (2) Redefinition of bridge roles for disaster scenarios via metapath-based classification. (3) Establishment of maintenance budget decision support methodology. (4) k-NN tuning strategy validated across diverse city scales. (5) Empirical demonstration of UMAP superiority over t-SNE/PCA for multi-role bridge visualization.
Authors: Zhenhua Dang, Lei Zhang, Long Wang, Guowei He
Abstract: Data-driven discovery of governing equations has advanced significantly in recent years; however, existing methods often struggle in multiscale systems where dynamically significant terms may have small coefficients. Therefore, we propose Balance-Guided SINDy (BG-SINDy) inspired by the principle of dominant balance, which reformulates $\ell_0$-constrained sparse regression as a term-level $\ell_{2,0}$-regularized problem and solves it using a progressive pruning strategy. Terms are ranked according to their relative contributions to the governing equation balance rather than their absolute coefficient magnitudes. Based on this criterion, BG-SINDy alternates between least-squares regression and elimination of negligible terms, thereby preserving dynamically significant terms even when their coefficients are small. Numerical experiments on the Korteweg--de Vries equation with a small dispersion coefficient, a modified Burgers equation with vanishing hyperviscosity, a modified Kuramoto--Sivashinsky equation with multiple small-coefficient terms, and a two-dimensional reaction--diffusion system demonstrate the validity of BG-SINDy in discovering small-coefficient terms. The proposed method thus provides an efficient approach for discovering governing equations that contain small-coefficient terms.
Authors: Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini
Abstract: Large language models (LLMs) using chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.
Authors: Hanfeng Zhai, Hongtao Qiao, Hassan Mansour, Christopher Laughman
Abstract: We present a scalable, data-driven simulation framework for large-scale heating, ventilation, and air conditioning (HVAC) systems that couples physics-informed neural ordinary differential equations (PINODEs) with differential-algebraic equation (DAE) solvers. At the component level, we learn heat-exchanger dynamics using an implicit PINODE formulation that predicts conserved quantities (refrigerant mass $M_r$ and internal energy $E_\text{hx}$) as outputs, enabling physics-informed training via automatic differentiation of mass/energy balances. Stable long-horizon prediction is achieved through gradient-stabilized latent evolution with gated architectures and layer normalization. At the system level, we integrate learned components with DAE solvers (IDA and DASSL) that explicitly enforce junction constraints (pressure equilibrium and mass-flow consistency), and we use Bayesian optimization to tune solver parameters for accuracy--efficiency trade-offs. To reduce residual system-level bias, we introduce a lightweight corrector network trained on short trajectory segments. Across dual-compressor and scaled network studies, the proposed approach attains multi-fold speedups over high-fidelity simulation while keeping errors low (MAPE below a few percent) and scales to systems with up to 32 compressor--condenser pairs.
Authors: Florian Kittler, Sheethal Bhat, Andreas Maier
Abstract: Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.
Authors: Chongxiao Li, Pengwei Jin, Di Huang, Guangrun Sun, Husheng Han, Jianan Mu, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Hanjun Wei, Tianyun Ma, Shuyao Cheng, Rui Zhang, Ying Wang, Zidong Du, Qi Guo, Xing Hu
Abstract: Performance, power, and area (PPA) optimization is a fundamental task in RTL design, requiring a precise understanding of circuit functionality and the relationship between circuit structures and PPA metrics. Recent studies attempt to automate this process using LLMs, but neither feedback-based nor knowledge-based methods are efficient enough, as they either design without any prior knowledge or rely heavily on human-summarized optimization rules. In this paper, we propose AutoPPA, a fully automated PPA optimization framework. The key idea is to automatically generate optimization rules that enhance the search for optimal solutions. To do this, AutoPPA employs an Explore-Evaluate-Induce ($E^2I$) workflow that contrasts and abstracts rules from diverse generated code pairs rather than manually defined prior knowledge, yielding better optimization patterns. To make the abstracted rules more generalizable, AutoPPA employs an adaptive multi-step search framework that adopts the most effective rules for a given circuit. Experiments show that AutoPPA outperforms both the manual optimization and the state-of-the-art methods SymRTLO and RTLRewriter.
Authors: Sijie Mai, Shiqin Han
Abstract: Multimodal affective computing aims to predict humans' sentiment, emotion, intention, and opinion using language, acoustic, and visual modalities. However, current models often learn spurious correlations that harm generalization under distribution shifts or noisy modalities. To address this, we propose a causal modality-invariant representation (CmIR) learning framework for robust multimodal learning. At its core, we introduce a theoretically grounded disentanglement method that separates each modality into `causal invariant representation' and `environment-specific spurious representation' from a causal inference perspective. CmIR ensures that the learned invariant representations retain stable predictive relationships with labels across different environments while preserving sufficient information from the raw inputs via invariance constraint, mutual information constraint, and reconstruction constraint. Experiments across multiple multimodal benchmarks demonstrate that CmIR achieves state-of-the-art performance. CmIR particularly excels on out-of-distribution data and noisy data, confirming its robustness and generalizability.
Authors: Yidi Yuan
Abstract: Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and hence affect its geometric impact. We applied STP at consecutive semantic reasoning step boundaries and achieved 168x more accurate multi-step latent prediction than frozen baselines on ProcessBench (3,400 samples), compared to only 4x for the random-token STP. Probing the latent manifold with a learned non-linear predictor reveals that STP-shaped trajectories are smooth curves, not straight lines: a 3-layer MLP reduces prediction error by a further 3-12x over linear extrapolation on step-boundary models. Removing the language modeling loss yields trajectories that are 2x more MLP-predictable than the combined loss, revealing a tradeoff between generation quality and geometric purity. Our results identify sampling position as the critical variable in geometric regularization and establish multi-step latent prediction MSE as a new evaluation metric for this class of methods.
Authors: Chupei Tang, Junxiao Kong, Moyu Tang, Di Wang, Jixiu Zhai, Ronghao Xie, Shangkun Sima, Tianchi Lu
Abstract: Motivation: Peptide-protein interactions (PepPIs) are central to cellular regulation and peptide therapeutics, but experimental characterization remains too slow for large-scale screening. Existing methods usually emphasize either interaction prediction or peptide generation, leaving candidate prioritization, residue-level interpretation, and target-conditioned expansion insufficiently integrated. Results: We present an integrated framework for early-stage peptide screening that combines a partner-aware prediction and localization model (ConGA-PepPI) with a target-conditioned generative model (TC-PepGen). ConGA-PepPI uses asymmetric encoding, bidirectional cross-attention, and progressive transfer from pair prediction to binding-site localization, while TC-PepGen preserves target information throughout autoregressive decoding via layerwise conditioning. In five-fold cross-validation, ConGA-PepPI achieved 0.839 accuracy and 0.921 AUROC, with binding-site AUPR values of 0.601 on the protein side and 0.950 on the peptide side, and remained competitive on external benchmarks. Under a controlled length-conditioned benchmark, 40.39% of TC-PepGen peptides exceeded native templates in AlphaFold 3 ipTM, and unconstrained generation retained evidence of target-conditioned signal.
Authors: Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin
Abstract: Discrete diffusion language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive approaches, offering the flexibility to generate tokens in arbitrary orders and the potential of parallel decoding. However, existing heuristic sampling strategies remain inefficient: they choose only a small part of tokens to sample at each step, leaving substantial room for improvement. In this work, we study the problem of token sampling order optimization and demonstrate its significant potential for acceleration. Specifically, we find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that utilize a neural indicator to decide which tokens should be sampled at each step. We further propose a novel trajectory-preserving objective to train the indicator. Experiments on LLaDA and Dream models across multiple benchmarks show that our method achieves up to 14.3$\times$ acceleration over full-step sampling with negligible performance drop, and consistently outperforms confidence threshold sampling in the accuracy-step trade-off. Code is available at https://github.com/imagination-research/NI-Sampling.
Authors: Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia, Matei Zaharia, Noah A. Smith, Sewon Min
Abstract: Extending a fully post-trained language model with new domain capabilities is fundamentally limited by monolithic training paradigms: retraining from scratch is expensive and scales poorly, while continued training often degrades existing capabilities. We present BAR (Branch-Adapt-Route), which trains independent domain experts, each through its own mid-training, supervised finetuning, and reinforcement learning pipeline, and composes them via a Mixture-of-Experts architecture with lightweight router training. Unlike retraining approaches that mix all domains and require full reprocessing for any update (with cost scaling quadratically), BAR enables updating individual experts independently with linear cost scaling and no degradation to existing domains. At the 7B scale, with experts for math, code, tool use, and safety, BAR achieves an overall score of 49.1 (averaged across 7 evaluation categories), matching or exceeding re-training baselines (47.8 without mid-training, 50.5 with). We further show that modular training provides a structural advantage: by isolating each domain, it avoids the catastrophic forgetting that occurs when late-stage RL degrades capabilities from earlier training stages, while significantly reducing the cost and complexity of updating or adding a domain. Together, these results suggest that decoupled, expert-based training is a scalable alternative to monolithic retraining for extending language models.
Authors: Sarwan Ali, Taslim Murad
Abstract: Biological classification with interpretability remains a challenging task. For this, we introduce a novel encoding framework, Multi-Scale Reversible Chaos Game Representation (MS-RCGR), that transforms biological sequences into multi-resolution geometric representations with guaranteed reversibility. Unlike traditional sequence encoding methods, MS-RCGR employs rational arithmetic and hierarchical k-mer decomposition to generate scale-invariant features that preserve complete sequence information while enabling diverse analytical approaches. Our framework bridges three distinct paradigms for sequence analysis: (1) traditional machine learning using extracted geometric features, (2) computer vision models operating on CGR-generated images, and (3) hybrid approaches combining protein language model embeddings with CGR features. Through comprehensive experiments on synthetic DNA and protein datasets encompassing seven distinct sequence classes, we demonstrate that MS-RCGR features consistently enhance classification performance across all paradigms. Notably, our hybrid approach combining pre-trained language model embeddings (ESM2, ProtT5) with MS-RCGR features achieves superior performance compared to either method alone. The reversibility property of our encoding ensures no information loss during transformation, while multi-scale analysis captures patterns ranging from individual nucleotides to complex motif structures. Our results indicate that MS-RCGR provides a flexible, interpretable, and high-performing foundation for biological sequence analysis.
Authors: Nicholas Thumiger, Andrea Bartezzaghi, Mattia Rigotti, Cezary Skura, Thomas Frick, Elisa Serioli, Fabrizio Arbucci, A. Cristiano I. Malossi
Abstract: Computational Fluid Dynamics (CFD) is central to race-car aerodynamic development, yet its cost -- tens of thousands of core-hours per high-fidelity evaluation -- severely limits the design space exploration feasible within realistic budgets. AI-based surrogate models promise to alleviate this bottleneck, but progress has been constrained by the limited complexity of public datasets, which are dominated by smoothed passenger-car shapes that fail to exercise surrogates on the thin, complex, highly loaded components governing motorsport performance. This work presents three primary contributions. First, we introduce a high-fidelity RANS dataset built on a parametric LMP2-class CAD model and spanning six operating conditions (map points) covering straight-line and cornering regimes, generated and validated by aerodynamics experts at Dallara to preserve features relevant to industrial motorsport. Second, we present the Gauge-Invariant Spectral Transformer (GIST), a graph-based neural operator whose spectral embeddings encode mesh connectivity to enhance predictions on tightly packed, complex geometries. GIST guarantees discretization invariance and scales linearly with mesh size, achieving state-of-the-art accuracy on both public benchmarks and the proposed race-car dataset. Third, we demonstrate that GIST achieves a level of predictive accuracy suitable for early-stage aerodynamic design, providing a first validation of the concept of interactive design-space exploration -- where engineers query a surrogate in place of the CFD solver -- within industrial motorsport workflows.
Authors: Worachit Amnuaypongsa, Yotsapat Suparanonrat, Pana Wanitchollakit, Jitkomut Songsiri
Abstract: This paper proposes a multi-step probabilistic forecasting framework using a single neural-network based model to generate simultaneous point and interval forecasts. Our approach ensures non-crossing prediction intervals (PIs) through a model structure design that strictly satisfy a target coverage probability (PICP) while maximizing sharpness. Unlike existing methods that rely on manual weight tuning for scalarized loss functions, we treat point and PI forecasting as a multi-objective optimization problem, utilizing multi-gradient descent to adaptively select optimal weights. Key innovations include a new PI loss function based on an extended log-barrier with an adaptive hyperparameter to guarantee the coverage, a hybrid architecture featuring a shared temporal model with horizon-specific submodels, and a training strategy. The proposed loss is scale-independent and universally applicable; combined with our training algorithm, the framework eliminates trial-and-error hyperparameter tuning for balancing multiple objectives. Validated by an intra-day solar irradiance forecasting application, results demonstrate that our proposed loss consistently outperforms those in current literature by achieving target coverage with the narrowest PI widths. Furthermore, when compared against LSTM encoder-decoder and Transformer architectures--including those augmented with Chronos foundation models--our method remains highly competitive and can be seamlessly adapted to any deep learning structure.
Authors: Zhenwen Liang, Yujun Zhou, Sidi Lu, Xiangliang Zhang, Haitao Mi, Dong Yu
Abstract: Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.
Authors: Aniruddha Adiga, Jingyuan Chou, Anshul Chiranth, Bryan Lewis, Ana I. Bento, Shaun Truelove, Geoffrey Fox, Madhav Marathe, Harry Hochheiser, Srini Venkatramanan
Abstract: Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real-time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative-based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information-theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi-horizon short-term forecasting (1- to 4-week-ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP-based methods have the most robust performance, with statistical methods having a slight edge during the pre-peak phase. IDOBE dataset along with baselines are released publicly on https://github.com/NSSAC/IDOBE to enable standardized, reproducible benchmarking of outbreak forecasting methods.
Authors: Feras Al Taha, Eilyan Bitar
Abstract: We propose a distributionally robust approach to risk-sensitive estimation of an unknown signal x from an observed signal y. The unknown signal and observation are modeled as random vectors whose joint probability distribution is unknown, but assumed to belong to a given type-2 Wasserstein ball of distributions, termed the ambiguity set. The performance of an estimator is measured according to the conditional value-at-risk (CVaR) of the squared estimation error. Within this framework, we study the problem of computing affine estimators that minimize the worst-case CVaR over all distributions in the given ambiguity set. As our main result, we show that, when the nominal distribution at the center of the Wasserstein ball is finitely supported, such estimators can be exactly computed by solving a tractable semidefinite program. We evaluate the proposed estimators on a wholesale electricity price forecasting task using real market data and show that they deliver lower out-of-sample CVaR of squared error compared to existing methods.
Authors: William Lavery, Jodie A. Cochrane, Christian Olesen, Dagim S. Tadele, John T. Nardini, Sara Hamis
Abstract: Physics-informed neural networks (PINNs) provide a powerful framework for learning governing equations of dynamical systems from data. Biologically-informed neural networks (BINNs) are a variant of PINNs that preserve the known differential operator structure (e.g., reaction-diffusion) while learning constitutive terms via trainable neural subnetworks, enforced through soft residual penalties. Existing BINN studies are limited to $1\mathrm{D}{+}t$ reaction-diffusion systems and focus on forward prediction, using the governing partial differential equation as a regulariser rather than an explicit identification target. Here, we extend BINNs to $2\mathrm{D}{+}t$ systems within a PINN framework that combines data preprocessing, BINN-based equation learning, and symbolic regression post-processing for closed-form equation discovery. We demonstrate the framework's real-world applicability by learning the governing equations of lung cancer cell population dynamics from time-lapse microscopy data, recovering $2\mathrm{D}{+}t$ reaction-diffusion models from experimental observations. The proposed framework is readily applicable to other spatio-temporal systems, providing a practical and interpretable tool for fast analytic equation discovery from data.
Authors: Ran Ben-Basat, Yaniv Ben-Itzhak, Gal Mendelson, Michael Mitzenmacher, Amit Portnoy, Shay Vargaftik
Abstract: This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any $b>0$ bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant$_{\text{mse}}$ is a special case of EDEN obtained by fixing EDEN's scalar scale parameter to $S=1$. EDEN supports both biased and unbiased quantization, each optimized by a different $S$ (chosen via methods described in the EDEN works). The fixed choice $S=1$ used by TurboQuant is generally suboptimal, although the optimal $S$ for biased EDEN converges to $1$ as the dimension grows; accordingly TurboQuant$_{\text{mse}}$ approaches EDEN's behavior for large $d$. Second, TurboQuant$_{\text{prod}}$ combines a biased $(b-1)$-bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its $(b-1)$-bit step uses the suboptimal $S=1$; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased $(b-1)$-bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with $b$-bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized $S$) is more accurate than TurboQuant$_{\text{mse}}$, and unbiased EDEN is markedly more accurate than TurboQuant$_{\text{prod}}$, often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant$_{\text{prod}}$). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried.
Authors: Manan Gupta, Dhruv Kumar
Abstract: Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\textbf{Latent Phase-Shift Rollback}$ (LPSR): at each generation step, we monitor the residual stream at a critical layer lcrit, detect abrupt directional reversals (phase shifts) via a cosine-similarity $+$ entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves $\mathbf{44.0\%}$ on MATH-500 with an 8B model versus $28.8\%$ for standard AR ($+15.2$ pp; McNemar $\chi^2 = 66.96$, $p < 10^{-15}$). Critically, prompted self-correction, the most natural inference-time baseline, scores only $19.8\%$, below standard AR; LPSR exceeds it by $+24.2$ pp ($\chi^2 = 89.4$, $p \approx 0$). LPSR also outperforms Best-of-16 ($+7.8$ pp) at $5.4\times$ lower token cost, and surpasses a standard 70B model ($35.2\%$) with $8.75\times$ fewer parameters at ${\sim}3\times$ the token budget. A 32-layer sweep reveals a novel \textbf{detection-correction dissociation}: error-detection AUC peaks at layer~14 ($0.718$) but task accuracy peaks at layer~16 ($44.0\%$ vs.\ $29.2\%$), demonstrating that optimal monitoring depth differs for detection and correction.
Authors: Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian, Ming Y. Lu, Rowland Pettit, Joshua E. Lewis, Alexandre Misrahi, Dandan Mo, Long Phi Le, Faisal Mahmood
Abstract: Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This "atlas of medical concepts" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.
Authors: Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel, Pavel Izmailov
Abstract: Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
Authors: Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp F\"urnstahl, Bernhard Sch\"olkopf, Andreas Krause
Abstract: Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
Authors: Liubomyr Horbatko
Abstract: Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails $O(\ell^{-\beta})$ for $0 < \beta < 1$, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.
Authors: Mingxuan Wang, Satoshi Nakamura
Abstract: Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.
Authors: Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang
Abstract: Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.
Authors: Claudio Spiess, Prem Devanbu, Earl T. Barr
Abstract: LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT-5.2 exhibits significant brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non-exception behaviors. Our findings both point to limitations in the way all models understand code, and establish the value of using perturbation to evaluate code models.
Authors: Yiyao Yang, Yasemin Gulbahar
Abstract: Artificial intelligence (AI) tutors have become increasingly popular in learning environments. In this study, we propose an AI agent prototype framework for exploring AI-assisted learning with temporal interaction patterns, multiple outcomes analysis, and behavioral-cognitive learner profiling. Based on three research questions, this study aims to investigate whether early interaction patterns can predict later performance and trust, how multiple outcomes can be traded off with different AI tutor feedback conditions, and if learner profiles can be identified with behavioral and cognitive indicators. An AI tutor agent has been developed to provide various feedback forms to learners, including hints, explanations, examples, and code. A neural policy model and a stochastic simulation framework are used to produce artificial student-AI tutor interaction records, which include response time, attempts, hint requests, correctness, quiz results, improvement, satisfaction, and trust. Temporal features are used to predict later correctness and trust with early interaction patterns, and clustering methods are used to find learner profiles. The results showed that early interaction patterns were predictive of later performance and trust, that student behavior changed over time with AI-based tutoring, and that latent student profiles could be identified based on their behavioral and cognitive differences.
Authors: Yunshuo Tian, Akayou Kitessa, Tanuja Chitnis, Yijun Zhao
Abstract: Large language models (LLMs) and classical machine learning methods offer complementary strengths for predictive modeling, yet their fundamentally different representations and training paradigms hinder effective integration: LLMs rely on gradient-based optimization over textual data, whereas models such as Random Forests (RF) employ non-differentiable feature partitioning. This work introduces a reciprocal co-training framework that couples an LLM with an RF classifier via reinforcement learning, creating an iterative feedback loop in which each model improves using signals from the other. Tabular data are reformulated into standardized textual representations for the LLM, whose embeddings augment the RF feature space, while calibrated RF probability estimates provide feedback signals that guide reinforcement learning updates of the LLM. Experiments across three medical datasets demonstrate consistent performance gains for both models, with particularly strong effects for the LLM. Ablation analyses show that iterative refinement, hybrid reward design, and dimensionality control jointly contribute to these gains. The proposed framework provides a general mechanism that allows incompatible model families to leverage each other's strengths through bidirectional adaptation.
Authors: Zhuo Chen, Yuxuan Miao, Supryadi, Deyi Xiong
Abstract: Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively. In recent years, a growing body of work has proposed principled data mixing methods for LLM pretraining; however, the literature remains fragmented and lacks a dedicated, systematic survey. This paper provides a comprehensive review of data mixing for LLM pretraining. We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline, and briefly explain how existing methods make this formulation tractable in practice. We then introduce a fine-grained taxonomy that organizes existing methods along two dimensions: static versus dynamic mixing. Static mixing is further categorized into rule-based and learning-based methods, while dynamic mixing is grouped into adaptive and externally guided families. For each class, we summarize representative approaches and analyze their strengths and limitations from a performance-cost trade-off perspective. Building on this analysis, we highlight challenges that cut across methods, including limited transferability across data domains, optimization objectives, models, and validation sets, as well as unstandardized evaluation protocols and benchmarks, and the inherent tension between performance gains and cost control in learning-based methods. Finally, we outline several exploratory directions, including finer-grained domain partitioning and inverse data mixing, as well as pipeline-aware designs, aiming to provide conceptual and methodological insights for future research.
Authors: Shaoyuan Huang, Xiaokai Wang, Na Yan, Xiaofei Wang, Wenyu Wang, Yansha Deng
Abstract: As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.
Authors: Zhenglin Lai, Sirui Huang, Yuteng Li, Changxin Huang, Jianqiang Li, Bingzhe Wu
Abstract: Video-generative world models are increasingly used as neural simulators for embodied planning and policy learning, yet their ability to predict physical risk and severe consequences is rarely evaluated.We find that these models often downplay or omit key danger cues and severe outcomes for hazardous actions, which can induce unsafe preferences during planning and training on imagined rollouts. We propose ICAT, which grounds testing in real incident reports and safety manuals by building structured risk memories and retrieving/composing them to constrain the generation of risk cases with causal chains and severity labels. Experiments on an ICAT-based benchmark show that mainstream world models frequently miss mechanisms and triggering conditions and miscalibrate severity, falling short of the reliability required for safety-critical embodied deployment.
Authors: Jinkai Qiu, Alessandro Saviolo, Chaojie Wang, Mingke Wang, Xiaoyu Huang
Abstract: Realistic highway simulation is critical for scalable safety evaluation of autonomous vehicles, particularly for interactions that are too rare to study from logged data alone. Yet highway traffic generation remains challenging because it requires broad coverage across speeds and maneuvers, controllable generation of rare safety-critical scenarios, and behavioral credibility in multi-agent interactions. We present PHASE, Policy for Heterogeneous Agent Self-play on Expressway, a context-aware self-play framework that addresses these three requirements through explicit per-agent conditioning for controllability, synthetic scenario generation for broad highway coverage, and closed-loop multi-agent training for realistic interaction dynamics. PHASE further supports different vehicle profiles, for example, passenger cars and articulated trailer trucks, within a single policy via vehicle-aware dynamics and context-conditioned actions, and stabilizes self-play with early termination of unrecoverable states, at-fault collision attribution, highway-aware reward shaping, coupled curricula, and robust policy optimization. Despite being trained only on synthetic data, PHASE transfers zero-shot to 512 unseen high-interaction real scenarios in exiD, achieving a 96.3% success rate and reducing ADE/FDE from 6.57/12.07 m to 2.44/5.25 m relative to a prior self-play baseline. In a learned trajectory embedding space, it also improves behavioral realism over IDM, reducing Frechet trajectory distance by 13.1% and energy distance by 20.2%. These results show that synthetic self-play can provide a scalable route to controllable and realistic highway scenario generation without direct imitation of expert logs.
Authors: Jamal Toutouh
Abstract: This paper studies semi-supervised tabular classification in the extreme low-label regime using lightweight base learners. The paper proposes a cooperative coevolutionary method (CC-SSL) that evolves (i) two feature-subset views and (ii) a pseudo-labeling policy, and compares it to a matched monolithic evolutionary baseline (EA-SSL) and three lightweight SSL baselines. Experiments on 25 OpenML datasets with labeled fractions {1%,5%,10%} evaluate test MacroF1 and accuracy, together with evolutionary and pseudo-label diagnostics. CC-SSL and EA-SSL achieve higher median test MacroF1 than the lightweight baselines, with the largest separations at 1% labeled data. Most CC-SSL vs. EA-SSL comparisons are statistical draws on final test performance. EA-SSL shows higher best-so-far fitness and higher diversity during search, while time-to-target is comparable and generations-to-target favors EA-SSL in several multiclass settings. Pseudo-label volume, ProbeDrop, and validation optimism show no significant differences between CC-SSL and EA-SSL under the shared protocol.
Authors: Enock O. Ayiku, Evelyn Osei, Emebo Onyeka
Abstract: Fairness-aware recommender systems often mitigate bias by increasing exposure to under-represented or long-tail content, commonly through mechanisms that promote novelty and diversity. In practice, the strength of such interventions is typically controlled using global hyperparameters, fixed regularization weights, heuristic caps, or offline tuning strategies. These approaches implicitly assume that a single level of exploration is appropriate across users, contexts, and stages of interaction. In this work, we study exploration saturation as a user-dependent phenomenon arising from fairness- and novelty-driven recommendation strategies. We define exploration saturation as the point at which further increases in exploration no longer improve user utility and may instead reduce engagement or perceived relevance. Rather than proposing a new fairness-aware algorithm or optimizing a specific objective, we empirically analyze how increasing exploration affects users across varied recommendation models. Through longitudinal experiments using MovieLens-1M and Last.fm datasets, our results indicate that fairness-induced exploration exhibits diminishing or non-monotonic returns and varies substantially across users. In particular, users with limited interaction histories tend to reach saturation earlier, suggesting that uniform fairness or novelty pressure can disproportionately disadvantage certain users. These findings reveal a trade-off between fairness and user experience, suggesting that recommendation systems should adapt not only to relevance but also to the amount of fairness-driven exploration applied to individual users.
Authors: Jaafer Klila, Sondes Bannour Souihi, Rahma Boujelben, Nasredine Semmar, Lamia Hadrich Belguith
Abstract: The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds knowledge into model parameters, and (ii) Graph Retrieval-Augmented Generation (GraphRAG) that consults a knowledge graph at inference time. We first construct a large-scale biomedical knowledge graph from UMLS (3.4 million concepts and 34.2 million relations), stored in Neo4j for efficient querying. We then derive a ~100-million-token textual corpus from this graph to continually pretrain two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT). We evaluate these models on six BLURB (Biomedical Language Understanding and Reasoning Benchmark) datasets spanning five task types and evaluate GraphRAG on the two QA (Question Answering) datasets (PubMedQA, BioASQ). On BLURB tasks, BERTUMLS improves over BERT, with the largest gains on knowledge-intensive QA. Effects on BioBERT are more nuanced, suggesting diminishing returns when the base model already encodes substantial biomedical text knowledge. Finally, augmenting LLaMA 3-8B with our GraphRAG pipeline yields over than 3 points accuracy on PubMedQA and 5 points on BioASQ without any retraining, delivering transparent, multi-hop, and easily updated knowledge access. We release the processed UMLS Neo4j graph to support reproducibility.
Authors: Manoj Parmar
Abstract: State-Space Models (SSMs) -- structured SSMs (S4, S4D, DSS, S5), selective SSMs (Mamba, Mamba-2), and hybrid architectures (Jamba) -- are deployed in safety-critical long-context applications: genomic analysis, clinical time-series forecasting, and cybersecurity log processing. Their linear-time scaling is compelling, yet the security properties of their compressed-state recurrent architectures remain unstudied. We present the first systematic treatment of SSM safety, security, and cognitive risks. Seven contributions: (1) Formal threat framework -- SSM Attack Surface (five layers), State Integrity Violation (StIV), Cross-Context Amplification Ratio $\mathcal{X}_\mathcal{S}$, and a Spectral Sensitivity Proposition grounded in the $H_\infty$ norm. (2) Three novel attack classes: spectral adversarial attacks (transfer-function gain exploitation), delayed-trigger stateful backdoors (activate thousands of steps after injection), and state capacity saturation (entropy flooding forces silent forgetting). (3) 14 MITRE ATLAS technique extensions across the full tactic chain. (4) Six-profile attacker taxonomy with kill chains for genomics, clinical, and cybersecurity domains. (5) Four cognitive risk hypotheses grounded in state-compression mechanics. (6) Governance-aligned mitigations mapped to CREST, NIST AI 600-1, and EU AI Act. (7) Empirical evaluation: targeted genomic injection achieves $\mathrm{StIV}=0.519$ vs. $0.086$ random ($6.0\times$, $p<0.001$); PGD state injection achieves $156\times$ output perturbation over random; SSD-structured extraction confirmed at $O(N^2)$ vs. $O(N^3)$ query complexity ($N\times$ speedup). Validation on pretrained checkpoints is detailed in the Appendix.
Authors: Vsevolod Lazebnyi, Natalia Tereshkina, Maria Shabarina, Dmitriy Fedorov
Abstract: This paper presents a method for the automated collection and aggregation of unstructured data from diverse web sources, utilizing Large Language Models (LLMs). The primary challenge with existing techniques is their instability when the structure of webpages changes, their limited support for dynamically loaded content during information collection, and the requirement for labor-intensive manual design of data pre-processing processes. The proposed algorithm integrates hybrid web scraping (Goose3 for static pages and Selenium+WebDriver for dynamic ones), data storage in a non-relational MongoDB database management system (DBMS), and intelligent extraction and normalization of information using LLMs into a predetermined JSON schema. A key scientific contribution of this study is a two-stage verification process for the generated data, designed to eliminate potential hallucinations byy comparing the embeddings of multiple LLM outputs obtained with different temperature parameter values, combined with formalized rules for monitoring data consistency and integrity. The experimental findings indicate a high level of accuracy in the completion of key fields, as well as the robustness of the proposed methodology to changes in web page structures. This makes it suitable for use in tasks such as news content aggregation, monitoring, and log analysis in near real-time mode, with the capacity to scale rapidly in terms of the number of sources.
Authors: Nicholas CL Beale
Abstract: AI in applications like screening job applicants had become widespread, and may contribute to unemployment especially among the young. Biases in the AIs may become baked into the job selection process, but even in their absence, reliance on a single AI is problematic. In this paper we derive a simple formula to estimate, or at least place an upper bound on, the precision of such approaches for data resembling realistic CVs: $P(q) \approx \frac{\rho n^b + q(1-\rho)}{1 + (n^b - 1)\rho}$ where $b \approx q^* + 0.8 (1 - \rho)$ and $q^*$ is $q$ clipped to $[0.07, 0.22]$ where $P(q)$ is the precision of the top $q$ quantile selected by a panel of $n$ AIs and $\rho$ is their average pairwise correlation. This equation provides a basis for considering how many AIs should be used in a Panel, depending on the importance of the decision. A quantitative discussion of the merits of using a diverse panel of AIs to support decision-making in such areas will move away from dangerous reliance on single AI systems and encourage a balanced assessment of the extent to which diversity needs to be built into the AI parts of the socioeconomic systems that are so important for our future.
Authors: Mark Walsh
Abstract: When a system commits to a hypothesis, much of the evidential structure behind that commitment is lost to compression. Standard accounts assume that selected content and scalar confidence suffice for downstream control. This paper argues that they do not, and that determining what must survive compression is itself a consequence-sensitive problem. We develop a recurrent arbitration architecture in which active constraint fields jointly determine a hypothesis geometry over candidates. Rather than carrying that geometry forward in full, the system compresses it into a support-aware control state whose resolution is regulated by current consequence geometry, arbitration memory, and resource constraints. A bounded objective formalizes the tradeoff. Too little retained support collapses policy-relevant distinctions, producing controllers that select content adequately while misrouting verification, abstention, and recovery. Too much retained support fragments learning across overly fine contexts, degrading adaptation even as discrimination improves. These failure modes yield ordered controller predictions confirmed by a minimal repeated-interaction simulation. Adaptive controllers that regulate support resolution outperform all fixed-resolution controllers in cumulative utility. Agile adaptive control outperforms sluggish adaptive control. Fixed high-resolution control achieves the best commitment accuracy but still trails adaptive controllers because resource cost and learning fragmentation offset the gains from richer retention. Support sufficiency should be understood not as a static representational threshold, but as a dynamic compression criterion. Robust arbitration depends on preserving the smallest support structure adequate for policy under the current consequence landscape, and on regulating that structure as conditions change across repeated cycles of inference and action.
Authors: Aref Ghoreishee, Abhishek Mishra, Lifeng Zhou, John Walsh, Anup Das, Nagarajan Kandasamy
Abstract: This paper develops an end-to-end fuzzy encoder-decoder architecture for enhancing vision-based multi-modal deep spiking Q-networks in autonomous driving. The method addresses two core limitations of spiking reinforcement learning: information loss stemming from the conversion of dense visual inputs into sparse spike trains, and the limited representational capacity of spike-based value functions, which often yields weakly discriminative Q-value estimates. The encoder introduces trainable fuzzy membership functions to generate expressive, population-based spike representations, and the decoder uses a lightweight neural decoder to reconstruct continuous Q-values from spiking outputs. Experiments on the HighwayEnv benchmark show that the proposed architecture substantially improves decision-making accuracy and closes the performance gap between spiking and non-spiking multi-modal Q-networks. The results highlight the potential of this framework for efficient and real-time autonomous driving with spiking neural networks.
Authors: Arjan Mahmuod, Adrian Rod Hammerstad, Muzaffar Yousef, Yngve Sebastian Heill, Jonas L. Isaksen, J{\o}rgen K. Kanters, Pal Halvorsen, Vajira Thambawita
Abstract: Deep learning models for atrial fibrillation (AF) detection are increasingly trained on heterogeneous electrocardiogram (ECG) datasets with varying sampling frequencies, yet the specific consequences of these discrepancies on model performance, calibration, and robustness remain insufficiently characterized. To address this, we conducted a systematic benchmark using 12-lead, 10-second recordings from the PTB-XL dataset, resampled to target frequencies of 62, 100, 250, and 500 Hz, to evaluate a standard 1-D Convolutional Neural Network (CNN) and a hybrid CNN-Long Short-Term Memory (LSTM) architecture under a rigorous patient-safe cross-validation framework. Our analysis reveals that sampling frequency significantly impacts detection metrics in an architecture-dependent manner; the hybrid CNN-LSTM model demonstrated optimal performance and consistent calibration at intermediate frequencies (100-250 Hz), whereas the 1-D CNN baseline exhibited marked degradation in accuracy and sensitivity at 500 Hz, suggesting increased susceptibility to high-frequency noise. We conclude that ECG sampling frequency is a critical, underappreciated factor in arrhythmia detection, and future foundation models must explicitly control for temporal resolution to ensure clinical reliability and reproducibility.
Authors: Zhuo Diao, Yueting Li, Jianpeng Wang, Shengyu Guan, Xinwei Wang, Wenxiong Cui, Xin Shi, Tong Liu, Kailai Sun, Jingyu Wang, Dian Fan, Thomas Penzel
Abstract: Sleep staging is essential for the assessment of sleep quality and the diagnosis of sleep-related disorders. Conventional polysomnography (PSG), while considered the gold standard, is intrusive, labor-intensive, and unsuitable for long-term monitoring. This study evaluates the performance of the Sleepal AI Lamp, a contactless, radar-based consumer-grade sleep tracker, in comparison with gold-standard polysomnography (PSG), using a large-scale dataset comprising 1022 overnight recordings. We extract multi-scale respiratory and motion-related features from radar signals to train a frequency-augmented deep learning model. For the binary sleep-wake classification task, experimental results demonstrated that the model achieved an accuracy of 92.8% alongside a macro-averaged F1 score of 0.895. For four-stage classification (wake, light NREM (N1 + N2), deep NREM (N3), REM), the model achieved an accuracy of 78.5% with a Cohen's kappa coefficient of 0.695 in healthy individuals and maintained a stable accuracy of 77.2% with a kappa of 0.677 in a heterogeneous population including patients with varying severities of obstructive sleep apnea (OSA). These experimental results demonstrate that the sleep staging performance of the contactless Sleepal AI Lamp is in high agreement with expert-labeled PSG sleep stages. Our findings suggest that non-contact radar sensing, combined with advanced temporal modeling, can provide reliable sleep staging performance without requiring physical contact or wearable devices. Owing to its unobtrusive nature, ease of deployment, and robustness to long-term use, the contactless Sleepal AI Lamp shows strong potential for clinical screening, home-based sleep assessment, and continuous longitudinal sleep monitoring in real-world medical and healthcare applications.
Authors: Felix Koch, Fabian Raisch, Benjamin Tischler
Abstract: Data-driven models for building thermal dynamics are a scalable approach for enabling energy-efficient operation through fault detection & diagnosis or advanced control. To obtain accurate models, measurement data from a target building spanning months to years are required. Transfer Learning (TL) mitigates this challenge by employing pretrained models based on single or multiple source buildings. General multi-source TL models promise to outperform single-source TL, but alternative multi-source modeling architectures remain to be explored, and evaluation on real-world data is missing. Moreover, time series foundation models (TSFM) have emerged as candidates for the best-performing general models. Hence, we conduct a first, comprehensive assessment of general modeling approaches for building thermal dynamics, including multi-source TL and TSFMs. Our assessment includes ablations using four state-of-the-art multi-source TL architectures and evaluations on synthetic as well as real-world data. We demonstrate that multi-source TL models are highly effective in accurately modeling buildings in real-world applications, yielding up to 63% lower forecasting errors compared to single-source TL. Moreover, our results suggest a trade-off between multi-source TL models exclusively pretrained with building data and TSFMs pretrained with a multitude of different time series, revealing that data from 16-32 source buildings must be available over 1 year for pretraining multi-source TL models to consistently outperform TSFMs as evaluated using the mean absolute error. These findings provide practical guidance for selecting modeling strategies based on the number of source buildings available for pretraining multi-source TL models.
Authors: Giovanna Sannino, Ivanoe De Falco, Nadia Brancati, Laura Verde, Maria Frucci, Daniel Riccio, Vincenzo Bevilacqua, Antonio Di Marino, Lucia Aruta, Valentina Virginia Iuzzolino, Gianmaria Senerchia, Myriam Spisto, Raffaele Dubbioso
Abstract: Recent advances in Artificial Intelligence (AI) and the exploration of noninvasive, objective biomarkers, such as speech signals, have encouraged the development of algorithms to support the early diagnosis of neurodegenerative diseases, including Amyotrophic Lateral Sclerosis (ALS). Voice changes in subjects suffering from ALS typically manifest as progressive dysarthria, which is a prominent neurodegenerative symptom because it affects patients as the disease progresses. Since voice signals are complex data, the development and use of advanced AI techniques are fundamental to extracting distinctive patterns from them. Validating AI algorithms for ALS diagnosis and monitoring using voice signals is challenging, particularly due to the lack of annotated reference datasets. In this work, we present the outcome of a collaboration between a multidisciplinary team of clinicians and Machine Learning experts to create both a clinically annotated validation dataset and the "Speech Analysis for Neurodegenerative Diseases" (SAND) challenge based on it. Specifically, by analyzing voice disorders, the SAND challenge provides an opportunity to develop, test, and evaluate AI models for the automatic early identification and prediction of ALS disease progression.
Authors: Junwen Ma, Huhu Xue, Xingyuan Zhao, and Weicheng Fu
Abstract: Optical Music Recognition (OMR) aims to convert printed or handwritten music score images into editable symbolic representations. This paper presents an end-to-end OMR framework that combines residual bottleneck convolutions with bidirectional gated recurrent unit (BiGRU)-based sequence modeling. A convolutional neural network with ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is used to extract features that encode both fine-grained symbol details and global staff-line structures. The extracted feature sequences are then fed into a BiGRU network to model temporal dependencies among musical symbols. The model is trained using the Connectionist Temporal Classification loss, enabling end-to-end prediction without explicit alignment annotations. Experimental results on the Camera-PrIMuS and PrIMuS datasets demonstrate the effectiveness of the proposed framework. On Camera-PrIMuS, the proposed method achieves a sequence error rate (SeER) of $7.52\%$ and a symbol error rate (SyER) of $0.45\%$, with pitch, type, and note accuracies of $99.33\%$, $99.60\%$, and $99.28\%$, respectively. The average training time is 1.74~s per epoch, demonstrating high computational efficiency while maintaining strong recognition performance. On PrIMuS, the method achieves a SeER of $8.11\%$ and a SyER of $0.49\%$, with pitch, type, and note accuracies of $99.27\%$, $99.58\%$, and $99.21\%$, respectively. A fine-grained error analysis further confirms the effectiveness of the proposed model.
Authors: Kang Yang, Walid A. Hanafy, Prashant Shenoy, Mani Srivastava
Abstract: As edge AI deployments scale to billions of devices running always-on, real-time compound AI pipelines, they represent a massive and largely unmanaged source of energy consumption and carbon emissions. To reduce carbon emissions while maximizing Quality-of-Service (QoS), this paper proposes FM-CAC, a proactive carbon-aware control framework that leverages a battery as an active temporal buffer. By decoupling energy acquisition from energy consumption, FM-CAC can maximize the use of low-carbon energy, substantially reducing carbon emissions. At each control step, FM-CAC jointly optimizes the software pipeline variant, the hardware operating point, and the battery charging and discharging actions. To support this decision process, FM-CAC leverages edge-friendly Time-Series Foundation Models (TSFMs) for zero-shot carbon forecasting and integrates these forecasts into a dynamic programming solver with deferred cost attribution to prevent myopic battery depletion. Results show that FM-CAC reduces carbon emissions by up to 65.6% while maintaining near-maximum inference accuracy.
Authors: Nick Souligne, Vignesh Subbian
Abstract: Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.
Authors: Timothy B. Higgins, Antonios Mamalakis, Chirag Agarwal
Abstract: Recent advances in visual-language models (VLMs) have led to significant improvements in a plethora of complex multimodal tasks like image captioning, report generation, and visual perception. However, generating text from meteorological data is highly challenging because the atmosphere is a chaotic system that is rapidly changing at various spatial and temporal scales. Given the complexity of atmospheric phenomena, it is critical to verifiably quantify the effectiveness of existing VLMs on weather forecasting data. In this work, we present SynopticBench, a high-quality dataset consisting of 1,367,041 text samples of Area Forecast Discussions created by the National Weather Service over the continental United States paired to images of 500mb geopotential height, 2 meter temperature, and 850mb wind velocity in weather forecasts. We also present Synoptic Phenomena Alignment and Coverage Evaluation (SPACE), a novel evaluation framework that can be used to effectively estimate the quality of text descriptions of synoptic weather phenomena. Extensive experiments on generating forecast discussions using state-of-the-art VLMs show the sensitivity of existing evaluation metrics in this domain and enable further exploration into synoptic weather and climate text generation.
Authors: Smit Nautambhai Modi, Gandharv Mahajan, Marc Wetter, Randall Welles
Abstract: Real-time voice assistants must revise task state when users interrupt mid-response, but existing spoken-dialog benchmarks largely evaluate turn-based interaction and miss this failure mode. We introduce EchoChain, a controlled benchmark for evaluating full-duplex state-update reasoning under mid-speech interruptions. EchoChain identifies three recurring failure patterns in post-interruption continuations: contextual inertia, interruption amnesia, and objective displacement. The benchmark generates scenario-driven conversations and injects interruptions at a standardized point relative to assistant speech onset, enabling controlled cross-model comparison. In a paired half-duplex control, total failures drop by 40.2% relative to interrupted runs, indicating that many errors are driven by state-update reasoning under interruption rather than task difficulty alone. Across evaluated real-time voice models, no system exceeds a 50% pass rate, showing substantial room for improvement in mid-generation state revision. EchoChain provides a reproducible benchmark for diagnosing state-update reasoning failures in full-duplex voice interaction.
Authors: Yu Sha, Shuiping Gou, Bo Liu, Haofan Lu, Ningtao Liu, Jiahui Fu, Horst Stoecker, Domagoj Vnucec, Nadine Wetzstein, Andreas Widl, Kai Zhou
Abstract: Fault intensity diagnosis (FID) plays a pivotal role in intelligent manufacturing while neglecting dependencies among target classes hinders its practical deployment. This paper introduces a novel and general framework with deep hierarchical knowledge loss (DHK) to achieve hierarchical consistent representation and prediction. We develop a novel hierarchical tree loss to enable a holistic mapping of same-attribute classes, leveraging tree-based positive and negative hierarchical knowledge constraints. We further design a focal hierarchical tree loss to enhance its extensibility and devise two adaptive weighting schemes based on tree height. In addition, we propose a group tree triplet loss with hierarchical dynamic margin by incorporating hierarchical group concepts and tree distance to model boundary structural knowledge across classes. The joint two losses significantly improve the recognition of subtle faults. Extensive experiments are performed on four real-world datasets from various industrial domains (three cavitation datasets from SAMSON AG and one publicly available dataset) for FID, all showing superior results and outperforming recent state-of-the-art FID methods.
Authors: Andrew Millard, Zheng Zhao, Henrik Pedersen
Abstract: Physics-guided sampling with diffusion priors has recently shown strong performance in solving complex systems of partial differential equations (PDEs) from sparse observations. However, these methods are typically evaluated on benchmark problems that do not fully demonstrate their ability to generate temporally consistent solutions of time-dependent PDEs, often focusing instead on reconstructing a single snapshot. In this work, we apply these methods to gas-phase reaction kinetics problems governed by the advection-reaction-diffusion (ARD) equation, providing a setting that more closely reflects realistic laboratory experiments. We demonstrate that guided sampling can be used to reconstruct full spatiotemporal trajectories, rather than isolated states. Furthermore, we show that these methods generalise to previously unseen parameter regimes, highlighting their potential for real-world applications.
Authors: Michael Sheehan, Irina Timoshenko
Abstract: Passenger assistance services are essential for accessible rail travel, yet demand varies substantially across stations and over time, creating challenges for workforce planning and staff rostering. This paper presents a data-driven decision support framework for forecasting station-level passenger assistance demand and translating forecasts into workforce plans. The forecasting component applies a horizon-aware Prophet modelling approach using multi-source operational data, while the planning component maps demand forecasts to staffing requirements under service and operational constraints through an interpretable red-amber-green risk framework. The approach has been implemented within a production-grade system to support routine planning and staffing decisions across LNER-managed stations. Results demonstrate improved forecast accuracy relative to year-on-year baseline methods, with absolute error reduced by up to 76.9%, and show that forecast-informed staffing is associated with an approximate 50% reduction in failed passenger assistance deliveries attributable to staff availability. These findings highlight the value of integrating interpretable forecasting with operational work.
Authors: Marc Estafanous (Johns Hopkins University, Neurobaby Corporation)
Abstract: One of the limitations of transformer networks is the sequence length due to the quadratic nature of the attention matrix. Classical self attention uses the entire sequence length, however, the actual attention being used is sparse. Humans use a form of sparse attention when analyzing an image or scene called saccades. Focusing on key features greatly reduces computation time. By using a network (Saccade Attention Network) to learn where to attend from a large pre-trained model, we can use it to pre-process images and greatly reduce network size by reducing the input sequence length to just the key features being attended to. Our results indicate that you can reduce calculations by close to 80% and produce similar results.
Authors: Devendra Ghori
Abstract: State-of-the-art deepfake detectors achieve near-perfect in-domain accuracy yet degrade under cross-generator shifts, heavy compression, and adversarial perturbations. The core limitation remains the decoupling of semantic artifact learning from physical invariants: optical-flow discontinuities, specular-reflection inconsistencies, and cardiac-modulated reflectance (rPPG) are treated either as post-hoc features or ignored. We introduce PhyLAA-X, a novel physics-conditioned extension of Localized Artifact Attention (LAA-X). PhyLAA-X injects three end-to-end differentiable physics-derived feature volumes - optical-flow curl, specular-reflectance skewness, and spatially-upsampled rPPG power spectra - directly into the LAA-X attention computation via cross-attention gating and a resonance consistency loss. This forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co-occur - regions inherently harder for generative models to replicate consistently. PhyLAA-X is embedded across an efficient spatiotemporal ensemble (EfficientNet-B4+BiLSTM, ResNeXt-101+Transformer, Xception+causal Conv1D) with uncertainty-aware adaptive weighting. On FaceForensics++ (c23), Aletheia reaches 97.2% accuracy / 0.992 AUC-ROC; on Celeb-DF v2, 94.9% / 0.981; on DFDC, 90.8% / 0.966 - outperforming the strongest published baseline (LAA-Net [1]) by 4.1-7.3% in cross-generator settings and maintaining 79.4% accuracy under epsilon = 0.02 PGD-10 attacks. Single-backbone ablations confirm PhyLAA-X alone delivers a 4.2% cross-dataset AUC gain. The full production system is open-sourced at https://github.com/devghori1264/Aletheia (v1.2, April 2026) with pretrained weights, the adversarial corpus (referred to as ADC-2026 in this work), and complete reproducibility artifacts.
Authors: Hanuman Verma, Akshansh Gupta, Pranabesh Maji, Saurav Mandal, Vijay Kumar Pandey
Abstract: Accurate brain image segmentation, particularly for distinguishing various tissues from magnetic resonance imaging (MRI) images, plays a pivotal role in finding the neurological dis ease and medical image computing. In deep learning approaches, loss functions are very crucial for optimizing the model. In this study, we introduce a novel loss function integrating fuzzy logic to deals uncertainty issues in brain image segmentation into various tissues. It integrates the well-known categorical cross-entropy (CCE) loss function and fuzzy entropy based on fuzzy logic. By employing fuzzy logic, this loss function accounts for the inherent uncertainties in pixel classifications. The proposed loss function has been evaluated on two publicly available benchmark datasets, IBSR and OASIS, using two widely recognised architectures, U-Net and U-Net++. Experimental results demonstrate that the trained model with proposed loss function provided better results in comparison to the CCE optimisation function in terms of various performance metrics. Additionally, it effectively enhances segmentation performance while handling meaningful uncer tainty during training. The findings suggest that this approach not only improves segmentation outcomes but also contributes to the reliability of model predictions.
Authors: Shizheng Hou, Wenqi Pei, Nuo Chen, Quang-Trung Ta, Peng Lu, Beng Chin Ooi
Abstract: Natural Language to SQL (NL2SQL) technology empowers non-expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM-enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine-grained metrics that systematically quantify module-level effectiveness and efficiency. We further implement these metrics in a flexible multi-agent framework, allowing configurable benchmarking across diverse NL2SQL approaches. Leveraging NL2SQLBench, we rigorously evaluate ten representative open-source methods on two datasets, the BIRD development set and the ScienceBenchmark development set, using two LLMs, DeepSeek-V3 and GPT-4o mini. We systematically assess each approach across the three core modules and evaluate multiple critical performance dimensions. Our evaluation reveals significant gaps in existing NL2SQL methods, highlighting not only substantial room for accuracy improvements but also the significant computational inefficiency, which severely hampers real-world adoption. Furthermore, our analysis identifies critical shortcomings in current benchmark datasets and evaluation rules, emphasizing issues such as inaccurate gold SQL annotations and limitations in existing evaluation rules. By synthesizing these insights into a unified benchmarking, our study establishes a clear reference point for fair comparison and serves as essential guidance for future targeted innovation in NL2SQL technology.
Authors: Samrendra Roy, Kazuma Kobayashi, Souvik Chakraborty, Sajedul Talukder, Syed Bahauddin Alam
Abstract: Continual learning, the ability to acquire new tasks sequentially without forgetting prior knowledge, is essential for deploying neural networks in dynamic real-world environments, from nuclear digital twin monitoring to grid-edge fault detection. Existing synaptic importance methods, such as Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI), rely on gradient computation, making them incompatible with neuromorphic hardware that lacks backpropagation support. We propose ISI-CV, the first gradient-free synaptic importance metric for SNN continual learning, derived from the Coefficient of Variation (CV) of Inter-Spike Intervals (ISIs). Neurons that fire regularly (low CV) encode stable, task-relevant features and are protected from overwriting; neurons with irregular firing are permitted to adapt freely. ISI-CV requires only spike time counters and integer arithmetic, all of which are native to every neuromorphic chip. We evaluate on four benchmarks of increasing difficulty: Split-MNIST, Permuted-MNIST, Split-FashionMNIST, and Split-N-MNIST using real Dynamic Vision Sensor (DVS) event data. Across three seeds, ISI-CV achieves zero forgetting (AF = 0.000 +/- 0.000) on Split-MNIST and Split-FashionMNIST, near-zero forgetting on Permuted-MNIST (AF = 0.001 +/- 0.000), and the highest accuracy with the lowest forgetting on real neuromorphic DVS data (AA = 0.820 +/- 0.012, AF = 0.221 +/- 0.014). On N-MNIST, gradient-based methods produce unreliable importance estimates and perform worse than no regularization; ISI-CV avoids this failure by design.
Authors: Nicholas Pather, Joshua Fouch\'e, Sitwala Mundia, Karl-G\"unter Technau, Thokozile Malaba, Alex Welte, Ushma Mehta, Bruce A. Bassett
Abstract: Manual digitisation of structured handwritten documents is slow and costly. We benchmark 17 leading frontier multi-modal large language models and open-source models against a very challenging real-world medical form that mixes dates; structured, printed text; hand-written responses and significant variability challenges. None of the smaller or older models perform well but the latest Google and OpenAI models reach accuracies around $85\%$ with weighted F1 scores $\simeq 90\%$ across the discrete or predefined fields despite the very challenging nature of the responses. Clear task specific strengths emerge: GPT 5.4 excels in noisy date extraction as well as reliability with the lowest hallucination rate ($6\%$). Claude Sonnet 4.6 had the best average performance across formatted fields (dates and numerical values), while Gemini 3.1 delivered the best overall performance, with the lowest free text error rates (WER = $0.50$ and CER = $0.31$) and the strongest results across discrete classification metrics. We further show that prompt optimisation dramatically improves macro precision, recall and F1 by over $60\%$, but has little impact on weighted metrics (only $\sim2-5\%$ improvement). These results provide evidence that the rapid improvements of multimodal large language models offer a compelling pathway toward fully automated digitisation of complex handwritten workflows that is particularly relevant in low- and middle-income countries.
Authors: Zahra Asghari Varzaneh, Niclas W\"olner-Hanssen, Reza Khoshkangini, Thomas Ebner, Magnus Johnsson
Abstract: The selection of the optimal embryo for transfer is a critical yet challenging step in in vitro fertilization (IVF), primarily due to its reliance on the manual inspection of extensive time-lapse imaging data. A key obstacle in this process is predicting blastocyst formation from the limited number of daily images available. Many clinics also lack complete time-lapse systems, so full videos are often unavailable. In this study, we aimed to predict which embryos will develop into blastocysts using limited daily images from time-lapse recordings. We propose a novel hybrid model that combines DINOv2, a transformer-based vision model, with an enhanced long short-term memory (LSTM) network featuring a multi-head attention layer. DINOv2 extracts meaningful features from embryo images, and the LSTM model then uses these features to analyze embryo development over time and generate final predictions. We tested our model on a real dataset of 704 embryo videos. The model achieved 96.4% accuracy, surpassing existing methods. It also performs well with missing frames, making it valuable for many IVF laboratories with limited imaging systems. Our approach can assist embryologists in selecting better embryos more efficiently and with greater confidence.
Authors: Adithya V. Sastry, Bibek Poudel, Weizi Li
Abstract: Many robotic exploration algorithms rely on graph structures for frontier-based exploration and dynamic path planning. However, these graphs grow rapidly, accumulating redundant information and impacting performance. We present a transformer-based framework trained with Proximal Policy Optimization (PPO) to prune these graphs during exploration, limiting their growth and reducing the accumulation of excess information. The framework was evaluated on simulations of a robotic agent using Rapidly Exploring Random Trees (RRT) to carry out frontier-based exploration, where the learned policy reduces graph size by up to 96%. We find preliminary evidence that our framework learns to associate pruning decisions with exploration outcomes despite sparse, delayed reward signals. We also observe that while intelligent pruning achieves a lower rate of exploration compared to baselines, it yields the lowest standard deviation, producing the most consistent exploration across varied environments. To the best of our knowledge, these results are the first suggesting the viability of RL in sparsification of dynamic graphs used in robotic exploration algorithms.
Authors: Samuel Weidemaier, Christoph Norden-Smoch, Martin Rumpf
Abstract: We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.
Authors: Suraj Prasad, Pinak Mahapatra
Abstract: Automating the digitization of Piping and Instrumentation Diagrams (P&IDs) into structured process graphs would unlock significant value in plant operations, yet progress is bottlenecked by a fundamental data problem: engineering drawings are proprietary, and the entire community shares a single public benchmark of just 12 annotated images. Prior attempts at synthetic augmentation have fallen short because template-based generators scatter symbols at random, producing graphs that bear little resemblance to real process plants and, accordingly, yield only approximately 33% edge detection accuracy under synth-only training. We argue the failure is structural rather than visual and address it by introducing SynthPID, a corpus of 665 synthetic P&IDs whose pipe topology is seeded directly from real drawings. Paired with a patch-based Relationformer adapted for high-resolution diagrams, a model trained on SynthPID alone achieves 63.8 +/- 3.1% edge mAP on PID2Graph OPEN100 without seeing a single real P&ID during training, closing within 8 pp of the real-data oracle. These gains hold up under a controlled comparison against the template-based regime, confirming that generation quality drives performance rather than model choice. A scaling study reveals that gains flatten beyond roughly 400 synthetic images, pointing to seed diversity as the binding constraint.
Authors: Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu
Abstract: Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq 4.4M$ data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf{3$\times$} decoding throughput speedup compared to the source model.
Authors: Jiachen Qian, Zhaolu Kang
Abstract: The rapid proliferation of Multimodal Large Language Models (MLLMs) has enabled mobile agents to execute high-stakes financial transactions, but their adversarial robustness remains underexplored. We identify Visual Dominance Hallucination (VDH), where imperceptible visual cues can override textual price evidence in screenshot-based, price-constrained settings and lead agents to irrational decisions. We propose PriceBlind, a stealthy white-box adversarial attack framework for controlled screenshot-based evaluation. PriceBlind exploits the modality gap in CLIP-based encoders via a Semantic-Decoupling Loss that aligns the image embedding with low-cost, value-associated anchors while preserving pixel-level fidelity. On E-ShopBench, PriceBlind achieves around 80% ASR in white-box evaluation; under a simplified single-turn coordinate-selection protocol, Ensemble-DI-FGSM transfers with roughly 35-41% ASR across GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet. We also show that robust encoders and Verify-then-Act defenses reduce ASR substantially, though with some clean-accuracy trade-off.
Authors: Megan Smith, Venkatesh Thirugnana Sambandham, Florian Richter, Laura Crompton, Matthias Uhl, Torsten Sch\"on
Abstract: Text-to-Image (T2I) generation models have been widely adopted across various industries, yet are criticized for frequently exhibiting societal stereotypes. While a growing body of research has emerged to evaluate and mitigate these biases, the field at present contends with conceptual ambiguity, for example terms like "bias" and "fairness" are not always clearly distinguished and often lack clear operational definitions. This paper provides a comprehensive systematic review of T2I fairness literature, organizing existing work into a taxonomy of bias types and fairness notions. We critically assess the gap between "target fairness" (normative ideals in T2I outputs) and "threshold fairness" (normative standards with actionable decision rules). Furthermore, we survey the landscape of mitigation strategies, ranging from prompt engineering to diffusion process manipulation. We conclude by proposing a new framework for operationalizing fairness that moves beyond descriptive metrics towards rigorous, target-based testing, offering an approach for more accountable generative AI development.
Authors: Ke Zhang, Patricio Gallardo, Maziar Raissi, Sudhir Murthy
Abstract: Automatic translation of natural language mathematics into faithful Lean 4 code is hindered by the fundamental dissonance between informal set-theoretic intuition and strict formal type theory. This gap often causes LLMs to hallucinate non-existent library definitions, resulting in code that fails to compile or lacks semantic fidelity. In this work, we investigate the effectiveness of tool-augmented agents for this task through a systematic factorial analysis of three distinct tool categories: Fine-tuned Model Querying (accessing expert drafts), Knowledge Search (retrieving symbol definitions), and Compiler Feedback (verifying code via a Lean REPL). We first benchmark the agent against one-shot baselines, demonstrating large gains in both compilation success and semantic equivalence. We then use the factorial decomposition to quantify the impact of each category, isolating the marginal contribution of each tool type to overall performance.
Authors: Noor Islam S. Mohammad, Ulu\u{g} Bayaz{\i}t
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, yet a unified treatment of their overlapping safety challenges remains lacking. We present SafeLM, a framework that jointly addresses four pillars of LLM safety: privacy, security, misinformation, and adversarial robustness. SafeLM combines federated training with gradient smartification and Paillier encryption for privacy, integrates defenses against training and inference-time attacks, employs contrastive grounding with calibrated decoding to reduce hallucinations, and introduces alignment-aware binarized aggregation to enhance robustness while maintaining bounded reconstruction quality. Across benchmarks on factuality, toxicity, and membership inference, SafeLM achieves 98.0% harmful content detection accuracy, reduces communication by 96.9%, and lowers gradient inversion PSNR from 31.7 dB to 15.1 dB. Ablations show that each component contributes independently, whereas their integration yields a strong privacy utility efficiency trade-off for deploying trustworthy LLMs.
Authors: Yixiao Lin, James Booth
Abstract: Machine learning models often inherit biases from historical data, raising critical concerns about fairness and accountability. Conventional fairness interventions typically require access to sensitive attributes like gender or race, but privacy and legal restrictions frequently limit their use. To address this challenge, we propose a framework that infers sensitive attributes from auxiliary features and integrates fairness constraints into model training. Our approach mitigates bias while preserving predictive accuracy, offering a practical solution for fairness-aware learning. Empirical evaluations validate its effectiveness, contributing to the advancement of more equitable algorithmic decision-making.
Authors: Livia Qian, Gabriel Skantze
Abstract: Backchannels (e.g., `yeah', `mhm', and `right') are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage framework: first, fine-tuning large language models on dialogue transcripts to derive rich contextual representations; and second, learning a joint embedding space for dialogue contexts and backchannel realizations. We evaluate alignment with human perception via triadic similarity judgments (prosodic and cross-lexical) and a context-backchannel suitability task. Our results demonstrate that the learned projections substantially improve context-backchannel retrieval compared to previous methods. In addition, they reveal that backchannel form is highly sensitive to extended conversational context and that the learned embeddings align more closely with human judgments than raw WavLM features.
Authors: Weihua Du, Jingming Zhuo, Yixin Dong, Andre Wang He, Weiwei Sun, Zeyu Zheng, Manupa Karunaratne, Ivan Fox, Tim Dettmers, Tianqi Chen, Yiming Yang, Sean Welleck
Abstract: Recent large language model (LLM) agents have shown promise in using execution feedback for test-time adaptation. However, robust self-improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain-specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non-linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self-improvement via accumulated execution feedback for performance-critical kernel code generation through two complementary stages: failure-driven adaptation and diversity-preserving search, jointly improving correctness and optimization performance without additional fine-tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3, respectively, within 100 steps, and continues to improve with additional computation.
Authors: Varun Kumar, George Em Karniadakis
Abstract: This paper introduces a multi-agent framework guided by Large Language Models (LLMs) to assist in the early stages of engineering design, a phase often characterized by vast parameter spaces and inherent uncertainty. Operating under a human-in-the-loop paradigm and demonstrated on the canonical problem of aerodynamic airfoil design, the framework employs a team of specialized agents: a Coding Assistant, a Design Agent, a Systems Engineering Agent, and an Analyst Agent - all coordinated by a human Manager. Integrated within a set-based design philosophy, the process begins with a collaborative phase where the Manager and Coding Assistant develop a suite of validated tools, after which the agents execute a structured workflow to systematically explore and prune a large set of initial design candidates. A key contribution of this work is the explicit integration of formal risk management, employing the Conditional Value-at-Risk (CVaR) as a quantitative metric to filter designs that exhibit a high probability of failing to meet performance requirements, specifically the target coefficient of lift. The framework automates labor-intensive initial exploration through a global sensitivity analysis conducted by the Analyst agent, which generates actionable heuristics to guide the other agents. The process culminates by presenting the human Manager with a curated final set of promising design candidates, augmented with high-fidelity Computational Fluid Dynamics (CFD) simulations. This approach effectively leverages AI to handle high-volume analytical tasks, thereby enhancing the decision-making capability of the human expert in selecting the final, risk-assessed design.
Authors: Gustavo Sandoval, Brendan Dolan-Gavitt, Siddharth Garg
Abstract: Large language models write production code, and yet they routinely introduce well-known vulnerabilities. We show that this is not a knowledge deficit: the same models that generate insecure code, correctly identify and explain the vulnerability when asked directly, this is a gap we call the Format-Reliability Gap. Mechanistic analysis reveals the cause: security representations are encoded from the earliest layers but remain computationally inert until the final layer, where format-compliance demands compete with them. Because the failure is localized to a single layer, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead. The mechanism and the fix generalize across five models, three architecture families, and six vulnerability types, suggesting insecure code generation is an interpretability problem, not a training artifact.
Authors: Jun-Liang Lin, Kamesh Madduri, Mahmut Taylan Kandemir
Abstract: Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity. In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models.
Authors: Moein Salimi, Babak Hosseini Mohtasham, Amin Aghakasiri, Mahdi Naieni, Amir Hossein Qeysarbeigi, Mohammad Masih Shalchian Nazer, Zahra Azar, Mahdi Jafari Siavoshani, Mohammad Hossein Rohban
Abstract: Large Language Models (LLMs) have demonstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking -- where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation. To address these limitations, we propose an RL framework explicitly tailored for high-quality scientific idea generation. We propose the first multi-agent reward function designed to serve as a judge, decoupling methodological validation from implementation details while providing strict binary rewards that are robust to reward hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our training in ICLR-320, a curated dataset of problem-solution pairs extracted from ICLR 2024 proceedings. Experiments demonstrate that our framework significantly outperforms state-of-the-art baselines across expert-evaluated metrics of novelty, feasibility, and effectiveness.
Authors: Maxwell Shepherd
Abstract: We present a system for automated detection, localization, and scoring of arrow punctures on 40\,cm indoor archery target faces, trained on only 48 annotated photographs (5{,}084 punctures). Our pipeline combines three components: a color-based canonical rectification stage that maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements; a frozen self-supervised vision transformer (DINOv3 ViT-L/16) paired with AnyUp guided feature upsampling to recover sub-millimeter spatial precision from $32 \times 32$ patch tokens; and lightweight CenterNet-style detection heads for arrow-center heatmap prediction. Only 3.8\,M of 308\,M total parameters are trainable. Across three cross-validation folds, we achieve a mean F1 score of $0.893 \pm 0.011$ and a mean localization error of $1.41 \pm 0.06$\,mm, comparable to or better than prior fully-supervised approaches that require substantially more training data. An ablation study shows that the CenterNet offset regression head, typically essential for sub-pixel refinement, provides negligible detection improvement while degrading localization in our setting. This suggests that guided feature upsampling already resolves the spatial precision lost through patch tokenization. On downstream archery metrics, the system recovers per-image average arrow scores with a median error of 1.8\% and group centroid positions to within a median of 4.00\,mm. These results demonstrate that frozen foundation models with minimal task-specific adaptation offer a practical paradigm for dense prediction in small-data regimes.
Authors: Samrendra Roy, Syed Bahauddin Alam
Abstract: Quantum feature maps offer expressive embeddings for classical learning tasks, and augmenting sparse identification of nonlinear dynamics (SINDy) with such features is a natural but unexplored direction. We introduce \textbf{Q-SINDy}, a quantum-kernel-augmented SINDy framework, and identify a specific failure mode that arises: \emph{coefficient cannibalization}, in which quantum features absorb coefficient mass that rightfully belongs to the polynomial basis, corrupting equation recovery. We derive the exact cannibalization-bias formula $\Delta\xi_P = (P^\top P)^{-1}P^\top Q\,\hat\xi_Q$ and prove that orthogonalizing quantum features against the polynomial column space at fit time eliminates this bias exactly. The claim is verified numerically to machine precision ($<10^{-12}$) on multiple systems. Empirically, across six canonical dynamical systems (Duffing, Van der Pol, Lorenz, Lotka-Volterra, cubic oscillator, R\"ossler) and three quantum feature map architectures (ZZ-angle encoding, IQP, data re-uploading), orthogonalized Q-SINDy consistently matches vanilla SINDy's structural recovery while uncorrected augmentation degrades true-positive rates by up to 100\%. A refined dynamics-aware diagnostic, $R^2_Q$ for $\dot X$, predicts cannibalization severity with statistical significance (Pearson $r=0.70$, $p=0.023$). An RBF classical-kernel control across 20 hyperparameter configurations fails more severely than any quantum variant, ruling out feature count as the cause. Orthogonalization remains robust under depolarizing hardware noise up to 2\% per gate, and the framework extends without modification to Burgers' equation.
Authors: Qiaoyue Tang, Sepidehsadat Hosseini, Mengyao Zhai, Thibaut Durand, Greg Mori
Abstract: This paper presents FairNVT, a lightweight debiasing framework for pretrained transformer-based encoders that improves both representation and prediction level fairness while preserving task accuracy. Unlike many existing debiasing approaches that address these notions separately, we argue they are inherently connected: suppressing sensitive information at the representation level can facilitate fairer predictions. Our approach learns task-relevant and sensitive embeddings via lightweight adapters, applies calibrated Gaussian noise to the sensitive embedding, and fuses it with the task representation. Together with orthogonality constraints and fairness regularization, these components jointly reduce sensitive-attribute leakage in the learned embeddings and encourage fairer downstream predictions. The framework is compatible with a wide range of pretrained transformer encoders. Across three datasets spanning vision and language, FairNVT reduces sensitive-attribute attacker accuracy, improves demographic-parity and equalized-odds metrics, and maintains high task performance.
Authors: Peifeng Gao, Wenyi Fang, Yang Zheng, Difan Zou
Abstract: Delayed loss spikes have been reported in neural-network training, but existing theory mainly explains earlier non-monotone behavior caused by overly large fixed learning rates. We study one stylized hypothesis: normalization can postpone instability by gradually increasing the effective learning rate during otherwise stable descent. To test this hypothesis at theorem level, we analyze batch-normalized linear models. Our flagship result concerns whitened square-loss linear regression, where we derive explicit no-rising-edge and delayed-onset conditions, bound the waiting time to directional onset, and show that the rising edge self-stabilizes within finitely many iterations. Combined with a square-loss decomposition, this yields a concrete delayed-spike mechanism in the whitened regime. For logistic regression, under highly restrictive active-margin assumptions, we prove only a supporting finite-horizon directional precursor in a knife-edge regime, with an optional appendix-only loss lower bound under an extra non-degeneracy condition. The paper should therefore be read as a stylized mechanism study rather than a general explanation of neural-network loss spikes. Within that scope, the results isolate one concrete delayed-instability pathway induced by batch normalization.
Authors: Huaxin Wang, Xinge Wu, Jiajun Liu, Ruiqing He, Jiandong Shang, Hengliang Guo, Qiang Chen
Abstract: Quantum error mitigation (QEM) provides a practical route for estimating reliable observables on noisy intermediate-scale quantum (NISQ) devices. Traditional QEM strategies, including zero-noise extrapolation (ZNE) and Clifford data regression (CDR), rely on noise scaling or global regression, and their performance is constrained by the exponential growth of the system degrees of freedom. We construct a graph-enhanced mitigation (GEM) framework, which incorporates physical information into the model representation. In this work, quantum circuits are encoded as attributed graphs. Hardware-level physical information is mapped to node and edge features: local noise parameters such as calibration parameters $T_1$, $T_2$, and readout errors are encoded at nodes, while coupling-related information such as two-qubit gate errors is encoded as edge features. Graph neural networks are used to model how errors propagate along the physical coupling structure and build up into non-local correlations. This allows the model to capture local interactions and part of the resulting non-local correlations across qubits. A dual-branch affine correction is applied to maintain consistency with physical constraints. Experiments on 10-qubit and 16-qubit random circuits executed on superconducting quantum processors show that GEM provides a level of accuracy comparable to CDR at small scales, while yielding lower mean absolute error and improved stability in zero-shot transfer to larger systems. Results of the traditional QEM strategy indicate that global regression methods remain effective in low-dimensional settings but become less reliable as system degrees of freedom grow. In contrast, GEM makes use of local physical structures to show better scalability and generalization, while preserving the overall error propagation patterns. This work provides a practical scalable approach to QEM for NISQ devices.
Authors: Nges Brian Njungle, Eric Jahns, Michel A. Kinsy
Abstract: Privacy-preserving machine learning (PPML) has become increasingly important in applications where sensitive data must remain confidential. Homomorphic Encryption (HE) enables computation directly on encrypted data, allowing neural network inference without revealing raw inputs. While prior works have largely focused on inference over a single encrypted image, batch processing of encrypted inputs lags behind, despite being critical for high-throughput inference scenarios and training-oriented workloads. In this work, we address this gap by developing optimized algorithms for batched HE-friendly neural networks. We also introduced a pipeline architecture designed to maximize resource efficiency for different batch size execution. We implemented these algorithms and evaluated our work using HE-friendly ResNet-20 and ResNet-34 models on encrypted CIFAR-10 and CIFAR-100 datasets, respectively. For ResNet-20, our approach achieves an amortized inference time of 8.86 seconds per image when processing a batch of 512 encrypted images, with a peak memory usage of 98.96 GB. These results represent a 1.78x runtime improvement and a 3.74x reduction in memory usage compared to the state-of-the-art design. For the deeper ResNet-34 model, we achieve an amortized inference time of 28.14 on a batch of 256 encrypted images using 246.78GB of RAM
Authors: Zahid Hasan, Masud Ahmed, Nirmalya Roy
Abstract: Semantic segmentation in hyperbolic space enables compact modeling of hierarchical structure while providing inherent uncertainty quantification. Prior approaches predominantly rely on the Poincar\'e ball model, which suffers from numerical instability, optimization, and computational challenges. We propose a novel, tractable, architecture-agnostic semantic segmentation framework (pixel-wise and mask classification) in the hyperbolic Lorentz model. We employ text embeddings with semantic and visual cues to guide hierarchical pixel-level representations in Lorentz space. This enables stable and efficient optimization without requiring a Riemannian optimizer, and easily integrates with existing Euclidean architectures. Beyond segmentation, our approach yields free uncertainty estimation, confidence map, boundary delineation, hierarchical and text-based retrieval, and zero-shot performance, reaching generalized flatter minima. We introduce a novel uncertainty and confidence indicator in Lorentz cone embeddings. Further, we provide analytical and empirical insights into Lorentz optimization via gradient analysis. Extensive experiments on ADE20K, COCO-Stuff-164k, Pascal-VOC, and Cityscapes, utilizing state-of-the-art per-pixel classification models (DeepLabV3 and SegFormer) and mask classification models (mask2former and maskformer), validate the effectiveness and generality of our approach. Our results demonstrate the potential of hyperbolic Lorentz embeddings for robust and uncertainty-aware semantic segmentation. Code is available at https://github.com/mxahan/Lorentz_semantic_segmentation.
URLs: https://github.com/mxahan/Lorentz_semantic_segmentation.
Authors: Yiheng Chen, Zihui Ma, Peishi Jiang, Yilong Dai, Qikai Hu, Xinyue Ye, Lingyao Li, Rita Sousa, Runlong Yu
Abstract: Land surface temperature (LST) super-resolution is important for environmental monitoring. However, it remains challenging as coarse thermal observations severely underdetermine fine-scale structure. In this paper, we propose Earth Foundation Model-guided Diffusion (EFDiff), a novel framework for super-resolution under extreme spatial degradation. EFDiff uses the Prithvi-EO-2.0 Earth foundation model to encode high-resolution multispectral reflectance into geospatial embeddings, which are injected into the denoising network via cross-attention to guide fine-scale reconstruction from highly degraded observations. We study two variants, EFDiff-$\epsilon$ and EFDiff-$x_0$, which offer complementary trade-offs between perceptual realism and pixel-level fidelity. We evaluate EFDiff under an extreme $32\times$ scale gap using a globally diverse benchmark comprising 242,416 co-registered Landsat thermal-reflectance patches. Results show that EFDiff consistently outperforms baseline methods and that cross-attention conditioning by EFM is more effective than HLS channel concatenation. Although we present EFDiff in the context of LST super-resolution, the framework is broadly applicable to remote sensing problems in which pretrained geospatial representations can guide generative reconstruction.
Authors: Yixuan Wang
Abstract: This thesis develops numerical and theoretical approaches for understanding and analyzing singularity formation in Partial Differential Equations (PDEs). The singularity formation in the Navier-Stokes Equation (NSE) is famously challenging as one of the seven Clay Prize problems. Unlike simpler equations such as the Nonlinear Heat (NLH) or Keller-Segel (KS) equations, where formal asymptotics near blowup are better understood, the intrinsic complexity of NSE makes quantitative analytical treatment difficult, if not impossible, without numerical guidance. Building on numerical insights, we introduce a robust analytical framework to simplify and systematize pen-and-paper proofs for simpler singular PDEs. We present a novel approach based on enforcing vanishing modulation conditions for perturbations around approximate blowup profiles, complemented by singularly weighted energy estimates. We demonstrate the efficacy of our method on PDEs with complicated asymptotics, such as NLH and the Complex Ginzburg-Landau (CGL) equation, and address the open problem of singularity formation in the 3D KS equation with logistic damping. We develop and refine numerical approaches that facilitate deeper insights into singularity formation. We demonstrate that machine learning methods significantly enhance our capability to identify and characterize potential blowup solutions with high precision. We improve on existing Physics-Informed Neural Network (PINN) and Neural Operator (NO) frameworks. Moreover, we present a novel machine learning paradigm, the Kolmogorov-Arnold Network (KAN) architecture, whose interpretability and excellent scaling properties are achieved through learnable nonlinearities.
Authors: Victor Korolev, Mikhail Ivanov, Tatiana Kukanova, Artyom Rukavitsa, Alexander Vakshin, Peter Solomonov, Alexander Zeifman
Abstract: In this paper, we consider the problem of extraction of most informative features from time series that are regarded as observed values of stochastic processes satisfying the It{\^{o}} stochastic differential equations with unknown random drift and diffusion coefficients. We do not attract any additional information and use only the information contained in the time series as it is. Therefore, as additional features, we use the parameters of statistically adjusted mixture-type models of the observed regularities of the behavior of the time series. Several algorithms of construction of these parameters are discussed. These algorithms are based on statistical reconstruction of the coefficients which, in turn, is based on statistical separation of normal mixtures. We obtain two types of parameters by the techniques of the uniform and non-uniform statistical reconstruction of the coefficients of the underlying It{\^{o}} process. The reconstructed coefficients obtained by uniform techniques do not depend on the current value of the process, while the non-uniform techniques reconstruct the coefficients with the account of their dependence on the value of the process. Actually, the non-uniform techniques used in this paper represent a stochastic analog of the Taylor expansion for the time series. The efficiency of the obtained additional features is compared by using them in the autoregressive algorithms of prediction of time series. In order to obtain pure conclusion that is not affected by unwanted factors, say, related to a special choice of the architecture of the neural network prediction methods, we used only simple autoregressive algorithms. We show that the use of additional statistical features improves the prediction.
Authors: Hikaru Shindo, Henri R\"o{\ss}ler, Quentin Delfosse, Kristian Kersting
Abstract: Neuro-symbolic Reinforcement Learning (NeSy-RL) combines symbolic reasoning with gradient-based optimization to achieve interpretable and generalizable policies. Relational concepts, such as "left of" or "close by", serve as foundational building blocks that structure how agents perceive and act. However, conventional approaches require human experts to manually define these concepts, limiting adaptability since concept semantics vary across environments. We propose GRAIL (Grounding Relational Agents through Interactive Learning), a framework that autonomously grounds relational concepts through environmental interaction. GRAIL leverages large language models (LLMs) to provide generic concept representations as weak supervision, then refines them to capture environment-specific semantics. This approach addresses both sparse reward signals and concept misalignment prevalent in underdetermined environments. Experiments on the Atari games Kangaroo, Seaquest, and Skiing demonstrate that GRAIL matches or outperforms agents with manually crafted concepts in simplified settings, and reveals informative trade-offs between reward maximization and high-level goal completion in the full environment.
Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang
Abstract: Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
Authors: Weiyu Ma, Yongcheng Zeng, Yan Song, Xinyu Cui, Jian Zhao, Xuhui Liu, Mohamed Elhoseiny
Abstract: Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.
Authors: Noga Mudrik, Adam S. Charles
Abstract: Across many scientific fields, measurements often represent the number of times an event occurs. For example, a document can be represented by word occurrence counts, neural activity by spike counts per time window, or online communication by daily email counts. These measurements yield high-dimensional count data that often approximate a Poisson distribution, frequently with low rates that produce substantial sparsity and complicate downstream analysis. A useful approach is to embed the data into a low-dimensional space that preserves meaningful structure, commonly termed dimensionality reduction. Yet existing dimensionality reduction methods, including both linear (e.g., PCA) and nonlinear approaches (e.g., t-SNE), often assume continuous Euclidean geometry, thereby misaligning with the discrete, sparse nature of low-rate count data. Here, we propose p-SNE (Poisson Stochastic Neighbor Embedding), a nonlinear neighbor embedding method designed around the Poisson structure of count data, using KL divergence between Poisson distributions to measure pairwise dissimilarity and Hellinger distance to optimize the embedding. We test p-SNE on synthetic Poisson data and demonstrate its ability to recover meaningful structure in real-world count datasets, including weekday patterns in email communication, research area clusters in OpenReview papers, and temporal drift and stimulus gradients in neural spike recordings.
Authors: Alexis Carrillo, Salvatore Citraro, Ali Aghazhadeh Ardebili, Enrique Taietta, Giulio Rossetti, Emilio Ferrara, Giuseppe Alessandro Veltri, Massimo Stella
Abstract: Scarce longitudinal evidence examines LLMs' persuasiveness and humanness along time-evolving psychological frameworks. We introduce Talk2AI, a longitudinal framework quantifying psycho-social, reasoning and affective dimensions of LLMs' persuasiveness about polarizing societal topics. In a four-way longitudinal setup, Talk2AI's 770 participants engaged in structured conversations with one of four leading LLMs on topics like climate change, social media misinformation, and math anxiety. This produced 3,080 conversations over 60,000 turns. After each wave, participants reported conviction in their initial topic stance, perceived opinion change, LLM's perceived humanness, a self-donation to the topic and a textual explanation. Feedback time series showed longitudinal inertia in convictions, indicating some human anchoring to initial opinions even after repeated exposure to AI-generated arguments. Interestingly, NLP analyses revealed that both humans and LLMs relied on fallacious reasoning in 1 conversational quip every 6, countering the ``LLMs as superior systems" stereotype behind LLMs' cognitive surrender. LLMs' perceived humanness was most learnable from sociodemographic, psychological and engagement features ($R^2=0.44$), followed by opinion change ($R^2=0.34$), conviction ($R^2=0.26$) and personal endowment ($R^2=0.24$). Crucially, explainable AI (XAI) indicated: (i) the presence of individuals more susceptible to LLM-based opinion changes; (ii) psychological susceptibility to LLM-convincing consisted of having more trust in LLMs, being more agreeable and extraverted and with a higher need for cognition. A multiverse approach with mixed-effects models confirmed XAI results, alongside strong individual differences. Talk2AI provides a grounded framework and evidence for detecting how GenAI can influence human opinions via multiple psycho-social pathways in AI-human digital platforms.
Authors: Riza Alaudin Syah, Irwan Alnarus Kautsar, Gunawan Witjaksono, Haza Nuzly bin Abdull Hamed
Abstract: Breast cancer diagnosis through thermographic image analysis remains a critical challenge in medical AI, with classical deep learning approaches facing limitations in complex thermal pattern classification tasks. This paper presents a novel Hybrid Quantum Neural Network (HQNN) architecture that integrates quantum computing principles with classical convolutional neural networks for enhanced breast cancer classification. Our approach employs parameterized quantum circuits with multi-head attention mechanisms for quantum-aware feature encoding, coupled with classical convolutional layers for comprehensive pattern recognition. The quantum component utilizes a 4qubit variational circuit with strongly entangling layers, while the classical component incorporates advanced attention mechanisms for feature fusion. Experimental validation on breast cancer thermographic data demonstrates substantial performance improvements over state-of-the-art classical architectures, with the quantum-enhanced approach exhibiting superior convergence dynamics and enhanced feature representation capabilities. Our findings provide evidence for quantum advantage in medical image classification through classical simulation, establishing a framework for quantum-classical hybrid systems in healthcare applications. The methodology addresses key challenges in quantum machine learning deployment while maintaining computational feasibility on near-term quantum devices.
Authors: Liyin Chen, Nazlee Zebardast, Mengyu Wang, Tobias Elze, Jason I. Comander
Abstract: Quantitative prediction of future retinal appearance from longitudinal imaging would support clinical decisions in progressive macular disease that currently rely on qualitative comparison or scalar progression scores. Recent methods have moved toward increasing generative complexity, but whether this complexity is necessary for slowly progressing retinal disease is unclear. We tested this through a controlled comparison of five conditioning configurations sharing one architecture and training dataset, spanning standard conditional diffusion, inference-aligned stochastic training, and deterministic regression. In our evaluation, aligning the training and inference input distributions produced large gains (delta-SSIM +0.082, SSIM +0.086, both p < 0.001), while the choice among aligned frameworks did not significantly affect any primary metric. Task-entropy and posterior-concentration analyses, replicated on two fundus autofluorescence (FAF) platforms, provided a mechanistic account: the predictable component of inter-visit change is small relative to time-invariant acquisition variability, leaving stochastic sampling with little width to exploit. Guided by these findings, we developed TRU (Temporal Retinal U-Net), a deterministic direct-regression model with continuous time-delta conditioning and multi-scale history aggregation. We evaluated TRU on 28,902 eyes across three imaging platforms: a mixed-disease Optos FAF cohort (9,942 eyes), zero-shot transfer to Stargardt macular dystrophy on Optos (288 eyes) and Heidelberg Spectralis (125 eyes), and a boundary evaluation on Cirrus en-face fundus images from a glaucoma cohort (18,547 eyes). TRU matched or exceeded delta-SSIM, SSIM, and PSNR in every FAF cohort against three state-of-the-art benchmarks, and its advantage grew monotonically with available history length.
Authors: Yahya Saleh
Abstract: Recent work has shown that parameterizing and optimizing coordinate transformations using normalizing flows, i.e., invertible neural networks, can significantly accelerate the convergence of spectral approximations. We present the first error estimates for approximating functions using Hermite expansions composed with adaptive coordinate transformations. Our analysis establishes an equivalence principle: approximating a function $f$ in the span of the transformed basis is equivalent to approximating the pullback of $f$ in the span of Hermite functions. This allows us to leverage the classical approximation theory of Hermite expansions to derive error estimates in transformed coordinates in terms of the regularity of the pullback. We present an example demonstrating how a nonlinear coordinate transformation can enhance the convergence of Hermite expansions. Focusing on smooth functions decaying along the real axis, we construct a monotone transport map that aligns the decay of the target function with the Hermite basis. This guarantees spectral convergence rates for the corresponding Hermite expansion. Our analysis provides theoretical insight into the convergence behavior of adaptive Hermite approximations based on normalizing flows, as recently explored in the computational quantum physics literature.
Authors: Paul A. Constable, Dorothy A. Thompson, Irene O. Lee, Lynne Loh, Aleksei Zhdanov, Mikhail Kulyabin, Andreas Maier
Abstract: The LEOPs (Light-ERG-Oscillatory Potentials) dataset provides light-adapted (LA) electroretinogram (ERG) and Oscillatory Potentials (OPs) waveforms for typically developing Control, Autism Spectrum Disorder (ASD) and ASD + Attention Deficit Hyperactivity Disorder (ADHD) childhood and adolescent populations. The ERGs were recorded in the Right And Left eyes with skin electrodes using the handheld RETeval device at two sites in Australia and the United Kingdom. The LEOPs dataset includes 5309 single flash ERG and 4434 OPs waveforms as well as images selected from each participant showing the position of the skin electrode. The LEOPs dataset is constructed from recordings using a 9 step randomized flash series from $-0.37$ to $1.20$~$Td.s$, a 2 step at 113 and 446 $Td.s$ flash strengths (2500 Control, 1730 ASD and 451 ASD + ADHD samples), as well as the $85$~$Td.s$ (Light Adapted 3 $cd.s.m^{-2}$ (LA3)) equivalent International Society of Clinical Electrophysiology of Vision (ISCEV) Standard flash with 435 Control, 176 ASD and 37 ASD + ADHD waveform samples. Code for the stimulus is provided along with participant demographics, date and time of testing, and where available diagnostic scores for the ASD and ASD + ADHD groups, alongside iris color, electrode position with image files and time domain values for the ERG and summed values for the OPs. The repository contains excel file, exported JSON files on the patient level that are more suitable for machine learning tasks, images of electrode position for each recording and the protocol files for use with the RETeval.
Authors: Jan Greb\'ik, Pavel Hub\'a\v{c}ek, Martin Kouteck\'y, Mat\v{e}j Kripner, V\'aclav Rozho\v{n}, Robert \v{S}\'amal, Adri\'an Z\'ame\v{c}n\'ik
Abstract: We report new results on six problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base that is carried across rounds. Classified using the significance-autonomy taxonomy of Feng et al., four of the six results reach the level of publishable research, and three of the six were produced essentially autonomously by Bolzano. Our results provide evidence that LLMs can contribute meaningfully to mathematical research, complementing recent reports by Bubeck et al., Woodruff et al., and others.
Authors: Yifu Huo, Chenglong Wang, Ziming Zhu, Shunjie Xing, Peinan Feng, Tongran Liu, Qiaozhi He, Tianhua Zhou, Xiaojia Chang, Jingbo Zhu, Zhengtao Yu, Tong Xiao
Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.
Authors: Antonio Valerio Miceli Barone, Poon Tsz Nok
Abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of $\approx$28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.
Authors: Tianbao Zhang
Abstract: Large Language Models (LLMs) produce a controllability gap in safety-critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic compliance, context attention decay [Liu et al., 2024], and stochastic oscillation during self-correction [Huang et al., 2024]. We introduce the Convergent AI Agent Framework (CAAF), which transitions agentic workflows from open-loop generation to closed-loop Fail-Safe Determinism via three pillars: (1) Recursive Atomic Decomposition with physical context firewalls; (2) Harness as an Asset, formalizing domain invariants into machine-readable registries enforced by a deterministic Unified Assertion Interface (UAI); and (3) Structured Semantic Gradients with State Locking for monotonic convergence. Empirical evaluation across two domains -- SAE Level 3 (L3) autonomous driving (AD) (n=30, 7 conditions) and pharmaceutical continuous flow reactor design (n=20, 4 conditions including a Mono+UAI ablation) -- shows that CAAF-all-GPT-4o-mini achieves 100% paradox detection while monolithic GPT-4o achieves 0% (even at temperature=0). The pharmaceutical benchmark features 7 simultaneous constraints with nonlinear Arrhenius interactions and a 3-way minimal unsatisfiable subset, representing a structurally harder challenge than the 2-constraint AD paradox. Alternative multi-agent architectures (debate, sequential checking) also achieve 0% across 80 trials, confirming that CAAF's reliability derives from its deterministic UAI, not from multi-agent orchestration per se. A Mono+UAI ablation (95%) isolates UAI as the core contribution. CAAF's reliability is invariant to prompt hints; all components use a single commodity model, enabling fully offline deployment.
Authors: Khizar Anjum, Tingcong Jiang, Dario Pompili
Abstract: We present E2E-WAVE, the first end-to-end learned waveform generation system for underwater video multicasting. Acoustic channels exhibit 20--46% bit error rates where forward error correction becomes counterproductive -- LDPC increases rather than decreases errors beyond its decoding threshold. E2E-WAVE addresses this by embedding semantic similarity directly into physical layer waveforms: when decoding errors are unavoidable, the system preferentially selects semantically similar tokens rather than arbitrary corruption. Combining VideoGPT tokenization (1024x compression) with a trainable waveform bank and fully differentiable OFDM transmission, E2E-WAVE achieves +5 dB (19.26%) PSNR and +0.10 (14.28%) SSIM over the strongest FEC-protected baseline in less challenging underwater channel (NOF1) while delivering real-time 16 FPS video at 128x128 resolution over 2.3 kbps channels -- impossible for conventional digital modulation. The performance gap only increases in harsher channels (BCH1, NCS1). Trained on a single channel, E2E-WAVE generalizes to unseen underwater environments without retraining, while HEVC fails at sub-5 kbps rates and SoftCast's AWGN assumptions collapse on frequency-selective channels.
Authors: Faris Chaudhry, Anthea Monod, Keisuke Yano
Abstract: Linear convergence of first-order methods is typically characterized by global optimization conditions whose constants reflect worst-case geometry of the ambient space. In high-dimensional or structured problems, these global constants can be arbitrarily conservative and fail to capture the geometry actually encountered by optimization trajectories. In this paper, we develop a trajectory-restricted framework for linear convergence based on localized geometric regularity. We introduce restricted variants of the Polyak--{\L}ojasiewicz inequality, error bound, and quadratic growth conditions that are required to hold only on subsets of the domain. We show that classical convergence guarantees extend under these localized conditions, and in key cases, we develop new arguments that yield explicit relationships between the corresponding constants. The resulting rates are governed by geometric quantities associated with the regions traversed by the algorithm. For polyhedral composite problems, we prove that convergence is controlled by restricted Hoffman constants corresponding to the active polyhedral faces visited along the trajectory. Once the iterates enter a well-conditioned face, the effective condition number improves accordingly. Our work provides a geometric quantification for fast local convergence after active-set or manifold identification and more broadly suggests that linear convergence is fundamentally governed by the geometry of the subsets explored by the algorithm, rather than by worst-case global conditioning.
Authors: Yue Wu, Jian Huang
Abstract: Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics computed at a single denoising step, ignoring temporal history and often leading to premature unmasking of unstable tokens. In this work, we theoretically establish that a token's temporal instability, quantified by the KL divergence between consecutive prediction distributions, provides a strict lower bound on its mutual information with the remaining masked context, indicating that temporally unstable tokens are inherently unsafe to unmask. Based on this insight, we propose Stability-Weighted Decoding (SWD), a training-free, plug-and-play strategy that incorporates temporal stability into token scoring and acts as a universal modulator for arbitrary score-based decoding policies. Experiments on code generation and mathematical reasoning benchmarks demonstrate that SWD consistently improves generation accuracy across representative scoring metrics and selection policies, and exhibits exceptional robustness, maintaining a significant performance lead over standard baselines across varying acceleration ratios.
Authors: Jiafei Song, Fengwei Zhou, Jin Qu, Wenjin Jason Li, Tong Wu, Gengjian Xue, Zhikang Zhao, Daomin Wei, Yichao Lu, Bailin Na
Abstract: Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.
Authors: Tingfeng Lan, Zirui Wang, Yunjia Zheng, Zhaoyuan Su, Juncheng Yang, Yue Cheng
Abstract: Modern AI models are growing rapidly in size and redundancy, leading to significant storage and distribution challenges in model hubs. We present TensorHub, a tensor-centric system for reducing storage overhead through fine-grained deduplication and compression. TensorHub leverages tensor-level fingerprinting and clustering to identify redundancy across models without requiring annotations. Our design enables efficient storage reduction while preserving model usability and performance. Experiments on real-world model repositories demonstrate substantial storage savings with minimal overhead.
Authors: Emadeldeen Hamdan, Gorkem Durak, Muhammed Enes Tasci, Abel Lorente Campos, Aritrick Chatterjee, Roger Engelmann, Gregory Karczma, Aytekin Oto, Ahmet Enis Cetin, Ulas Bagci
Abstract: Magnetic Resonance Imaging (MRI) is vital for prostate cancer (PCa) diagnosis. While advanced techniques such as Hybrid Multi-dimensional MRI (HM-MRI) have enhanced diagnostic capabilities, the significant need remains for robust, automated Artificial Intelligence (AI)-based detection methods. In this study, we combine quantitative HM-MRI of tissue composition with an AI-based neural network. We propose the Hadamard-Bias Network plus ResNet18 (HBR-Net-18), a two-stage AI framework for PCa detection. In the first stage, a Hadamard U-Net-based algorithm suppresses intensity inhomogeneities (bias fields) across six parametric HM-MRI maps generated via a Physics-Informed Autoencoder (PIA). In the second stage, a Residual Network (ResNet-18) performs patch-level classification. The framework utilizes overlapping 11-by-11 patches, incorporating both 2D intra-slice and 3D inter-slice (adjacent-slice) information to improve spatial consistency. Our experimental results demonstrate that HB-Net achieves balanced sensitivity and specificity, significantly outperforming conventional radiomics-based approaches and baseline CNN models, highlighting its potential for clinical deployment.
Authors: Konrad Furmanczyk, Kacper Paczutkowski
Abstract: The present study aims to investigate a cluster cleaning algorithm that is both computationally simple and capable of solving the PU classification when the SCAR condition is unsatisfied. A secondary objective of this study is to determine the robustness of the LassoJoint method to perturbations of the SCAR condition. In the first step of our algorithm, we obtain cleaning labels from 2-means clustering. Subsequently, we perform logistic regression on the cleaned data, assigning positive labels from the cleaning algorithm with additional true positive observations. The remaining observations are assigned the negative label. The proposed algorithm is evaluated by comparing 11 real data sets from machine learning repositories and a synthetic set. The findings obtained from this study demonstrate the efficacy of the clustering algorithm in scenarios where the SCAR condition is violated and further underscore the moderate robustness of the LassoJoint algorithm in this context.
Authors: Yilan Qin, Chuanfei Dong, Hongyang Zhou, Chi Zhang, Kaichun Xu, Jiawei Gao, Simin Shekarpaz, Xinmin Li, Liang Wang
Abstract: The plasma environment around Mars is highly variable because it is strongly influenced by the solar wind. Accurate identification of plasma regions around Mars is important for the community studying solar wind-Mars interactions, region-specific plasma processes, and atmospheric escape. In this study, we develop a machine-learning-based classifier to automatically identify three key plasma regions--solar wind, magnetosheath, and induced magnetosphere--using only ion omnidirectional energy spectra measured by the MAVEN Solar Wind Ion Analyzer (SWIA). Two neural network architectures are evaluated: a multilayer perceptron (MLP) and a convolutional neural network (CNN) that incorporates short temporal sequences. Our results show that the CNN can reliably distinguish the three plasma regions, whereas the MLP struggles to separate the solar wind and magnetosheath. Therefore, the CNN-based approach provides an efficient and accurate framework for large-scale plasma region identification at Mars and can be readily applied to future planetary missions.
Authors: Oliver E. Richardson, Mandana Samiei, Mehran Shakerinava, Joseph D. Viviano, Abdessamad El Kabid, Ali Parviz, Yoshua Bengio
Abstract: We present a generic algorithm for learning and approximate inference with an intuitive epistemic interpretation: iteratively focus on a subset of the model and resolve inconsistencies using the parameters under control. This framework, which we call Local Inconsistency Resolution (LIR) is built upon Probabilistic Dependency Graphs (PDGs), which provide a flexible representational foundation capable of capturing inconsistent beliefs. We show how LIR unifies and generalizes a wide variety of important algorithms in the literature, including the Expectation-Maximization (EM) algorithm, belief propagation, adversarial training, GANs, and GFlowNets. In the last case, LIR actually suggests a more natural loss, which we demonstrate improves GFlowNet convergence. Each method can be recovered as a specific instance of LIR by choosing a procedure to direct focus (attention and control). We implement this algorithm for discrete PDGs and study its properties on synthetically generated PDGs, comparing its behavior to the global optimization semantics of the full PDG.
Authors: Henry Shugart, Shuyi Wang, Jason M. Altschuler
Abstract: This paper revisits momentum in the context of min-max optimization. Momentum is a celebrated mechanism for accelerating gradient dynamics in settings like convex minimization, but its direct use in min-max optimization makes gradient dynamics diverge. Surprisingly, Gidel et al. 2019 showed that negative momentum can help fix convergence. However, despite these promising initial results and progress since, the power of momentum remains unclear for min-max optimization in two key ways. (1) Generality: is global convergence possible for the foundational setting of convex-concave optimization? This is the direct analog of convex minimization and is a standard testing ground for min-max algorithms. (2) Fast convergence: is accelerated convergence possible for strongly-convex-strong-concave optimization (the only non-linear setting where global convergence is known)? Recent work has even argued that this is impossible. We answer both these questions in the affirmative. Together, these results put negative momentum on more equal footing with competitor algorithms, and show that negative momentum enables convergence significantly faster and more generally than was known possible.
Authors: Yilong Dai, Yiming Sun, Yiheng Chen, Shengyu Chen, Xiaowei Jia, Runlong Yu
Abstract: Accurate autoregressive prediction of 3D turbulent flows remains challenging for neural PDE solvers, as small errors in fine-scale structures can accumulate rapidly over rollout. In this paper, we propose FlowRefiner, a flow matching-based iterative refinement framework for 3D turbulent flow simulation. The method replaces stochastic denoising refinement with deterministic ODE-based correction, uses a unified velocity-field regression objective across all refinement stages, and introduces a decoupled sigma schedule that fixes the noise range independently of refinement depth. These design choices yield stable and effective refinement in the small-noise regime. Experiments on large-scale 3D turbulence with rich multi-scale structures show that FlowRefiner achieves state-of-the-art autoregressive prediction accuracy and strong physical consistency. Although developed for turbulent flow simulation, the proposed framework is broadly applicable to iterative refinement problems in scientific modeling.
Authors: Yijun Wang, Mihai B\^ace, Maria Torres Vega
Abstract: The occurrence of cybersickness in virtual reality (VR) significantly impairs users' perception and sense of immersion. Therefore, timely detection of cybersickness and the application of appropriate intervention strategies are crucial for enhancing the user experience. However, existing cybersickness detection methods often suffer from issues such as poor detection reliability across different levels of cybersickness and unnecessary model complexity. Furthermore, while cybersickness exhibits significant inter-user variability, most existing approaches aggregate all data from users and lack user-specific solutions. In this paper, we investigate a lightweight approach for cybersickness detection incorporating an ensemble learning model and user-specific eye and head tracking data. Our experiments using the open-source dataset Simulation 2021 demonstrate that feature engineering and training set construction are critical for determining detection performance. Models trained with data from similar-content segments achieve the best results, attaining detection accuracies of 93% in the cross-user setting and 88% in the user-personalized setting, using only 23-dimensional eye and head features. Moreover, by using user-specific data, well-tuned ensemble learning models with shorter training and inference times can be feasibly applied to real-world cybersickness detection, offering superior time efficiency and outstanding detection performance. This work offers useful evidence toward the development of lightweight and user-adaptive cybersickness detection models for VR applications.
Authors: Nima Afsharhajari, Jonathan Yu-Meng Li
Abstract: Sparsity or complexity? In modern high-dimensional asset pricing, these are often viewed as competing principles: richer feature spaces appear to favor complexity, while economic intuition has long favored parsimony. We show that this tension is misplaced. We distinguish capacity sparsity-the dimensionality of the candidate feature space-from factor sparsity-the parsimonious structure of priced risks-and argue that the two are complements: expanding capacity enables the discovery of factor sparsity. Revisiting the benchmark empirical design of Didisheim et al. (2025) and pushing it to higher complexity regimes, we show that nonlinear feature expansions combined with basis pursuit yield portfolios whose out-of-sample performance dominates ridgeless benchmarks beyond a critical complexity threshold. The evidence shows that the gains from complexity arise not from retaining more factors, but from enlarging the space from which a sparse structure of priced risks can be identified. The virtue of complexity in asset pricing operates through factor sparsity.
Authors: Daniel Aronoff, Kristian Praizner, Armin Sabouri
Abstract: Bitcoin transaction fees will become more important as the block subsidy declines, but fee formation is hard to study with blockchain data alone because the relevant queueing environment is unobserved. We develop and estimate a structural model of Bitcoin fee choice that treats the mempool as a market for scarce blockspace. We assemble a novel, high-frequency mempool panel, from a self-run Bitcoin node that records transaction arrivals, exits, block inclusion, fee-bumping events, and congestion snapshots. We characterize the fee market as a Vickery-Clarke-Groves mechanism and derive an equation to estimate fees. In the first-stage we estimate a monotone delay technology linking fee-rate priority and network state to expected confirmation delay. We then estimate how fees respond to that delay technology and to transaction characteristics. We find that congestion is the main determinant of delay; that the marginal value of priority is priced in fees, which is increasing in the gradient of confirmation time reduction per movement up in the fee queue; and that transactor choice of RBF, CPFP, and block conditions have economically important effects on fees.
Authors: Yifan Zhang, Jieyu Li, Kexin Pei, Yu Huang, Kevin Leach
Abstract: Large Language Models (LLMs) show promise for automated code repair but often struggle with the complex semantic and structural correctness required. We present SynthFix, a hybrid neural-symbolic framework that improves LLM-based vulnerability repair by unifying code synthesis with compiler-informed symbolic feedback. The core of our approach is an adaptive training strategy where a neural Router Model directs code samples to either Supervised Fine-Tuning (SFT) to learn common patterns or Reward Fine-Tuning (RFT) with symbolic rewards for complex, iterative refinement. On the FixJS (JavaScript) and CodeFlaws (C) benchmarks, SynthFix achieves up to 18% relative improvement in CodeBLEU/CrystalBLEU and 32% in Exact Match over strong SFT and RFT baselines. Our results show that this adaptive combination of training strategies, which mirrors how developers alternate between pattern application and tool feedback, significantly improves the accuracy and efficiency of LLM-based vulnerability repair. Our code and data are available at https://github.com/CoderDoge1108/SynthFix.
Authors: Kaito Goto, Naoya Takeishi, Takehisa Yairi
Abstract: Converting betting odds into accurate outcome probabilities is a fundamental challenge in order to use betting odds as a benchmark for sports forecasting and market efficiency analysis. In this study, we propose two methods to overcome the limitations of existing conversion methods. Firstly, we propose an odds-only method to convert betting odds to probabilities without using historical data for model fitting. While existing odds-only methods, such as Multiplicative, Shin, and Power exist, they do not adjust for biases or relationships we found in our betting odds dataset, which consists of 90014 football matches across five different bookmakers. To overcome these limitations, our proposed Odds-Only-Equal-Profitability-Confidence (OO-EPC) method aligns with the bookmakers' pricing objectives of having equal confidence in profitability for each outcome. We provide empirical evidence from our betting odds dataset that, for the majority of bookmakers, our proposed OO-EPC method outperforms the existing odds-only methods. Beyond controlled experiments, we applied the OO-EPC method under real-world uncertainty by using it for six iterations of an annual basketball outcome forecasting competition. Secondly, we propose a generalised linear model that utilises historical data for model fitting and then converts betting odds to probabilities. Existing generalised linear models attempt to capture relationships that the Efficient Market Hypothesis already captures. To overcome this shortcoming, our proposed Favourite-Longshot-Bias-Adjusted Generalised Linear Model (FL-GLM) fits just one parameter to capture the favourite-longshot bias, providing a more interpretable alternative. We provide empirical evidence from historical football matches where, for all bookmakers, our proposed FL-GLM outperforms the existing multinomial and logistic generalised linear models.
Authors: Chenyang Wang, Yun Yang
Abstract: We derive explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors, that is, data-dependent distributions over model parameters obtained by exponentially tilting a prior with the empirical risk. Unlike classical worst-case complexity bounds based on uniform laws of large numbers, which require explicit control of the model space in terms of metric entropy (integrals), our analysis yields posterior-averaged risk bounds that can be applied to overparameterized models and adapt to the data structure and the intrinsic model complexity. The bound involves a marginal-type integral over the parameter space, which we analyze using tools from singular learning theory to obtain explicit and practically meaningful characterizations of the posterior risk. Applications to low-rank matrix completion and ReLU neural network regression and classification show that the resulting bounds are analytically tractable and substantially tighter than classical complexity-based bounds. Our results highlight the potential of PAC-Bayes analysis for precise finite-sample generalization guarantees in modern overparameterized and singular models.
Authors: Hiroki Fujii, Masaki Yamakita
Abstract: Selective State Space Models (SSMs), notably Mamba, employ diagonal state transitions that limit both memory retention and bilinear computational capacity. We propose a factorized bilinear input modulation that augments the SSM with a state-input product, interpretable as a finite-dimensional Koopman bilinear form. After introducing a shared state across channels (Coupled SSM), the modulation admits two implementations. Coupled Bilinear Input Modulation (Coupled-BIM) retains the full bilinear product at the cost of sequential computation, while Coupled Gated Modulation (Coupled-GM) linearizes it into a gate modulation that is compatible with the parallel scan. Experiments on a multiple input-delay pendulum (memory retention) and NARMA-10 (bilinear computation) reveal a clear dissociation. Coupled-GM substantially improves memory retention but not bilinear computation, while Coupled-BIM improves both. A pathway ablation confirms that the two downstream routes of the bilinear signal serve complementary roles. The improvement is statistically robust, with Coupled-BIM consistently outperforming all other variants on bilinear computation. Furthermore, only Coupled-BIM benefits from increasing the SSM state dimension, while coupling or gate modulation alone show no improvement, establishing the bilin-ear mechanism as uniquely capable of exploiting larger state spaces.
Authors: Yuji Yamamoto, Satoshi Matsuura
Abstract: Rowhammer on GPU DRAM has enabled adversarial bit flips in model weights; shared KV-cache blocks in LLM serving systems present an analogous but previously unexamined target. In vLLM's Prefix Caching, these blocks exist as a single physical copy without integrity protection. Using software fault injection under ideal bit targeting, we characterize worst-case severity and identify three properties: (1) Silent divergence - 13 of 16 BF16 bit positions produce coherent but altered outputs, indistinguishable from legitimate responses without a clean baseline. (2) Selective propagation - only requests sharing the targeted prefix are affected. (3) Persistent accumulation - no temporal decay occurs, so cumulative damage grows linearly with subsequent requests. Together, these constitute a threat profile distinct from weight corruption: silent divergence and selective propagation enable detection evasion; persistent accumulation then proceeds unchecked, yielding damage amplification bounded only by how long the block remains cached. A checksum-based countermeasure detects any single-bit corruption at scheduling time, bounding cumulative damage to one batch independent of the block's cache lifetime, with negligible overhead. These results argue for integrity protection of prefix blocks before end-to-end exploitation is demonstrated.
Authors: Sheldon Paul, Izzat Alsmadi
Abstract: Assessing the security posture of modern computing systems typically requires the use of multiple specialized tools. These tools focus on different aspects such as configuration compliance, file integrity, and vulnerability exposure, and their outputs are often difficult to interpret collectively. This paper introduces the Unified Compliance Aggregator (UCA), a framework that integrates several open-source security tools into a single composite score representing overall system security. The proposed framework combines outputs from Lynis, OpenSCAP (STIG and CIS profiles), AIDE, Tripwire, and Nmap NSE. A normalization process converts heterogeneous outputs into a consistent 0 to 100 scale, followed by weighted aggregation. We also introduce a logarithmic scoring model for file integrity measurements to address limitations observed in prior linear approaches. Experiments were conducted on Ubuntu 22.04 across different hardening levels and environments. Results show consistent improvement in composite scores as systems are hardened, while also revealing contrasting behavior between compliance and file integrity tools. Two case studies, a basic web server and a DVWA-based system illustrate how the framework can be applied in practical scenarios.
Authors: Juhyeon Lee, Wonduk Seo, Junseo Koh, Seunghyun Lee, Haihua Chen, Yi Bu
Abstract: Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.
Authors: Kyeongman Park, Minha Jhang, Kyomin Jung
Abstract: Modern generative models still lack human-level creativity, particularly in multi-branch diversity. Prior approaches to address this problem often incur heavy computation or strong dependency on model architecture. Therefore, we introduce UAG(Universal Avoidance Generation), a model-agnostic and computationally efficient generation strategy that penalizes similarity among previously generated outputs. Thus, UAG can enhance multi-branch diversity across both diffusion and transformer models, with minimal additional computation. In experiments, our method achieves up to 1.9 times higher diversity, runs 4.4 times faster, and requires only 1/64 of the FLOPs compared to state-of-the-art methods. The full code is https://anonymous.4open.science/r/2026_ACL_Universal/.
URLs: https://anonymous.4open.science/r/2026_ACL_Universal/.
Authors: K. A. Shahriar
Abstract: Time series forecasting is traditionally dominated by sequence-based architectures such as recurrent neural networks and attention mechanisms, which process all time steps uniformly and often incur substantial computational cost. However, real-world temporal signals typically exhibit heterogeneous structure, where informative patterns are sparsely distributed and interspersed with redundant observations. This work introduces \textbf{SPaRSe-TIME}, a structured and computationally efficient framework that models time series through a decomposition into three complementary components: saliency, memory, and trend. The proposed approach reformulates temporal modeling as a projection onto informative subspaces, where saliency acts as a data-dependent sparsification operator, memory captures dominant low-rank temporal patterns, and trend encodes low-frequency dynamics. These components are integrated through a lightweight, adaptive mapping that enables simplified, selective, and interpretable temporal reasoning. Extensive experiments on diverse real-world datasets demonstrate that SPaRSe-TIME achieves competitive predictive performance compared to recurrent and attention-based architectures, while significantly reducing computational complexity. The model is particularly effective in structured time series with clear temporal components and provides explicit interpretability through component-wise contributions. Furthermore, analysis reveals both the strengths and limitations of decomposition-based modeling, highlighting challenges in highly stochastic and complex multivariate settings. Overall, SPaRSe-TIME offers a principled alternative to monolithic sequence models, bridging efficiency, interpretability, and performance, and providing a scalable framework for time series learning.
Authors: Anis Hamadouche, Mathini Sellathurai
Abstract: This paper investigates communication-efficient neural network transmission by exploiting structured symmetry constraints in convolutional kernels. Instead of transmitting all model parameters, we propose a degrees-of-freedom (DoF) based codec that sends only the unique coefficients implied by a chosen symmetry group, enabling deterministic reconstruction of the full weight tensor at the receiver. The proposed framework is evaluated under quantization and noisy channel conditions across multiple symmetry patterns, signal-to-noise ratios, and bit-widths. To improve robustness against transmission impairments, a projection step is further applied at the receiver to enforce consistency with the symmetry-invariant subspace, effectively denoising corrupted parameters. Experimental results on MNIST and CIFAR-10 using a DeepCNN architecture demonstrate that DoF-based transmission achieves substantial bandwidth reduction while preserving significantly higher accuracy than pruning-based baselines, which often suffer catastrophic degradation. Among the tested symmetries, \textit{central-skew symmetry} consistently provides the best accuracy-compression tradeoff, confirming that structured redundancy can be leveraged for reliable and efficient neural model delivery over constrained links.
Authors: Kaliki V Srinanda, M Manvith Prabhu, Hemanth K Mogilipalem, Jayavarapu S Abhinai, Vaibhav Santhosh, Aryan Herur, Deepu Vijayasenan
Abstract: In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.
Authors: Yuan-Hao Wei
Abstract: This paper proposes StrEBM, a structured latent energy-based model for source-wise structured representation learning. The framework is motivated by a broader goal of promoting identifiable and decoupled latent organization by assigning different latent dimensions their own learnable structural biases, rather than constraining the entire latent representation with a single shared energy. In this sense, blind source separation is adopted here as a concrete and verifiable testbed, through which the evolution of latent dimensions toward distinct underlying components can be directly examined. In the proposed framework, latent trajectories are optimized directly together with an observation-generation map and source-wise structural parameters. Each latent dimension is associated with its own energy-based formulation, allowing different latent components to gradually evolve toward distinct source-like roles during training. In the present study, this source-wise energy design is instantiated using Gaussian-process-inspired energies with learnable length-scales, but the framework itself is not restricted to Gaussian processes and is intended as a more general structured latent EBM formulation. Experiments on synthetic multichannel signals under linear and nonlinear mixing settings show that the proposed model can recover source components effectively, providing an initial empirical validation of the framework. At the same time, the study reveals important optimization characteristics, including slow late-stage convergence and reduced stability under nonlinear observation mappings. These findings not only clarify the practical behavior of the current GP-based instantiation, but also establish a basis for future investigation of richer source-wise energy families and more robust nonlinear optimization strategies.
Authors: Nick Andreasyan, Mikhail Struve, Alexey Popov, Maksim Nikolaev, Vadim Vashkelis
Abstract: RISC-V is emerging as a viable platform for automotive-grade embedded computing, with recent ISO 26262 ASIL-D certifications demonstrating readiness for safety-critical deployment in autonomous driving systems. However, functional safety in automotive systems is fundamentally a certification problem rather than a processor problem. The dominant costs arise from diagnostic coverage analysis, toolchain qualification, fault injection campaigns, safety-case generation, and compliance with ISO 26262, ISO 21448 (SOTIF), and ISO/SAE 21434. This paper analyzes the role of RISC-V in automotive functional safety, focusing on ISA openness, formal verifiability, custom extension control, debug transparency, and vendor-independent qualification. We examine autonomous driving safety requirements and map them to RISC-V architectural challenges such as lockstep execution, safety islands, mixed-criticality isolation, and secure debug. Rather than proposing a single algorithmic breakthrough, we present an analytical framework and research roadmap centered on certification economics as the primary optimization objective. We also discuss how selected ML methods, including LLM-assisted FMEDA generation, knowledge-graph-based safety case automation, reinforcement learning for fault injection, and graph neural networks for diagnostic coverage, can support certification workflows. We argue that the strongest outcome is not a faster core, but an ASIL-D-ready certifiable RISC-V platform.
Authors: Chuyue Wang, Jie Feng, Yuxi Wu, Shenglin Yi, Hang Zhang
Abstract: Human mobility prediction is a critical task but remains challenging due to its complexity and variability across populations and regions. Recently, large language models (LLMs) have made progress in zero-shot prediction, but existing methods suffer from limited interpretability (due to black-box reasoning), lack of iterative learning from new data, and poor transferability. In this paper, we introduce \textbf{ARMove}, a fully transferable framework for predicting human mobility through agentic reasoning. To address these limitations, ARMove employs standardized feature management with iterative optimization and user-specific customization: four major feature pools for foundational knowledge, user profiles for segmentation, and an automated generation mechanism integrating LLM knowledge. Robust generalization is achieved via agentic decision-making that adjusts feature weights to maximize accuracy while providing interpretable decision paths. Finally, large-small model synergy distills strategies from large LLMs (e.g., 72B) to smaller ones (e.g., 7B), reducing costs and enhancing performance ceilings. Extensive experiments on four global datasets show ARMove outperforms state-of-the-art baselines on 6 out of 12 metrics (gains of 0.78\% to 10.47\%), with transferability tests confirming robustness across regions, users, and scales. The other 4 items also achieved suboptimal results. Transferability tests confirm its 19 robustness across regions, user groups, and model scales, while interpretability 20 analysis highlights its transparency in decision-making. Our codes are available at: https://anonymous.4open.science/r/ARMove-F847.
Authors: Raman Saparkhan, Majd Hawasly, Md Rizwan Parvez, Mohammad Raza
Abstract: Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.
Authors: Ashutosh Bajpai, Tamal Majumder, Akshay Nambi, Tanmoy Chakraborty
Abstract: Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.
Authors: Qihao Shen, Jiaxing Xuan, Zhenguang Liu, Sifan Wu, Yutong Xie, Zhaoyan Ming, Yingying Jiao, kui Ren
Abstract: Advanced deepfake technologies are blurring the lines between real and fake, presenting both revolutionary opportunities and alarming threats. While it unlocks novel applications in fields like entertainment and education, its malicious use has sparked urgent ethical and societal concerns ranging from identity theft to the dissemination of misinformation. To tackle these challenges, feature analysis using frequency features has emergedas a promising direction for deepfake detection. However, oneaspect that has been overlooked so far is that existing methodstend to concentrate on one or a few specific frequency domains,which risks overfitting to particular artifacts and significantlyundermines their robustness when facing diverse forgery patterns. Another underexplored aspect we observe is that different features often attend to the same forged region, resulting in redundant feature representations and limiting the diversity of the extracted clues. This may undermine the ability of a model to capture complementary information across different facets, thereby compromising its generalization capability to diverse manipulations. In this paper, we seek to tackle these challenges from two aspects: (1) we propose a triple-branch network that jointly captures spatial and frequency features by learning from both original image and image reconstructed by different frequency channels, and (2) we mathematically derive feature decoupling and fusion losses grounded in the mutual information theory, which enhances the model to focus on task-relevant features across the original image and the image reconstructed by different frequency channels. Extensive experiments on six large-scale benchmark datasets demonstrate that our method consistently achieves state-of-the-art performance. Our code is released at https://github.com/injooker/Unveiling Deepfake.
Authors: Haocheng Ju, Leheng Chen, Peihao Wu, Bryan Dai, Bin Dong
Abstract: Retrieving mathematical knowledge is a central task in both human-driven research, such as determining whether a result already exists, finding related results, and identifying historical origins, and in emerging AI systems for mathematics, where reliable grounding is essential. However, the scale and structure of the mathematical literature pose significant challenges: results are distributed across millions of documents, and individual statements are often difficult to interpret in isolation due to their dependence on prior definitions and theorems. In this paper, we introduce Matlas, a semantic search engine for mathematical statements. Matlas is built on a large-scale corpus of 8.07 million statements extracted from 435K peer-reviewed papers spanning 1826 to 2025, drawn from a curated set of 180 journals selected using an ICM citation-based criterion, together with 1.9K textbooks. From these sources, we extract mathematical statements together with their dependencies, construct document-level dependency graphs, and recursively unfold statements in topological order to produce more self-contained representations. On top of this corpus, we develop a semantic retrieval system that enables efficient search for mathematical results using natural language queries. We hope that Matlas can improve the efficiency of theorem retrieval for mathematicians and provide a structured source of grounding for AI systems tackling research-level mathematical problems, and serve as part of the infrastructure for mathematical knowledge retrieval.
Authors: Davin Choo, Paul W. Goldberg, Nicholas Teh
Abstract: Many high-stakes AI deployments proceed only if every stakeholder deems the system acceptable relative to their own minimum standard. With randomization over a finite menu of options, this becomes a feasibility question: does there exist a lottery over options that clears all stakeholders' acceptability bars? We study a query model where the algorithm proposes lotteries and receives only binary accept/reject feedback. We give deterministic and randomized algorithms that either find a unanimously acceptable lottery or certify infeasibility; adaptivity can avoid eliciting many stakeholders' constraints, and randomization further reduces the expected elicitation cost relative to full elicitation. We complement these upper bounds with worst-case lower bounds (in particular, linear dependence on the number of stakeholders and logarithmic dependence on precision are unavoidable). Finally, we develop learning-augmented algorithms that exploit natural forms of advice (e.g., likely binding stakeholders or a promising lottery), improving query complexity when predictions are accurate while preserving worst-case guarantees.
Authors: Harshavardhanan Deekeswar
Abstract: Serialization formats designed for document interchange impose structural overhead that becomes prohibitive when large language models consume operational data at scale. A modest dataset of 1,000 IoT sensor readings serialized as JSON requires approximately 80,000 tokens - the majority spent on repeated field names, nested braces, and structural punctuation rather than semantic content. We present ONTO (Object Notation for Token Optimization), a columnar notation that declares field names once per entity and arranges values in pipe-delimited rows with indentation-based hierarchy. This schema-once, data-many design eliminates per-record key repetition while preserving human readability and nested structure support. Evaluation across three synthetic operational datasets demonstrates 46-51% token reduction versus JSON, with stable scaling from 100 to 1,000 records. Controlled inference benchmarks on Qwen2.5-7B show corresponding 5-10% latency improvement. Comprehension validation confirms no material degradation in LLM task accuracy across lookup, counting, extraction, and aggregation operations when format context is provided. Ablation analysis reveals that key repetition accounts for the majority of JSON overhead, with indentation costs in nested structures explaining the 4-percentage-point gap between flat and hierarchical data. ONTO occupies a previously unfilled position in the serialization landscape: columnar efficiency with hierarchical structure, optimized for LLM context windows rather than document interchange. Code and specification are available at https://github.com/harsh-aranga/onto.
Authors: Achraf El Messaoudi, Noureddine Khaous, Karim Cherifi
Abstract: Machine learning is becoming increasingly important for nonlinear system identification, including dynamical systems with spatially distributed outputs. However, classical identification and forecasting approaches become markedly less reliable in turbulent-flow regimes, where the dynamics are high-dimensional, strongly nonlinear, and highly sensitive to compounding rollout errors. Diffusion-based models have recently shown improved robustness in this setting and offer probabilistic inference capabilities, but many current implementations inherit target parameterizations from image generation, most commonly noise or velocity prediction. In this work, we revisit this design choice in the context of nonlinear spatiotemporal system identification. We consider a simple, self-contained patch-based transformer that operates directly on physical fields and use turbulent flow simulation as a representative testbed. Our results show that clean-state prediction consistently improves rollout stability and reduces long-horizon error relative to velocity- and noise-based objectives, with the advantage becoming more pronounced as the per-token dimensionality increases. These findings identify target parameterization as a key modeling choice in diffusion-based identification of nonlinear systems with spatial outputs in turbulent regimes.
Authors: Suklav Ghosh, Arijit Sur, Pinaki Mitra
Abstract: Salient object detection (SOD) requires modeling both long-range contextual dependencies and fine-grained structural details, which remains challenging for convolutional, transformer-based, and Mamba-based state space models. While recent Mamba-based state space approaches enable efficient global reasoning, they often struggle to recover precise object boundaries. In contrast, diffusion models capture strong structural priors through iterative denoising, but their use in discriminative dense prediction is still limited due to computational cost and integration challenges. In this work, we propose DGSSM, a diffusion-guided state space (Mamba) framework that formulates multimodal salient object detection as a progressive denoising process. The framework integrates diffusion structural priors with multi-scale state space encoding, adaptive saliency prompting, and an iterative Mamba diffusion refinement mechanism to improve boundary accuracy. A boundary-aware refinement head and self-distillation strategy further enhance spatial coherence and feature consistency. Extensive experiments on 13 public benchmarks across RGB, RGB-D, and RGB-T settings demonstrate that DGSSM consistently outperforms state-of-the-art methods across multiple evaluation metrics while maintaining a compact model size. These results suggest that diffusion-guided state space modeling is an effective and generalizable paradigm for multimodal dense prediction tasks.
Authors: Leon Engl\"ander, Sophia Althammer, Ahmet \"Ust\"un, Matthias Gall\'e, Tom Sherborne
Abstract: LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task's solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command "returns the complete solution to this task" in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.
Authors: Feiyang Kang, Mahavir Dabas, Myeongseob Ko, Ruoxi Jia
Abstract: Skills are a natural unit for describing what a language model can do and how its behavior can be changed. However, existing characterizations rely on human-written taxonomies, textual descriptions, or manual profiling pipelines--all external hypotheses about what matters that need not align with the model's internal representations. We argue that when the goal is to intervene on model behavior, skill characterization should be *model-native*: grounded in the model's own representations rather than imposed through external ontologies. We instantiate this view by recovering a compact orthogonal basis from sequence-level activations. The resulting basis is semantically interpretable but need not correspond to any predefined human ontology; instead, it captures axes of behavioral variation that the model itself organizes around. We validate this characterization on reasoning post-training, using the recovered basis for both SFT data selection and inference-time steering. We develop lightweight proxy interventions to identify which directions are most useful for a given model. Across Llama3-8B and Qwen2.5-3B, selecting data along those directions improves Pass@1 by up to 20% on MATH and 41% on AMC, outperforming data selection based on human-characterized skills. Because the basis lives in activation space, the same directions also serve as steering vectors at inference time, improving Pass@8 by up to 4.8% on MATH--an intervention that human-characterized skills cannot support. We further validate the characterization on safety alignment, where selecting adversarial training data for model-native skill coverage rather than textual diversity yields more sample-efficient learning. These results suggest that recovering skills from the model's own representations, rather than imposing them externally, provides a more effective foundation for intervening on model behavior. Codes are open-sourced.
Authors: I. M. Ross
Abstract: A number of optimization algorithms have been inspired by the physics of Newtonian motion. Here, we ask the question: do algorithms themselves obey some ``natural laws of motion,'' and can they be derived by an application of these laws? We explore this question by positing the theory that optimization algorithms may be considered as some manifestation of hidden algorithm primitives that obey certain universal non-Newtonian dynamics. This natural physics of optimization is developed by equating the terminal transversality conditions of an optimal control problem to the generalized Karush/John-Kuhn-Tucker conditions of an optimization problem. Through this equivalence formulation, the data functions of a given constrained optimization problem generate a natural vector field that permeates an entire hidden space with information on the optimality conditions. An ``action-at-a-distance'' operation via a Pontryagin-type minimum principle produces a local action to deliver a globalized result by way of a Hamilton-Jacobi inequality. An inverse-optimal algorithm is generated by performing control jumps that dissipate quantized ``energy'' defined by a search Lyapunov function. Illustrative applications of the proposed theory show that a large number of algorithms can be generated and explained in terms of the new mathematical physics of optimization.
Authors: Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj, Gouthaman KV, Ramani Duraiswami, Lie Lu, Sreyan Ghosh, Dinesh Manocha
Abstract: Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.
Authors: Noha Hassan, Xavier Fernando, Halim Yanikomeroglu
Abstract: Reconfigurable intelligent surfaces (RISs) modify signal reflections to enhance wireless communication capabilities. Classical RIS phase optimization is highly non convex and challenging in dynamic environments due to high interference and user mobility. Here we propose a hierarchical multi-objective quantum metalearning algorithm that switches among specific quantum paths based on historical success, energy cost, and current data rate. Candidate RIS control directions are arranged as switch paths between quantum neural network layers to minimize inference, and a scoring mechanism selects the top performing paths per layer. Instead of merely storing past successful settings of the RIS and picking the closest match when a new problem is encountered, the algorithm learns how to select and recombine the best parts of different solutions to solve new scenarios. In our model, high-dimensional RIS scenario features are compressed into a quantum state using the tensor product, then superimposed during quantum path selection, significantly improving quantum computational advantage. Results demonstrate efficient performance with enhanced spectral efficiency, convergence rate, and adaptability.
Authors: Nicholas Williams, Alejandro Schuler
Abstract: Predictions from machine learning algorithms can vary across random seeds, inducing instability in downstream debiased machine learning estimators. We formalize random seed stability via a concentration condition and prove that subbagging guarantees stability for any bounded-outcome regression algorithm. We introduce a new cross-fitting procedure, adaptive cross-bagging, which simultaneously eliminates seed dependence from both nuisance estimation and sample splitting in debiased machine learning. Numerical experiments confirm that the method achieves the targeted level of stability whereas alternatives do not. Our method incurs a small computational penalty relative to standard practice whereas alternative methods incur large penalties.
Authors: Khang Tran, Khoa Nguyen, Cristian Borcea, NhatHai Phan
Abstract: Recent advances in large language models for test case generation have improved branch coverage via prompt-engineered mutations. However, they still lack principled mechanisms for steering models toward specific high-risk execution branches, limiting their effectiveness for discovering subtle bugs and security vulnerabilities. We propose GLMTest, the first program structure-aware LLM framework for targeted test case generation that seamlessly integrates code property graphs and code semantics using a graph neural network and a language model to condition test case generation on execution branches. This structured conditioning enables controllable and branch-targeted test case generation, thereby potentially enhancing bug and security risk discovery. Experiments on real-world projects show that GLMTest built on a Qwen2.5-Coder-7B-Instruct model improves branch accuracy from 27.4% to 50.2% on TestGenEval benchmark compared with state-of-the-art LLMs, i.e., Claude-Sonnet-4.5 and GPT-4o-mini.
Authors: Jon-Paul Cacioli
Abstract: The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen's d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624). Split-half cross-validation yields median d = 1.77, P(d > 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.
Authors: Mohammed Kharma, Ahmed Sabbah, Radi Jarrar, Samer Zain, Mohammad Alkhanafseh, David Mohaisen
Abstract: This paper presents a controlled quasi-experimental developer study examining whether a layer-based security training package is associated with improved security quality in LLM-assisted implementation of an identity-centric Java Spring Boot backend. The study uses a mixed design with a within-subject pre-training versus post-training comparison and an exploratory between-subject expertise factor. Twelve developers completed matched runs under a common interface, fixed model configuration, counterbalanced task sets, and a shared starter project. Security outcomes were assessed via independent manual validation of submitted repositories by the first and second authors. The primary participant-level endpoint was a severity-weighted validated-weakness score. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test ($p = 0.0059$). In aggregate, validated weaknesses decreased from 162 to 111 (31.5\%), the severity-weighted burden decreased from 432 to 267 (38.2\%), and critical findings decreased from 24 to 5 (79.2\%). The largest reductions were in authorization and object access (53.3\%) and in authentication, credential policy, and recovery weaknesses (44.7\%). Session and browser trust-boundary issues showed minimal change, while sensitive-data and cryptographic weaknesses showed only marginal improvement. These results suggest that, under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. They do not support replacing secure defaults, static analysis, expert review, or operational hardening.
Authors: Seunghee Koh, Sunghyun Baek, Youngdong Kim, Junmo Kim
Abstract: Unlearning in large language models (LLMs) has emerged as a promising safeguard against adversarial behaviors. When the forgetting loss is applied uniformly without considering token-level semantic importance, model utility can be unnecessarily degraded. Recent studies have explored token-wise loss regularizers that prioritize informative tokens, but largely rely on ground-truth confidence or external linguistic parsers, which limits their ability to capture contextual information or the model's overall predictive state. Intuitively, function words like "the" primarily serve syntactic roles and are highly predictable with little ambiguity, but informative words admit multiple plausible alternatives with greater uncertainty. Based on this intuition, we propose Entropy-guided Token Weighting (ETW), a token-level unlearning regularizer that uses entropy of the predictive distribution as a proxy for token informativeness. We demonstrate that informative tokens tend to have higher entropy, whereas structural tokens tend to have lower entropy. This behavior enables ETW to achieve more effective unlearning while better preserving model utility than existing token-level approaches.
Authors: Prasoon Goyal, Sattvik Sahai, Michael Johnston, Hangjie Shi, Yao Lu, Shaohua Liu, Anna Rumshisky, Rahul Gupta, Anna Gottardi, Desheng Zhang, Lavina Vaz, Leslie Ball, Lucy Hu, Luke Dai, Samyuth Sagi, Maureen Murray, Sankaranarayanan Ananthakrishnan
Abstract: Post-training Large Language Models requires diverse, high-quality data which is rare and costly to obtain, especially in low resource domains and for multi-turn conversations. Common solutions are crowdsourcing or synthetic generation, but both often yield low-quality or low-diversity data. We introduce Adversarial Arena for building high quality conversational datasets by framing data generation as an adversarial task: attackers create prompts, and defenders generate responses. This interactive competition between multiple teams naturally produces diverse and complex data. We validated this approach by conducting a competition with 10 academic teams from top US and European universities, each building attacker or defender bots. The competition, focused on safety alignment of LLMs in cybersecurity, generated 19,683 multi-turn conversations. Fine-tuning an open-source model on this dataset produced an 18.47% improvement in secure code generation on CyberSecEval-Instruct and 29.42% improvement on CyberSecEval-MITRE.
Authors: Charles Ye, Bo Yuan, Lee Sharkey
Abstract: An LLM's residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., ":") follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.
Authors: Zichao Wei
Abstract: Can syntactic processing emerge spontaneously from purely local interaction? We present a concrete instance on a minimal system: an 18,658-parameter two-dimensional neural cellular automaton (NCA), supervised by nothing more than a 1-bit boundary signal, is trained on the membership problem of an arithmetic-expression grammar. After training, its internal $L \times L$ grid spontaneously self-organizes into an ordered, spatially extended representation that we name Proto-CKY. This representation satisfies three operational criteria for syntactic processing: expressive power beyond the regular languages, structural generalization beyond the training distribution, and an internal organization quantitatively aligned with grammatical structure (Pearson $r \approx 0.71$). It emerges independently on four context-free grammars and regenerates spontaneously after perturbation. Proto-CKY is functionally aligned with the CKY algorithm but formally distinct from it: it is a physical prototype, a concrete instantiation of a mathematical ideal on a physical substrate, and the systematic distance between the two carries information about the substrate itself.
Authors: Tianfu Wang, Zhezheng Hao, Yin Wu, Wei Wu, Qiang Lin, Hande Dong, Nicholas Jing Yuan, Hui Xiong
Abstract: Vibe coding produces correct, executable code at speed, but leaves no record of the structural commitments, dependencies, or evidence behind it. Reviewers cannot determine what invariants were assumed, what changed, or why a regression occurred. This is not a generation failure but a control failure: the dominant artifact of AI-assisted development (code plus chat history) performs dimension collapse, flattening complex system topology into low-dimensional text and making systems opaque and fragile under change. We propose Agentic Consensus: a paradigm in which the consensus layer C, an operable world model represented as a typed property graph, replaces code as the primary artifact of engineering. Executable artifacts are derived from C and kept in correspondence via synchronization operators Phi (realize) and Psi (rehydrate). Evidence links directly to structural claims in C, making every commitment auditable and under-specification explicit as measurable consensus entropy rather than a silent guess. Evaluation must move beyond code correctness toward alignment fidelity, consensus entropy, and intervention distance. We propose benchmark task families designed to measure whether consensus-based workflows reduce human intervention compared to chat-driven baselines.
Authors: Chuhan Qiao
Abstract: Off-policy learning in constrained MDPs with large binary state spaces faces a fundamental tension: causal identification of transition dynamics requires structural assumptions, while sample-efficient policy learning requires state-space compression. We introduce PI-CMDP, a framework for CMDPs whose constraint dependencies form a layered DAG under a Lifecycle Ordering Assumption (LOA). We propose an Identify-Compress-Estimate pipeline: (i) Identify: LOA enables backdoor identification of causal edge weights for cross-layer pairs, with formal partial-identification bounds when LOA is violated; (ii) Compress: a Markov abstraction compresses state cardinality from 2^(WL) to (W+1)^L under layer-priority regularity and exchangeability; and (iii) Estimate: a physics-guided doubly-robust estimator remains unbiased and reduces the variance constant when the physics prior outperforms a learned model. We instantiate PI-CMDP on constraint repair in engineering simulation pipelines. On the TPS benchmark (4,206 episodes), PI-CMDP achieves 76.2% repair success rate with only 300 training episodes versus 70.8% for the strongest baseline (+5.4 pp), narrowing to +2.8 pp (83.4% vs. 80.6%) in the full-data regime, while substantially reducing cascade failure rates. All improvements are consistent across 5 independent seeds (paired t-test p < 0.02).
Authors: Islam Mansour, Francescopaolo Sica, Michael Schmitt
Abstract: Synthetic Aperture Radar (SAR) plays a critical role in maritime surveillance, yet deep learning for SAR analysis is limited by the lack of pixel-level annotations. This paper explores how general-purpose vision foundation models can enable zero-shot ship instance segmentation in SAR imagery, eliminating the need for pixel-level supervision. A YOLOv11-based detector trained on open SAR datasets localizes ships via bounding boxes, which then prompt the Segment Anything Model 2 (SAM2) to produce instance masks without any mask annotations. Unlike prior SAM-based SAR approaches that rely on fine tuning or adapters, our method demonstrates that spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions. This design partially mitigates the optical-SAR domain gap and enables downstream applications such as vessel classification, size estimation, and wake analysis. Experiments on the SSDD benchmark achieve a mean IoU of 0.637 (89% of a fully supervised baseline) with an overall ship detection rate of 89.2%, confirming a scalable, annotation-efficient pathway toward foundation-model-driven SAR image understanding.
Authors: H S V N S Kowndinya Renduchintala, Sumit Bhatia
Abstract: Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.
URLs: https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.
Authors: Andrew Gracyk
Abstract: We develop interconnections between the complex normalizing flow for data drawn from Borel probability measures on the twofold realification of the complex manifold and the K\"ahler-Ricci flow. The complex normalizing flow relates the initial and target realified densities under the complex change of variables, necessitating the log determinant of the Wirtinger Jacobian. The Ricci curvature of a K\"ahler manifold is the second order mixed Wirtinger partial derivative of the log of the local density of the volume form. Therefore, we reconcile these two facts by drawing forth the connection that the log determinant used in the complex normalizing flow matches the Ricci curvature term under differentiation and conditions. The log density under the normalizing flow is kindred to a spatial Fisher information metric under a holomorphic pullback and a Bayesian perspective to the parameter, thus under the continuum limit the log likelihood matches a Fisher metric, recovering the K\"ahler-Ricci flow up to expectation. Using this framework, we establish other relevant results, attempting to bridge the statistical and ordinary behaviors of the complex normalizing flow to the geometric features of the K\"ahler-Ricci flow.
Authors: Victoria Bosch, Rowan Sommers, Adrien Doerig, Tim C Kietzmann
Abstract: Recent studies reveal striking representational alignment between artificial neural networks (ANNs) and biological brains, leading to proposals that all sufficiently capable systems converge on universal representations of reality. Here, we argue that this claim of Universality is premature. We introduce the Umwelt Representation Hypothesis (URH), proposing that alignment arises not from convergence toward a single global optimum, but from overlap in ecological constraints under which systems develop. We review empirical evidence showing that representational differences between species, individuals, and ANNs are systematic and adaptive, which is difficult to reconcile with Universality. Finally, we reframe ANN model comparison as a method for mapping clusters of alignment in ecological constraint space rather than searching for a single optimal world model.
Authors: Isma\"il Baaj, Henri Prade
Abstract: In this article, we establish a precise connection between binarized neural networks (BNNs) and Sugeno integrals. The advantage of the Sugeno integral is that it provides a framework for representing the importance of inputs and their interactions, while being equivalent to a set of if-then rules. For a hidden BNN neuron at inference time, we show that the activation threshold test can be written as a Sugeno integral on binary inputs. This yields an explicit set-function representation of each neuron decision, and an associated rule-based representation. We also provide a Sugeno-integral expression for the last-layer score. Finally, we discuss how the same framework can be adapted to support richer input interactions and how it can be extended beyond the binary case induced by binarized neural networks.
Authors: Sanzo Miyazawa
Abstract: The inverse Potts problem for estimating evolutionary single-site fields and pairwise couplings in homologous protein sequences from their single-site and pairwise amino acid frequencies observed in their multiple sequence alignment would be still one of useful methods in the studies of protein structure and evolution. Since the reproducibility of fields and couplings are the most important, the Boltzmann machine method is employed here, although it is computationally intensive. In order to reduce computational time required for the Boltzmann machine, parallel, persistent Markov chain Monte Carlo method is employed to estimate the single-site and pairwise marginal distributions in each learning step. Also, stochastic gradient descent methods are used to reduce computational time for each learning. Another problem is how to adjust the values of hyperparameters; there are two regularization parameters for evolutionary fields and couplings. The precision of contact residue pair prediction is often used to adjust the hyperparameters. However, it is not sensitive to these regularization parameters. Here, they are adjusted for the fields and couplings to satisfy a specific condition that is appropriate for protein conformations. This method has been applied to eight protein families.
Authors: Wen Tao, Yiwei Wang, Peng Zhou, Bryan Hooi, Wanlong Fang, Tianle Zhang, Xiao Luo, Yuansheng Liu, Alvin Chan
Abstract: Molecule generation requires satisfying multiple chemical and biological constraints while searching a large and structured chemical space. This makes it a non-binary problem, where effective models must identify non-obvious solutions under constraints while maintaining exploration to improve success by escaping local optima. From this perspective, creativity is a functional requirement in molecular generation rather than an aesthetic notion. Large language models (LLMs) can generate molecular representations directly from natural language prompts, but it remains unclear what type of creativity they exhibit in this setting and how it should be evaluated. In this work, we study the creative behavior of LLMs in molecular generation through a systematic empirical evaluation across physicochemical, ADMET, and biological activity tasks. We characterize creativity along two complementary dimensions, convergent creativity and divergent creativity, and analyze how different factors shape these behaviors. Our results indicate that LLMs exhibit distinct patterns of creative behavior in molecule generation, such as an increase in constraint satisfaction when additional constraints are imposed. Overall, our work is the first to reframe the abilities required for molecule generation as creativity, providing a systematic understanding of creativity in LLM-based molecular generation and clarifying the appropriate use of LLMs in molecular discovery pipelines.
Authors: Saeid Sheikhi, Panos Kostakos, Lauri Loven
Abstract: Intrusion detection systems (IDSs) for 5G networks must handle complex, high-volume traffic. Although opaque "black-box" models can achieve high accuracy, their lack of transparency hinders trust and effective operational response. We propose ExAI5G, a framework that prioritizes interpretability by integrating a Transformer-based deep learning IDS with logic-based explainable AI (XAI) techniques. The framework uses Integrated Gradients to attribute feature importance and extracts a surrogate decision tree to derive logical rules. We introduce a novel evaluation methodology for LLM-generated explanations, using a powerful evaluator LLM to assess actionability and measuring their semantic similarity and faithfulness. On a 5G IoT intrusion dataset, our system achieves 99.9\% accuracy and a 0.854 macro F1-score, demonstrating strong performance. More importantly, we extract 16 logical rules with 99.7\% fidelity, making the model's reasoning transparent. The evaluation demonstrates that modern LLMs can generate explanations that are both faithful and actionable, indicating that it is possible to build a trustworthy and effective IDS without compromising performance for the sake of marginal gains from an opaque model.
Authors: Francesco Vitale, Francesco Grimaldi, Massimiliano Rak, Nicola Mazzocca
Abstract: Anomaly-based Intrusion Detection Systems (IDSs) ensure protection against malicious attacks on networked systems. While deep learning-based IDSs achieve effective performance, their limited trustworthiness due to black-box architectures remains a critical constraint. Despite existing explainable techniques offering insight into the alarms raised by IDSs, they lack process-based explanations grounded in packet-level sequencing analysis. In this paper, we propose a method that employs process mining techniques to enhance anomaly-based IDSs by providing process-based alarm severity ratings and explanations for alerts. Our method prioritizes critical alerts and maintains visibility into network behavior, while minimizing disruption by allowing misclassified benign traffic to pass. We apply the method to the publicly available USB-IDS-TC dataset, which includes anomalous traffic affected by different variants of the Slowloris DoS attack. Results show that our method is able to discriminate between low- to very-high-severity alarms while preserving up to 99.94% recall and 99.99% precision, effectively discarding false positives while providing different degrees of severity for the true positives.
Authors: Francesco Vitale, Simone Guarino, Stefano Perone, Massimiliano Rak, Nicola Mazzocca
Abstract: While attack graphs are useful for identifying major cybersecurity threats affecting a system, they do not provide operational support for determining the likelihood of having a known vulnerability exploited, or that critical system nodes are likely to be compromised. In this paper, we perform dynamic risk assessment by combining Bayesian Attack Graphs (BAGs) and online monitoring of system behavior through process mining. Specifically, the proposed approach applies process mining techniques to characterize malicious network traffic and derive evidence regarding the probability of having a vulnerability actively exploited. This evidence is then provided to a BAG, which updates its conditional probability tables accordingly, enabling dynamic assessment of vulnerability exploitation. We apply our method to a cybersecurity testbed instantiating several machines deployed on different subnets and affected by several CVE vulnerabilities. The testbed is stimulated with both benign traffic and malicious behavior, which simulates network attack patterns aimed at exploiting the CVE vulnerabilities. The results indicate that our proposal effectively detects whether vulnerabilities are being actively exploited, allowing for an updated assessment of the probability of system compromise.
Authors: Varad Vishwarupe, Marina Jirotka, Nigel Shadbolt, Ivan Flechais
Abstract: LLMs are increasingly presented as collaborators in programming, design, writing, and analysis. Yet the practical experience of working with them often falls short of this promise. In many settings, users must diagnose misunderstandings, reconstruct missing assumptions, and repeatedly repair misaligned responses. This poster introduces a conceptual framework for understanding why such collaboration remains fragile. Drawing on a constructivist grounded theory analysis of 16 interviews with designers, developers, and applied AI practitioners working on LLM-enabled systems, and informed by literature on human-AI collaboration, we argue that stable collaboration depends not only on model capability but on the interaction's grounding conditions. We distinguish three recurrent structures of human-AI work: one-shot assistance, weak collaboration with asymmetric repair, and grounded collaboration. We propose that collaboration breaks down when the appearance of partnership outpaces the grounding capacity of the interaction and contribute a framework for discussing grounding, repair, and interaction structure in LLM-enabled work.
Authors: Ziyang Liu
Abstract: We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.
Authors: Xiao Lingao, Yang He
Abstract: Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to improve augmentation-per-image diversity. Our approach reduces soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and dataset distillation methods. Code is available at https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation.
URLs: https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation.
Authors: Kosuke Matsushima, Yasuyuki Okoshi, Masato Motomura, Daichi Fujiki
Abstract: Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM's limited memory capacity, while techniques like sparse attention can conflict with PIM's need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency. We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM's internal bandwidth capabilities. Building on this, we introduce AQPIM, a novel PIM-aware activation quantization framework based on Product Quantization (PQ), optimizing it for modern Large Language Models (LLMs). By performing quantization directly within memory, AQPIM leverages PIM's high internal bandwidth and enables direct computation on compressed data, significantly reducing both memory footprint and computational overhead for attention computation. AQPIM addresses PQ's accuracy challenges by introducing several algorithmic optimizations. Evaluations demonstrate that AQPIM achieves significant performance improvements, drastically reducing of GPU-CPU communication that can account for 90$\sim$98.5\% of decoding latency, together with 3.4$\times$ speedup over a SOTA PIM approach.
Authors: Qi Kuang, Chao Wang, Yuling Jiao, Fan Zhou
Abstract: This paper investigates the off-policy evaluation (OPE) problem from a distributional perspective. Rather than focusing solely on the expectation of the total return, as in most existing OPE methods, we aim to estimate the entire return distribution. To this end, we introduce a quantile-based approach for OPE using deep quantile process regression, presenting a novel algorithm called Deep Quantile Process regression-based Off-Policy Evaluation (DQPOPE). We provide new theoretical insights into the deep quantile process regression technique, extending existing approaches that estimate discrete quantiles to estimate a continuous quantile function. A key contribution of our work is the rigorous sample complexity analysis for distributional OPE with deep neural networks, bridging theoretical analysis with practical algorithmic implementations. We show that DQPOPE achieves statistical advantages by estimating the full return distribution using the same sample size required to estimate a single policy value using conventional methods. Empirical studies further show that DQPOPE provides significantly more precise and robust policy value estimates than standard methods, thereby enhancing the practical applicability and effectiveness of distributional reinforcement learning approaches.
Authors: Ammar Bhilwarawala, Mainak Bandyopadhyay
Abstract: Automated fetal head segmentation in ultrasound images is critical for accurate biometric measurements in prenatal care. While existing deep learning approaches have achieved a reasonable performance, they struggle with issues like low contrast, noise, and complex anatomical boundaries which are inherent to ultrasound imaging. This paper presents Attention-ResUNet. It is a novel architecture that synergistically combines residual learning with multi-scale attention mechanisms in order to achieve enhanced fetal head segmentation. Our approach integrates attention gates at four decoder levels to focus selectively on anatomically relevant regions while suppressing the background noise, and complemented by residual connections which facilitates gradient flow and feature reuse. Extensive evaluation on the HC18 Challenge dataset where n = 200 demonstrates that Attention ResUNet achieves a superior performance with a mean Dice score of 99.30 +/- 0.14% against similar architectures. It significantly outperforms five baseline architectures including ResUNet (99.26%), Attention U-Net (98.79%), Swin U-Net (98.60%), Standard U-Net (98.58%), and U-Net++ (97.46%). Through statistical analysis we confirm highly significant improvements (p < 0.001) with effect sizes that range from 0.230 to 13.159 (Cohen's d). Using Saliency map analysis, we reveal that our architecture produces highly concentrated, anatomically consistent activation patterns, which demonstrate an enhanced interpretability which is crucial for clinical deployment. The proposed method establishes a new state of the art performance for automated fetal head segmentation whilst maintaining computational efficiency with 14.7M parameters and a 45 GFLOPs inference cost. Code repository: https://github.com/Ammar-ss
Authors: Sebastian Fischer, Lukas Burk, Carson Zhang, Bernd Bischl, Martin Binder
Abstract: Deep learning (DL) has become a cornerstone of modern machine learning (ML) praxis. We introduce the R package mlr3torch, which is an extensible DL framework for the mlr3 ecosystem. It is built upon the torch package, and simplifies the definition, training, and evaluation of neural networks for both tabular data and generic tensors (e.g., images) for classification and regression. The package implements predefined architectures, and torch models can easily be converted to mlr3 learners. It also allows users to define neural networks as graphs. This representation is based on the graph language defined in mlr3pipelines and allows users to define the entire modeling workflow, including preprocessing, data augmentation, and network architecture, in a single graph. Through its integration into the mlr3 ecosystem, the package allows for convenient resampling, benchmarking, preprocessing, and more. We explain the package's design and features and show how to customize and extend it to new problems. Furthermore, we demonstrate the package's capabilities using three use cases, namely hyperparameter tuning, fine-tuning, and defining architectures for multimodal data. Finally, we present some runtime benchmarks.
Authors: Geet Sethi, Panav Shah, Ashutosh Gandhe, Soumitra Darshan Nayak
Abstract: Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore their potential for object grounding in remote sensing imagery. We propose a hybrid pipeline that integrates diffusion-based localization cues with state-of-the-art segmentation models such as RemoteSAM and SAM3 to obtain more accurate bounding boxes. By leveraging the complementary strengths of generative diffusion models and foundational segmentation models, our approach enables robust and adaptive object localization across complex scenes. Experiments demonstrate that our pipeline significantly improves localization performance, achieving over a 14% increase in Acc@0.5 compared to existing state-of-the-art methods.
Authors: Lachlan Ewen MacDonald
Abstract: We prove a centre manifold theorem for a map along a manifold-with-boundary of fixed points, and provide an application to the study of gradient descent with large step size on two-layer matrix factorisation problems.
Authors: Sheng Xu, Guiliang Liu, Tarak Kharrat, Yudong Luo, Mohamed Aloulou, Javier L\'opez Pe\~na, Konstantin Sofeikov, Adam Reid, Paul Roberts, Steven Spencer, Joe Carnall, Ian McHale, Oliver Schulte, Hongyuan Zha, Wei-Shi Zheng
Abstract: Success in association football relies on both individual skill and coordinated tactics. While recent advancements in spatio-temporal data and deep learning have enabled predictive analyses like trajectory forecasting, the development of tactical design remains limited. Bridging this gap is essential, as prediction reveals what is likely to occur, whereas tactic generation determines what should occur to achieve strategic objectives. In this work, we present TacticGen, a generative model for adaptable and scalable tactic generation. TacticGen formulates tactics as sequences of multi-agent movements and interactions conditioned on the game context. It employs a multi-agent diffusion transformer with agent-wise self-attention and context-aware cross-attention to capture cooperative and competitive dynamics among players and the ball. Trained with over 3.3 million events and 100 million tracking frames from top-tier leagues, TacticGen achieves state-of-the-art precision in predicting player trajectories. Building on it, TacticGen enables adaptable tactic generation tailored to diverse inference-time objectives through classifier guidance mechanism, specified via rules, natural language, or neural models. Its modeling performance is also inherently scalable. A case study with football experts confirms that TacticGen generates realistic, strategically valuable tactics, demonstrating its practical utility for tactical planning in professional football. The project page is available at: https://shengxu.net/TacticGen/.
Authors: Zikun Zhou, Wenshuo Wang, Wenzhuo Liu, Hui Yao, Chaopeng Zhang, Yichen Liu, Xiaonan Yang, Junqiang Xi
Abstract: Electroencephalography (EEG) signals have been promising for long-term braking intensity prediction but are prone to various artifacts that limit their reliability. Here, we propose a novel framework that models EEG signals as mixtures of independent blind sources and identifies those strongly correlated with braking action. Our method employs independent component analysis to decompose EEG into different components and combines time-frequency analysis with Pearson correlations to select braking-related components. Furthermore, we utilize hierarchical clustering to group braking-related components into two clusters, each characterized by a distinct spatial pattern. Additionally, these components exhibit trial-invariant temporal patterns and demonstrate stable and common neural signatures of the emergency braking process. Using power features from these components and historical braking data, we predict braking intensity at a 200 ms horizon. Evaluations on the open source dataset (O.D.) and human-in-the-loop simulation (H.S.) show that our method outperforms state-of-the-art approaches, achieving RMSE reductions of 8.0% (O.D.) and 23.8% (H.S.).
Authors: Yangdi Jiang, Xiaotian Chang, Cyrus Mostajeran
Abstract: \We introduce the horospherical depth, an intrinsic notion of statistical depth on Hadamard manifolds, and define the Busemann median as the set of its maximizers. The construction exploits the fact that the linear functionals appearing in Tukey's half-space depth are themselves limits of renormalized distance functions; on a Hadamard manifold the same limiting procedure produces Busemann functions, whose sublevel sets are horoballs, the intrinsic replacements for halfspaces. The resulting depth is parametrized by the visual boundary, is isometry-equivariant, and requires neither tangent-space linearization nor a chosen base point.For arbitrary Hadamard manifolds, we prove that the depth regions are nested and geodesically convex, that a centerpoint of depth at least $1/(d+1)$ exists, and hence that the Busemann median exists for every Borel probability measure. Under strictly negative sectional curvature and mild regularity assumptions, the depth is strictly quasi-concave and the median is unique. We also establish robustness: the depth is stable under total-variation perturbations, and under contamination escaping to infinity the limiting median depends on the escape direction but not on how far the contaminating mass has moved along the geodesic ray, in contrast with the Fr\'echet mean. Finally, we establish uniform consistency of the sample depth and convergence of sample depth regions and sample Busemann medians; on symmetric spaces of noncompact type, the argument proceeds through a VC analysis of upper horospherical halfspaces, while on general Hadamard manifolds it follows from a compactness argument under a mild non-atomicity assumption.
Authors: Hamed Ouattara, Pascal Houssam Salmane, Pierre Duthon, Fr\'ed\'eric Bernardin, Omar Ait Aider
Abstract: In this paper, we present three neural network architectures designed for real-time classification of weather conditions (sunny, rain, snow, fog) from images. These models, inspired by recent advances in style transfer, aim to capture the stylistic elements present in images. One model, called "Multi-PatchGAN", is based on PatchGANs used in well-known architectures such as Pix2Pix and CycleGAN, but here adapted with multiple patch sizes for detection tasks. The second model, "Truncated ResNet50", is a simplified version of ResNet50 retaining only its first nine layers. This truncation, determined by an evolutionary algorithm, facilitates the extraction of high-frequency features essential for capturing subtle stylistic details. Finally, we propose "Truncated ResNet50 with Gram Matrix and Attention", which computes Gram matrices for each layer during training and automatically weights them via an attention mechanism, thus optimizing the extraction of the most relevant stylistic expressions for classification. These last two models outperform the state of the art and demonstrate remarkable generalization capability on several public databases. Although developed for weather detection, these architectures are also suitable for other appearance-based classification tasks, such as animal species recognition, texture classification, disease detection in medical imaging, or industrial defect identification.
Authors: Svetlana Pavlitska, Malte St\"uven, Beyza Keskin, J. Marius Z\"ollner
Abstract: Mixture-of-Experts (MoE) models provide a structured approach to combining specialized neural networks and offer greater interpretability than conventional ensembles. While MoEs have been successfully applied to image classification and semantic segmentation, their use in object detection remains limited due to challenges in merging dense and structured predictions. In this work, we investigate model-level mixtures of object detectors and analyze their suitability for improving performance and interpretability in object detection. We propose an MoE architecture that combines YOLO-based detectors trained on semantically disjoint data subsets, with a learned gating network that dynamically weights expert contributions. We study different strategies for fusing detection outputs and for training the gating mechanism, including balancing losses to prevent expert collapse. Experiments on the BDD100K dataset demonstrate that the proposed MoE consistently outperforms standard ensemble approaches and provides insights into expert specialization across domains, highlighting model-level MoEs as a viable alternative to traditional ensembling for object detection. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.
URLs: https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.
Authors: Chih-Kang Huang, Ludovick Gagnon, Miha Zalo\v{z}nik, Beno\^it Appolaire
Abstract: The multi-scale and non-linear nature of phase-field models of solidification requires fine spatial and temporal discretization, leading to long computation times. This could be overcome with artificial-intelligence approaches. Surrogate models based on neural operators could have a lower computational cost than conventional numerical discretization methods. We propose a new neural operator approach that bridges classical convex-concave splitting schemes with physics-informed learning to accelerate the simulation of phase-field models. It consists of a Deep Ritz method, where a neural operator is trained to approximate a variational formulation of the phase-field model. By training the neural operator with an energy-splitting variational formulation, we enforce the energy dissipation property of the underlying models. We further introduce a custom Reaction-Diffusion Neural Operator (RDNO) architecture, adapted to the operators of the model equations. We successfully apply the deep learning approach to the isotropic Allen-Cahn equation and to anisotropic dendritic growth simulation. We demonstrate that our physically-informed training provides better generalization in out-of-distribution evaluations than data-driven training, while achieving faster inference than traditional Fourier spectral methods.
Authors: Riccardo Casciotti, Francesco De Santis, Alberto Antonietti, Annamaria Mesaros
Abstract: The ability of humans for lifelong learning is an inspiration for deep learning methods and in particular for continual learning. In this work, we apply Hebbian learning, a biologically inspired learning process, to sound classification. We propose a kernel plasticity approach that selectively modulates network kernels during incremental learning, acting on selected kernels to learn new information and on others to retain previous knowledge. Using the ESC-50 dataset, the proposed method achieves 76.3% overall accuracy over five incremental steps, outperforming a baseline without kernel plasticity (68.7%) and demonstrating significantly greater stability across tasks.
Authors: Matic Petri\v{c}, Ren\'e Zander
Abstract: Block-encoding is a foundational technique in modern quantum algorithms, enabling the implementation of non-unitary operations by embedding them into larger unitary matrices. While theoretically powerful and essential for advanced protocols like Quantum Singular Value Transformation (QSVT) and Quantum Signal Processing (QSP), the generation of compilable implementations of block-encodings poses a formidable challenge. This work presents the BlockEncoding interface within the Eclipse Qrisp framework, establishing block-encodings as a high-level programming abstraction accessible to a broad scientific audience. Serving as both a technical framework introduction and a hands-on tutorial, this paper explicitly details key underlying concepts abstracted away by the interface, such as block-encoding construction and qubitization, and their practical integration into methods like the Childs-Kothari-Somma (CKS) algorithm. We outline the interface's software architecture, encompassing constructors, core utilities, arithmetic composition, and algorithmic applications such as matrix inversion, polynomial filtering, and Hamiltonian simulation. Through code examples, we demonstrate how this interface simplifies both the practical realization of advanced quantum algorithms and their associated resource estimation.
Authors: Daniel Marks, Dario Paccagnan, Mark van der Wilk
Abstract: Variational inference (VI) is a central tool in modern machine learning, used to approximate an intractable target density by optimising over a tractable family of distributions. As the variational family cannot typically represent the target exactly, guarantees on the quality of the resulting approximation are crucial for understanding which of its properties VI can faithfully capture. Recent work has identified instances in which symmetries of the target and the variational family enable the recovery of certain statistics, even under model misspecification. However, these guarantees are inherently problem-specific and offer little insight into the fundamental mechanism by which symmetry forces statistic recovery. In this paper, we overcome this limitation by developing a general theory of symmetry-induced statistic recovery in variational inference. First, we characterise when variational minimisers inherit the symmetries of the target and establish conditions under which these pin down identifiable statistics. Second, we unify existing results by showing that previously known statistic recovery guarantees in location-scale families arise as special cases of our theory. Third, we apply our framework to distributions on the sphere to obtain novel guarantees for directional statistics in von Mises-Fisher families. Together, these results provide a modular blueprint for deriving new recovery guarantees for VI in a broad range of symmetry settings.
Authors: Hafiza Syeda Yusra Tirmizi, Syed Ibad Hasnain, Muhammad Faris, Rabail Khowaja, Saad Abdullah
Abstract: The most common cause of dementia is Alzheimer disease, a progressive neurodegenerative disorder affecting older adults that gradually impairs memory, cognition, and behavior. It is characterized by the accumulation of abnormal proteins in the brain, including amyloid-beta plaques and neurofibrillary tangles of tau protein, which disrupt neuronal communication and lead to neuronal death. Early manifestations typically include mild memory impairment and reduced ability to acquire new information. As the disease progresses, patients experience severe cognitive decline, loss of independence, and significant personality and behavioral changes. Although the exact etiology of Alzheimer disease remains unclear, factors such as age, genetic predisposition, lifestyle, and cardiovascular health contribute to its development. While no definitive cure exists, early diagnosis, pharmacological interventions, and supportive care can slow progression and improve quality of life. This study presents a predictive cheminformatics-based model for identifying natural medicinal compounds with potential therapeutic efficacy against Alzheimer disease. The model functions as a drug screening system utilizing molecular descriptors and machine learning to detect anti-Alzheimer activity. More than 7,000 compounds from ChEBI, SynSysNet, and INDOFINE were preprocessed using Open Babel and analyzed with Dragon descriptors. A Random Forest classifier trained on approved treatments achieved moderate performance, with precision of 0.5970 and recall of 0.6590, identifying 73 candidate compounds. Key descriptors included atomic polarizability, bond multiplicity, and non-hydrogen bond counts.These findings demonstrate the value of cheminformatics in early-stage drug discovery for Alzheimer disease.
Authors: Jonas Arruda, Sophie Chervet, Paula Staudt, Andreas Wieser, Michael Hoelscher, Isabelle Sermet-Gaudelus, Nadine Binder, Lulla Opatowski, Jan Hasenauer
Abstract: Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.
Authors: Runhao Jiang, Renchi Yang, Donghao Wu
Abstract: Recommender systems have advanced markedly over the past decade by transforming each user/item into a dense embedding vector with deep learning models. At industrial scale, embedding tables constituted by such vectors of all users/items demand a vast amount of parameters and impose heavy compute and memory overhead during training and inference, hindering model deployment under resource constraints. Existing solutions towards embedding compression either suffer from severely compromised recommendation accuracy or incur considerable computational costs. To mitigate these issues, this paper presents BACO, a fast and effective framework for compressing embedding tables. Unlike traditional ID hashing, BACO is built on the idea of exploiting collaborative signals in user-item interactions for user and item groupings, such that similar users/items share the same embeddings in the codebook. Specifically, we formulate a balanced co-clustering objective that maximizes intra-cluster connectivity while enforcing cluster-volume balance, and unify canonical graph clustering techniques into the framework through rigorous theoretical analyses. To produce effective groupings while averting codebook collapse, BACO instantiates this framework with a principled weighting scheme for users and items, an efficient label propagation solver, as well as secondary user clusters. Our extensive experiments comparing BACO against full models and 18 baselines over benchmark datasets demonstrate that BACO cuts embedding parameters by over 75% with a drop of at most 1.85% in recall, while surpassing the strongest baselines by being up to 346X faster.
Authors: Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Bogdan Kulynych
Abstract: State-of-the-art Differentially Private (DP) synthetic data generators such as MST and AIM are widely used, yet tightly auditing their privacy guarantees remains challenging. We introduce a Gaussian Differential Privacy (GDP)-based auditing framework that measures privacy via the full false-positive/false-negative tradeoff. Applied to MST and AIM under worst-case settings, our method provides the first tight audits in the strong-privacy regime. For $(\epsilon,\delta)=(1,10^{-2})$, we obtain $\mu_{emp}\approx0.43$ vs. implied $\mu=0.45$, showing a small theory-practice gap. Our code is publicly available: https://github.com/sassoftware/dpmm.
Authors: Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma, Armin Parchami, Frederic Sala, Paroma Varma
Abstract: Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.
Authors: Michal Valko, R\'emi Munos, Branislav Kveton, Tom\'a\v{s} Koc\'ak
Abstract: Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
Authors: Florentin Coeurdoux, Gr\'egoire Ferr\'e, Jean-Philippe Bouchaud
Abstract: Empirical studies of trained models often report a transient regime in which signal is detectable in a finite gradient descent time window before overfitting dominates. We provide an analytically tractable random-matrix model that reproduces this phenomenon for gradient flow in a linear teacher--student setting. In this framework, learning occurs when an isolated eigenvalue separates from a noisy bulk, before eventually disappearing in the overfitting regime. The key ingredient is anisotropy in the input covariance, which induces fast and slow directions in the learning dynamics. In a two-block covariance model, we derive the full time-dependent bulk spectrum of the symmetrized weight matrix through a $2\times 2$ Dyson equation, and we obtain an explicit outlier condition for a rank-one teacher via a rank-two determinant formula. This yields a transient Baik-Ben Arous-P\'ech\'e (BBP) transition: depending on signal strength and covariance anisotropy, the teacher spike may never emerge, emerge and persist, or emerge only during an intermediate time interval before being reabsorbed into the bulk. We map the corresponding phase diagrams and validate the theory against finite-size simulations. Our results provide a minimal solvable mechanism for early stopping as a transient spectral effect driven by anisotropy and noise.
Authors: Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu, Marco Hutter, Manling Li, Fan Shi
Abstract: Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
Authors: Tianshi Cao, Jiawei Ren, Yuxuan Zhang, Jaewoo Seo, Jiahui Huang, Shikhar Solanki, Haotian Zhang, Mingfei Guo, Haithem Turki, Muxingzi Li, Yue Zhu, Sipeng Zhang, Zan Gojcic, Sanja Fidler, Kangxue Yin
Abstract: Closed-loop simulation is a core component of autonomous vehicle (AV) development, enabling scalable testing, training, and safety validation before real-world deployment. Neural scene reconstruction converts driving logs into interactive 3D environments for simulation, but it does not produce complete 3D object assets required for agent manipulation and large-viewpoint novel-view synthesis. To address this challenge, we present Asset Harvester, an image-to-3D model and end-to-end pipeline that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. Rather than relying on a single model component, we developed a system-level design for real-world AV data that combines large-scale curation of object-centric training tuples, geometry-aware preprocessing across heterogeneous sensors, and a robust training recipe that couples sparse-view-conditioned multiview generation with 3D Gaussian lifting. Within this system, SparseViewDiT is explicitly designed to address limited-angle views and other real-world data challenges. Together with hybrid data curation, augmentation, and self-distillation, this system enables scalable conversion of sparse AV object observations into reusable 3D assets.
Authors: Abdeladhim Tahimi
Abstract: This paper is a step-by-step, self-contained guide to the complete training cycle of a Physics-Informed Neural Network (PINN) -- a topic that existing tutorials and guides typically delegate to automatic differentiation libraries without exposing the underlying algebra. Using a first-order initial value problem with a known analytical solution as a running example, we walk through every stage of the process: forward propagation of both the network output and its temporal derivative, evaluation of a composite loss function built from the ODE residual and the initial condition, backpropagation of gradients -- with particular attention to the product rule that arises in hidden layers -- and a gradient descent parameter update. Every calculation is presented with explicit, verifiable numerical values using a 1-3-3-1 multilayer perceptron with two hidden layers and 22 trainable parameters. From these concrete examples, we derive general recursive formulas -- expressed as sensitivity propagation relations -- that extend the gradient computation to networks of arbitrary depth, and we connect these formulas to the automatic differentiation engines used in practice. The trained network is then validated against the exact solution, achieving a relative $L^2$ error of $4.290 \times 10^{-4}$ using only the physics-informed loss, without any data from the true solution. A companion Jupyter/PyTorch notebook reproduces every manual calculation and the full training pipeline, providing mutual validation between hand-derived and machine-computed gradients.
Authors: Sacha Huriot, Ihab Tabbara, Hussein Sibai
Abstract: Safety filters have been shown to be effective tools to ensure the safety of control systems with unsafe nominal policies. To address scalability challenges in traditional synthesis methods, learning-based approaches have been proposed for designing safety filters for systems with high-dimensional state and control spaces. However, the inevitable errors in the decisions of these models raise concerns about their reliability and the safety guarantees they offer. This paper presents Adaptive Conformal Filtering (ACoFi), a method that combines learned Hamilton-Jacobi reachability-based safety filters with adaptive conformal inference. Under ACoFi, the filter dynamically adjusts its switching criteria based on the observed errors in its predictions of the safety of actions. The range of possible safety values of the nominal policy's output is used to quantify uncertainty in safety assessment. The filter switches from the nominal policy to the learned safe one when that range suggests it might be unsafe. We show that ACoFi guarantees that the rate of incorrectly quantifying uncertainty in the predicted safety of the nominal policy is asymptotically upper bounded by a user-defined parameter. This gives a soft safety guarantee rather than a hard safety guarantee. We evaluate ACoFi in a Dubins car simulation and a Safety Gymnasium environment, empirically demonstrating that it significantly outperforms the baseline method that uses a fixed switching threshold by achieving higher learned safety values and fewer safety violations, especially in out-of-distribution scenarios.
Authors: Jun Chen, Umberto Biccari, Junmin Wang
Abstract: We propose a computational framework for replacing the repeated numerical solution of differential Riccati equations in finite-horizon Linear Quadratic Regulator (LQR) problems by a learned operator surrogate. Instead of solving a nonlinear matrix-valued differential equation for each new system instance, we construct offline an approximation of the associated solution operator mapping time-dependent system parameters to the Riccati trajectory. The resulting model enables fast online evaluation of approximate optimal feedbacks across a wide class of systems, thereby shifting the computational burden from repeated numerical integration to a one-time learning stage. From a theoretical perspective, we establish control-theoretic guarantees for this operator-based approximation. In particular, we derive bounds quantifying how operator approximation errors propagate to feedback performance, trajectory accuracy, and cost suboptimality, and we prove that exponential stability of the closed-loop system is preserved under sufficiently accurate operator approximation. These results provide a framework to assess the reliability of data-driven approximations in optimal control. On the computational side, we design tailored DeepONet architectures for matrix-valued, time-dependent problems and introduce a progressive learning strategy to address scalability with respect to the system dimension. Numerical experiments on both time-invariant and time-varying LQR problems demonstrate that the proposed approach achieves high accuracy and strong generalization across a wide range of system configurations, while delivering substantial computational speedups compared to classical solvers. The method offers an effective and scalable alternative for parametric and real-time optimal control applications.
Authors: Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang
Abstract: Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.
Authors: Leon Bungert, Lucas Schmitt
Abstract: Adversarial training of binary classifiers can be reformulated as regularized risk minimization involving a nonlocal total variation. Building on this perspective, we establish a characterization of the subdifferential of this total variation using duality techniques. To achieve this, we derive a dual representation of the nonlocal total variation and a related integration of parts formula, involving a nonlocal gradient and divergence. We provide such duality statements both in the space of continuous functions vanishing at infinity on proper metric spaces and for the space of essentially bounded functions on Euclidean domains. Furthermore, under some additional conditions we provide characterizations of the subdifferential in these settings.
Authors: Joonhyuk Lee, Virginia Ma, Sarah Zhao, Yash Nair, Asher Spector, Regev Cohen, Emmanuel J. Cand\`es
Abstract: Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.
Authors: Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Dan Alistarh
Abstract: Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.
Authors: Minji Lee, Colin Kalicki, Minkyu Jeon, Aymen Qabel, Alisia Fadini, Mohammed AlQuraishi
Abstract: Models from the AlphaFold (AF) family reliably predict one dominant conformation for most well-ordered proteins but struggle to capture biologically relevant alternate states. Several efforts have focused on eliciting greater conformational variability through ad hoc inference-time perturbations of AF models or their inputs. Despite their progress, these approaches remain inefficient and fail to consistently recover major conformational modes. Here, we investigate both the optimal location and manner-of-operation for perturbing latent representations in the AF3 architecture. We distill our findings in ConforNets: channel-wise affine transforms of the pre-Pairformer pair latents. Unlike previous methods, ConforNets globally modulate AF3 representations, making them reusable across proteins. On unsupervised generation of alternate states, ConforNets achieve state-of-the-art success rates on all existing multi-state benchmarks. On the novel supervised task of conformational transfer, ConforNets trained on one source protein can induce a conserved conformational change across a protein family. Collectively, these results introduce a mechanism for conformational control in AF3-based models.
Authors: Terry Leitch
Abstract: We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.
Authors: Maria-Eleni Sfyraki, Jun-Kun Wang
Abstract: In this work, we revisit the problem of active sequential prediction-powered mean estimation, where at each round one must decide the query probability of the ground-truth label upon observing the covariates of a sample. Furthermore, if the label is not queried, the prediction from a machine learning model is used instead. Prior work proposed an elegant scheme that determines the query probability by combining an uncertainty-based suggestion with a constant probability that encodes a soft constraint on the query probability. We explored different values of the mixing parameter and observed an intriguing empirical pattern: the smallest confidence width tends to occur when the weight on the constant probability is close to one, thereby reducing the influence of the uncertainty-based component. Motivated by this observation, we develop a non-asymptotic analysis of the estimator and establish a data-dependent bound on its confidence interval. Our analysis further suggests that when a no-regret learning approach is used to determine the query probability and control this bound, the query probability converges to the constraint of the max value of the query probability when it is chosen obliviously to the current covariates. We also conduct simulations that corroborate these theoretical findings.
Authors: A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros
Abstract: The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.
Authors: Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei, Sultan Albarakati, William T. Freeman, Antonio Torralba
Abstract: Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
URLs: https://mathnet.mit.edu.
Authors: Xin Zheng, Yi Wang, Yixin Liu, Ming Li, Miao Zhang, Di Jin, Philip S. Yu, Shirui Pan
Abstract: Recent years have witnessed fast developments of graph neural networks (GNNs) that have benefited myriad graph analytic tasks and applications. Most GNNs rely on the homophily assumption that nodes belonging to the same class are more likely to be connected. However, as a ubiquitous graph property in numerous real-world scenarios, heterophily, i.e., nodes with different labels tend to be linked, significantly limits the performance of tailor-made homophilic GNNs. Hence, GNNs for heterophilic graphs are gaining increasing research attention to enhance graph learning with heterophily. In this paper, we provide a comprehensive review of GNNs for heterophilic graphs. Specifically, we propose a systematic taxonomy that governs existing heterophilic GNN models, along with general summaries and detailed analyses. Furthermore, we discuss the relationship between heterophily and various graph research domains, aiming to facilitate the development of more effective GNNs across a spectrum of practical applications and learning tasks in the graph research community. In the end, we point out potential directions to advance and inspire future research and applications on heterophilic graph learning with GNNs.
Authors: Jesus Tordesillas, Victor Klemm, Jonathan P. How, Marco Hutter
Abstract: Despite the numerous applications of convex constraints in Robotics, enforcing them within learning-based frameworks remains an open challenge. Existing techniques either fail to guarantee satisfaction at all times, or incur prohibitive computational costs. This paper presents RAYEN, a framework for imposing hard convex constraints on the output or latent variables of a neural network. RAYEN guarantees constraint satisfaction during both training and testing, for any input and any network weights. Unlike prior approaches, RAYEN avoids computationally expensive orthogonal projections, soft constraints, conservative approximations of the feasible set, and slow iterative corrections. RAYEN supports any combination of linear, convex quadratic, second-order cone (SOC), and linear matrix inequality (LMI) constraints, with negligible overhead compared to unconstrained networks. For instance, it imposes 1K quadratic constraints on a 1K-dimensional variable with only 8 ms of overhead compared to a network that does not enforce these constraints. An LMI constraint with 300x300 dense matrices on a 10K-dimensional variable can be guaranteed with only 12 ms additional overhead. When used in neural networks that approximate the solution of constrained trajectory optimization problems, RAYEN runs 20 to 7468 times faster than state-of-the-art algorithms, while guaranteeing constraint satisfaction at all times and achieving a near-optimal cost (<1.5% optimality gap). Finally, we demonstrate RAYEN's ability to enforce actuator constraints on a learned locomotion policy by validating constraint satisfaction in both simulation and real-world experiments on a quadruped robot. The code is available at https://github.com/leggedrobotics/rayen
Authors: Dimitris Bertsimas, Cheol Woo Kim
Abstract: We propose an approach based on machine learning to solve two-stage linear adaptive robust optimization (ARO) problems with binary here-and-now variables and polyhedral uncertainty sets. We encode the optimal here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the optimal wait-and-see decisions into what we denote as the strategy. We solve multiple similar ARO instances in advance using the column and constraint generation algorithm and extract the optimal strategies to generate a training set. We train machine learning models that predict high-quality strategies for the here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the wait-and-see decisions. The models can be applied to problems with varying dimensions. We also introduce novel methods to expedite training data generation and reduce the number of different target classes the machine learning algorithm needs to be trained on. We apply the proposed approach to the facility location, the multi-item inventory control and the unit commitment problems. Our approach solves ARO problems drastically faster than the state-of-the-art algorithms with high accuracy.
Authors: Yinbin Han, Meisam Razaviyayn, Renyuan Xu
Abstract: Diffusion models have become a leading paradigm in generative AI, with score estimation via denoising score matching as a central component. While recent theory provides strong statistical guarantees, it typically relies on algorithm-agnostic assumptions and treats empirical risk minimization as if it were solved exactly. In practice, however, score functions are parameterized by highly nonconvex neural networks and trained by gradient descent (GD), and it remains unclear whether such practical procedures admit rigorous guarantees. We take a first step toward this question by developing a mathematical framework for score estimation with GD-trained neural networks. Our analysis addresses both optimization and generalization. We introduce a parametric formulation that reduces denoising score matching to a regression problem with noisy labels. This setting poses several challenges, including unbounded inputs, vector-valued outputs, and an additional time variable, which prevent a direct application of existing techniques. We show that, with a suitable design, the dynamics of GD-trained networks can be approximated by a sequence of localized kernel regression problems. We also show that prolonged training on noisy labels leads to overfitting, and derive an early-stopping rule adapted to unbounded domains. As a consequence, we establish the first minimax-optimal generalization bounds for GD-trained neural networks in diffusion models. Experiments on the Credit Default dataset further show that our theory-guided training framework achieves performance comparable to heavily tuned heuristic methods for generating high-fidelity financial tabular data.
Authors: Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Ajay Jaiswal, Li Shen, Xiaolong Ma, Shiwei Liu, Lu Yin
Abstract: Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated, such as DeepSeekMoE and QwenMoE. We demonstrate the effectiveness of our method. Specifically, for the DeepSeekMoE-16B model, our approach maintains 90% of the average accuracy while reducing memory usage by 27.5% and increasing inference speed by 1.26 times. Moreover, we show that by applying lightweight expert fine-tuning -- only to the condensed layers -- and using 5 hours on a single 80G A100 GPU, we can successfully recover 98% of the original performance. Our code is available at: https://github.com/duterscmy/CD-MoE/tree/main.
Authors: Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang
Abstract: Large text-to-video models hold immense potential for a wide range of downstream applications. However, they struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. In this work, we investigate the use of feedback to enhance the quality of object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively overcome movement misalignment and realistic object interactions? We first point out that offline RL-finetuning algorithms for text-to-video models can be equivalent as derived from a unified probabilistic objective. This perspective highlights that there is no algorithmically dominant method in principle; rather, we should care about the property of reward and data. While human feedback is less scalable, vision-language models could notice the video scenes as humans do. We then propose leveraging vision-language models to provide perceptual feedback specifically tailored to object dynamics in videos. Compared to popular video quality metrics measuring alignment or dynamics, the experiments demonstrate that our approach with binary AI feedback drives the most significant improvements in the quality of interaction scenes in video, as confirmed by AI, human, and quality metric evaluations. Notably, we observe substantial gains when using signals from vision language models, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
Authors: Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn
Abstract: Reward models trained on aggregate preferences often fail to capture individual users' values, but existing adaptation methods such as fine-tuning or long-context conditioning are too costly for real-time personalization. We propose Hypothesis Reweighting (HyRe), which enables real-time personalization by reweighting ensemble members using just 1-5 labeled examples from the target user or domain. Our method builds on the empirical observation that when different heads capture different valid interpretations of preference data, reweighting them can substantially outperform uniform averaging. HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences. This requires only a single forward pass with negligible (<1%) computational overhead, making it practical for inference-time personalization. We evaluate HyRe across diverse target preference distributions. With as few as five preference pairs per target distribution, HyRe surpasses state-of-the-art reward models on RewardBench at 2B and 8B scale and improves reward model accuracy by 20% across 32 personalization tasks.
Authors: Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter
Abstract: Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Leading uncertainty estimation methods generate and analyze multiple output sequences, which is computationally expensive and impractical at scale. In this work, we inspect the theoretical foundations of these methods and explore new directions to enhance computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically principled uncertainty measure. To approximate this alternative measure, we propose G-NLL, obtained using a single output sequence from greedy decoding. This approach streamlines uncertainty estimation while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various scenarios. Our work lays the theoretical foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of the prevalent methods that are more complex and resource-intensive.
Authors: Fernando H. O. Duarte, Gladston J. P. Moreira, Eduardo J. S. Luz, Leonardo B. L. Santos, Vander L. S. Freitas
Abstract: The COVID-19 pandemic has claimed millions of lives, spurring the development of diverse forecasting models. In this context, the true utility of complex spatio-temporal architectures versus simpler temporal baselines remains a subject of debate. Here, we show that structural sparsification of the input graph and temporal granularity are determining factors for the effectiveness of Graph Neural Networks (GNNs). By leveraging human mobility networks in Brazil and China, we address a conflicting scenario in the literature: while standard LSTMs suffice for smooth, monotonic cumulative trends, GNNs significantly outperform baselines when forecasting volatile daily case counts. We show that backbone extraction substantially enhances predictive stability and reduces predictive error by removing negligible connections. Our results indicate that incorporating spatial dependencies is essential for modeling complex dynamics. Specifically, GNN architectures such as GCRN and GCLSTM outperform the LSTM baseline (Nemenyi test, p < 0.05) on datasets from Brazil and China for daily case predictions. Lastly, we frame the problem as a binary classification task to better analyze the dependency between context sizes and prediction horizons.
Authors: Qishen Zhou, Yifan Zhang, Michail A. Makridis, Anastasios Kouvelas, Yibing Wang, Simon Hu
Abstract: Given a partially observed road network, how can we predict the traffic state of interested unobserved locations? Traffic prediction is crucial for advanced traffic management systems, with deep learning approaches showing exceptional performance. However, most existing approaches assume sensors are deployed at all locations of interest, which is impractical due to financial constraints. Furthermore, these methods are typically fragile to structural changes in sensing networks, which require costly retraining even for minor changes in sensor configuration. To address these challenges, we propose MoGERNN, an inductive spatio-temporal graph model with two key components: (i) a Mixture of Graph Experts (MoGE) with sparse gating mechanisms that dynamically route nodes to specialized graph aggregators, capturing heterogeneous spatial dependencies efficiently; (ii) a graph encoder-decoder architecture that leverages these embeddings to capture both spatial and temporal dependencies for comprehensive traffic state prediction. Experiments on two real-world datasets show MoGERNN consistently outperforms baseline methods for both observed and unobserved locations. MoGERNN can accurately predict congestion evolution even in areas without sensors, offering valuable information for traffic management. Moreover, MoGERNN is adaptable to the changes of sensor network, maintaining competitive performance even compared to its retrained counterpart. Tests performed with different numbers of available sensors confirm its consistent superiority, and ablation studies validate the effectiveness of its key modules. The code of this work is publicly available at: https://github.com/ZJU-TSELab/MoGERNN.
Authors: Philipp M. Faller, Dominik Janzing
Abstract: Conditional-independence-based discovery uses statistical tests to identify a graphical model that represents the independence structure of variables in a dataset. These tests, however, can be unreliable, and algorithms are sensitive to errors and violated assumptions. Often, there are tests that were not used in the construction of the graph. In this work, we show that these redundant tests have the potential to detect or sometimes correct errors in the learned model. But we further show that not all tests contain this additional information and that such redundant tests have to be applied with care. Precisely, we argue that the conditional (in)dependence statements that hold for every probability distribution are unlikely to detect and correct errors - in contrast to those that follow only from graphical assumptions.
Authors: Stefan K. Nielsen, Rachel S. Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen
Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.
Authors: Zhengdao Chen
Abstract: Diffusion models have achieved remarkable progress in various domains with an intriguing ability to produce new data that do not exist in the training set. In this work, we study the hypothesis that such creativity arises from the neural network backbone learning a smoothed version of the empirical score function, which guides the denoising dynamics to generate data points that interpolate the training data. Focusing mainly on settings where the training set lies uniformly in a one-dimensional subspace, we elucidate the interplay between score smoothing and the denoising dynamics with analytical solutions and numerical experiments, demonstrating how smoothing the score function can cause the denoised data samples to interpolate the training set along the subspace. Moreover, we present theoretical and empirical evidence that learning score functions with neural networks - either with or without explicit regularization - can naturally achieve a similar effect, including when the data belong to simple nonlinear manifolds.
Authors: Benjamin Gutteridge, Matthew Thomas Jackson, Toni Kukurin, Xiaowen Dong
Abstract: Handwriting text recognition (HTR) remains a challenging task. Existing approaches require fine-tuning on labeled data, which is impractical to obtain for real-world problems, or rely on zero-shot tools such as OCR engines and multi-modal LLMs (MLLMs). MLLMs have shown promise both as end-to-end transcribers and as OCR post-processors, but to date there is little empirical research evaluating different MLLM prompting strategies for HTR, particularly for the case of multi-page documents. Most handwritten documents are multi-page, and share context such as semantic content and handwriting style across pages, yet MLLMs are typically used for transcription at the page level, meaning they throw away this shared context. They are also typically used as either text-only post-processors or image-only OCR alternatives, rather than leveraging multiple modes. This paper investigates a suite of methods combining OCR, LLM post-processing and MLLM end-to-end transcription, for the task of zero-shot multi-page handwritten document transcription. We introduce a benchmark for this task from existing single-page datasets, including a new dataset, Malvern-Hills. Finally, we introduce OCR+PAGE-1 and OCR+PAGE-N, prompting strategies for multi-page transcription that outperform existing methods by sharing content across pages while minimizing prompt complexity.
Authors: Qishen Zhou, Yifan Zhang, Michail A. Makridis, Anastasios Kouvelas, Yibing Wang, Simon Hu
Abstract: Network-wide Traffic State Estimation (TSE), which aims to infer a complete image of network traffic states with sparsely deployed sensors, plays a vital role in intelligent transportation systems. With the development of data-driven methods, traffic dynamics modeling has advanced significantly. However, TSE poses fundamental challenges for data-driven approaches, since historical patterns cannot be learned locally at sensor-free segments. Although graph representation learning shows promise in estimating states at locations without sensors, existing methods typically handle unobserved locations by filling them with zeros, introducing bias to the sensitive graph message propagation. The recently proposed Dirichlet Energy-based Feature Propagation (DEFP) method achieves State-Of-The-Art (SOTA) performance in unobserved node classification by eliminating the need for zero-filling. However, applying it to TSE faces three key challenges: inability to handle directed traffic networks, strong assumptions in traffic spatial correlation modeling, and overlooking distinct propagation rules of different patterns (e.g., congestion and free flow). We propose DGAE, a novel inductive graph representation model that addresses these challenges through theoretically derived DEFP for Directed graph (DEFP4D), enhanced spatial representation learning via DEFP4D-guided latent space encoding, and physics-guided propagation mechanisms that separately handle congested and free-flow patterns. Experiments on three traffic datasets demonstrate that DGAE outperforms existing SOTA methods and exhibits strong cross-city transferability. Furthermore, DEFP4D can serve as a standalone lightweight solution, showing superior performance under extremely sparse sensor conditions. The code of this work is publicly available at: https://github.com/ZJU-TSELab/DGAE.
Authors: Xiao Ma, Young D. Kwon, Dong Ma
Abstract: Continual Test-time adaptation (CTTA) continuously adapts the deployed model on every incoming batch of data. While achieving optimal accuracy, existing CTTA approaches present poor real-world applicability on resource-constrained edge devices, due to the substantial memory overhead and energy consumption. In this work, we first introduce a novel paradigm -- on-demand TTA -- which triggers adaptation only when a significant domain shift is detected. Then, we present OD-TTA, an on-demand TTA framework for accurate and efficient adaptation on edge devices. OD-TTA comprises three innovative techniques: 1) a lightweight domain shift detection mechanism to activate TTA only when it is needed, drastically reducing the overall computation overhead, 2) a source domain selection module that chooses an appropriate source model for adaptation, ensuring high and robust accuracy, 3) a decoupled Batch Normalization (BN) update scheme to enable memory-efficient adaptation with small batch sizes. Extensive experiments show that OD-TTA achieves comparable and even better performance while reducing the energy and computation overhead remarkably, making TTA a practical reality.
Authors: Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Hongjian Fang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
Abstract: The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method TRSP: Two-Stage Regularization-Based Structured Pruning for LLMs. Specifically, we multiply the output of each transformer layer by an initial learnable weight and iteratively learn these weights by adding their $\ell_1$-norm as a regularization term to the loss function, serving as the first-stage regularization. Subsequently, we apply additional regularization to the difference between the output and input of layers with smaller weights, encouraging the shift of knowledge to the preserved layers. This serves as the second-stage regularization. TRSP retains more knowledge and better preserves model performance than direct parameter elimination. Through extensive experimentation we show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining. As a layer-wise pruning method, it delivers notable end-to-end acceleration, making it a promising solution for efficient LLM deployment.Code is available at https://github.com/fmk345/TRSP.
Authors: Junseo Hwang, Wonguk Cho, Taesup Kim
Abstract: Fine-tuning large foundation models is essential for building expert models tailored to specialized tasks and domains, but fully updating billions of parameters is computationally prohibitive. Reducing the number of trainable parameters using Parameter-Efficient Fine-Tuning (PEFT), such as Low-Rank Adaptation (LoRA), is therefore crucial not only to reduce training costs but also to mitigate storage, caching, and serving overheads during deployment. Prior works, such as Singular Vectors-guided Fine-Tuning (SVFT), have shown that exploiting the geometry of pre-trained weights based on Singular Value Decomposition (SVD) can significantly improve parameter-efficiency, but they lack a solid theoretical foundation. In this paper, we introduce Parameter-Efficient Fine-Tuning with Column Space Projection (PiCa), a novel theoretically grounded PEFT method. We prove that projecting gradients onto the principal column space of pre-trained weights provides an effective inductive bias for adaptation and further enhance parameter efficiency through a novel weight-sharing strategy. Across diverse NLP and vision tasks, PiCa consistently outperforms state-of-the-art baselines under comparable or smaller parameter budgets, demonstrating both theoretical rigor and practical effectiveness.
Authors: Ioannis Bantzis, James B. Simon, Arthur Jacot
Abstract: When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the $\ell$-th layer weight matrix is at least $\ell^{\frac{1}{4}}$ larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank (Jacot, 2023).
Authors: An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim
Abstract: Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs' reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.
Authors: Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu
Abstract: Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.
Authors: Minsung Kim, Nakyeong Yang, Kyomin Jung
Abstract: Large Vision-Language Models (LVLMs) can recognize individuals in images and disclose sensitive personal information about them, raising critical privacy concerns. Machine unlearning aims to remove such knowledge from the model. However, existing methods rarely prescribe what the model should output in place of the forgotten content, leading to Unlearning Aftermaths: degenerate, hallucinated, or excessively refused responses. We argue that, especially for generative LVLMs, it is crucial to consider the quality and informativeness of post-unlearning responses rather than relying solely on naive suppression. To address this, we introduce a new unlearning task for LVLMs that requires models to provide privacy-preserving yet informative and visually grounded responses. We also propose PUBG, a novel unlearning method that explicitly guides post-unlearning behavior toward a desirable output distribution. Experiments show that, while existing methods suffer from Unlearning Aftermaths despite successfully preventing privacy violations, PUBG effectively mitigates these issues, generating visually grounded and informative responses without privacy leakage for forgotten targets.
Authors: Guanzhong Chen, Shaoxiong Yang, Chao Li, Wei Liu, Jian Luan, Zenglin Xu
Abstract: Large language models (LLMs) are versatile, yet their deployment in complex real-world settings is limited by static knowledge cutoffs and the difficulty of producing controllable behavior within a single inference. Multi-agent search systems (MASS), which coordinate specialized LLM agents equipped with search tools, mitigate these issues via task decomposition and retrieval-augmented problem solving. However, optimizing LLMs for agent-specific roles remains labor-intensive with prompt engineering or supervised fine-tuning, motivating automated end-to-end training. Existing multi-agent reinforcement learning (MARL) methods such as Multi-Agent Proximal Policy Optimization (MAPPO) typically depend on large critic networks to evaluate joint actions, leading to instability and high memory costs. We introduce Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which updates policies by estimating relative advantages across heterogeneous groups of multi-agent rollouts, shifting the optimization focus from local agent performance to global system success. We further study three group rollout sampling strategies to trade off sample efficiency and optimization quality. Experiments show that MHGPO captures implicit inter-agent dependencies and consistently outperforms strong baselines in both task performance and computational efficiency.
Authors: Jacob Helwig, Sai Sreeharsha Adavi, Xuan Zhang, Yuchao Lin, Felix S. Chim, Luke Takeshi Vizzini, Haiyang Yu, Muhammad Hasnain, Saykat Kumar Biswas, John J. Holloway, Narendra Singh, N. K. Anand, Swagnik Guhathakurta, Shuiwang Ji
Abstract: We consider the problem of modeling high-speed flows using machine learning methods. While most prior studies focus on low-speed fluid flows in which uniform time-stepping is practical, flows approaching and exceeding the speed of sound exhibit sudden changes such as shock waves. In such cases, it is essential to use adaptive time-stepping methods to allow a temporal resolution sufficient to resolve these phenomena while simultaneously balancing computational costs. Here, we propose a two-phase machine learning method, known as ShockCast, to model high-speed flows with adaptive time-stepping. In the first phase, we propose to employ a machine learning model to predict the timestep size. In the second phase, the predicted timestep is used as an input along with the current fluid fields to advance the system state by the predicted timestep. We explore several physically-motivated components for timestep prediction and introduce timestep conditioning strategies inspired by neural ODE and Mixture of Experts. We evaluate our methods by generating three supersonic flow datasets, available at https://huggingface.co/divelab. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).
URLs: https://huggingface.co/divelab., https://github.com/divelab/AIRS).
Authors: Brendan Leigh Ross, No\"el Vouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, Jesse C. Cresswell
Abstract: Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problem--one that limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the model's textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inference--a difficult problem even for well-studied data modalities--we introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.
Authors: Daniel Lawson, Adriana Hugessen, Charlotte Cloutier, Glen Berseth, Khimya Khetarpal
Abstract: While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally correlated states are properly encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. We formalize this notion by demonstrating how encouraging long-range temporal consistency via successor representations (SR) can facilitate generalization. We then propose a simple yet effective representation learning objective, $\text{BYOL-}\gamma$ for GCBC, which theoretically approximates the successor representation in the finite MDP case through self-predictive representations, and achieves competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.
Authors: Yitong Zhou, Yucong Luo, Mingyue Cheng, Qi Liu, Jiahao Wang, Daoyu Wang, Enhong Chen
Abstract: To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical techniques to data-driven deep learning architectures. Despite their effectiveness, most existing methods still adhere to a fast thinking paradigm-relying on extracting historical patterns and mapping them to future values as their core modeling philosophy, lacking an explicit thinking process that incorporates intermediate time series reasoning. Meanwhile, emerging slow-thinking LLMs (e.g., OpenAI-o1) have shown remarkable multi-step reasoning capabilities, offering an alternative way to overcome these issues. However, prompt engineering alone presents several limitations - including high computational cost, privacy risks, and limited capacity for in-depth domain-specific time series reasoning. To address these limitations, a more promising approach is to train LLMs to develop slow thinking capabilities and acquire strong time series reasoning skills. For this purpose, we propose Time-R1, a two-stage reinforcement fine-tuning framework designed to enhance multi-step reasoning ability of LLMs for time series forecasting. Specifically, the first stage conducts supervised fine-tuning for warmup adaptation, while the second stage employs reinforcement learning to improve the model's generalization ability. Particularly, we design a fine-grained multi-objective reward specifically for time series forecasting, and then introduce GRIP (group-based relative importance for policy optimization), which leverages non-uniform sampling to further encourage and optimize the model's exploration of effective reasoning paths. Experiments demonstrate that Time-R1 significantly improves forecast performance across diverse datasets.
Authors: Jackson Eshbaugh
Abstract: In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network's predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score $\lambda(f)$, a diagnostic that quantifies the extent to which a regression network's input--output behavior is linearly decodable. $\lambda(f)$ is defined as an $R^2$ measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.
Authors: Mingxuan Cui, Duo Zhou, Yuxuan Han, Grani A. Hanasusanto, Qiong Wang, Huan Zhang, Zhengyuan Zhou
Abstract: Deep reinforcement learning (RL) has achieved remarkable success, yet its deployment in real-world scenarios is often limited by vulnerability to environmental uncertainties. Distributionally robust RL (DR-RL) algorithms have been proposed to resolve this challenge, but existing approaches are largely restricted to value-based methods in tabular settings. In this work, we introduce Distributionally Robust Soft Actor-Critic (DR-SAC), the first actor-critic based DR-RL algorithm for offline learning in continuous action spaces. DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within an KL-divergence constrained uncertainty set. We derive the distributionally robust version of the soft policy iteration with a convergence guarantee and incorporate a generative modeling approach to estimate the unknown nominal transition models. Experiment results on five continuous RL tasks demonstrate our algorithm achieves up to 9.8 times higher average reward than the SAC baseline under common perturbations. Additionally, DR-SAC significantly improves computing efficiency and applicability to large-scale problems compared with existing DR-RL algorithms. Code is publicly available at github.com/Lemutisme/DR-SAC.
Authors: Egor Maximov, Yulia Kuzkina, Azamat Kanametov, Alexander Prutko, Aleksei Goncharov, Maxim Zhelnin, Egor Shvetsov
Abstract: As large language models (LLMs) grow in size, efficient compression techniques like quantization and sparsification are critical. While quantization maintains performance with reduced precision, structured sparsity methods, such as N:M sparsification, often fall short due to limited flexibility, and sensitivity to outlier weights. We explore 8:16 semi-structured sparsity, demonstrating its ability to surpass the Performance Threshold-where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element). We also apply sparse structured patterns for salient weights, showing that structured sparsity for outliers is competitive with unstructured approaches leading to equivalent or better results. Finally, we demonstrate that simple techniques such as variance correction and SmoothQuant like weight equalization improve sparse models performance.
Authors: Md Abrar Jahin, Shahriar Soudeep, M. F. Mridha, Muhammad Mostafa Monowar, Md. Abdul Hamid
Abstract: Real-time particle transverse momentum ($p_T$) estimation in high-energy physics demands algorithms that are both efficient and accurate under strict hardware constraints. Static machine learning models degrade under high pileup and lack physics-aware optimization, while generic graph neural networks (GNNs) often neglect domain structure critical for robust $p_T$ regression. We propose a physics-informed GNN framework that systematically encodes detector geometry and physical observables through four distinct graph construction strategies that systematically encode detector geometry and physical observables: station-as-node, feature-as-node, bending angle-centric, and pseudorapidity ($\eta$)-centric representations. This framework integrates these tailored graph structures with a novel Message Passing Layer (MPL), featuring intra-message attention and gated updates, and domain-specific loss functions incorporating $p_{T}$-distribution priors. Our co-design methodology yields superior accuracy-efficiency trade-offs compared to existing baselines. Extensive experiments on the CMS Trigger Dataset validate the approach: a station-informed EdgeConv model achieves a state-of-the-art MAE of 0.8525 with $\ge55\%$ fewer parameters than deep learning baselines, especially TabNet, while an $\eta$-centric MPL configuration also demonstrates improved accuracy with comparable efficiency. These results establish the promise of physics-guided GNNs for deployment in resource-constrained trigger systems.
Authors: Henri Arno, Thomas Demeester
Abstract: We study how to learn treatment policies from multimodal electronic health records (EHRs) that consist of tabular data and clinical text. These policies can help physicians make better treatment decisions and allocate healthcare resources more efficiently. Causal policy learning methods prioritize patients with the largest expected treatment benefit. Yet, existing estimators are designed for tabular covariates under causal assumptions that may be hard to justify in the multimodal setting. A pragmatic alternative is to apply causal estimators directly to multimodal representations, but this can produce biased treatment effect estimates when the representations do not preserve the relevant confounding information. As a result, predictive models of baseline risk are commonly used in practice to guide treatment decisions, although they are not designed to identify which patients benefit most from treatment. We propose AACE (Annotation-Assisted Coarsened Effects), an annotation-assisted approach to causal policy learning for multimodal EHRs. The method uses expert-provided annotations during training to support confounding adjustment, and then predicts treatment benefit from only multimodal representations at inference. We show that the proposed method achieves strong empirical performance across synthetic, semi-synthetic, and real-world EHR datasets, outperforming risk-based and representation-based causal baselines, and offering practical insights for applying causal machine learning in clinical practice.
Authors: Haoran Liu, Yihan Zhan, Mingzhe Liu, Yanhua Liu, Peng Li, Zhuo Zuo, Bingqi Liu, Runxi Liu
Abstract: This review presents a comprehensive survey and benchmark of pulse shape discrimination (PSD) algorithms for radiation detection, classifying nearly sixty methods into statistical (time-domain, frequency-domain, neural network-based) and prior-knowledge (machine learning, deep learning) paradigms. We implement and evaluate all algorithms on two standardized datasets: an unlabeled set from a 241Am-9Be source and a time-of-flight labeled set from a 238Pu-9Be source, using metrics including Figure of Merit (FOM), F1-score, ROC-AUC, and inter-method correlations. Our analysis reveals that deep learning models, particularly Multi-Layer Perceptrons (MLPs) and hybrid approaches combining statistical features with neural regression, often outperform traditional methods. We discuss architectural suitabilities, the limitations of FOM, alternative evaluation metrics, and performance across energy thresholds. Accompanying this work, we release an open-source toolbox in Python and MATLAB, along with the datasets, to promote reproducibility and advance PSD research.
Authors: Huanyu Liu, Jia Li, Yihong Dong, Chang Yu, Taozhi Chen, Lecheng Wang, Yongding Tao, Bin Gu, Ge Li
Abstract: Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on teacher models for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens CoT steps to expand the space in a controlled way. The framework enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.
Authors: Hao Luan, Yi Xian Goh, See-Kiong Ng, Chun Kai Ling
Abstract: Modifications to test-time sampling have emerged as an important extension to diffusion algorithms, with the goal of biasing the generative process to achieve a given objective without having to retrain the entire diffusion model. However, generating jointly correlated samples from multiple pre-trained diffusion models while simultaneously enforcing task-specific constraints without costly retraining has remained challenging. To this end, we propose Projected Coupled Diffusion (PCD), a novel test-time framework for constrained joint generation. PCD introduces a coupled guidance term into the generative dynamics to encourage coordination between diffusion models and incorporates a projection step at each diffusion step to enforce hard constraints. Empirically, we demonstrate the effectiveness of PCD in application scenarios of image-pair generation, object manipulation, and multi-robot motion planning. Our results show improved coupling effects and guaranteed constraint satisfaction without incurring excessive computational costs.
Authors: Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev
Abstract: Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural-language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next-step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains bounded.
Authors: Dongseok Kim, Gisung Oh
Abstract: Conventional regularization is designed to control variance, but in small-data regression it can also aggravate underfitting when predictive signal is concentrated in weak directions of a restricted representation. We study a negative-capable ridge family that permits a feasible negative region whenever the estimator remains well posed, and show that negative regularization acts there as controlled anti-shrinkage by increasing effective complexity most strongly along weak eigendirections. Building on this mechanism, we formalize weak-spectrum underfitting, derive a sign-switch result under conservative baseline shrinkage, and study criterion-based automatic selection over the full negative-capable family. Synthetic and semi-synthetic experiments support the theory by verifying feasibility, spectral complexity increase, sign-switch behavior, and effective recovery of negative adjustments in the predicted regimes.
Authors: Aditri Paul, Archan Paul
Abstract: Autonomous planetary exploration demands real-time, high-fidelity environmental perception. Standard deep learning models require massive computational resources. Conversely, space-qualified onboard computers operate under strict power, thermal, and memory limits. This disparity creates a severe engineering bottleneck, preventing the deployment of highly capable perception architectures on extraterrestrial exploration platforms. In this foundational concept paper, we propose the theoretical architecture for the Adaptive Quantized Planetary Crater Detection System (AQ-PCDSys) to resolve this bottleneck. We present a mathematical blueprint integrating an INT8 Quantized Neural Network (QNN) designed specifically for Quantization Aware Training (QAT). To address sensor fragility, we mathematically formalize an Adaptive Multi-Sensor Fusion (AMF) module. By deriving the exact integer requantization multiplier required for spatial attention gating, this module actively selects and fuses Optical Imagery (OI) and Digital Elevation Models (DEMs) at the feature level, ensuring reliable perception during extreme cross-illuminations and optical hardware dropouts. Furthermore, the architecture introduces anchor-free, center-to-edge regression heads, protected by a localized FP16 coordinate conversion, to accurately frame asymmetrical lunar craters without catastrophic integer truncation. Rather than presenting physical hardware telemetry, this manuscript establishes the theoretical bounds, structural logic, and mathematical justifications for the architecture. We outline a rigorous Hardware-in-the-Loop (HITL) evaluation protocol to define the exact testing criteria required for future empirical validation, paving the way for next-generation space-mission software design.
Authors: Yuhang Liu, Tao Li, Zhehao Huang, Zuopeng Yang, Xiaolin Huang
Abstract: Fine-tuning large-scale pre-trained models with limited data presents significant challenges for generalization. While Sharpness-Aware Minimization (SAM) has proven effective in improving generalization by seeking flat minima, its substantial extra memory and computation overhead make it impractical for large models. Integrating SAM with parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) is a promising direction. However, we find that directly applying SAM to LoRA parameters limits the sharpness optimization to a restricted subspace, hindering its effectiveness. To address this limitation, we propose Bi-directional Low-Rank Adaptation (Bi-LoRA), which introduces an auxiliary LoRA module to model SAM's adversarial weight perturbations. It decouples SAM's weight perturbations from LoRA optimization: the primary LoRA module adapts to specific tasks via standard gradient descent, while the auxiliary module captures the sharpness of the loss landscape through gradient ascent. Such dual-module design enables Bi-LoRA to capture broader sharpness for achieving flatter minima while remaining memory-efficient. Another important benefit is that the dual design allows for simultaneous optimization and perturbation, eliminating SAM's doubled training costs. Extensive experiments across diverse tasks and architectures demonstrate Bi-LoRA's efficiency and effectiveness in enhancing generalization.
Authors: Madhav Kanda, Shubham Ugare, Sasa Misailovic
Abstract: Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models are tasked with generating probabilistic programs, they frequently produce outputs that suffer from both syntactic and semantic errors, such as flawed inference constructs. Motivated by probabilistic programmers' domain expertise and debugging strategies, we introduce RefineStat, a language model--driven framework that enforces semantic constraints ensuring synthesized programs contain valid distributions and well-formed parameters, and then applies diagnostic-aware refinement by resampling prior or likelihood components whenever reliability checks fail. We evaluate RefineStat on multiple probabilistic-programming code-generation tasks using smaller language models (SLMs) and find that it produces programs that are both syntactically sound and statistically reliable, often matching or surpassing those from closed-source large language models (e.g., OpenAI o3).
Authors: Yu Chen, Shengwei Wang, Hongwei Lin
Abstract: Reservoir is an efficient network for time series processing. It is well known that network structure is one of the determinants of its performance. However, the topology structure of reservoirs, as well as their performance, is hard to analyzed, due to the lack of suitable mathematical tools. In this paper, we study the topology structure of reservoirs using persistent GLMY homology theory, and develop a method to improve its performance. Specifically, it is found that the reservoir performance is closely related to the one-dimensional GLMY homology groups. Then, we develop a reservoir structure optimization method by modifying the minimal representative cycles of one-dimensional GLMY homology groups. Finally, by experiments, it is validated that the performance of reservoirs is jointly influenced by the reservoir structure and the periodicity of the dataset.
Authors: Chuan He, Zhanwang Deng, Zhaosong Lu
Abstract: Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \citep{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose \textit{low-rank orthogonalization}, which performs orthogonalization by leveraging the low-rank nature of gradients during NN training. Building on this, we introduce low-rank matrix-signed gradient descent (MSGD) and a low-rank variant of Muon. Numerical experiments demonstrate the superior performance of low-rank orthogonalization, with low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the carefully tuned vanilla Muon on tasks with large model sizes. Theoretically, we establish the iteration complexity of low-rank MSGD for finding an approximate stationary solution, and the iteration complexity of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise. The code to reproduce our numerical experiments is available at https://github.com/dengzhanwang/Low-rank-Muon.
Authors: Yuchen Zhang, Mohammad Mohammadi Amiri
Abstract: Assessing the impact the training data on machine learning models is crucial for understanding the behavior of the model, enhancing the transparency, and selecting training data. Influence function provides a theoretical framework for quantifying the effect of training data points on model's performance given a specific test data. However, the computational and memory costs of influence function presents significant challenges, especially for large-scale models, even when using approximation methods, since the gradients involved in computation are as large as the model itself. In this work, we introduce a novel approach that leverages dropout as a gradient compression mechanism to compute the influence function more efficiently. Our method significantly reduces computational and memory overhead, not only during the influence function computation but also in gradient compression process. Through theoretical analysis and empirical validation, we demonstrate that our method could preserves critical components of the data influence and enables its application to modern large-scale models.
Authors: Hengbo Xiao, Jingyuan Fan, Xin Tong, Jingzhao Zhang, Chao Lu, Guannan He
Abstract: Tasks on complex systems require high-precision numerical computation to support decisions, but current large language models (LLMs) cannot integrate such computations as an intrinsic and interpretable capability with existing architectures. Multi-agent approaches can leverage external experts, but inevitably introduce communication overhead and suffer from inefficiency caused by limited scalability. To this end, we propose Physically-isolated Experts Routing Network (PiERN), an architecture for integrating computation and reasoning. Instead of the tool-use workflows or function-calling, PiERN endogenously integrates computational capabilities into neural networks after separately training experts, a text-to-computation module, and a router. At inference, the router directs computation and reasoning at the token level, thereby enabling iterative alternation within a single chain of thought. We evaluate PiERN on representative linear and nonlinear computation-reasoning tasks against LLM finetuning and the multi-agent system approaches. Results show that the PiERN architecture achieves not only higher accuracy than directly finetuning LLMs but also significant improvements in response latency, token usage, and GPU energy consumption compared with mainstream multi-agent approaches. PiERN offers an efficient, interpretable, and scalable paradigm for interfacing language models with scientific systems.
Authors: Zituo Chen, Sili Deng
Abstract: Pretraining on large-scale collections of PDE-governed spatiotemporal trajectories has recently shown promise for building generalizable models of dynamical systems. Yet most existing PDE foundation models rely on deterministic Transformer architectures, which lack generative flexibility for many science and engineering applications. We propose Flow Marching, an algorithm that bridges neural operator learning with flow matching motivated by an analysis of error accumulation in physical dynamical systems, and we build a generative PDE foundation model on top of it. By jointly sampling the noise level and the physical time step between adjacent states, the model learns a unified velocity field that transports a noisy current state toward its clean successor, reducing long-term rollout drift while enabling uncertainty-aware ensemble generations. Alongside this core algorithm, we introduce a Physics-Pretrained Variational Autoencoder (P2VAE) to embed physical states into a compact latent space, and an efficient Flow Marching Transformer (FMT) that combines a diffusion-forcing scheme with latent temporal pyramids, achieving up to 15x greater computational efficiency than full-length video diffusion models and thereby enabling large-scale pretraining at substantially reduced cost. We curate a corpus of ~2.5M trajectories across 12 distinct PDE families and train suites of P2VAEs and FMTs at multiple scales. On downstream evaluation, we benchmark on unseen Kolmogorov turbulence with few-shot adaptation, demonstrate long-term rollout stability over deterministic counterparts, and present uncertainty-stratified ensemble results, highlighting the importance of generative PDE foundation models for real-world applications.
Authors: Xingtu Liu
Abstract: This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We prove a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.
Authors: Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu
Abstract: Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce CaTS-Bench, a comprehensive benchmark for Context-aware Time Series reasoning across 11 diverse domains, centered on a gold-standard evaluation set of 1746 human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, we release a diagnostic suite of 910 multiple-choice questions and use tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal text generation in numeric domains.
Authors: Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning is often framed as balancing exploration and exploitation in action space, typically operationalized with token-level proxies (e.g., output entropy or confidence). We argue that this apparent trade-off is largely a measurement artifact: token-level statistics reflect next-token uncertainty rather than how reasoning progresses over multi-token semantic structures. We therefore study exploration and exploitation in the hidden-state space of response trajectories. We use Effective Rank (ER) to quantify representational exploration and introduce its temporal derivatives, Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to characterize exploitative refinement dynamics. Empirically and theoretically, ER and ERV exhibit near-zero correlation in semantic space, suggesting the two capacities can be improved simultaneously. Motivated by this, we propose Velocity-Exploiting Rank Learning (VERL), which shapes the RLVR advantage with an auxiliary signal derived from ER/ERV and uses the more stable ERA as a meta-control variable to adaptively balance the incentives. Across multiple base models, RLVR algorithms, and reasoning benchmarks, VERL yields consistent improvements, including large gains on challenging tasks (e.g., 21.4\% in Gaokao 2024). The code is available at https://github.com/hf618/VERL.
Authors: Hao Chen, Tao Han, Jie Zhang, Song Guo, Lei Bai
Abstract: To gain finer regional forecasts, many works have explored the regional integration from the global atmosphere, e.g., by solving boundary equations in physics-based methods or cropping regions from global forecasts in data-driven methods. However, the effectiveness of these methods is often constrained by static and imprecise regional boundaries, resulting in poor generalization ability. To address this issue, we propose Spatial-Temporal Weather Forecasting (STCast), a novel AI-driven framework for adaptive regional boundary optimization and dynamic monthly forecast allocation. Specifically, our approach employs a Spatial-Aligned Attention (SAA) mechanism, which aligns global and regional spatial distributions to initialize boundaries and adaptively refines them based on attention-derived alignment patterns. Furthermore, we design a Temporal Mixture-of-Experts (TMoE) module, where atmospheric variables from distinct months are dynamically routed to specialized experts using a discrete Gaussian distribution, enhancing the model's ability to capture temporal patterns. Beyond global and regional forecasting, we evaluate our STCast on extreme event prediction and ensemble forecasting. Experimental results demonstrate consistent superiority over state-of-the-art methods across all four tasks. Code: https://github.com/chenhao-zju/STCast
Authors: Yicheng Lang, Yihua Zhang, Chongyu Fan, Changsheng Wang, Jinghan Jia, Sijia Liu
Abstract: Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the 'grade' of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.
Authors: Yoshihiko Ozaki, Shuhei Watanabe, Toshihiko Yanase
Abstract: Black-box optimization (BBO) underpins advances in domains such as AutoML and Materials Informatics, yet implementations of algorithms and benchmarks remain fragmented across research communities. We introduce OptunaHub (https://hub.optuna.org/), a community-oriented, decentralized platform for distributing BBO components under a unified Optuna-compatible interface. OptunaHub enables independent publication, discovery, and reuse of optimization algorithms and benchmark problems through a lightweight Python module, a contributor-driven registry, and a searchable web interface. The source code is publicly available in the \href{https://github.com/optuna/optunahub}{\texttt{optunahub}}, \href{https://github.com/optuna/optunahub-registry}{\texttt{optunahub-registry}}, and \href{https://github.com/optuna/optunahub-web}{\texttt{optunahub-web}} repositories under the Optuna organization on GitHub (https://github.com/optuna/).
URLs: https://hub.optuna.org/),, https://github.com/optuna/optunahub, https://github.com/optuna/optunahub-registry, https://github.com/optuna/optunahub-web, https://github.com/optuna/).
Authors: Mingsong Yan, Charles Kulick, Sui Tang
Abstract: Continuous-depth graph neural networks, also known as Graph Neural Differential Equations (GNDEs), combine the structural inductive bias of Graph Neural Networks (GNNs) with the continuous-depth architecture of Neural ODEs, offering a scalable and principled framework for modeling dynamics on graphs. In this paper, we present a rigorous convergence analysis of GNDEs with time-varying parameters in the infinite-node limit, providing theoretical insights into their size transferability. To this end, we introduce Graphon Neural Differential Equations (Graphon-NDEs) as the infinite-node limit of GNDEs and establish their well-posedness. Leveraging tools from graphon theory and dynamical systems, we prove the trajectory-wise convergence of GNDE solutions to Graphon-NDE solutions. Moreover, we derive explicit convergence rates under two deterministic graph sampling regimes: (1) weighted graphs sampled from smooth graphons, and (2) unweighted graphs sampled from $\{0,1\}$-valued (discontinuous) graphons. We further establish size transferability bounds, providing theoretical justification for the practical strategy of transferring GNDE models trained on moderate-sized graphs to larger, structurally similar graphs without retraining. Numerical experiments using synthetic and real data support our theoretical findings.
Authors: Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava
Abstract: Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward-backward pass of a single attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce Repeated Arrays-of-Count Estimators (RACE) Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding size. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH), avoiding construction of the full attention matrix. Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines up to 64K seqeuence length while reducing wall-clock time and memory usage. In addition, we conduct a controlled scaling study on a single attention layer and demonstrate processing of up to 12 million tokens on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU in a single forward-backward pass, which is well beyond the capabilities of current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today's hardware. We release our code at https://github.com/sahiljoshi515/RACE_Attention.
Authors: Haiquan Qiu, Quanming Yao
Abstract: The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.
URLs: https://github.com/ucker/why-low-precision-training-fails.
Authors: Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, Jiaang Li, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, Bo Zheng
Abstract: Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-6.9B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models. Our code is available at https://github.com/LivingFutureLab/MeSH/ .
Authors: Natalie Abreu, Nikhil Vyas, Sham Kakade, Depen Morwani
Abstract: Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
Authors: Han Ouyang, Ayush Singhal, Jesse Hamilton, Saeed Amal
Abstract: The stacking ensemble combining RF, LightGBM, and DNN performed well on internal test sets, exhibiting an NPV greater than 99.9% even with substantial class imbalance. While performance was lower on the external eICU cohort compared to the internal test sets, sensitivity remained robust. Therefore, the stacking ensemble may serve as a rule-out screening option for ERs and ICUs after additional prospective multi-site validation studies for its efficacy in real-world.
Authors: Shilaj Baral, Youngkyu Lee, Sangam Khanal, Joongoo Jeon
Abstract: Autoregressive neural surrogates offer computational acceleration for fluid dynamics but inherently suffer from error accumulation and non-physical drift during long-term rollouts. Although hybrid strategies combining surrogate models and physics-based solvers have been proposed, they are limited to manual implementations for low-dimensional benchmarks. In this study, we propose an OpenFOAM-based hybrid framework, XRePIT (eXtensible Residual-based Physics-nformed Transfer learning), characterized by its fastness, robustness, and scalability. Unlike prior manual implementations (e.g., RePIT), XRePIT integrates a fully automated open-source workflow that manages the state transition between a neural surrogate and a traditional numerical solver (OpenFOAM) based on a monitored residual threshold. Using 3D buoyancy-driven flow as a testbed, we demonstrate that this residual-guided coupling enables stable long-term simulation-ell beyond the stability horizon of standalone surrogates. Our results indicate that the hybrid loop achieves up to 2.91x wall-clock acceleration while maintaining relative L2 errors within O(1E-03) Furthermore, we benchmark the framework's extensibility by introducing a finite-volume-based Fourier neural operator (FVFNO), confirming that the stabilizing effect of the residual guardrail is agnostic to the underlying neural architecture. This study provides a deployable methodology for fast, robust, and automated hybrid simulation in 3D unsteady flow.
Authors: Ana K. Rivera, Anvita Bhagavathula, Alvaro Carbonero, Priya Donti
Abstract: Power flow (PF) calculations are the backbone of real-time grid operations, across workflows such as contingency analysis (where repeated PF evaluations assess grid security under outages) and topology optimization (which involves PF-based searches over combinatorially large action spaces). Running these calculations at operational timescales or across large evaluation spaces remains a major computational bottleneck. Additionally, growing uncertainty in power system operations from the integration of renewables and climate-induced extreme weather also calls for tools that can accurately and efficiently simulate a wide range of scenarios and operating conditions. Machine learning methods offer a potential speedup over traditional solvers, but their performance has not been systematically assessed on benchmarks that capture real-world variability. This paper introduces PF$\Delta$, a benchmark dataset for power flow that captures diverse variations in load, generation, and topology. PF$\Delta$ contains 859,800 solved power flow instances spanning six different bus system sizes, capturing three types of contingency scenarios (N , N -1, and N -2), and including close-to-infeasible cases near steady-state voltage stability limits. We evaluate traditional solvers and GNN-based methods, highlighting key areas where existing approaches struggle, and identifying open problems for future research. Our dataset is available at https://huggingface.co/datasets/pfdelta/pfdelta/tree/main and our code with data generation scripts and model implementations is at https://github.com/MOSSLab-MIT/pfdelta.
URLs: https://huggingface.co/datasets/pfdelta/pfdelta/tree/main, https://github.com/MOSSLab-MIT/pfdelta.
Authors: Ai Jian, Jingqing Ruan, Xing Ma, Xiaoyun Zhang, Dailin Li, Weipeng Zhang, Ke Zeng, Xunliang Cai
Abstract: Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. Generative reward models (GRMs) provide greater interpretability than traditional scalar RMs, but they come with a critical trade-off: pairwise methods are hindered by a training-inference mismatch, while pointwise methods require expensive absolute annotations. To bridge this gap, we propose the Preference-aware Task-adaptive Reward Model (PaTaRM). Unlike prior approaches, PaTaRM enables robust pointwise training using readily available pairwise data via a novel Preference-Aware Reward (PAR) mechanism, eliminating the need for explicit rating labels. Furthermore, it incorporates a Task-Adaptive Rubric system that dynamically generates instance-specific criteria for precise evaluation. Extensive experiments demonstrate that PATRM achieves a 8.7% average improvement on RewardBench and RMBench across Qwen3-8B/14B models. Crucially, it boosts downstream RLHF performance by an average relative improvement of 13.6% across IFEval and InFoBench, validating its effectiveness for policy alignment. Our code is available at https://github.com/JaneEyre0530/PaTaRM.
Authors: Xiutian Zhao, Rochelle Choenni, Rohit Saxena, Ivan Titov
Abstract: Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e., neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform diagnostic tests by deactivating the neurons flagged by various identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having limited effects on others. Moreover, we introduce a new margin-based selector Contrastive Activation Margin (ConAct) and show that it outperforms probability- and entropy-based methods in identifying neurons associated with cultural selectivity. Finally, our layer-wise analyses reveal that such neurons are not uniformly distributed: they cluster in specific decoder layers in a model-dependent way.
Authors: Nazmul Takbir, Hamidreza Alikhani, Nikil Dutt, Sangeetha Abdu Jyothi
Abstract: Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, and lowers online token latency by 1.6-2.1x, all while maintaining accuracy in long-context, long-generation scenarios.
Authors: Cooper Simpson, Stephen Becker, Alireza Doostan
Abstract: Focusing on implicit neural representations, we present a novel in situ training protocol that employs limited memory buffers of full and sketched data samples, where the sketched data are leveraged to prevent catastrophic forgetting. The theoretical motivation for our use of sketching as a regularizer is presented via a simple Johnson-Lindenstrauss-informed result. While our methods may be of wider interest in the field of continual learning, we specifically target in situ neural compression using implicit neural representation-based hypernetworks. We evaluate our method on a variety of complex simulation data in two and three dimensions, over long time horizons, and across unstructured grids and non-Cartesian geometries. On these tasks, we show strong reconstruction performance at high compression rates. Most importantly, we demonstrate that sketching enables the presented in situ scheme to approximately match the performance of the equivalent offline method.
Authors: Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil
Abstract: Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy for finetuning large language models (LLMs) because it eliminates the memory overhead of backpropagation. However, it converges slowly due to the inherent curse of dimensionality when searching for descent directions in the high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a novel zeroth-order optimizer that accelerates convergence by adaptive directional sampling. Instead of drawing the direction uniformly at random, ConMeZO restricts the sampling to a cone centered around a momentum estimate. This concentrates the search in directions where the true gradient is more likely to lie and thus reduces the effect of high dimensions. We prove that ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically, when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than MeZO while retaining the low-memory footprint of zeroth-order methods.
Authors: Yash Mittal, Dmitry Ignatov, Radu Timofte
Abstract: It introduces FractalNet, a fractal-inspired computational architectures for advanced large language model analysis that mainly challenges model diversity on a large scale in an efficient manner. The new set-up involves a template-driven generator, runner, and evaluation framework that, through systematic permutations of convolutional, normalization, activation, and dropout layers, can create more than 1,200 variants of neural networks. Fractal templates allow for structural recursion and multi-column pathways, thus, models become deeper and wider in a balanced way. Training utilizes PyTorch, Automatic Mixed Precision (AMP), and gradient checkpointing and is carried out on the CIFAR-10 dataset for five epochs. The outcomes show that fractal-based architectures are capable of strong performance and are computationally efficient. The paper positions fractal design as a feasible and resource-efficient method of automated architecture exploration.
Authors: Ansel Kaplan Erol, Seungjun Lee, Divya Mahajan
Abstract: Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a distributed decision problem between orbit and ground. EarthSight introduces three core innovations: (1) multi-task inference on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a ground-station query scheduler that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) dynamic filter ordering, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9x and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.
Authors: Yeyue Cai, Jianhua Mo, Meixia Tao
Abstract: Phase-time arrays, which integrate phase shifters (PSs) and true-time delays (TTDs), have emerged as a cost-effective architecture for generating frequency-dependent rainbow beams in wideband sensing and localization. This paper proposes an end-to-end deep learning-based scheme that simultaneously designs the rainbow beams and estimates user positions. Treating the PS and TTD coefficients as trainable variables allows the network to synthesize task-oriented beams that maximize localization accuracy. A lightweight fully connected module then recovers the user's angle-range coordinates from its feedback of the maximum quantized received power and its corresponding subcarrier index after a single downlink transmission. Compared with existing analytical and learning-based schemes, the proposed method reduces overhead by an order of magnitude and delivers consistently lower two-dimensional positioning error.
Authors: Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, Anoop Deoras
Abstract: Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% across diverse math reasoning benchmarks, establishing its effectiveness. GTPO also improves GRPO by 3.9% on commonsense reasoning and program synthesis tasks, demonstrating its generalizability to non-math domains. Importantly, GTPO incurs negligible overhead, ensuring its practicality for real-world scenarios.
Authors: Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin
Abstract: Does Chain-of-Thought (CoT) reasoning genuinely improve Vision-Language-Action (VLA) models, or does it merely add overhead? Existing CoT-VLA systems report limited and inconsistent gains, yet no prior work has rigorously diagnosed when and why CoT helps robots act. Through systematic experiments, we identify two necessary conditions that must be jointly satisfied for CoT to be effective in VLA: (1) Decoding Alignment -- CoT and actions must be generated with modality-appropriate mechanisms; forcing both through a single autoregressive decoder is not merely suboptimal but actively harmful, degrading performance by 4.2 percentage points; (2) Causal Alignment -- CoT must be causally linked to task success via outcome-based optimization; without it, supervised CoT is indistinguishable from no reasoning at all under distribution shift, exhibiting a 32.0\,pp performance drop nearly identical to the 31.6\,pp drop of a reasoning-free baseline. Guided by these findings, we build DeepThinkVLA: a hybrid-attention decoder satisfies Condition~1 by pairing causal attention for language with bidirectional attention for parallel action decoding, while a two-stage SFT-then-RL pipeline satisfies Condition~2 by aligning the full reasoning--action chain with sparse task-success rewards. DeepThinkVLA achieves 97.0\% success on LIBERO, 79.0\% robustness on LIBERO-Plus (vs.\ 61.6\% for $\pi_0$-FAST), and 59.3\% success on RoboTwin~2.0, exceeding the strongest baseline by 21.7 points. Furthermore, we validate the practical effectiveness of our approach through real-world robot experiments. Code available at https://github.com/OpenBMB/DeepThinkVLA
Authors: Zifeng Huang, Konstantin M. Zuev, Yong Xia, Michael Beer
Abstract: Neural oscillators, originating from second-order ordinary differential equations (ODEs), have demonstrated strong performance in stably learning causal mappings between long-term sequences or continuous temporal functions, as well as in accurately approximating physical systems. However, theoretically quantifying the capacities of their neural network architectures remains a significant challenge. In this study, the neural oscillator consisting of a second-order ODE followed by a multilayer perceptron (MLP) is considered. Its upper approximation bound for approximating causal and uniformly continuous operators between continuous temporal function spaces and that for approximating uniformly asymptotically incrementally stable second-order dynamical systems are derived. The established proof method of the approximation bound for approximating the causal continuous operators can also be directly applied to state-space models consisting of a linear time-continuous complex recurrent neural network followed by an MLP. Theoretical results reveal that the approximation error of the neural oscillator for approximating the second-order dynamical systems scales polynomially with the reciprocals of the widths of two utilized MLPs, thus overcoming the curse of parametric complexity. The convergence rates of two established approximation error bounds are validated through four numerical cases. These results provide a robust theoretical foundation for the effective application of the neural oscillator in science and engineering.
Authors: Vishnu Sarukkai, Asanshay Gupta, James Hong, Micha\"el Gharbi, Kayvon Fatahalian
Abstract: Deploying LLM agents at scale typically requires choosing between quality and cost. Existing cost-reduction approaches fail to preserve agility: the ability to iterate rapidly without human time bottlenecks. Prompt engineering is brittle and slows iteration, while fine-tuning requires multi-day training and commitment to fixed designs; both are impractical for iterative workflows and time-sensitive batch jobs. We demonstrate that established inference-time techniques--dynamic in-context learning and self-consistency cascades--can be leveraged to shift the cost-accuracy Pareto frontier while preserving agility. Practitioners run the teacher on a small task subset to collect demonstrations, then immediately deploy a cheaper student on the remainder. At each step, the system retrieves relevant teacher demonstrations as in-context examples. When multiple student samples agree, we proceed; when they diverge, we fall back to the teacher. This requires no prompt engineering or training. On ALFWorld, we match teacher accuracy at 2.5x lower cost (0.059 to 0.024 per episode). On AppWorld, we achieve 3.5x cost reduction while recovering 79% of teacher accuracy. Our empirical analyses provide guidance on key design choices: teacher database size, demonstration set size, retrieval strategy, and cascade thresholds. These analyses highlight inference-time levers for navigating cost-performance tradeoffs without sacrificing human development speed.
Authors: Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, Max Simchowitz
Abstract: Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single flow map. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow to outperform a 1024 step flow matching model with only a single additional backward NFE.
Authors: Kibidi Neocosmos, Diego Baptista, Nicole Ludwig
Abstract: In community detection, many methods require the user to specify the number of clusters in advance since an exhaustive search over all possible values is computationally infeasible. While some classical algorithms can infer this number directly from the data, this is typically not the case for graph neural networks (GNNs): even when a desired number of clusters is specified, standard GNN-based methods often fail to return the exact number due to the way they are designed. In this work, we address this limitation by introducing a flexible and principled way to control the number of communities discovered by GNNs. Rather than assuming the true number of clusters is known, we propose a framework that allows the user to specify a plausible range and enforce these bounds during training. However, if the user wants an exact number of clusters, it may also be specified and reliably returned.
Authors: Emily Jin, Andrei Cristian Nica, Mikhail Galkin, Jarrid Rector-Brooks, Kin Long Kelvin Lee, Santiago Miret, Frances H. Arnold, Michael Bronstein, Avishek Joey Bose, Alexander Tong, Cheng-Hao Liu
Abstract: Accurately predicting experimentally realizable 3D molecular crystal structures from their 2D chemical graphs is a long-standing open challenge in computational chemistry called crystal structure prediction (CSP). Efficiently solving this problem has implications ranging from pharmaceuticals to organic semiconductors, as crystal packing directly governs the physical and chemical properties of organic solids. In this paper, we introduce OXtal, a large-scale 100M parameter all-atom diffusion model that directly learns the conditional joint distribution over intramolecular conformations and periodic packing. To efficiently scale OXtal, we abandon explicit equivariant architectures imposing inductive bias arising from crystal symmetries in favor of data augmentation strategies. We further propose a novel crystallization-inspired lattice-free training scheme, Stoichiometric Stochastic Shell Sampling ($S^4$), that efficiently captures long-range interactions while sidestepping explicit lattice parametrization -- thus enabling more scalable architectural choices at all-atom resolution. By leveraging a large dataset of 600K experimentally validated crystal structures (including rigid and flexible molecules, co-crystals, and solvates), OXtal achieves orders-of-magnitude improvements over prior ab initio machine learning CSP methods, while remaining orders of magnitude cheaper than traditional quantum-chemical approaches. Specifically, OXtal recovers experimental structures with conformer $\text{RMSD}_1<0.5$ {\AA} and attains over 80\% packing similarity rate, demonstrating its ability to model both thermodynamic and kinetic regularities of molecular crystallization.
Authors: Kaichuang Zhang, Wei Yin, Jinghao Yang, Ping Xu
Abstract: Decentralized federated learning (DFL) has emerged as a promising paradigm that enables multiple clients to collaboratively train machine learning models through iterative rounds of local training, communication, and aggregation, without relying on a central server. Nevertheless, DFL systems continue to face a range of challenges, including fairness and Byantine robustness. To address these challenges, we propose \textbf{DFedReweighting}, a unified aggregation framework that achieves diverse learning objectives in DFL via objective-oriented reweighting at the final step of each learning round. Specifically, for each client, the framework first evaluates a target performance metric (TPM) on a compact auxiliary dataset constructed from local data, yielding preliminary aggregation weights, which are subsequently refined by a customized reweighting strategy (CRS) to produce the final aggregation weights. Theoretically, we prove that an appropriate TPM-CRS combination guarantees linear convergence for general $L$-smoothand strongly convex functions. Empirical results consistently demonstrate that \textbf{DFedReweighting} significantly improves fairness and robustness against Byzantine attacks across diverse settings. Two multi-objective examples, spanning tasks across and within clients, further establish that a broad range of desired learning objectives can be accommodated by appropriately designing the TPM and CRS. Our code is available at https://github.com/KaichuangZhang/DFedReweighting.
Authors: Carlo Abate, Ivan Marisca, Filippo Maria Bianchi
Abstract: Torch Geometric Pool (tgp) is a pooling library built on top of PyTorch Geometric. Graph pooling methods differ in how they assign nodes to supernodes, how they handle batches, what they return after pooling, and whether they expose auxiliary losses. These differences make it hard to compare methods or reuse the same model code across them. tgp addresses this problem with a common software interface based on the Select-Reduce-Connect-Lift (SRCL) decomposition. The library provides 20 hierarchical poolers, standardized output objects, standalone readout modules, support for dense poolers in batched and unbatched mode, and workflows for caching and pre-coarsening. It is released under the MIT license on GitHub and PyPI, with comprehensive documentation, tutorials, and examples.
Authors: Nuria Alina Chandra, Yucen Lily Li, Alan N. Amin, Alex Ali, Joshua Rollins, Sebastian W. Ober, Aniruddh Raghu, Andrew Gordon Wilson
Abstract: To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.
Authors: Xuanyu Hu
Abstract: Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.
Authors: Jie Xu, Wenyuan Yang, Yazhou Ren, Lifang He, Philip S. Yu, Xiaofeng Zhu
Abstract: Real-world multi-view data often exhibit highly inconsistent missing patterns, posing significant challenges for incomplete multi-view clustering (IMVC). Although existing IMVC methods have made progress from both imputation-based and imputation-free routes, they largely overlook the issue of pair underutilization. Specifically, inconsistent missing patterns prevent incomplete but available multi-view pairs from being fully exploited, thereby limiting the model performance. To address this limitation, we propose a novel missing-pattern tree based IMVC framework. Specifically, to fully leverage available multi-view pairs, we first introduce a missing-pattern tree model to group data into multiple decision sets according to their missing patterns, and then perform multi-view clustering within each set. Furthermore, a multi-view decision ensemble module is proposed to aggregate clustering results across all decision sets. This module infers uncertainty-based weights to suppress unreliable clustering decisions and produce robust outputs. Finally, we develop an ensemble-to-individual knowledge distillation module module, which transfers ensemble knowledge to view-specific clustering models. This design enables mutual enhancement between ensemble and individual modules by optimizing cross-view consistency and inter-cluster discrimination losses. Extensive theoretical analysis supports our key designs, and empirical experiments on multiple benchmark datasets demonstrate that our method effectively mitigates the pair underutilization issue and achieve superior IMVC performance.
Authors: Nathan Buskulic, Luca Calatroni, Lorenzo Rosasco, Silvia Villa
Abstract: Blind inverse problems arise in many experimental settings where both the signal of interest and the forward operator are (partially) unknown. In this context, methods developed for the non-blind case cannot be adapted in a straightforward manner due to identifiability issues and symmetric solutions inherent to the blind setting. Recently, data-driven approaches have been proposed to address such problems, demonstrating strong empirical performance and adaptability. However, these methods often lack interpretability and are not supported by theoretical guarantees, limiting their reliability in domains such as applied imaging where a blind approach often relates to a calibration of the acquisition device. In this work, we shed light on learning in blind inverse problems within the insightful framework of Linear Minimum Mean Square Estimators (LMMSEs). We provide a theoretical analysis, deriving closed-form expressions for optimal estimators and extending classical recovery results to the blind setting. In particular, we establish equivalences with tailored Tikhonov-regularized formulations, where the regularization structure depends explicitly on the distributions of the unknown signal, of the noise, and of the random forward operator. We also show how the reconstruction error converges as the noise and the randomness of the operator diminish when we use a source condition assumption. Furthermore, we derive finite-sample error bounds that characterize the performance of the learned estimators as a function of the noise level, problem conditioning, and number of available samples. These bounds explicitly quantify the impact of operator randomness and show explicitly the dependence of the associated convergence rates to this randomness factors. Finally, we validate our theoretical findings through illustrative exemplar numerical experiments that confirm the predicted convergence behavior.
Authors: Raul D. Steleac, Mohan Sridharan, David Abel
Abstract: Temporally extended actions improve the ability to explore and plan in single-agent settings. In multi-agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi-agent options particularly challenging. Existing multi-agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviours. Toward addressing these limitations, we describe a novel approach for multi-agent option discovery. Specifically, we propose a joint-state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the \textit{Fermat} state, and use it to define a measure of \textit{spreadness}, capturing team-level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two simulated multi-agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.
Authors: Azadeh Alavi, Hamidreza Khalili, Stanley H. Chan, Fatemeh Kouchmeshki, Muhammad Usman, Ross Vlahos
Abstract: Quantum methods are increasingly proposed for healthcare, but translational biomarker studies demand transparent benchmarking and robust small-dataset evaluation. We analysed a preclinical COPD cohort of 213 animals with blood and bronchoalveolar-lavage biomarkers to predict tibialis anterior muscle weight, specific force, and muscle quality. We benchmarked tuned classical models against two structured nonlinear low-data strategies: geometry-aware symmetric positive definite (SPD) descriptors, in which training-only clustering maps each subject to Stein-divergence distances from representative prototypes and optional unlabeled synthetic SPD interpolation stabilises prototype discovery; and quantum-kernel regression, including a clustered Nystrom-style feature map that compresses each subject into similarities to a small set of training-derived centres. By replacing full pairwise structure with compact prototype- and centre-based summaries, these steps regularise learning and preserve interpretability in a small-sample setting. Across five outer folds, quantum-kernel ridge regression using four interpretable inputs achieved the best muscle-weight performance (RMSE 4.41 mg; R2 0.62), outperforming a matched compact classical baseline (4.68 mg; R2 0.56). Biomarker-only SPD features also improved over ridge regression (4.55 versus 4.79 mg), and screening evaluation reached ROC-AUC 0.91 for low muscle weight.
Authors: Adrien Aumon, Guy Wolf, Kevin R. Moon, Jake S. Rhodes
Abstract: Decision forests induce supervised similarities through the partition structure of their trees. Yet forest proximity computation is still often treated as a quadratic operation in the number of samples, which limits scalability and restricts broader use in kernel and representation-learning pipelines. We introduce a unified view of leaf-collision forest proximities through a class of Separable Weighted Leaf-Collision (SWLC) kernels, showing that most existing proximities differ only in their weighting scheme while sharing a common sparse leaf-incidence structure. This yields an explicit leaf-space representation that clarifies their kernel interpretation and leads to an exact finite-sample sparse factorization of the proximity matrix, avoiding an explicit all-pairs comparison and reducing computation to sparse linear algebra over leaf collisions. We implement this framework in a memory-efficient Python library and show, both theoretically and empirically, that exact kernel computation scales near-linearly in time and memory under standard forest regimes. Benchmarks verify the predicted scaling behavior in practice across datasets, proximity definitions, and forest settings, and show that the resulting sparse leaf-space representation can also be used directly for fast task-aware embedding.
Authors: Yujie Feng, Hao Wang, Jian Li, Xu Chu, Zhaolu Kang, Yiran Liu, Yasha Wang, Philip S. Yu, Xiao-Ming Wu
Abstract: Continual learning (CL) for large language models (LLMs) aims to enable sequential knowledge acquisition without catastrophic forgetting. Memory replay methods are widely used for their practicality and effectiveness, but most rely on fixed, step-based heuristics that often misalign with the model's actual learning progress, since identical training steps can result in varying degrees of parameter change. Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. FOREVER defines model time using the magnitude of optimizer updates, allowing forgetting curve-inspired replay intervals to align with the model's internal evolution rather than raw training steps. Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay. Extensive experiments on three CL benchmarks and models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.
Authors: Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, Taesup Kim
Abstract: Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.
Authors: Michael J. Clark
Abstract: As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.
Authors: Prashant C. Raju
Abstract: Representational similarity analysis and related methods have become standard tools for comparing the internal geometries of neural networks and biological systems. These methods measure what is represented, the alignment between two representational spaces, but not whether that structure is robust. We introduce geometric stability, a distinct dimension of representational quality that quantifies how reliably a representation's pairwise distance structure holds under perturbation. Our metric, Shesha, measures self-consistency through split-half correlation of representational dissimilarity matrices constructed from complementary feature subsets. A key formal property distinguishes stability from similarity: Shesha is not invariant to orthogonal transformations of the feature space, unlike CKA and Procrustes, enabling it to detect compression-induced damage to manifold structure that similarity metrics cannot see. Spectral analysis reveals the mechanism: similarity metrics collapse after removing the top principal component, while stability retains sensitivity across the eigenspectrum. Across 2463 encoder configurations in seven domains -- language, vision, audio, video, protein sequences, molecular profiles, and neural population recordings -- stability and similarity are empirically uncorrelated ($\rho=-0.01$). A regime analysis shows this independence arises from opposing effects: geometry-preserving transformations make the metrics redundant, while compression makes them anti-correlated, canceling in aggregate. Applied to 94 pretrained models across 6 datasets, stability exposes a "geometric tax": DINOv2, the top-performing model for transfer learning, ranks last in geometric stability on 5/6 datasets. Contrastive alignment and hierarchical architecture predict stability, providing actionable guidance for model selection in deployment contexts where representational reliability matters.
Authors: Alireza Bakhtiari, Alex Ayoub, Samuel Robertson, David Janz, Csaba Szepesv\'ari
Abstract: We establish a lower bound on the eluder dimension of generalised linear model classes, showing that standard eluder dimension-based analysis cannot lead to first-order regret bounds. To address this, we introduce a localisation method for the eluder dimension; our analysis immediately recovers and improves on classic results for Bernoulli bandits, and allows for the first genuine first-order bounds for finite-horizon reinforcement learning tasks with bounded cumulative returns.
Authors: Shovito Barua Soumma, Asiful Arefeen, Stephanie M. Carpenter, Melanie Hingle, Hassan Ghasemzadeh
Abstract: Counterfactual explanations (CFEs) provide human-centric interpretability by identifying the minimal, actionable changes required to alter a machine learning model's prediction. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. We conduct a comprehensive evaluation of CF generation using large language models (LLMs), including GPT-4 (zero-shot and few-shot) and two open-source models-BioMistral-7B and LLaMA-3.1-8B, in both pretrained and fine-tuned configurations. Using the multimodal AI-READI clinical dataset, we assess CFs across three dimensions: intervention quality, feature diversity, and augmentation effectiveness. Fine-tuned LLMs, particularly LLaMA-3.1-8B, produce CFs with high plausibility (up to 99%), strong validity (up to 0.99), and realistic, behaviorally modifiable feature adjustments. When used for data augmentation under controlled label-scarcity settings, LLM-generated CFs substantially restore classifier performance, yielding an average 20% F1 recovery across three scarcity scenarios. Compared with optimization-based baselines such as DiCE, CFNOW, and NICE, LLMs offer a flexible, model-agnostic approach that generates more clinically actionable and semantically coherent counterfactuals. Overall, this work demonstrates the promise of LLM-driven counterfactuals for both interpretable intervention design and data-efficient model training in sensor-based digital health. Impact: SenseCF fine-tunes an LLM to generate valid, representative counterfactual explanations and supplement minority class in an imbalanced dataset for improving model training and boosting model robustness and predictive performance
Authors: Zhiwei Zhang, Fei Zhao, Rui Wang, Zezhong Wang, Bin Liang, Jiakang Wang, Yao Hu, Shaosheng Cao, Kam-Fai Wong
Abstract: Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: after a tool-call error, smaller models often fall into repetitive invalid re-invocations instead of interpreting the feedback and recovering. This failure mode persists because current training paradigms do not explicitly teach models how to recover from execution errors. In particular, standard reinforcement learning (RL) collapses rich failure experience into sparse negative rewards, while pre-collected error-correction datasets become mismatched to the policy's evolving failure modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into on-policy corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a fine-tuned Error Simulator, then resampling multiple recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute and overall accuracy by 4.0% (from 42.75% to 46.75%), outperforming both RL baselines and specialized tool-use agents. The method further generalizes to TAU-Bench and TAU2-Bench, achieving leading results across most settings with gains up to +17.4%.
Authors: Trong Khiem Tran, Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang
Abstract: Adapting pre-trained models to unseen feature modalities has become increasingly important due to the growing need for cross-disciplinary knowledge integration. A key challenge here is how to align the representation of new modalities with the most relevant parts of the pre-trained model's representation space to enable accurate knowledge transfer. This requires combining feature alignment with target fine-tuning, but uncalibrated combinations can exacerbate misalignment between the source and target feature-label structures and reduce target generalization. Existing work, however, lacks a theoretical understanding of this critical interaction between feature alignment and target fitting. To bridge this gap, we develop a principled framework that establishes a provable generalization bound on the target error, which explains the interaction between feature alignment and target fitting through a novel concept of feature-label distortion. This bound offers actionable insights into how this interaction should be optimized for practical algorithm design. The resulting approach achieves significantly improved performance over state-of-the-art methods across a wide range of benchmark datasets.
Authors: Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup on math reasoning, and a 1.83% gain on scientific and general reasoning. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
Authors: Anderson de Andrade, Alon Harell, Ivan V. Baji\'c
Abstract: Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. We introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade bitrate for accuracy. Experiments on language benchmarks show that the simplest of the proposed codecs achieves substantial rate savings, outperforming more complex methods. We characterize and analyze the rate-distortion behaviour of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to derive bounds on the achievable rate of learnable codecs. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulations.
Authors: Sergi Masip, Gido M. van de Ven, Javier Ferrando, Tinne Tuytelaars
Abstract: Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features or by disrupting their readout by downstream computations. Analysis of a tractable toy model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We do so through a case study example of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.
Authors: Pi-Ju Tsai, Charkkri Limbud, Kuan-Fu Chen, Yi-Ju Tseng
Abstract: Electronic Health Records (EHRs) provide high-dimensional temporal data essential for patient modeling; however, conventional algorithmic approaches often rely on data aggregation or imputation, which distorts temporal disease trajectories. Such computational limitations are particularly critical in sepsis, a heterogeneous syndrome where clustering-based stratification plays a key role in identifying clinically distinct phenotypes for precise treatment strategies. Furthermore, existing clustering processes seldom incorporate domain-driven constraints, often resulting in phenotypes that lack clear clinical distinction. We propose a novel clustering network, NPCNet, that comprises a text embedding generator, a clustering operator, and a target navigator. We first transform EHRs into pseudo texts by discretizing continuous clinical measurements, then integrate them with static variables to construct the embeddings. The target navigator then infuses clinical knowledge into the embeddings through auxiliary tasks, constraining clustering results to better align sepsis phenotypes with clinical significance. Finally, the clustering operator employs an iterative refinement mechanism to jointly optimize phenotype centroids and patient representations under domain-driven constraints. Extensive experiments on public datasets validate that NPCNet achieves superior performance on both internal clustering benchmarks and clinical validity metrics, offering a viable pathway for precision treatment strategies in the management of sepsis.
Authors: Nikolaos Bousias, Lars Lindemann, George Pappas
Abstract: Conformal prediction, a post-hoc, distribution-free, finite-sample method of uncertainty quantification that offers formal coverage guarantees under the assumption of data exchangeability. Unfortunately, the resulting uncertainty regions can grow significantly in long horizon missions, rendering the statistical guarantees uninformative. To that end, we propose infusing CP with geometric information via group-averaging of the pretrained predictor to distribute the non-conformity mass across the orbits. Each sample now is treated as a representative of an orbit, thus uncertainty can be mitigated by other samples entangled to it via the orbit inducing elements of the symmetry group. Our approach provably yields contracted non-conformity scores in increasing convex order, implying improved exponential-tail bounds and sharper conformal prediction sets in expectation, especially at high confidence levels. We then propose an experimental design to test these theoretical claims in pedestrian trajectory prediction.
Authors: Jonathan von Rad, Yong Cao, Andreas Geiger
Abstract: Model compression is increasingly essential for deploying large language models (LLMs), yet existing comparative studies largely focus on pruning and quantization evaluated primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through evaluation of six compression techniques across 40 datasets, we observe (i) a consistent knowledge bias, where factual recall is largely preserved while multi-step reasoning, multilingual, and instruction-following capabilities degrade; (ii) a decoupling between performance and reliability, indicating that retained performance does not consistently imply preserved reliability; and (iii) that task-specific calibration can yield up to 50% relative improvement of reasoning performance in pruned models.
Authors: Jia Li, Yu Hou, Rui Zhang
Abstract: Automatically discovering personalized sequential events from large-scale time-series data is crucial for enabling precision medicine in clinical research, yet it remains a formidable challenge even for contemporary AI models. For example, while transformers capture rich associations, they are mostly agnostic to event timing and ordering, thereby bypassing potential causal reasoning. Intuitively, we need a method capable of evaluating the "degree of alignment" among patient-specific trajectories and identifying their shared patterns, i.e., the significant events in a consistent sequence. This necessitates treating timing as a true \emph{computable} dimension, allowing models to assign ``relative timestamps'' to candidate events beyond their observed physical times. In this work, we introduce LITT (Individual-Level Time Transformation), a novel architecture that enables temporary alignment of sequential events on a virtual ``relative timeline'', thereby enabling \emph{event-timing-focused attention} and personalized interpretations of clinical trajectories. Its interpretability and effectiveness are validated on real-world longitudinal EHR data from 3,276 breast cancer patients to predict the onset timing of cardiotoxicity-induced heart disease. Furthermore, LITT outperforms both the benchmark and state-of-the-art survival analysis methods on public datasets, positioning it as a significant step forward for precision medicine in clinical AI.
Authors: Cassandre Notton, Benjamin Stott, Philippe Schoeb, Anthony Walsh, Gr\'egoire Leboucher, Vincent Espitalier, Vassilis Apostolou, Louis-F\'elix Vigneux, Alexia Salavrakos, Jean Senellart
Abstract: Identifying where quantum models may offer practical benefits in near term quantum machine learning (QML) requires moving beyond isolated algorithmic proposals toward systematic and empirical exploration across models, datasets, and hardware constraints. We introduce MerLin, an open-source framework designed as a discovery engine for photonic and hybrid quantum machine learning. MerLin integrates optimized strong simulation of linear optical circuits into standard PyTorch and scikit learn workflows, enabling end-to-end differentiable training of quantum layers. MerLin is designed around systematic benchmarking and reproducibility. As an initial contribution, we reproduce eighteen state-of-the-art photonic and hybrid QML works spanning kernel methods, reservoir computing, convolutional and recurrent architectures, generative models, and modern training paradigms. These reproductions are released as reusable, modular experiments that can be directly extended and adapted, establishing a shared experimental baseline consistent with empirical benchmarking methodologies widely adopted in modern artificial intelligence. By embedding photonic quantum models within established machine learning ecosystems, MerLin allows practitioners to leverage existing tooling for ablation studies, cross-modality comparisons, and hybrid classical-quantum workflows. The framework already implements hardware-aware features, allowing tests on available quantum hardware while enabling exploration beyond its current capabilities, positioning MerLin as a forward-looking co-design tool linking algorithms, benchmarks, and hardware.
Authors: Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, You Wu, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, Bo Zheng
Abstract: Recursive (looped) Transformers decouple computational depth from parameter depth by repeatedly applying shared layers, providing an explicit architectural primitive for iterative refinement and latent reasoning. However, early looped Transformers often underperform non-recursive baselines of equal compute. While recent literature has introduced more effective recursion mechanisms to mitigate this gap, existing architectures still operate at a fixed, full-token resolution, neglecting the potential efficiency of computing over compressed latent representations. In this paper, we propose SpiralFormer, a looped Transformer that executes recurrence under a multi-resolution recursion schedule. We provide probing evidence that multi-resolution recursion enables the model to learn hierarchical dependencies by inducing iteration-wise functional specialization across different scales. Empirically, SpiralFormer achieves better parameter and compute efficiency than both looped and non-looped baselines across model scales from 160M to 1.4B, establishing sequence resolution as a potential axis for scaling recursive architectures.
Authors: Roberto Molinaro, Niall Siegenheim, Henry Martin, Mark Frey, Niels Poulsen, Philipp Seitz, Marvin Vincent Gabler
Abstract: We introduce a universal diffusion-based downscaling framework that lifts deterministic low-resolution weather forecasts into probabilistic high-resolution predictions without any model-specific fine-tuning. A single conditional diffusion model is trained on paired coarse-resolution inputs (~25 km resolution) and high-resolution regional reanalysis targets (~5 km resolution), and is applied in a fully zero-shot manner to deterministic forecasts from heterogeneous upstream weather models. Focusing on near-surface variables, we evaluate probabilistic forecasts against independent in situ station observations over lead times up to 90 h. Across a diverse set of AI-based and numerical weather prediction (NWP) systems, the ensemble mean of the downscaled forecasts consistently improves upon each model's own raw deterministic forecast, and substantially larger gains are observed in probabilistic skill as measured by CRPS. These results demonstrate that diffusion-based downscaling provides a scalable, model-agnostic probabilistic interface for enhancing spatial resolution and uncertainty representation in operational weather forecasting pipelines.
Authors: Adolfo Gonz\'alez, V\'ictor Parada
Abstract: Business environments characterized by intermittent demand, high variability, and multi-step planning horizons require forecasting policies that support consistent operational decisions across heterogeneous SKU portfolios. Because no forecasting model is universally dominant, and model rankings vary across evaluation metrics, demand regimes, and forecast horizons, selecting the most appropriate forecasting model for each demand series is a nontrivial model selection problem in inventory planning, procurement, and supply management. This study addresses this problem by proposing the Adaptive Hybrid Selector for Intermittency and Variability (AHSIV), a horizon-aware, structure-adaptive model selection framework designed to improve the consistency of forecast model assignment under horizon-induced ranking instability. The framework integrates the Metric Degradation by Forecast Horizon (MDFH), structural demand classification, multi-objective Pareto dominance, and hierarchical bias refinement within a unified decision-support scheme. The empirical evaluation is conducted on the Walmart, M3, M4, and M5 datasets under multiple train-test partition schemes and twelve-step forecasting horizons. The results show that AHSIV achieves statistical equivalence with the strongest single-metric baseline in aggregated performance while improving the consistency of model assignment across forecast horizons in heterogeneous demand settings. These findings suggest that forecasting model choice in multi-SKU environments should not be treated as a static ex post ranking exercise, but rather as a horizon-aware, structurally adaptive assignment problem that can provide more robust support for inventory planning, procurement alignment, and operational decision-making in business forecasting systems.
Authors: Ruibiao Zhu
Abstract: This paper introduces a novel approach to sea ice modeling using Graph Neural Networks (GNNs), utilizing the natural graph structure of sea ice, where nodes represent individual ice pieces, and edges model the physical interactions, including collisions. This concept is developed within a one-dimensional framework as a foundational step. Traditional numerical methods, while effective, are computationally intensive and less scalable. By utilizing GNNs, the proposed model, termed the Collision-captured Network (CN), integrates data assimilation (DA) techniques to effectively learn and predict sea ice dynamics under various conditions. The approach was validated using synthetic data, both with and without observed data points, and it was found that the model accelerates the simulation of trajectories without compromising accuracy. This advancement offers a more efficient tool for forecasting in marginal ice zones (MIZ) and highlights the potential of combining machine learning with data assimilation for more effective and efficient modeling.
Authors: Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai
Abstract: Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming baselines. Our code is available at: https://github.com/FlyTune/MASPO-RL.
Authors: Michael Hardy, Yunsung Kim
Abstract: LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study contrasts LLM alignment on benchmarks, downstream tasks, and, importantly the intended impact of those tasks. We evaluate the performance of leading LLMs (i.e., generative pre-trained base models) on difficult-to-verify tasks of the teaching and learning of schoolchildren. Across all LLMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often negatively aligned with the intended impact of student learning outcomes. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that selection of LLM and/or prompting strategy only reliably accounts for $15\%$ of all measured misalignment error and that variation in misalignment error is shared across LLMs, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into practical applications of LLMs in high-noise contexts.
Authors: Faisal Mohamed, Catherine Ji, Benjamin Eysenbach, Glen Berseth
Abstract: Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent perceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
Authors: Waleed Afandi, Hussein Abdallah, Ashraf Aboulnaga, Essam Mansour
Abstract: Efficient inference for graph neural networks (GNNs) on large knowledge graphs (KGs) is essential for many real-world applications. GNN inference queries are computationally expensive and vary in complexity, as each involves a different number of target nodes linked to subgraphs of diverse densities and structures. Existing acceleration methods, such as pruning, quantization, and knowledge distillation, instantiate smaller models but do not adapt them to the structure or semantics of individual queries. They also store models as monolithic files that must be fully loaded, and miss the opportunity to retrieve only the neighboring nodes and corresponding model components that are semantically relevant to the target nodes. These limitations lead to excessive data loading and redundant computation on large KGs. This paper presents KG-WISE, a task-driven inference paradigm for large KGs. KG-WISE decomposes trained GNN models into fine-grained components that can be partially loaded based on the structure of the queried subgraph. It employs large language models (LLMs) to generate reusable query templates that extract semantically relevant subgraphs for each task, enabling query-aware and compact model instantiation. We evaluate KG-WISE on six large KGs with up to 42 million nodes and 166 million edges. KG-WISE achieves up to 28x faster inference and 98% lower memory usage than state-of-the-art systems while maintaining or improving accuracy across both commercial and open-weight LLMs.
Authors: Ruichen Xu, Kexin Chen
Abstract: Differentially private learning is essential for training models on sensitive data, but empirical studies consistently show that it can degrade performance, introduce fairness issues like disparate impact, and reduce adversarial robustness. The theoretical underpinnings of these phenomena in modern, non-convex neural networks remain largely unexplored. This paper introduces a unified feature-centric framework to analyze the feature learning dynamics of differentially private stochastic gradient descent (DP-SGD) in two-layer ReLU convolutional neural networks. Our analysis establishes test loss bounds governed by a crucial metric: the feature-to-noise ratio (FNR). We demonstrate that the noise required for privacy leads to suboptimal feature learning, and specifically show that: 1) imbalanced FNRs across classes and subpopulations cause disparate impact; 2) even in the same class, noise has a greater negative impact on semantically long-tailed data; and 3) noise injection exacerbates vulnerability to adversarial attacks. Furthermore, our analysis reveals that the popular paradigm of public pre-training and private fine-tuning does not guarantee improvement, particularly under significant feature distribution shifts between datasets. Experiments on synthetic and real-world data corroborate our theoretical findings.
Authors: Muhammad Khalifa, Zohaib Khan, Omer Tafveez, Hao Peng, Lu Wang
Abstract: Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We introduce Countdown-Code, a minimal environment where models can both solve a mathematical reasoning task and manipulate the test harness. This dual-access design creates a clean separation between proxy rewards (test pass/fail) and true rewards (mathematical correctness), enabling accurate measurement of reward-hacking rates. Using this environment, we study reward hacking in open-weight LLMs and find that such behaviors can be unintentionally learned during supervised fine-tuning (SFT) when even a small fraction of reward-hacking trajectories leak into training data. As little as 1\% contamination in distillation SFT data is sufficient for models to internalize reward hacking which resurfaces during subsequent reinforcement learning (RL). We further show that RL amplifies misalignment and drives its generalization beyond the original domain. We open-source our environment and code to facilitate future research on reward hacking in LLMs. Our results reveal a previously underexplored pathway through which reward hacking can emerge and persist in LLMs, underscoring the need for more rigorous validation of synthetic SFT data. Code is available at https://github.com/zohaib-khan5040/Countdown-Code.
Authors: Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via "hard clipping", which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent "soft clipping" methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_\theta\log \pi_\theta$) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient ($\nabla_\theta \pi_\theta$) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/FlyTune/DGPO-RL.
Authors: Ying Zhang, Hang Yu, Haipeng Zhang, Peng Di
Abstract: Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node's raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.
Authors: Francesco Sovrano, Lidia Losavio, Giulia Vilone, Marc Langheinrich
Abstract: Symbolic regression aims to replace black-box predictors with concise analytical expressions that can be inspected and validated in scientific machine learning. Kolmogorov-Arnold Networks (KANs) are well suited to this goal because each connection between adjacent units (an "edge") is parametrised by a learnable univariate function that can, in principle, be replaced by a symbolic operator. In practice, however, symbolic extraction is a bottleneck: the standard KAN-to-symbol approach fits operators to each learned edge function in isolation, making the discrete choice sensitive to initialisation and non-convex parameter fitting, and ignoring how local substitutions interact through the full network. We study in-context symbolic regression for operator extraction in KANs, and present two complementary instantiations. Greedy in-context Symbolic Regression (GSR) performs greedy, in-context selection by choosing edge replacements according to end-to-end loss improvement after brief fine-tuning. Gated Matching Pursuit (GMP) amortises this in-context selection by training a differentiable gated operator layer that places an operator library behind sparse gates on each edge; after convergence, gates are discretised (optionally followed by a short in-context greedy refinement pass). We quantify robustness via one-factor-at-a-time (OFAT) hyper-parameter sweeps and assess both predictive error and qualitative consistency of recovered formulas. Across several experiments, greedy in-context symbolic regression achieves up to 99.8% reduction in median OFAT test MSE.
Authors: Panos Stinis
Abstract: We propose a novel approach which exploits chaos to enhance classification accuracy. Specifically, the available data that need to be classified are treated as vectors that are first lifted into a higher-dimensional space and then used as initial conditions for the evolution of a chaotic dynamical system for a prescribed temporal interval. The evolved state of the dynamical system is then fed to a trainable softmax classifier which outputs the probabilities of the various classes. As proof-of-concept, we use samples of randomly perturbed orthogonal vectors of moderate dimension (2 to 20), with a corresponding number of classes equal to the vector dimension, and show how our approach can both significantly accelerate the training process and improve the classification accuracy compared to a standard softmax classifier which operates on the original vectors, as well as a softmax classifier which only lifts the vectors to a higher-dimensional space without evolving them. We also provide an explanation for the improved performance of the chaos-enhanced classifier and a selection process for the optimal chaotic evolution interval.
Authors: Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan
Abstract: Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.
Authors: Zongliang Ji, Yifei Sun, Andre Amaral, Anna Goldenberg, Rahul G. Krishnan
Abstract: Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question -- can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors. Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals. We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk. Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.
Authors: Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu, Shih-Lun Huang, Long Chen, Gabe Schulman, Huizhen Jin, Shengduo Li, Yixuan Wang, Huidi Yang, Kyunghyun Cho, Cem M. Deniz, Narges Razavian
Abstract: While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms both standard simulation-based next-token approaches and a prompted medical large language model baseline. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.
Authors: Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela
Abstract: The autoresearch repository enables an LLM agent to optimize hyperparameters by editing training code directly. We use it as a testbed to compare classical HPO algorithms against LLM-based methods on tuning the hyperparameters of a small language model under a fixed compute budget. When defining a fixed search space over autoresearch, classical methods such as CMA-ES and TPE consistently outperform LLM-based agents, where avoiding out-of-memory failures matters more than search diversity. Allowing the LLM to directly edit source code narrows the gap to the classical methods but does not close it, even with frontier models available at the time of writing such as Claude Opus 4.6 and Gemini 3.1 Pro Preview. We observe that LLMs struggle to track optimization state across trials. In contrast, classical methods lack the domain knowledge of LLMs. To combine the strengths of both, we introduce Centaur, a hybrid that shares CMA-ES's interpretable internal state, including mean vector, step-size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, and a 0.8B LLM already suffices to outperform all classical and pure LLM methods. Unconstrained code editing requires larger models to be competitive with classical methods. We further analyze search diversity, model scaling from 0.8B to frontier models, and ablate the fraction of LLM-proposed trials in Centaur. All in all, our results suggest that LLMs are most effective as a complement to classical optimizers, not as a replacement. Code is available at https://github.com/ferreirafabio/autoresearch-automl & interactive demo at https://ferreirafabio.github.io/autoresearch-automl.
URLs: https://github.com/ferreirafabio/autoresearch-automl, https://ferreirafabio.github.io/autoresearch-automl.
Authors: Yaqi Guo, Fabian Braun, Bastiaan Ketelaar, Stephanie Tan, Richard Norte, Siddhant Kumar
Abstract: While backpropagation and automatic differentiation have driven deep learning's success, the physical limits of chip manufacturing and rising environmental costs of deep learning motivate alternative learning paradigms such as physical neural networks. However, most existing physical neural networks still rely on digital computing for training, largely because backpropagation and automatic differentiation are difficult to realize in physical systems. We introduce FFzero, a forward-only learning framework enabling stable neural network training without backpropagation or automatic differentiation. FFzero combines layer-wise local learning, prototype-based representations, and directional-derivative-based optimization through forward evaluations only. We show that local learning is effective under forward-only optimization, where backpropagation fails. FFzero generalizes to multilayer perceptron and convolutional neural networks across classification and regression. Using a simulated photonic neural network as an example, we demonstrate that FFzero provides a viable path toward backpropagation-free in-situ physical learning.
Authors: Moritz Vandenhirtz, Samuel Ruip\'erez-Campillo, Simon B\"ohi, Sonia Laguna, Irene Cannistraci, Andrea Agostini, Ece Ozkan, Thomas M. Sutter, Julia E. Vogt
Abstract: Electrocardiograms (ECGs) are among the most widely available clinical signals and play a central role in cardiovascular diagnosis. While recent foundation models (FMs) have shown promise for learning transferable ECG representations, most existing pretraining approaches treat leads as independent channels and fail to explicitly leverage their strong structural redundancy. We introduce the latent attention masked autoencoder (LAMAE) FM that directly exploits this structure by learning cross-lead connection mechanisms during self-supervised pretraining. Our approach models higher-order interactions across leads through latent attention, enabling permutation-invariant aggregation and adaptive weighting of lead-specific representations. We provide empirical evidence on the Mimic-IV-ECG database that leveraging the cross-lead connection constitutes an effective form of structural supervision, improving representation quality and transferability. Our method shows strong performance in predicting ICD-10 codes, outperforming independent-lead masked modeling and alignment-based baselines.
Authors: Tushar Dhananjay Pathak
Abstract: This paper presents ARCS (Autoregressive Circuit Synthesis), a system for amortized analog circuit generation. ARCS produces complete, SPICE-simulatable designs (topology and component values) in milliseconds rather than the minutes required by search-based methods. A hybrid pipeline combines two learned generators, a graph VAE and a flow-matching model, with SPICE-based ranking. It achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single-model inference, a topology-aware Graph Transformer with Best-of-3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution adapts Group Relative Policy Optimization (GRPO) to multi-topology circuit reinforcement learning. GRPO resolves a critical failure mode of REINFORCE, cross-topology reward distribution mismatch, through per-topology advantage normalization. This improves simulation validity by +9.6 percentage points over REINFORCE in only 500 RL steps (10x fewer). Grammar-constrained decoding additionally guarantees 100% structural validity by construction via topology-aware token masking.
Authors: Yunwen Lei, Yufeng Xie
Abstract: Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses measure the distance from initialization by the Frobenius norm, and often imply vacuous bounds in practice for overparamterized models. In this paper, we develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.
Authors: Mitchell A. Thornton
Abstract: We establish that temporal averaging over multiple observations is the degenerate case of algebraic group action with the trivial group $G=\{e\}$. A General Replacement Theorem proves that a group-averaged estimator from one snapshot achieves equivalent subspace decomposition to multi-snapshot covariance estimation. The Trivial Group Embedding Theorem proves that the sample covariance is the accumulation of trivial-group estimates, with variance governed by a $(G,L)$ continuum as $1/(|G|\cdot L)$. The processing gain $10\log_{10}(M)$ dB equals the classical beamforming gain, establishing that this gain is a property of group order, not sensor count. The DFT, DCT, and KLT are unified as group-matched special cases. We conjecture a General Algebraic Averaging Theorem extending these results to arbitrary statistics, with variance governed by the effective group order $d_{\mathrm{eff}}$. Monte Carlo experiments on the first four sample moments across five group types confirm the conjecture to four-digit precision. The framework exploits the $structure$ of information (representation-theoretic symmetry of the data object) rather than the content, complementing Shannon's theory. Five applications are demonstrated: single-snapshot MUSIC, massive MIMO with 64% throughput gain, single-pulse waveform classification at 90% accuracy, graph signal processing with non-abelian groups, and algebraic analysis of transformer LLMs.
Authors: Mehrdad Shoeibi, Elias Hossain, Ivan Garibay, Niloofar Yousefi
Abstract: Learning from weak, proxy, or relative supervision is common when ground-truth labels are unavailable, but robustness under distribution shift remains poorly understood because the supervision mechanism itself may change across environments. We formalize this phenomenon as supervision drift, defined as changes in $P(y \mid x, c)$ across contexts, and study it in CRISPR-Cas13d transcriptomic perturbation experiments where guide efficacy is inferred indirectly from RNA-seq responses. Using publicly available data spanning two human cell lines and multiple post-induction timepoints, we construct a controlled non-IID benchmark with explicit domain (cell line) and temporal shifts, while reusing a fixed weak-label construction across all contexts to avoid changing targets. Across linear and tree-based models, weak supervision supports meaningful learning in-domain (ridge $R^2 = 0.356$, Spearman $\rho = 0.442$) and partial cross-cell-line transfer ($\rho \approx 0.40$). In contrast, temporal transfer collapses across all model classes considered, yielding negative $R^2$ and weak or near-zero $\rho$ (ridge $R^2 = -0.145$, $\rho = 0.008$; XGBoost $R^2 = -0.155$, $\rho = 0.056$; random forest $R^2 = -0.322$, $\rho = 0.139$). Additional robustness analyses using externally recomputed weak labels, shift-score quantification, and simple mitigation baselines preserve the same qualitative pattern. Feature-label association and feature-importance analyses remain relatively stable across cell lines but change sharply over time, indicating that failures arise from supervision drift rather than model capacity or simple covariate shift. These results show that strong in-domain performance under weak supervision can be misleading and motivate feature stability as a lightweight diagnostic for non-transferability before deployment.
Authors: Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao, Wei Chen, Naipeng Chao
Abstract: Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.
Authors: Andrea Pollastro, Andrea Apicella, Francesco Isgr\`o, Roberto Prevete
Abstract: Variational autoencoders (VAEs) rely on amortized variational inference to enable efficient posterior approximation, but this efficiency comes at the cost of a shared parametrization, giving rise to the amortization gap. We propose the instance-adaptive variational autoencoder (IA-VAE), an amortized inference framework in which a hypernetwork generates input-dependent modulations of a shared encoder. This enables input-specific adaptation of the inference model while preserving the efficiency of a single forward pass. From a theoretical perspective, we show that the variational family induced by IA-VAE contains that of standard amortized inference, implying that IA-VAE cannot yield a worse optimal ELBO. By leveraging instance-specific parameter modulations, the proposed approach can achieve performance comparable to standard encoders with substantially fewer parameters, indicating a more efficient use of model capacity. Experiments on synthetic data, where the true posterior is known, show that IA-VAE yields more accurate posterior approximations and reduces the amortization gap. Similarly, on standard image benchmarks, IA-VAE consistently improves held-out ELBO over baseline VAEs, with statistically significant gains across multiple runs. These results suggest that increasing the flexibility of the inference parametrization through instance-adaptive modulation is an effective strategy for mitigating amortization-induced suboptimality in deep generative models.
Authors: Vincent Abbott, Gioele Zardini
Abstract: Despite deep learning models running well-defined mathematical functions, we lack a formal mathematical framework for describing model architectures. Ad-hoc notation, diagrams, and pseudocode poorly handle nonlinear broadcasting and the relationship between individual components and composed models. This paper introduces a categorical framework for deep learning models that formalizes broadcasting through the novel axis-stride and array-broadcasted categories. This allows the mathematical function underlying architectures to be precisely expressed and manipulated in a compositional manner. These mathematical definitions are translated into human manageable diagrams and machine manageable data structures. We provide a mirrored implementation in Python (pyncd) and TypeScript (tsncd) to show the universal aspect of our framework, along with features including algebraic construction, graph conversion, PyTorch compilation and diagram rendering. This lays the foundation for a systematic, formal approach to deep learning model design and analysis.
Authors: Sam Gunn
Abstract: How does the choice of training data influence an AI model? This broad question is of central importance to interpretability, privacy, and basic science. At its technical core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ and failure probability $\delta$ in the deep learning setting. Our precomputation and prediction algorithms are only $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ models. Our proof is based on an assumption that we call stability. In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.
Authors: Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
Abstract: We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax
Authors: Elena Xinyi Wang, Arnur Nigmetov, Dmitriy Morozov
Abstract: Topological Data Analysis (TDA) provides tools to describe the shape of data, but integrating topological features into deep learning pipelines remains challenging, especially when preserving local geometric structure rather than summarizing it globally. We propose a persistence-based data augmentation framework that encodes local gradient flow regions and their hierarchical evolution using the Morse-Smale complex. This representation, compatible with both convolutional and graph neural networks, retains spatially localized topological information across multiple scales. Importantly, the augmentation procedure itself is efficient, with computational complexity $O(n \log n)$, making it practical for large datasets. We evaluate our method on histopathology image classification and 3D porous material regression, where it consistently outperforms baselines and global TDA descriptors such as persistence images and landscapes. We also show that pruning the base level of the hierarchy reduces memory usage while maintaining competitive performance. These results highlight the potential of local, structured topological augmentation for scalable and interpretable learning across data modalities.
Authors: Ivo Nowak
Abstract: Reinforcement learning is typically treated as a uniform, data-driven optimization process, where updates are guided by rewards and temporal-difference errors without explicitly exploiting global structure. In contrast, dynamic programming methods rely on structured information propagation, enabling efficient and stable learning. In this paper, we provide evidence that such structure can be recovered from the learning dynamics of distributional reinforcement learning. By analyzing the temporal evolution of return distributions, we identify signals that capture when and where learning occurs in the state space. In particular, we introduce a temporal learning indicator t*(s) that reflects when a state undergoes its strongest learning update during training. Empirically, this signal induces an ordering over states that is consistent with a dynamic programming-style propagation of information. Building on this observation, we propose StructRL, a framework that exploits these signals to guide sampling in alignment with the emerging propagation structure. Our preliminary results suggest that distributional learning dynamics provide a mechanism to recover and exploit dynamic programming-like structure without requiring an explicit model. This offers a new perspective on reinforcement learning, where learning can be interpreted as a structured propagation process rather than a purely uniform optimization procedure.
Authors: Zhangyong Liang
Abstract: In this paper, we propose a stochastic-dimension frozen sampled neural network (SD-FSNN) for solving a class of high-dimensional Gross-Pitaevskii equations (GPEs) on unbounded domains. SD-FSNN is unbiased across all dimensions, and its computational cost is independent of the dimension, avoiding the exponential growth in computational and memory costs associated with Hermite-basis discretizations. Additionally, we randomly sample the hidden weights and biases of the neural network, significantly outperforming iterative, gradient-based optimization methods in terms of training time and accuracy. Furthermore, we employ a space-time separation strategy, using adaptive ordinary differential equation (ODE) solvers to update the evolution coefficients and incorporate temporal causality. To preserve the structure of the GPEs, we integrate a Gaussian-weighted ansatz into the neural network to enforce exponential decay at infinity, embed a normalization projection layer for mass normalization, and add an energy conservation constraint to mitigate long-time numerical dissipation. Comparative experiments with existing methods demonstrate the superior performance of SD-FSNN across a range of spatial dimensions and interaction parameters. Compared to existing random-feature methods, SD-FSNN reduces the complexity from linear to dimension-independent. Additionally, SD-FSNN achieves better accuracy and faster training compared to general high-dimensional solvers, while focusing specifically on high-dimensional GPEs on unbounded domains.
Authors: Muhammad Imran Hossain, Md Fazley Rafy, Sarika Khushalani Solanki, Anurag K. Srivastava
Abstract: Accurate battery health prognosis using State of Health (SOH) estimation is essential for the reliability of multi-scale battery energy storage, yet existing methods are limited in generalizability across diverse battery chemistries and operating conditions. The inability of standard neural networks to capture the complex, high-dimensional physics of battery degradation is a major contributor to these limitations. To address this, a physics-informed neural network with the Quantum Feature Mapping(QFM) technique (QPINN) is proposed. QPINN projects raw battery sensor data into a high-dimensional Hilbert space, creating a highly expressive feature set that effectively captures subtle, non-linear degradation patterns using Nystr\"om method. These quantum-enhanced features are then processed by a physics-informed network that enforces physical constraints. The proposed method achieves an average SOH estimation accuracy of 99.46\% across different datasets, substantially outperforming state-of-the-art baselines, with reductions in MAPE and RMSE of up to 65\% and 62\%, respectively. This method was validated on a large-scale, multi-chemistry dataset of 310,705 samples from 387 cells, and further showed notable adaptability in cross-validation settings, successfully transferring from one chemistry to another without relying on target-domain SOH labels.
Authors: Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yinzhe Zhou
Abstract: Wearable IMU-based Human Activity Recognition (HAR) relies heavily on Deep Neural Networks (DNNs), which are burdened by immense computational and buffering demands. Their power-hungry floating-point operations and rigid requirement to process complete temporal windows severely cripple battery-constrained edge devices. While Spiking Neural Networks (SNNs) offer extreme event-driven energy efficiency, standard architectures struggle with complex biomechanical topologies and temporal gradient degradation. To bridge this gap, we propose the Physics-Aware Spiking Neural Network (PAS-Net), a fully multiplier-free architecture explicitly tailored for Green HAR. Spatially, an adaptive symmetric topology mixer enforces human-joint physical constraints. Temporally, an $O(1)$-memory causal neuromodulator yields context-aware dynamic threshold neurons, adapting actively to non-stationary movement rhythms. Furthermore, we leverage a temporal spike error objective to unlock a flexible early-exit mechanism for continuous IMU streams. Evaluated across seven diverse datasets, PAS-Net achieves state-of-the-art accuracy while replacing dense operations with sparse 0.1 pJ integer accumulations. Crucially, its confidence-driven early-exit capability drastically reduces dynamic energy consumption by up to 98\%. PAS-Net establishes a robust, ultra-low-power neuromorphic standard for always-on wearable sensing. The source code and pre-trained models are publicly available at https://github.com/zhengnaichuan2022/PAS-Net.git.
Authors: Abeer Mostafa, Raneen Younis, Zahra Ahmadi
Abstract: Spatio-temporal processes often exhibit highly heterogeneous and non-intuitive responses to localized disruptions, limiting the effectiveness of conventional message passing approaches in modeling local heterogeneity. We reformulate spatio-temporal forecasting as the problem of learning information flow over locally structured spaces, rather than propagating globally aligned node representations. To this end, we introduce a spatio-temporal sheaf diffusion graph neural network (ST-Sheaf GNN) that embeds graph topology into sheaf-based vector spaces connected by learned linear restriction maps. Unlike prior approaches relying on static or globally shared transformations, our model learns dynamic restriction maps that evolve over time and adapt to local spatio-temporal patterns, enabling more expressive interactions. The proposed framework both theoretically guarantees and empirically demonstrates evidence that the proposed diffusion mechanism mitigates oversmoothing, preserving discriminative node representations even with increasing diffusion layer depth. Experiments on diverse real-world spatio-temporal forecasting benchmarks across multiple domains demonstrate state-of-the-art performance, highlighting the effectiveness of sheaf topological representations as a principled foundation for spatio-temporal graph learning. The code is available at: https://anonymous.4open.science/r/ST-SheafGNN-6523/.
Authors: Augustus Haoyang Li
Abstract: We present THEIA, a 2.75M modular neural architecture that learns the complete Kleene three-valued logic (K3) truth table from task data without external symbolic inference or hand-encoded K3 gate primitives. Across 5 seeds, THEIA achieves all 39 K3 rules at >99% per-rule accuracy. K3 learnability is not the central finding: Transformer baselines also reach >99% on all 39 rules, and flat MLPs match THEIA on Phase-1 accuracy within 0.04pp. The central findings are two properties of the learned system. (1) Uncertainty-verdict asymmetric propagation. The network preserves Has-Unknown at every upstream boundary (80.0/91.1/90.8/99.7% across Arith/Order/Set/Logic vs. ~52% majority) while final-verdict decodability stays at or below a 73.4% U-vs-non-U oracle reference under linear and nonlinear MLP probes. Activation patching on non-absorbent T->U configurations flips 4,898/4,898 OR pairs (4,719/4,719 AND) across 5 seeds, ruling out residual shortcuts. (2) Reliability spectrum under discretized end-to-end training, on task structures decomposable along the engine boundaries. A mod-3 sequential composition task generalizes from 5- to 500-step eval at 99.96+-0.04% (5 seeds). Under identical Gumbel-softmax training, flat MLPs collapse to chance by 50 steps; a 2x2 ResMLP depth x expansion grid reaches >=99% on only 3/20 (config, seed) trials; a pre-LN Transformer reaches 99.24+-0.34%. The 500-step figure is dominated by straight-through discretization preventing 0.999^500 compounding; the architectural separator is sustaining Phase-1 accuracy under Phase-3 end-to-end Gumbel training, where flat MLPs fail. Auxiliary: under matched optimizer settings THEIA reaches 12/12 Kleene coverage 6.5x faster than a parameter-comparable 8L Transformer; the ratio narrows to ~3.6x under Transformer-standard tuning. We did not perform a THEIA-optimal sweep; ratios are specific-config, not asymptotic.
Authors: Mohammed Ezzaldin Babiker Abdullah
Abstract: The stable operation of off-grid photovoltaic systems requires accurate, computationally efficient solar forecasting. Contemporary deep learning models often suffer from massive computational overhead and physical blindness, generating impossible predictions. This paper introduces the Physics-Informed State Space Model (PISSM) to bridge the gap between efficiency and physical accuracy for edge-deployed microcontrollers. PISSM utilizes a dynamic Hankel matrix embedding to filter stochastic sensor noise by transforming raw meteorological sequences into a robust state space. A Linear State Space Model replaces heavy attention mechanisms, efficiently modeling temporal dependencies for parallel processing. Crucially, a novel Physics-Informed Gating mechanism leverages the Solar Zenith Angle and Clearness Index to structurally bound outputs, ensuring predictions strictly obey diurnal cycles and preventing nocturnal errors. Validated on a multi-year dataset for Omdurman, Sudan, PISSM achieves superior accuracy with fewer than 40,000 parameters, establishing an ultra-lightweight benchmark for real-time off-grid control.
Authors: Mohammed Ezzaldin Babiker Abdullah
Abstract: The stable operation of autonomous off-grid photovoltaic systems requires solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The methodology projects 22 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency optical transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.
Authors: Steve Hanneke, Idan Mehalel, Shay Moran
Abstract: Modern large language models generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for $T$ steps, producing a chain of tokens whose final token constitutes the model's output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length $T$, and how much Chain-of-Thought supervision can reduce this dependence. In this work we give a nearly complete answer to both questions by uncovering a taxonomy of how the sample complexity scales with $T$. For End-to-End learning, we show that the landscape is remarkably rich: subject to mild conditions, essentially any growth rate $r(T)$ between constant and linear can arise as the sample complexity, and combined with the linear upper bound of Joshi et al., this yields an essentially complete characterization. In contrast, under Chain-of-Thought supervision we show that the sample complexity is independent of $T$, demonstrating that access to intermediate reasoning steps can eliminate the dependence on the generation length altogether. Our analysis introduces new combinatorial tools, and as corollaries we resolve several open questions posed by Joshi et al. regarding the dependence of learnability on the generation length and the role of Chain-of-Thought supervision.
Authors: Muhammad Umer Sheikh, Khawar Shehzad, Salman Khan, Fahad Shahbaz Khan, Muhammad Haris Khan
Abstract: Climate decision-making in the GCC states increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated multimodal dataset grounded in the GCC states, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises 200k question--answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on climate tasks in the GCC states and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.
Authors: Angelo Moroncelli, Matteo Rufolo, Gunes Cagin Aydin, Asad Ali Shahid, Loris Roveda
Abstract: Accurate modeling of robot dynamics is essential for model-based control, yet remains challenging under distributional shifts and real-time constraints. In this work, we formulate system identification as an in-context meta-learning problem and compare deterministic and generative sequence models for forward dynamics prediction. We take a Transformer-based meta-model, as a strong deterministic baseline, and introduce to this setting two complementary diffusion-based approaches: (i) inpainting diffusion (Diffuser), which learns the joint input-observation distribution, and (ii) conditioned diffusion models (CNN and Transformer), which generate future observations conditioned on control inputs. Through large-scale randomized simulations, we analyze performance across in-distribution and out-of-distribution regimes, as well as computational trade-offs relevant for control. We show that diffusion models significantly improve robustness under distribution shift, with inpainting diffusion achieving the best performance in our experiments. Finally, we demonstrate that warm-started sampling enables diffusion models to operate within real-time constraints, making them viable for control applications. These results highlight generative meta-models as a promising direction for robust system identification in robotics.
Authors: Mohammed Ezzaldin Babiker Abdullah, Rufaidah Abdallah Ibrahim Mohammed
Abstract: Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer-based architectures, this paper challenges the prevailing "complexity-first" paradigm. We propose a lightweight, Physics-Informed Hybrid CNN-BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi-Directional LSTM for capturing temporal dependencies. Unlike standard data-driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear-Sky indices and Solar Zenith Angle - rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics-guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention-based baselines (RMSE 30.64 W/m^2). These results confirm a "Complexity Paradox": in high-noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self-attention mechanisms. The findings advocate for a shift towards hybrid, physics-aware AI for real-time renewable energy management.
Authors: Mohammed Ezzaldin Babiker Abdullah
Abstract: Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.
Authors: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard
Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.
Authors: Junzhe Wang, Zhiheng Xi, Yajie Yang, Hao Luo, Shihan Dou, Tao Gui, Qi Zhang
Abstract: Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agents, existing approaches face key limitations: process supervision often suffers from unstable value estimation, whereas outcome supervision struggles with credit assignment due to sparse, trajectory-level rewards. To bridge this gap, we propose Contribution-Weighted GRPO (CW-GRPO), a framework that integrates process supervision into group relative policy optimization. Instead of directly optimizing process rewards, CW-GRPO employs an LLM judge to assess the retrieval utility and reasoning correctness at each search round, producing per-round contribution scores. These scores are used to rescale outcome-based advantages along the trajectory, enabling fine-grained credit assignment without sacrificing optimization stability. Experiments on multiple knowledge-intensive benchmarks show that CW-GRPO outperforms standard GRPO by 5.0% on Qwen3-8B and 6.3% on Qwen3-1.7B, leading to more effective search behaviors. Additional analysis reveals that successful trajectories exhibit concentrated contributions in specific rounds, providing empirical insight into search agent tasks.
Authors: Tianhao Qian
Abstract: As search depth increases in autonomous reasoning and embodied planning, the candidate action space expands exponentially, heavily taxing computational budgets. While heuristic pruning is a common countermeasure, it operates without formal safety guarantees when surrogate models (like LLMs) exhibit systematic evaluation biases. This paper frames the node expansion process as a localized Best-Arm Identification (BAI) problem over dynamic frontiers, subject to a bounded systematic bias $L$. By inverting the Lambert W function, we establish an additive sample complexity of $\mathcal{O}((\Delta-4L)^{-2})$, which indicates that safe node elimination is only feasible when the empirical reward gap exceeds $4L$. We complement this with an information-theoretic lower bound of $\Omega((\Delta-2L)^{-2})$ to confirm the structural limits of biased search. Subsequent evaluations on both synthetic trees and complex reasoning tasks demonstrate that adhering to this local safety boundary successfully preserves optimal trajectories while maximizing sample allocation efficiency.
Authors: Wladimir Silva
Abstract: We present VoodooNet, a non-iterative neural architecture that replaces the stochastic gradient descent (SGD) paradigm with a closed-form analytic solution via Galactic Expansion. By projecting input manifolds into a high-dimensional, high-entropy "Galactic" space ($d \gg 784$), we demonstrate that complex features can be untangled without the thermodynamic cost of backpropagation. Utilizing the Moore-Penrose pseudoinverse to solve for the output layer in a single step, VoodooNet achieves a classification accuracy of \textbf{98.10\% on MNIST} and \textbf{86.63\% on Fashion-MNIST}. Notably, our results on Fashion-MNIST surpass a 10-epoch SGD baseline (84.41\%) while reducing the training time by orders of magnitude. We observe a near-logarithmic scaling law between dimensionality and accuracy, suggesting that performance is a function of "Galactic" volume rather than iterative refinement. This "Magic Hat" approach offers a new frontier for real-time Edge AI, where the traditional training phase is bypassed in favor of instantaneous manifold discovery.
Authors: Karim K. Ben Hicham, Jan G. Rittig, Martin Grohe, Alexander Mitsos
Abstract: Accurate molecular property prediction is central to drug discovery, catalysis, and process design, yet real-world applications are often limited by small datasets. Molecular foundation models provide a promising direction by learning transferable molecular representations; however, they typically involve task-specific fine-tuning, require machine learning expertise, and often fail to outperform classical baselines. Tabular foundation models (TFMs) offer a fundamentally different paradigm: they perform predictions through in-context learning, enabling inference without task-specific training. Here, we evaluate TFMs in the low- to medium-data regime across both standardized pharmaceutical benchmarks and chemical engineering datasets. We evaluate both frozen molecular foundation model representations, as well as classical descriptors and fingerprints. Across the benchmarks, the approach shows excellent predictive performance while reducing computational cost, compared to fine-tuning, with these advantages also transferring to practical engineering data settings. In particular, combining TFMs with CheMeleon embeddings yields up to 100\% win rates on 30 MoleculeACE tasks, while compact RDKit2d and Mordred descriptors provide strong descriptor-based alternatives. Molecular representation emerges as a key determinant in TFM performance, with molecular foundation model embeddings and 2D descriptor sets both providing substantial gains over classic molecular fingerprints on many tasks. These results suggest that in-context learning with TFMs provides a highly accurate and cost-efficient alternative for property prediction in practical applications.
Authors: Ethan Goan, Clinton Fookes
Abstract: Neural Networks (NNs) have provided state-of-the-art results for many challenging machine learning tasks such as detection, regression and classification across the domains of computer vision, speech recognition and natural language processing. Despite their success, they are often implemented in a frequentist scheme, meaning they are unable to reason about uncertainty in their predictions. This article introduces Bayesian Neural Networks (BNNs) and the seminal research regarding their implementation. Different approximate inference methods are compared, and used to highlight where future research can improve on current methods.
Authors: Bo Tang, Elias B. Khalil
Abstract: In deterministic optimization, it is typically assumed that all problem parameters are fixed and known. In practice, however, some parameters may be a priori unknown but can be estimated from contextual information. A typical predict-then-optimize approach separates predictions and optimization into two distinct stages. Recently, end-to-end predict-then-optimize has emerged as an attractive alternative. This work introduces the PyEPO package, a PyTorch-based end-to-end predict-then-optimize library in Python. To the best of our knowledge, PyEPO (pronounced like \textit{pineapple} with a silent ``n") is the first such generic tool for linear and integer programming with predicted objective function coefficients. It includes various algorithms such as surrogate decision losses, black-box solvers, and perturbed methods. PyEPO offers a user-friendly interface for defining new optimization problems, applying state-of-the-art algorithms, and using custom neural network architectures. We conducted experiments comparing various methods on problems such as Shortest Path, Multiple Knapsack, and Traveling Salesperson Problem, and discussed empirical insights that may guide future research. PyEPO and its documentation are available at https://github.com/khalil-research/PyEPO.
Authors: Hrishikesh Viswanath, Md Ashiqur Rahman, Abhijeet Vyas, Andrey Shor, Beatriz Medeiros, Stephanie Hernandez, Suhas Eswarappa Prameela, Aniket Bera
Abstract: Numerical approximations of partial differential equations (PDEs) are routinely employed to formulate the solution of physics, engineering, and mathematical problems involving functions of several variables, such as the propagation of heat or sound, fluid flow, elasticity, electrostatics, electrodynamics, and more. While this has led to solving many complex phenomena, there are some limitations. Conventional approaches such as Finite Element Methods (FEMs) and Finite Difference Methods (FDMs) require considerable time and are computationally expensive. In contrast, data-driven machine learning-based methods, such as neural networks, provide a faster, fairly accurate alternative, and, in particular, focus on neural operators, which have certain advantages such as discretization invariance and resolution invariance. This article aims to provide a comprehensive insight into how data-driven approaches can complement conventional techniques to solve engineering and physics problems, while also noting some of the open problems of machine learning-based approaches. We will note how these new computational approaches can bring immense advantages in tackling many problems in fundamental and applied physics.
Authors: Zijie Fang, Yihan Liu, Yifeng Wang, Xiangyang Zhang, Yang Chen, Changjing Cai, Yiyang Lin, Ying Han, Zhi Wang, Shan Zeng, Jun Tan, Yongbing Zhang, Hong Shen
Abstract: Biomarker detection is an indispensable part of the diagnosis and treatment of low-grade glioma (LGG). However, current LGG biomarker detection methods rely on expensive and complex molecular genetic testing, for which professionals are required to analyze the results, and intra-rater variability is often reported. To overcome these challenges, we propose an interpretable deep learning pipeline, named Multi-Biomarker Histomorphology Discoverer (Multi-Beholder), to predict the status of five biomarkers in LGG using only hematoxylin and eosin-stained whole slide images. Specifically, Multi-Beholder incorporates one-class classification into the multiple instance learning framework to achieve accurate instance-level pseudo-labeling, thereby complementing slide-level labels and improving prediction performance. Multi-Beholder demonstrates high performance on two LGG cohorts with diverse races and scanning protocols, with area under the receiver operating characteristic curve up to 0.973 on the internal-validated TCGA-LGG dataset and 0.820 on the external-validated Xiangya cohort. Moreover, the interpretability of Multi-Beholder allows for discovering quantitative and qualitative correlations between biomarker status and histomorphology characteristics. Our pipeline not only provides a novel approach for biomarker prediction, enhancing the applicability of molecular treatments for LGG patients but also facilitates the discovery of new mechanisms in molecular functionality and LGG progression. Code can be accessed at https://github.com/Vison307/Multi-Beholder.
Authors: Weide Liu, Huijing Zhan
Abstract: Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most existing research assume that all modalities are available during both training and testing, which makes their algorithms susceptible to the missing-modality scenarios. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio features. Moreover, we develop a cross-modality attention mechanism to maximize the information extracted from the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baseline methods and achieve comparable results to the previous methods with complete multi-modality supervision.
Authors: Jiahao Liu, Jun Zeng, Fabio Pierazzi, Ziqi Yang, Lorenzo Cavallaro, Zhenkai Liang
Abstract: With the rapid advancement of machine learning (ML), ML-based Android malware detection has gained significant popularity due to its ability to automatically learn malicious patterns from Android apps. However, the lack of an in-depth and systematic analysis of existing research makes it difficult to obtain a holistic understanding of the state of the art in this field. In this work, we present the most comprehensive investigation to date of ML-based Android malware detection systems, combining both empirical and quantitative analyses. We first organize prior work into a unified taxonomy based on Android app representations and the ML modeling pipeline. Building on this taxonomy, we design a general-purpose framework for ML-based Android malware detection and re-implement 12 representative approaches from three research communities -- software engineering, security, and machine learning. Using this framework, we conduct a large-scale evaluation across three key dimensions: detection effectiveness, robustness to real-world challenges, and efficiency. Despite extensive research efforts and encouraging results, our findings reveal that existing learning-based Android malware detectors still face significant challenges, including vulnerability to malware evolution and susceptibility to adversarial attacks. We attribute these limitations to the detectors' ability to capture and leverage malware semantics, defined as semantic information that characterizes malicious behaviors derived from APK features. Finally, we summarize our key insights and provide actionable recommendations to guide future research in this domain.
Authors: Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang
Abstract: Large language models (LLMs) have revolutionized various applications, making robust safety alignment essential to prevent harmful outputs. Current safety alignment techniques, however, harbor inherent vulnerabilities due to their reliance on logit suppression. In this work, we identify critical logit-level vulnerabilities by introducing Semantic-sensitive Alignment and Generation (SSAG), a method designed to systematically manipulate output-layer logits without altering model parameters. Experiments on five popular LLMs show that SSAG exposes harmful responses with a 95% success rate while reducing response time by 86%. VulMine also demonstrates superior attack efficacy, achieving an average ASR of up to 77% against strong defensive mechanisms. These findings reveal crucial weaknesses in existing alignment methods, highlighting an urgent need for improved vulnerability detection and robust safety alignment strategies. Our code is available on github.
Authors: Zhanglu Yan, Zhenyu Bai, Tulika Mitra, Weng-Fai Wong
Abstract: Deep learning has driven significant technological advancements, but its high energy consumption limits its use on battery-operated edge devices. Spiking Neural Networks (SNNs) offer promising reductions in inference-time energy consumption. However, existing neuromorphic architectures optimize scalable, many-core NoC execution, suited to large models but mismatched to edge devices, and their prevalent integrate-and-fire neurons re-read weights across \(T\) timesteps, inflating data-movement and dynamic-control energy. To address this challenge, we propose SparrowSNN, an optimized end-to-end design tailored for edge applications. SparrowSNN proposes: (1) a hardware-friendly spike activation function SSF (Sum-Spike-and-Fire); (2) a customizable $\mu$W-level-power quantized hybrid ANN-SNN model that can be designed per application; (3) a compact and low-power reconfigurable ASIC architecture, supporting the aforementioned designs. Evaluated on biomedical MIT-BIH ECG and DEAP EEG datasets, SparrowSNN achieves state-of-the-art accuracy with $20\times$ to $100\times$ lower energy consumption, significantly outperforming existing ultra-low power solutions.
Authors: Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu
Abstract: Memory pressure has emerged as a dominant constraint in scaling the training of large language models (LLMs), particularly in resource-constrained environments. While modern frameworks incorporate various memory-saving techniques, they often expose low-level configuration knobs that require manual tuning and specialized system expertise. This not only adds engineering overhead but also risks suboptimal hardware utilization when misconfigured. This paper introduces ProTrain, a novel training system that automatically tailors memory management policies to the model architecture and underlying hardware resources, eliminating the need for manual intervention. The core of ProTrain is its automated memory management that abstracts complex memory management strategies into a few tunable configuration parameters, allowing searches for optimal parameter settings using cost models. ProTrain is equipped with a runtime profiler that provides precise estimates of latency, memory usage, and I/O bandwidth to build high-fidelity cost models. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the state-of-the-art training systems.
Authors: Seunghee Han, Byeong Gwan Lee, Dae Woon Lim, Jihan Kim
Abstract: Recently, metal-organic frameworks (MOFs) have demonstrated their potential as solid-state electrolytes in proton exchange membrane fuel cells. However, the number of MOFs reported to exhibit proton conductivity remains limited, and the mechanisms underlying this phenomenon are not fully elucidated, complicating the design of proton-conductive MOFs. In response, we developed a comprehensive database of proton-conductive MOFs and applied machine learning techniques to predict their proton conductivity. Our approach included the construction of both descriptor-based and transformer-based models. Notably, the transformer-based transfer learning (Freeze) model performed the best with a mean absolute error (MAE) of 0.91, suggesting that the proton conductivity of MOFs can be estimated within one order of magnitude using this model. Additionally, we employed feature importance and principal component analysis to explore the factors influencing proton conductivity. The insights gained from our database and machine learning model are expected to facilitate the targeted design of proton-conductive MOFs.
Authors: Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu
Abstract: Large-scale data collection is essential for developing personalized training data, mitigating the shortage of training data, and fine-tuning specialized models. However, creating high-quality datasets quickly and accurately remains a challenge due to annotation errors, the substantial time and costs associated with human labor. To address these issues, we propose Automatic Dataset Construction (ADC), an innovative methodology that automates dataset creation with negligible cost and high efficiency. Taking the image classification task as a starting point, ADC leverages LLMs for the detailed class design and code generation to collect relevant samples via search engines, significantly reducing the need for manual annotation and speeding up the data generation process. To demonstrate ADC at scale, we construct Clothing-ADC: a dataset of over 1 million images spanning 12 main classes and 12,000 fine-grained subclasses. Our automated curation achieves 79\% agreement with human annotators and reduces label noise from 22.2\% to 10.7\%. Despite these advantages, ADC also encounters real-world challenges such as label errors (label noise) and imbalanced data distributions (label bias). We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data, ensuring a higher-quality training data and more robust model training procedure. Furthermore, we design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning. These datasets are vital because there are few existing datasets specifically for label noise detection, despite its importance. Finally, we evaluate the performance of existing popular methods on these datasets, thereby facilitating further research in the field.
Authors: Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, Sanqiang Zhao
Abstract: To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix's compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.
Authors: Wojciech Anyszka, Kaja Gruntkowska, Alexander Tyurin, Peter Richt\'arik
Abstract: We revisit FedExProx - a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies - based on gradient diversity and Polyak stepsizes - again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Lojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.
Authors: Timo Klein, Christoph Luther, Manus McAuliffe, Lukas Miklautz, Claudia Plant, Sebastian Tschiatschek
Abstract: Plasticity refers to a network's ability to adapt to changing data distributions, which is crucial for the successful training of deep reinforcement learning agents. Loss of plasticity causes performance plateaus and contributes to scaling failures, overestimation bias, and insufficient exploration. To deepen the understanding of plasticity loss, we propose a unified definition, examine its drivers and pathologies, and organize over 50 mitigation strategies into the first comprehensive taxonomy of the field. Our analysis shows gaps in current evaluation practices and reveals that general regularization techniques often outperform domain-specific interventions. Future research should prioritize understanding the mechanisms underlying plasticity loss.
Authors: Zixian Su, Jingwei Guo, Xi Yang, Qiufeng Wang, Frans Coenen, Amir Hussain, Kaizhu Huang
Abstract: Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints -- such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols -- with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.
Authors: Xabier E. Barandiaran, Marta P\'erez-Verdugo
Abstract: This paper introduces the concept of ``generative midtended cognition'', exploring the integration of generative AI with human cognition. The term "generative" reflects AI's ability to iteratively produce structured outputs, while "midtended" captures the potential hybrid (human-AI) nature of the process. It stands between traditional conceptions of intended creation, understood directed from within, and extended processes that bring exo-biological processes into the creative process. We examine current generative technologies (based on multimodal transformer architectures typical of large language models like ChatGPT), to explain how they can transform human cognitive agency beyond what standard theories of extended cognition can capture. We suggest that the type of cognitive activity typical of the coupling between a human and generative technologies is closer (but not equivalent) to social cognition than to classical extended cognitive paradigms. Yet, it deserves a specific treatment. We provide an explicit definition of generative midtended cognition in which we treat interventions by AI systems as constitutive of the agent's intentional creative processes. Furthermore, we distinguish two dimensions of generative hybrid creativity: 1. Width: captures the sensitivity of the context of the generative process (from the single letter to the whole historical and surrounding data), 2. Depth: captures the granularity of iteration loops involved in the process. Generative midtended cognition stands in the middle depth between conversational forms of cognition in which complete utterances or creative units are exchanged, and micro-cognitive (e.g. neural) subpersonal processes. Finally, the paper discusses the potential risks and benefits of widespread generative AI adoption, including the challenges of authenticity, generative power asymmetry, and creative boost or atrophy.
Authors: John Chen, Alexandros Lotsos, Sihan Cheng, Caiyi Wang, Lexie Zhao, Yanjia Zhang, Jessica Hullman, Bruce Sherin, Uri Wilensky, Michael Horn
Abstract: Qualitative analysis is critical to understanding human datasets in many social science disciplines. A central method in this process is inductive coding, where researchers identify and interpret codes directly from the datasets themselves. Yet, this exploratory approach poses challenges for meeting methodological expectations (such as ``depth'' and ``variation''), especially as researchers increasingly adopt Generative AI (GAI) for support. Ground-truth-based metrics are insufficient because they contradict the exploratory nature of inductive coding, while manual evaluation can be labor-intensive. This paper presents a theory-informed computational method for measuring inductive coding results from humans and GAI. Our method first merges individual codebooks using an LLM-enriched algorithm. It measures each coder's contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence. Through two experiments on a human-coded online conversation dataset, we 1) reveal the merging algorithm's impact on metrics; 2) validate the metrics' stability and robustness across multiple runs and different LLMs; and 3) showcase the metrics' ability to diagnose coding issues, such as excessive or irrelevant (hallucinated) codes. Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.
Authors: Ranganath Krishnan, Piyush Khanna, Omesh Tickoo
Abstract: Large language models (LLMs) have revolutionized the field of natural language processing with their impressive reasoning and question-answering capabilities. However, these models are sometimes prone to generating credible-sounding but incorrect information, a phenomenon known as LLM hallucinations. Reliable uncertainty estimation in LLMs is essential for fostering trust in their generated responses and serves as a critical tool for the detection and prevention of erroneous or hallucinated outputs. To achieve reliable and well-calibrated uncertainty quantification in open-ended and free-form natural language generation, we propose an uncertainty-aware fine-tuning approach for LLMs. This approach enhances the model's ability to provide reliable uncertainty estimates without compromising accuracy, thereby guiding them to produce more trustworthy responses. We introduce a novel uncertainty-aware causal language modeling loss function, grounded in the principles of decision theory. Through rigorous evaluation on multiple free-form question-answering datasets and models, we demonstrate that our uncertainty-aware fine-tuning approach yields better calibrated uncertainty estimates in natural language generation tasks than fine-tuning with the standard causal language modeling loss. Furthermore, the experimental results show that the proposed method significantly improves the model's ability to detect hallucinations and identify out-of-domain prompts.
Authors: Zikai Yin, Hamid Gadirov, Jiri Kosinka, Steffen Frey
Abstract: We introduce ENTIRE, a novel deep learning-based approach for fast and accurate volume rendering time prediction. Predicting rendering time is inherently challenging due to its dependence on multiple factors, including volume data characteristics, image resolution, camera configuration, and transfer function settings. Our method addresses this by first extracting a feature vector that encodes structural volume properties relevant to rendering performance. This feature vector is then integrated with additional rendering parameters, such as image resolution, camera setup, and transfer function settings, to produce the final prediction. We evaluate ENTIRE across multiple rendering frameworks (CPU- and GPU-based) and configurations (with and without single-scattering) on diverse datasets. The results demonstrate that our model achieves high prediction accuracy with fast inference speed and can be efficiently adapted to new scenarios by fine-tuning the pretrained model with few samples. Furthermore, we showcase ENTIRE's effectiveness in two case studies, where it enables dynamic parameter adaptation for stable frame rates and load balancing.
Authors: Kang Yang, Yuning Chen, Wan Du
Abstract: We present GRaF, Generalizable Radio-Frequency (RF) Radiance Fields, a framework that models RF signal propagation to synthesize spatial spectra at arbitrary transmitter or receiver locations, where each spectrum measures signal power across all surrounding directions at the receiver. Unlike state-of-the-art methods that adapt vanilla Neural Radiance Fields (NeRF) to the RF domain with scene-specific training, GRaF generalizes across scenes to synthesize spectra. To enable this, we prove an interpolation theory in the RF domain: the spatial spectrum from a transmitter can be approximated using spectra from geographically proximate transmitters. Building on this theory, GRaF comprises two components: (i) a geometry-aware Transformer encoder that captures spatial correlations from neighboring transmitters to learn a scene-independent latent RF radiance field, and (ii) a neural ray tracing algorithm that estimates spectrum reception at the receiver. Experimental results demonstrate that GRaF outperforms existing methods on single-scene benchmarks and achieves state-of-the-art performance on unseen scene layouts.
Authors: Ezinne Nwankwo, Lauri Goldkind, Angela Zhou
Abstract: Estimating the causal effects of interventions is crucial to policy and decision-making, yet outcome data are often missing or subject to non-standard measurement error. While ground-truth outcomes can sometimes be obtained through costly data annotation or follow-up, budget constraints typically allow only a fraction of the dataset to be labeled. We address this challenge by optimizing which data points should be sampled for outcome information in order to improve efficiency in average treatment effect estimation with missing outcomes. We derive a closed-form solution for the optimal batch sampling probability by minimizing the asymptotic variance of a doubly robust estimator for causal inference with missing outcomes. Motivated by our street outreach partners, we extend the framework to costly annotations of unstructured data, such as text or images in healthcare and social services. Across simulated and real-world datasets, including one of outreach interventions in homelessness services, our approach achieves substantially lower mean-squared error and recovers the AIPW estimate with fewer labels than existing baselines. In practice, we show that our method can match confidence intervals obtained with 361 random samples using only 90 optimized samples - saving 75% of the labeling budget.
Authors: Antoine Chatalic, Marco Letizia, Nicolas Schreuder, Lorenzo Rosasco
Abstract: Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due to its flexibility and strong theoretical foundations. However, its use in large-scale scenarios is plagued by high computational costs. In this work, we use a Nystr\"om approximation of the MMD to design a computationally efficient and practical testing algorithm while preserving statistical guarantees. Our main result is a finite-sample bound on the power of the proposed test for distributions that are sufficiently separated with respect to the MMD. The derived separation rate matches the known minimax optimal rate in this setting. We support our findings with a series of numerical experiments, emphasizing applicability to realistic scientific data.
Authors: Adam \v{S}torek, Mukur Gupta, Noopur Bhatt, Aditya Gupta, Janie Kim, Prashast Srivastava, Suman Jana
Abstract: AI coding assistants are widely used for tasks like code generation. These tools now require large and complex contexts, automatically sourced from various origins$\unicode{x2014}$across files, projects, and contributors$\unicode{x2014}$forming part of the prompt fed to underlying LLMs. This automatic context-gathering introduces new vulnerabilities, allowing attackers to subtly poison input to compromise the assistant's outputs, potentially generating vulnerable code or introducing critical errors. We propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is challenging to detect as it relies on adversarial code modifications that are semantically equivalent. Traditional program analysis techniques struggle to identify these perturbations since the semantics of the code remains correct, making it appear legitimate. This allows attackers to manipulate coding assistants into producing incorrect outputs, while shifting the blame to the victim developer. We introduce a novel, task-agnostic, black-box attack algorithm GCGS that systematically searches the transformation space using a Cayley Graph, achieving a 75.72% attack success rate on average across five tasks and eleven models, including GPT 4.1 and Claude 3.5 Sonnet v2 used by popular AI coding assistants. Furthermore, defenses like adversarial fine-tuning are ineffective against our attack, underscoring the need for new security measures in LLM-powered coding tools.
Authors: Jonggwon Park, Byungmu Yoon, Soobum Kim, Kyoyun Choi
Abstract: Automated radiology report generation (RRG) holds potential to reduce the workload of radiologists, and recent advances in multimodal large language models (MLLMs) have enabled multimodal chest X-ray (CXR) report generation. However, existing MLLMs are computationally expensive, require large-scale training data, and may produce hallucinated content, limiting their practical deployment. To address these limitations, we propose RA-RRG, a retrieval-augmented RRG framework that combines multimodal retrieval with large language models (LLMs) to generate radiology reports while reducing hallucinations and computational demands. RA-RRG uses LLMs to extract clinically essential key phrases from radiology reports and retrieves relevant phrases given an input image. By conditioning LLMs on the retrieved phrases, RA-RRG effectively suppresses hallucinations while maintaining strong report generation performance. Experiments on the MIMIC-CXR and IU X-ray datasets show state-of-the-art results on CheXbert metrics and competitive RadGraph F1 scores compared to MLLMs. Furthermore, RA-RRG naturally generalizes to multi-view RRG by aggregating phrases retrieved from multiple images, highlighting its broad applicability to real-world clinical scenarios. Code is available at https://github.com/deepnoid-ai/RA-RRG.
Authors: Ningsheng Zhao, Trang Bui, Jia Yuan Yu, Krzysztof Dzieciolowski
Abstract: Many classification performance metrics exist, each suited to a specific application. However, these metrics often differ in scale and can exhibit varying sensitivity to class imbalance rates in the test set. As a result, it is difficult to use the nominal values of these metrics to evaluate, compare and monitor classification performances, especially when imbalance rates vary. To address this problem, we introduce the outperformance standardization (OPS) function, a universal standardization method for confusion-matrix-based classification performance (CMBCP) metrics. It maps any given metric to a common scale of $[0,1]$, while providing a clear and consistent interpretation. Specifically, the resulting OPS value (o-value) represents the percentile rank of the observed classification performance within a reference distribution of possible performances. This unified framework enables meaningful comparison and monitoring of classification performance across test sets with differing imbalance rates. We illustrate how o-values can be applied to a variety of commonly used classification performance metrics and demonstrate the utility and robustness of our method through experiments on real-world datasets spanning multiple classification applications.
Authors: Adam \v{S}torek, Mukur Gupta, Samira Hajizadeh, Prashast Srivastava, Suman Jana
Abstract: Large language models (LLMs) are increasingly deployed for understanding large codebases, but whether they understand operational semantics of long code context or rely on pattern matching shortcuts remains unclear. We distinguish between lexical recall (retrieving code verbatim) and semantic recall (understanding operational semantics). Evaluating 10 state-of-the-art LLMs, we find that while frontier models achieve near-perfect, position-independent lexical recall, semantic recall degrades severely when code is centrally positioned in long contexts. We introduce semantic recall sensitivity to measure whether tasks require understanding of code's operational semantics vs. permit pattern matching shortcuts. Through a novel counterfactual measurement method, we show that models rely heavily on pattern matching shortcuts to solve existing code understanding benchmarks. We propose a new task SemTrace, which achieves high semantic recall sensitivity through unpredictable operations; LLMs' accuracy exhibits severe positional effects, with median accuracy drops of 92.73% versus CRUXEval's 53.36% as the relevant code snippet approaches the middle of the input code context. Our findings suggest current evaluations substantially underestimate semantic recall failures in long context code understanding.
Authors: Kaheon Kim, Bohan Zhou, Changbo Zhu, Xiaohui Chen
Abstract: This paper introduces a new constraint-free concave dual formulation for the Wasserstein barycenter. Tailoring the vanilla dual gradient ascent algorithm to the Sobolev geometry, we derive a scalable Sobolev gradient ascent (SGA) algorithm to compute the barycenter for input distributions discretized over a regular grid. Despite the algorithmic simplicity, we provide a global convergence analysis that achieves the same rate as the classical subgradient descent methods for minimizing nonsmooth convex functions in the Euclidean space. A central feature of our SGA algorithm is that the computationally expensive $c$-concavity projection operator enforced on the Kantorovich dual potentials is unnecessary to guarantee convergence, leading to significant algorithmic and theoretical simplifications over all existing primal and dual methods for computing the exact barycenter. Our numerical experiments demonstrate the superior empirical performance of SGA over the existing optimal transport barycenter solvers.
Authors: Kaito Ariu, Po-An Wang, Alexandre Proutiere, Kenshi Abe
Abstract: We study the policy testing problem in discounted Markov decision processes (MDPs) in the fixed-confidence setting under a generative model with static sampling. The goal is to decide whether the value of a given policy exceeds a specified threshold while minimizing the number of samples. We first derive an instance-dependent lower bound that any reasonable algorithm must satisfy, characterized as the solution to an optimization problem with non-convex constraints. Guided by this formulation, we propose a new algorithm. While this design paradigm is common in pure exploration problems such as best-arm identification, the non-convex constraints that arise in MDPs introduce substantial difficulties. To address them, we reformulate the lower-bound problem by swapping the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. This reformulation admits an interpretation as a policy optimization task in a newly constructed reversed MDP. We further show that the global KL constraint can be decomposed exactly into a family of product-box subproblems, which are solved by projected policy gradient and combined through an outer budget search. Beyond policy testing, our reformulation and reversed MDP view suggest extensions to other pure exploration tasks in MDPs, including policy evaluation and best policy identification.
Authors: Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi, Carl Ren, Mi Zhang, Yuning Chai, Richard Newcombe, Hyo Jin Kim
Abstract: To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.
Authors: John P. Dickerson, Hadi Hosseini, Samarth Khanna, Leona Pierce
Abstract: The rapid integration of Large Language Models (LLMs) in high-stakes decision-making -- such as allocating scarce resources like donor organs -- raises critical questions about their alignment with human moral values. We systematically evaluate the behavior of several prominent LLMs against human preferences in kidney allocation scenarios and show that LLMs: i) exhibit stark deviations from human values in prioritizing various attributes, and ii) in contrast to humans, LLMs rarely express indecision, opting for deterministic decisions even when alternative indecision mechanisms (e.g., coin flipping) are provided. Nonetheless, we show that low-rank supervised fine-tuning with few samples is often effective in improving both decision consistency and calibrating indecision modeling. These findings illustrate the necessity of explicit alignment strategies for LLMs in moral/ethical domains.
Authors: Zeming Wei, Chengcan Wu, Meng Sun
Abstract: Large Language Models (LLMs) have achieved tremendous success in various tasks, yet concerns about their safety and security have emerged. In particular, they pose risks of generating harmful content and are vulnerable to jailbreaking attacks, creating unaddressed security issues regarding their deployments. In the context of software engineering for artificial intelligence (SE4AI) techniques, model-based analysis has demonstrated notable potential for analyzing and monitoring machine learning models, particularly in stateful deep neural networks. However, it suffers from scalability issues when extended to LLMs due to their vast feature spaces. In this paper, we aim to address the scalability issue of model-based analysis techniques for safeguarding LLM-scale models. Motivated by the recent discovery of low-dimensional safety-critical representations that emerged in LLMs, we propose ReGA, a model-based analysis framework with Representation-Guided Abstraction, to safeguard LLMs against harmful prompts and generations. By leveraging safety-critical representations, which are key directions in hidden states that indicate safety-related concepts, ReGA effectively narrows the scalability gap when developing the abstract model for safety modeling. Our comprehensive evaluation shows that ReGA performs sufficiently well in distinguishing between safe and harmful inputs, achieving an AUROC of 0.975 at the prompt level and 0.985 at the conversation level. Additionally, ReGA exhibits robustness to real-world attacks and generalization across different safety perspectives, outperforming existing safeguard paradigms in terms of interpretability and scalability. Overall, ReGA serves as an efficient and scalable solution to enhance LLM safety by integrating representation engineering with model-based abstraction, paving the way for new paradigms to utilize software insights for AI safety.
Authors: Ziyang Yu, Wenbing Huang, Yang Liu
Abstract: Molecular Dynamics (MD) simulations are essential for understanding the atomic-level behavior of molecular systems, giving insights into their transitions and interactions. However, classical MD techniques are limited by the trade-off between accuracy and efficiency, while recent deep learning-based improvements have mostly focused on single-domain molecules, lacking transferability to unfamiliar molecular systems. Therefore, we propose \textbf{Uni}fied \textbf{Sim}ulator (UniSim), which leverages cross-domain knowledge to enhance the understanding of atomic interactions. First, we employ a multi-head pretraining approach to learn a unified atomic representation model from a large and diverse set of molecular data. Then, based on the stochastic interpolant framework, we learn the state transition patterns over long timesteps from MD trajectories, and introduce a force guidance module for rapidly adapting to different chemical environments. Our experiments demonstrate that UniSim achieves highly competitive performance across small molecules, peptides, and proteins.
Authors: Deborah Pereg
Abstract: Signal restoration and inverse problems are key elements in most real-world data science applications. In the past decades, with the emergence of machine learning methods, inversion of measurements has become a popular step in almost all physical applications, normally executed prior to downstream tasks that often involve parameter estimation. In this work, we propose a general framework for theoretical analysis of parameter estimation in inverse problem settings. We distinguish between continuous and discrete parameter estimation, corresponding with regression and classification problems, respectively. We investigate this setting for invertible and non-invertible degradation processes, with parameter estimation that is executed directly from the observed measurements, comparing with parameter estimation after data-processing performing an inversion of the observations. Our theoretical findings align with the well-known information-theoretic data processing inequality, and to a certain degree question the common misconception that data-processing for inversion, based on modern generative models that may often produce outstanding perceptual quality, will necessarily improve the following parameter estimation objective. Importantly, by re-formulating the domain-shift problem in direct relation with discrete parameter estimation, we expose a significant vulnerability in current popular practical attempts to enforce domain generalization, which we dubbed the Double Meaning Theorem. These theoretical findings are experimentally illustrated for domain shift examples in image deblurring and speckle suppression in medical imaging. It is our hope that this paper will provide practitioners with deeper insights that may be leveraged in the future for the development of more efficient and informed strategic system planning, critical in safety-sensitive applications.
Authors: William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christoffer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan
Abstract: Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we release our code, see https://research.zenseact.com/publications/R3D2/.
Authors: Anh-Quan Cao, Ivan Lopes, Raoul de Charette
Abstract: Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.
Authors: Almog Hilel, Riddhi Bhagwat, Idan Shenfeld, Jacob Andreas, Leshem Choshen
Abstract: We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).
Authors: Chien Van Nguyen, Huy Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.
Authors: Atharva Naik, Lawanya Baghel, Dhakshin Govindarajan, Darsh Agrawal, Yiqing Xie, Daniel Fried, Carolyn Rose
Abstract: Large language models excel at code generation but struggle with code linting, particularly in generalizing to unseen or evolving best practices beyond those observed during training. We introduce MetaLint, a meta-learning framework that formulates code linting as an instruction-following task, where a model evaluates whether code adheres to a natural language specification of best practices. In contrast to prior work that trains models to detect violations from a fixed set of best practices, MetaLint evaluates code against a provided natural language specification, enabling test-time control over which practices to enforce and generalization to unseen or evolving rules without retraining. We demonstrate that models trained solely on synthetic data generated from automatic linters still generalize to harder, context-dependent best practices for which such linters are not available. To evaluate generalization beyond such easy signals, we introduce a human-curated benchmark of hard best practices inspired by Python Enhancement Proposals (PEPs). On this benchmark, MetaLint substantially improves performance without explicit fine-tuning on target best practices and exhibits strong, easy-to-hard generalization. Qwen3-4B achieves a 2.7x detection F-score gain (25.9% -> 70.4%), the highest recall, and a 26.7% localization F-score, matching larger models such as o3-mini. These gains generalize across programming languages, model families, scales, reasoning settings, and linter sources. We release the code and benchmark to support reproducibility and future work.
Authors: Ievgenii Afanasiev, Leonid Berlyand, Mariia Kiyashko
Abstract: The paper is concerned with deformed Wigner random matrices. These matrices are closely related to Deep Neural Networks (DNNs): weight matrices of trained DNNs could be represented in the form $R + S$, where $R$ is random and $S$ is highly correlated. The spectrum of such matrices plays a key role in rigorous underpinning of the novel pruning technique based on Random Matrix Theory. In practice, the spectrum of the matrix $S$ can be rather complicated. In this paper, we develop an asymptotic analysis for the case of full rank $S$ with increasing number of outlier eigenvalues.
Authors: Junliang Luo, Katrin Tinn, Samuel Ferreira Duran, Di Wu, Xue Liu
Abstract: Tokenized U.S. Treasuries have emerged as a prominent subclass of real-world assets (RWAs), offering cryptographically secured, yield-bearing instruments issued across multi-chain Web3 infrastructures, with growing significance for transparency, accessibility, and financial inclusion. While the market has expanded rapidly, empirical analyses of transaction-level behaviours remain limited. This paper conducts a quantitative, function-level dissection of U.S. Treasury-backed RWA tokens, including BUIDL, BENJI, and USDY across multi-chain: mostly Ethereum and Layer-2s. Decoded contract calls expose core financial primitives such as issuance, redemption, transfer, and bridging, revealing patterns that distinguish institutional participants from smaller or retail users for the extent and limits of inclusivity in current RWA adoption. To infer address-level economic roles, we introduce a curvature-aware representation learning model. Our method outperforms baseline models in role inference on our collected U.S. Treasury transaction dataset and generalizes to address classification across broader public blockchain transaction datasets. The decoded transaction-level patterns in tokenized U.S. Treasuries across chains surface the degree of retail participation, and the role inference model enables the distinction between institutional treasuries, arbitrage bots, and retail traders based on behavioral patterns, facilitating future more transparent, inclusive, and accountable Web3 finance.
Authors: Atif Bilal Asad, Achyut Paudel, Safal Kshetri, Chenchen Kang, Salik Ram Khanal, Nataliya Shcherbatyuk, Pierre Davadant, R. Paul Schreiner, Santosh Kalauni, Manoj Karkee, Markus Keller
Abstract: Nitrogen (N) is one of the most critical nutrients in winegrape production, influencing vine vigor, fruit composition, and wine quality. Because soil N availability varies spatially and temporally, accurate estimation of leaf N concentration is essential for optimizing fertilization at the individual plant level. In this study, in-field hyperspectral images (400-1000 nm) were collected from four grapevine cultivars (Chardonnay, Pinot Noir, Concord, and Syrah) across two growth stages (bloom and veraison) during the 2022 and 2023 growing seasons at both the leaf and canopy levels. An ensemble feature selection framework was developed to identify the most informative spectral bands for N estimation within individual cultivars, effectively reducing redundancy and selecting compact, physiologically meaningful band combinations spanning the visible, red-edge, and near-infrared regions. At the leaf level, models achieved the highest predictive accuracy for Chardonnay (R^2 = 0.82, RMSE = 0.19 %DW) and Pinot Noir (R^2 = 0.69, RMSE = 0.20 %DW). Canopy-level predictions also performed well, with R^2 values of 0.65, 0.72, and 0.70 for Chardonnay, Concord, and Syrah, respectively. White cultivars exhibited balanced spectral contributions across the visible, red-edge, and near-infrared regions, whereas red cultivars relied more heavily on visible bands due to anthocyanin-chlorophyll interactions. Leaf-level N-sensitive bands selected for Chardonnay and Pinot Noir were successfully transferred to the canopy level, improving or maintaining prediction accuracy across cultivars. These results confirm that ensemble feature selection captures spectrally robust, scale-consistent bands transferable across measurement levels and cultivars, demonstrating the potential of integrating in-field hyperspectral imaging with machine learning for vineyard N status monitoring.
Authors: Tianyi Hu, Andrea Morales-Garz\'on, Jingyi Zheng, Maria Maistro, Daniel Hershcovich
Abstract: In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish's essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.
Authors: Hanling Zhang, Yayu Zhou, Tongcheng Fang, Zhihang Yuan, Guohao Dai, Wanli Ouyang, Yu Wang
Abstract: Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss during the prefill stage and lack flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning. Our code is available at https://github.com/AwakenedInsects/VocabTailor.
Authors: Yan Cathy Hua, Paul Denny, J\"org Wicker, Katerina Taskova
Abstract: Every year, most educational institutions seek and receive an enormous volume of text feedback from students on courses, teaching, and overall experience. Yet, turning this raw feedback into useful insights is far from straightforward. It has been a long-standing challenge to adopt automatic opinion mining solutions for such education review text data due to the content complexity and low-granularity reporting requirements. Aspect-based Sentiment Analysis (ABSA) offers a promising solution with its rich, sub-sentence-level opinion mining capabilities. However, existing ABSA research and resources are very heavily focused on the commercial domain. In education, they are scarce and hard to develop due to limited public datasets and strict data protection. A high-quality, annotated dataset is urgently needed to advance research in this under-resourced area. In this work, we present EduRABSA (Education Review ABSA), the first public, annotated ABSA education review dataset that covers three review subject types (course, teaching staff, university) in the English language and all main ABSA tasks, including the under-explored implicit aspect and implicit opinion extraction. We also share ASQE-DPT (Data Processing Tool), an offline, lightweight, installation-free manual data annotation tool that generates labelled datasets for comprehensive ABSA tasks from a single-task annotation. Together, these resources contribute to the ABSA community and education domain by removing the dataset barrier, supporting research transparency and reproducibility, and enabling the creation and sharing of further resources. The dataset, annotation tool, and scripts and statistics for dataset processing and sampling are available at https://github.com/yhua219/edurabsa_dataset_and_annotation_tool.
URLs: https://github.com/yhua219/edurabsa_dataset_and_annotation_tool.
Authors: Mihir Bafna, Bowen Jing, Bonnie Berger
Abstract: Many methods have been developed to predict static protein structures, however understanding the dynamics of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the in silico gold standard, its high computational cost limits scalability. We present DynaProt, a lightweight, SE(3)-invariant framework that predicts rich descriptors of protein dynamics directly from static structures. By casting the problem through the lens of multivariate Gaussians, DynaProt estimates dynamics at two complementary scales: (1) per-residue marginal anisotropy as $3 \times 3$ covariance matrices capturing local flexibility, and (2) joint scalar covariances encoding pairwise dynamic coupling across residues. From these dynamics outputs, DynaProt achieves high accuracy in predicting residue-level flexibility (RMSF) and, remarkably, enables reasonable reconstruction of the full covariance matrix for fast ensemble generation. Notably, it does so using orders of magnitude fewer parameters than prior methods. Our results highlight the potential of direct protein dynamics prediction as a scalable alternative to existing methods.
Authors: Po-Heng Chou, Wei-Lung Mao, Ru-Ping Lin, Jen-Yu Chiu, Chun-Yu Yeh
Abstract: This letter presents a CWT-enhanced vibration sensing framework for bearing fault monitoring through spatial localization on time-frequency spectrograms. Vibration signals are transformed into continuous wavelet transform (CWT) spectrograms to improve the observability of weak and non-stationary fault signatures, and YOLOv9, YOLOv10, and YOLOv11 are employed to localize and identify fault-related energy regions. Experiments on the CWRU, PU, and IMS datasets show that the proposed framework improves the detectability and robustness of fault-related sensing patterns compared with conventional time-series models, modern vision backbones, and short-time Fourier transform (STFT)-based representations, achieving mAP values up to 99.4%, 97.8%, and 99.5%, respectively. In addition, the region-aware localization provides a more interpretable connection between time-frequency energy distributions and bearing fault characteristics. These results demonstrate that spatial localization on CWT spectrograms offers an effective and generalizable approach for enhancing vibration sensing capability in non-stationary environments.
Authors: Yadi Cao, Futian Zhang, Wesley Liu, Tom Neiser, Orso Meneghini, Lawson Fuller, Sterling Smith, Raffi Nazikian, Brian Sammuli, Rose Yu
Abstract: The Trapped Gyro-Landau Fluid (TGLF) model provides fast, accurate predictions of turbulent transport in tokamaks, but whole device simulations requiring thousands of evaluations remain computationally expensive. Neural network (NN) surrogates offer accelerated inference with fully differentiable approximations that enable gradient-based coupling but typically require large training datasets to capture transport flux variations across plasma conditions, creating significant training burden and limiting applicability to expensive gyrokinetic simulations. We propose TGLF-WINN (Wavenumber-Informed Neural Network) with three key innovations: (1) principled feature engineering that reduces target prediction range, simplifying the learning task; (2) physics-guided wavenumber-resolved regularization to improve generalization under sparse data; and (3) Bayesian Active Learning (BAL) to strategically select training samples based on model uncertainty, reducing data requirements while maintaining accuracy. Feature tuning and wavenumber regularization together deliver a 12.5% relative RMSLE reduction over TGLF-NN on the full dataset; under sparse, unfiltered training (approximately 1/9 the full size) they yield an order-of-magnitude smaller RMSLE degradation than TGLF-NN, with the wavenumber-informed regularization imposing a physics-guided constraint on per-mode fluxes. Adding Bayesian Active Learning, TGLF-WINN matches TGLF-NN's full-data offline accuracy using only 25% of the training data, within 2.8% of TGLF-NN's full-data baseline and 4.3% of our own full-data result. A downstream flux-matching workflow further shows practicality: the NN surrogate gives a 45x speedup over TGLF with comparable reconstruction accuracy.
Authors: Robin Vujanic, Thomas Rueckstiess
Abstract: We present LEAF ("Lightweight Embedding Alignment Framework"), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.
Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar
Abstract: Fine-tuning the bias terms of large language models (LLMs) has the potential to achieve unprecedented parameter efficiency while maintaining competitive performance, particularly in low-data regimes. However, the link between fine-tuning different bias terms (i.e., $\boldsymbol{b}_q$, $\boldsymbol{b}_k$, and $\boldsymbol{b}_v$ in the query, key, or value projections) and downstream performance remains largely unclear to date. In this paper, we investigate the link between fine-tuning $\boldsymbol{b}_q$, $\boldsymbol{b}_k$, and $\boldsymbol{b}_v$ with the performance of the downstream task. Our key finding is that directly fine-tuning $\boldsymbol{b}_v$ generally leads to higher downstream performance in low-data regimes, in comparison to $\boldsymbol{b}_q$ and $\boldsymbol{b}_k$. We extensively evaluate this unique property across a wide range of LLMs spanning encoder-only and decoder-only architectures up to 6.7B parameters (including bias-free LLMs). Our results provide strong evidence for the effectiveness of directly fine-tuning $\boldsymbol{b}_v$ across various downstream tasks. The implementation code is available at https://github.com/whubaichuan/BEFT.
Authors: Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi
Abstract: Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.
Authors: Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tur, Shafiq Joty
Abstract: The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and fine-tuning. Recently, fine-tuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of fine-tuned judges regarding their real-world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future-proofing and backward-compatibility -- how well judges fine-tuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization -- how well judges generalize to unseen questions at test time. We study these three aspects under a unified framework with varying train and test distributions in two reasoning datasets, three SFT- and DPO-based fine-tuning algorithms, and three different backbone models. Experiments suggest that future-proofing is challenging for most models, while backward-compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models exhibit some degree of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.
Authors: Hossein Taheri, Avishek Ghosh, Arya Mazumdar
Abstract: Continual learning, the ability of a model to adapt to an ongoing sequence of tasks without forgetting earlier ones, is a central goal of artificial intelligence. To better understand its underlying mechanisms, we study the limitations of continual learning in a tractable yet representative setting. Specifically, we analyze one-hidden-layer quadratic neural networks trained by gradient descent on a sequence of XOR-cluster datasets with Gaussian noise, where different tasks correspond to clusters with orthogonal means. Our analysis is based on a tight characterization of gradient descent dynamics for the training loss, which yields explicit bounds on the rate of train-time forgetting as functions of the number of iterations, sample size, number of tasks, and hidden-layer width. We then leverage an algorithmic stability framework to bound the generalization gap, leading to corresponding guarantees on test-time forgetting. Together, our results provide the first closed-form guarantees for forgetting in continual learning with neural networks and show how key problem parameters jointly govern forgetting dynamics. Numerical experiments corroborate our theoretical results.
Authors: Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li
Abstract: Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code and checkpoints released at https://github.com/ModalityDance/LatentTTS
Authors: Yifan Zhao, Egan Johnson, Prasanth Chatarasi, Vikram Adve, Sasa Misailovic
Abstract: Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. This paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. Applying Neptune's advanced operator fusion to a plain attention operator generates operators equivalent to FlashAttention and FlashDecoding. On ten attention-based benchmarks, Neptune, starting from a plain attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have an average speedup of $1.35\times$ over the next best alternative, with up to $2.65\times$ speedup on Nvidia GPUs and up to $3.32\times$ on AMD GPUs, demonstrating its effectiveness for deep learning workloads.
Authors: Simon Idoko, Prajyot Jadhav, Arun Kumar Singh
Abstract: Centralized trajectory optimization in the joint space of multiple robots allows access to a larger feasible space that can result in smoother trajectories, especially while planning in tight spaces. Unfortunately, it is often computationally intractable beyond a very small swarm size. In this paper, we propose Flow-Opt, a learning-based approach towards improving the computational tractability of centralized multi-robot trajectory optimization. Specifically, we reduce the problem to first learning a generative model to sample different candidate trajectories and then using a learned Safety-Filter(SF) to ensure fast inference-time constraint satisfaction. We propose a flow-matching model with a diffusion transformer (DiT) augmented with permutation invariant robot position and map encoders as the generative model. We develop a custom solver for our SF and equip it with a neural network that predicts context-specific initialization. The initialization network is trained in a self-supervised manner, taking advantage of the differentiability of the SF solver. We advance the state-of-the-art in the following respects. First, we show that we can generate trajectories of tens of robots in cluttered environments in a few tens of milliseconds. This is several times faster than existing centralized optimization approaches. Moreover, our approach also generates smoother trajectories orders of magnitude faster than competing baselines based on diffusion models. Second, each component of our approach can be batched, allowing us to solve a few tens of problem instances in a fraction of a second. We believe this is a first such result; no existing approach provides such capabilities. Finally, our approach can generate a diverse set of trajectories between a given set of start and goal locations, which can capture different collision-avoidance behaviors.
Authors: Dwip Dalal, Gautam Vashishtha, Utkarsh Mishra, Jeonghwan Kim, Madhav Kanda, Hyeonjeong Ha, Svetlana Lazebnik, Heng Ji, Unnat Jain
Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.
Authors: Rui Xu, Jiawei Chen, Weizhi Liu, Zhaoxia Yin, Cong Kong, Xinpeng Zhang
Abstract: The proliferation of open-source code and large language models (LLMs) for code generation has amplified the risks of unauthorized reuse and intellectual property infringement. Source code watermarking offers a potential solution, yet existing methods typically encode watermarks through identifiers, local code patterns, or limited handcrafted edits, leaving them vulnerable to renaming, refactoring, and adaptive watermark removal. These limitations hinder the joint achievement of robustness, capacity, generalization, and deployment efficiency. We propose CLASP, a Code LLM-Assisted Semantic-Preserving watermarking framework that enables training-free, plug-and-play watermarking for source code. CLASP embeds watermark bits within a fixed space of semantics-preserving transformations, enabling automated watermark insertion with higher capacity while remaining reusable across programming languages and less dependent on brittle lexical features. To recover the watermark, CLASP uses reference-code retrieval and differential comparison to identify transformation traces, avoiding task-specific model training while improving robustness to structural edits and adaptive attacks. Experiments across multiple programming languages show that CLASP consistently outperforms existing baselines in watermark extraction accuracy and robustness, while maintaining code quality under both random removal and adaptive de-watermarking attacks.
Authors: Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, Chandan K. Reddy
Abstract: Multi-turn Text-to-SQL aims to translate a user's conversational utterances into executable SQL while preserving dialogue coherence and grounding to the target schema. However, most existing systems only regard this task as a simple text translation task and follow a short-horizon paradigm, generating a query per turn without execution, explicit verification, and refinement, which leads to non-executable or incoherent outputs. We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. We cast the task as a Markov Decision Process (MDP) in which an agent interacts with (i) a database for execution feedback and (ii) a persistent dialogue memory for coherence verification, performing an iterative propose to execute -> verify -> refine cycle until all checks pass. Experiments on COSQL and SPARC demonstrate that MTSQL-R1 consistently outperforms strong baselines, highlighting the importance of environment-driven verification and memory-guided refinement for conversational semantic parsing. Full recipes (including code, trained models, logs, reasoning trajectories, etc.) will be released after the internal review to contribute to community research.
Authors: Giosue Migliorini, Padhraic Smyth
Abstract: Systems of interacting continuous-time Markov chains are a powerful model class, but inference is typically intractable in high dimensional settings. Auxiliary information, such as noisy observations, is typically only available at discrete times, and incorporating it via a Doob's $h$-transform gives rise to an intractable posterior process that requires approximation. We introduce Latent Interacting Particle Systems, a model class parameterizing the generator of each Markov chain in the system. Our inference method involves estimating look-ahead functions (twist potentials) that anticipate future information, for which we introduce an efficient parameterization. We incorporate this approximation in a twisted Sequential Monte Carlo sampling scheme. We demonstrate the effectiveness of our approach on a challenging posterior inference task for a latent SIRS model on a graph, and on a neural model for wildfire spread dynamics trained on real data.
Authors: Yujie Luo, Zhuoyun Yu, Xuehai Wang, Yuqi Zhu, Ningyu Zhang, Lanning Wei, Lun Du, Da Zheng, Huajun Chen
Abstract: Replicating AI research is a crucial yet challenging task for large language model (LLM) agents. Existing approaches often struggle to generate executable code, primarily due to insufficient background knowledge and the limitations of retrieval-augmented generation (RAG) methods, which fail to capture latent technical details hidden in referenced papers. Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a pluggable, paper-centric knowledge base that automatically integrates code snippets and technical insights extracted from scientific literature. When integrated into three agent frameworks with two different LLMs, xKG shows substantial performance gains (10.9% with o3-mini) on PaperBench, demonstrating its effectiveness as a general and extensible solution for automated AI research replication. Code is available at https://github.com/zjunlp/xKG.
Authors: Wan Ki Wong, Sahel Torkamani, Michele Ciampi, Rik Sarkar
Abstract: Evaluating the usefulness of data before purchase is essential when obtaining data for high-quality machine learning models, yet both model builders and data providers are often unwilling to reveal their proprietary assets. We present PrivaDE, a privacy-preserving protocol that allows a model owner and a data owner to jointly compute a utility score for a candidate dataset without fully exposing model parameters, raw features, or labels. PrivaDE provides strong security against malicious behavior and can be integrated into blockchain-based marketplaces, where smart contracts enforce fair execution and payment. To make the protocol practical, we propose optimizations to enable efficient secure model inference, and a model-agnostic scoring method that uses only a small, representative subset of the data while still reflecting its impact on downstream training. Evaluation shows that PrivaDE performs data evaluation effectively, achieving online runtimes within 15 minutes even for models with millions of parameters. Our work lays the foundation for fair and automated data marketplaces in decentralized machine learning ecosystems.
Authors: Luca Maria Del Bono, Federico Ricci-Tersenghi, Francesco Zamponi
Abstract: Combinatorial optimization problems are central to both practical applications and the development of optimization methods. While classical and quantum algorithms have been refined over decades, machine learning--assisted approaches are comparatively recent and have not yet consistently outperformed simple, state-of-the-art classical methods. Here, we focus on a class of Quadratic Unconstrained Binary Optimization (QUBO) problems, specifically the challenge of finding minimum energy configurations in three-dimensional Ising spin glasses. We use a Global Annealing Monte Carlo algorithm that integrates standard local moves with global moves proposed via machine learning. We show that local moves play a crucial role in achieving optimal performance. Benchmarking against Simulated Annealing and Population Annealing, we demonstrate that Global Annealing not only surpasses the performance of Simulated Annealing but also exhibits greater robustness than Population Annealing, maintaining effectiveness across problem hardness and system size without hyperparameter tuning. These results provide clear and robust evidence that a machine learning--assisted optimization method can exceed the capabilities of classical state-of-the-art techniques in a combinatorial optimization setting.
Authors: Arman Zadgar, Somayeh Fallah, Farshid Mehrdoust, Juan E. Trinidad Segovia
Abstract: The Heston stochastic volatility model is a widely used tool in financial mathematics for pricing European options. However, its calibration remains computationally intensive and sensitive to local minima due to the model's nonlinear structure and high-dimensional parameter space. This paper introduces a hybrid deep learning-based framework that enhances both the computational efficiency and the accuracy of the calibration procedure. The proposed approach integrates two supervised feedforward neural networks: the Price Approximator Network (PAN), which approximates the option price surface based on strike and moneyness inputs, and the Calibration Correction Network (CCN), which refines the Heston model's output by correcting systematic pricing errors. Experimental results on real S\&P 500 option data demonstrate that the deep learning approach outperforms traditional calibration techniques across multiple error metrics, achieving faster convergence and superior generalization in both in-sample and out-of-sample settings. This framework offers a practical and robust solution for real-time financial model calibration.
Authors: Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu
Abstract: Despite extensive safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. However, existing methods generally lack the capability for continuous learning and self-evolution from interactions, limiting the diversity and adaptability of attack strategies. To address this, we propose ASTRA, an automated framework capable of autonomously discovering, retrieving, and evolving attack strategies. ASTRA operates on a closed-loop ``attack-evaluate-distill-reuse'' mechanism, which not only generates attack prompts but also automatically distills reusable strategies from every interaction. To systematically manage these strategies, we introduce a dynamic three-tier strategy library (Effective, Promising, and Ineffective) that categorizes strategies based on performance. This hierarchical memory mechanism enables the framework to enhance efficiency by leveraging successful patterns while optimizing the exploration space by avoiding known failures. Extensive experiments in a black-box setting demonstrate that ASTRA significantly outperforms existing baselines.
Authors: Seunghee Han, Yeonghun Kang, Taeun Bae, Junho Kim, Younghun Kim, Varinia Bernales, Alan Aspuru-Guzik, Jihan Kim
Abstract: Designing materials with targeted properties remains challenging due to the vastness of chemical space and the scarcity of property-labeled data. While recent advances in generative models offer a promising way for inverse design, most approaches require large datasets and must be retrained for every new target property. Here, we introduce the EGMOF (Efficient Generation of MOFs), a hybrid diffusion-transformer framework that overcomes these limitations through a modular, descriptor-mediated workflow. EGMOF decomposes inverse design into two steps: (1) a one-dimensional diffusion model (Prop2Desc) that maps desired properties to chemically meaningful descriptors followed by (2) a transformer model (Desc2MOF) that generates structures from these descriptors. This modular hybrid design enables minimal retraining and maintains high accuracy even under small-data conditions. On a hydrogen uptake dataset, EGMOF achieved over 94% validity and 91% hit rate, representing significant improvements of up to 39% in validity and 29% in hit rate compared to existing methods, while remaining effective with only 1,000 training samples. Moreover, our model successfully performed conditional generation across 29 diverse property datasets, including CoREMOF, QMOF, and text-mined experimental datasets, whereas previous models have not. This work presents a data-efficient, generalizable approach to the inverse design of diverse MOFs and highlights the potential of modular inverse design workflows for broader materials discovery.
Authors: Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, Deyi Xiong
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a prominent paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, the entropy of LLMs usually collapses during RLVR training, leading to premature convergence to suboptimal local minima and hindering further performance improvement. Although various approaches have been proposed to mitigate entropy collapse, a comprehensive study of entropy in RLVR remains lacking. To bridge this gap, we conduct extensive experiments to investigate the entropy dynamics of LLMs trained with RLVR and analyze how model entropy correlates with response diversity, calibration, and performance across various benchmarks. Our results identify three key factors that influence entropy: the clipping thresholds in the optimization objective, the number of off-policy updates, and the diversity of the training data. Furthermore, through both theoretical analysis and empirical validation, we demonstrate that tokens with positive advantages are the primary drivers of entropy collapse. Motivated by this insight, we propose Positive-Advantage Reweighting, a simple yet effective approach that regulates model entropy by adjusting the loss weights assigned to tokens with positive advantages during RLVR training, while maintaining competitive performance.
Authors: Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi
Abstract: Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
Authors: Priyanka Mudgal
Abstract: Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.
Authors: Maria Gonzalez-Calabuig, Kai-Hendrik Cohrs, Vishal Nedungadi, Zuzanna Osika, Ruben Cartuyvels, Steffen Knoblauch, Joppe Massant, Shruti Nath, Patrick Ebel, Vasileios Sitokonstantinou
Abstract: Geospatial foundation models (GFMs) for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that enables GFMs to identify and abstain from likely failures. Our approach integrates three complementary signals: geophysical out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space, and task-specific predictive uncertainty. We evaluate SHRUG-FM across three high-stakes rapid-mapping tasks: burn scar segmentation, flood mapping, and landslide detection. Our results show that SHRUG-FM consistently reduces prediction risk on retained samples, outperforming established single-signal baselines like predictive entropy. Crucially, by utilizing a shallow "glass-box" decision tree for signal fusion, SHRUG-FM provides interpretable abstention thresholds. It builds a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, bridging the gap between benchmark performance and real-world reliability.
Authors: Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan
Abstract: Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. Our project page is available at https://videop2r.github.io/videop2r/.
Authors: Suyog Garg, Feng-Li Lin, Kipp Cannon
Abstract: Upgrades to current gravitational wave detectors for the next observation run and upcoming third-generation observatories, like the Einstein telescope, are expected to have enormous improvements in detection sensitivities and compact object merger event rates. Estimation of source parameters for a wider parameter space that these detectable signals will lie in, will be a computational challenge. Thus, it is imperative to have methods to speed-up the likelihood calculations with theoretical waveform predictions, which can ultimately make the parameter estimation faster and aid in rapid multi-messenger follow-ups. In this work we study auto-encoder models for gravitational waveform generation by adopting the best-performing architecture of Liao & Lin (2021) to approximate aligned-spin SEOBNRv4 inspiral-merger-ringdown waveforms. Our parameter space consists of four parameters, [$m_1$, $m_2$, $\chi_1(z)$, $\chi_2(z)$]. The masses are uniformly sampled in $[5,75]\,M_{\odot}$ with a mass ratio limit at $10\,M_{\odot}$, while the spins are uniform in $[-0.99,0.99]$. Our model is able to generate $10^3$ waveforms in $\sim 0.1$ second at an average speed of about 50 microsecond per waveform on a GPU. This is about 4 orders of magnitude faster than the native SEOBNRv4 implementation, and 2--3 orders of magnitude faster than existing non-machine-learning accelerated waveform variants. The median mismatch for the generated waveforms in the test dataset is $\sim10^{-2}$, with better performance in a restricted parameter space of $\chi_{\rm eff}\in[-0.80,0.80]$. The latent sampling error of our model can be quantified at a median mismatch standard deviation of $4\times10^{-3}$. Although the accuracy of our model does not enable full production-use yet, the model could be useful wherever high-volume of approximate theoretical waveforms are required, for instance, for rapid sky localization.
Authors: Mario Markov (INSAIT, Sofia University "St. Kliment Ohridski"), Stefan Maria Ailuro (INSAIT, Sofia University "St. Kliment Ohridski"), Luc Van Gool (INSAIT, Sofia University "St. Kliment Ohridski"), Konrad Schindler (ETH Zurich), Danda Pani Paudel (INSAIT, Sofia University "St. Kliment Ohridski")
Abstract: Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.
Authors: Nathalie Kirch, Samuel Dower, Adrians Skapars, Helen Yannakoudakis, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
Abstract: Probing has emerged as a promising method for monitoring large language models (LLMs), enabling cheap inference-time detection of concerning behaviours. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that training data generation strategy can significantly affect probe performance, though the magnitude varies greatly by behaviour. The largest generalisation failures arise for behaviours defined by response ``intent'' (e.g., strategic deception) rather than text-level content (e.g., usage of lists). We then propose a useful test for predicting generalisation failures in cases where on-policy test data is unavailable: successful generalisation to incentivised data (where the model was coerced) strongly correlates with high performance against on-policy examples. Based on these results, we predict that current deception probes may fail to generalise to real monitoring scenarios. We find that off-policy data can yield more reliable probes than on-policy data from a sufficiently different setting. This underscores the need for better monitoring methods that handle all types of distribution shift.
Authors: Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang
Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.
Authors: Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi
Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
Authors: Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole-Jean Wu, Shang-Wen Li
Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
Authors: Yuanzhe Ma, Yian Huang, Hongseok Namkoong
Abstract: Limited overlap between treated and control groups is a key challenge in observational analysis. Standard approaches like trimming importance weights can reduce variance but introduce a fundamental bias. We propose a sensitivity framework for contextualizing findings under limited overlap, where we assess how irregular the outcome function has to be in order for the main finding to be invalidated. Our approach is based on worst-case confidence bounds on the bias introduced by standard trimming practices, under explicit assumptions necessary to extrapolate counterfactual estimates from regions of overlap to those without. Empirically, we demonstrate how our sensitivity framework protects against spurious findings by quantifying uncertainty in regions with limited overlap.
Authors: Feras Al Taha, Eilyan Bitar
Abstract: We consider a class of finite-horizon, linear-quadratic stochastic control problems, where the probability distribution governing the noise process is unknown but assumed to belong to an ambiguity set consisting of all distributions whose mean and covariance lie within norm balls centered at given nominal values. To cope with this ambiguity, we design causal affine control policies to minimize the worst-case expected regret over all distributions in the ambiguity set. The resulting minimax optimal control problem is shown to admit an equivalent reformulation as a tractable convex program, which can be interpreted as a regularized version of the nominal linear-quadratic stochastic control problem. Based on the dual of this convex reformulation, we develop a scalable projected subgradient method for computing optimal controllers to arbitrary accuracy. Numerical experiments are provided to compare the proposed method with state-of-the-art data-driven control design methods.
Authors: Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, Ning Zhang
Abstract: Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse unseen benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere distribution shift. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the internal representations, offering a practical path towards safer LVLM deployment.
Authors: Jinniao Qiu
Abstract: In this work, we investigate a stochastic control framework for global optimization over both Euclidean spaces and the Wasserstein space of probability measures, where the objective function may be non-convex and/or non-differentiable. In the Euclidean setting, the original minimization problem is approximated by a family of regularized stochastic control problems; using dynamic programming, we analyze the associated Hamilton-Jacobi-Bellman equations and obtain tractable representations via the Cole-Hopf transformation and the Feynman-Kac formula. For optimization over probability measures, we formulate a regularized mean-field control problem characterized by a master equation, and further approximate it by controlled $N$-particle systems. We establish that, as the regularization parameter tends to zero (and as the particle number tends to infinity for the optimization over probability measures), the value of the control problem converges to the global minimum of the original objective. Building on the resulting probabilistic representations, we propose the Monte Carlo-based numerical schemes that are derivative-free due to the utilization of the Bismut-Elworthy-Li formula and numerical experiments are reported to illustrate the effectiveness of the methods and to support the theoretical convergence rates.
Authors: Junseok Kim, Nakyeong Yang, Kyungmin Min, Kyomin Jung
Abstract: Self-Consistency improves reasoning reliability through multi-sample aggregation, but incurs substantial inference cost. Adaptive self-consistency methods mitigate this issue by adjusting the sampling budget; however, they rely on count-based stopping rules that treat all responses equally, often leading to unnecessary sampling. We propose Reliability-Aware Adaptive Self-Consistency (ReASC), which addresses this limitation by reframing adaptive sampling from response counting to evidence sufficiency, leveraging response-level confidence for principled information aggregation. ReASC operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets, ReASC consistently achieves the best accuracy-cost trade-off compared to existing baselines, yielding improved inference efficiency across model scales from 3B to 27B parameters. As a concrete example, ReASC reduces inference cost by up to 70\% relative to self-consistency while preserving accuracy on GSM8K using Gemma-3-4B-it.
Authors: Junhao Hu, Fangze Li, Mingtao Xu, Feifan Meng, Shiju Zhao, Tiancheng Hu, Ting Peng, Anmin Liu, Wenrui Huang, Chenxu Liu, Ziyue Hua, Tao Xie
Abstract: Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less'' (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.
Authors: Janvijay Singh, Dilek Hakkani-T\"ur
Abstract: Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevant reasoning. Prior work on compact reasoning shortens such chains through probabilistic sampling, heuristics, or supervision from frontier models, but offers limited insight into whether models internally encode token-level functional importance for answer generation. We address this gap diagnostically and propose greedy pruning, a likelihood-preserving deletion procedure that iteratively removes reasoning tokens whose removal minimally degrades model likelihood under a specified objective, yielding length-controlled reasoning chains. We evaluate pruned reasoning in a distillation framework and show that students trained on pruned chains outperform a frontier-model-supervised compression baseline at matched reasoning lengths. Finally, our analysis reveals systematic pruning patterns and shows that attention scores can predict greedy pruning ranks, further suggesting that models encode a nontrivial functional importance structure over reasoning tokens.
Authors: Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, Zhiqi Huang
Abstract: Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 1997 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 12 representative VLMs, and even the best model, Gemini-3-Pro-Preview, classifies the error correctly in only 66.65\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal models. Project Page: https://mmerror-benchmark.github.io
Authors: Xiaoyu Xu, Minxin Du, Zitong Li, Zi Liang, Zhibiao Guo, Shiyu Zhang, Peizhao Hu, Qingqing Ye, Haibo Hu
Abstract: Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true ``forgetting scope'' learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose \BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on \emph{external} generators, \BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while \emph{halving} the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.
Authors: Enze Pan
Abstract: Out-of-distribution generalization in reinforcement learning is hard to diagnose when benchmark shifts mix dynamics, observations, goals, and rewards. We address this with Tape, a controlled benchmark that isolates latent rule-shift in dynamics while keeping the observation-action interface fixed. The protocol combines deterministic splits, 20-seed replication, bootstrap uncertainty reporting, and continuous metrics for sparse-success regimes. Across baseline families, we find a consistent ID-to-OOD drop and strong heterogeneity across stable/periodic/chaotic rules. Importantly, this fragility appears even in an intentionally simple 1D deterministic setting, suggesting that many current RL algorithms remain brittle to latent-law changes under minimal confounds. To calibrate strict success, we report a protocol-matched true-dynamics random-shooting reference (p_oracle is almost 0.187) and oracle-normalized scores ON(p) = 100 p / p_oracle; this is a budgeted operational reference, not a global-optimality bound. A smaller feasibility regime (L = H = 16) with 100% rule-wise solvability helps separate reachability limits from policy failure. These results position Tape as a mechanism-oriented diagnostic for robust adaptation and latent-mechanism inference, and as a controlled benchmark relevant to broader AGI-oriented evaluation without making strong AGI sufficiency claims.
Authors: Gorjan Radevski, Kiril Gashteovski, Giwon Hong, Carolin Lawrence, Goran Glava\v{s}
Abstract: Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior steering of verifiable constraints (e.g., length, format, structure, language) compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.
Authors: Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar
Abstract: We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, Ganit), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +6 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words. Project page is available at https://dipta007.github.io/GanitLLM
Authors: Yixuan Zhang, Qiaomin Xie
Abstract: Non-asymptotic central limit theorem (CLT) rates play a central role in modern machine learning and operations research. In this paper, we study CLT rates for multivariate dependent data in Wasserstein-$p$ ($W_p$) distance, for general $p\ge 1$. We focus on two fundamental dependence structures that commonly arise in practice: locally dependent sequences and geometrically ergodic Markov chains. In both settings, we establish the first optimal $\mathcal O(n^{-1/2})$ rate in $W_1$, as well as the first $W_p$ ($p\ge 2$) CLT rates under mild moment assumptions, substantially improving the best previously known bounds in these dependent-data regimes. As an application of our optimal $W_1$ rate for locally dependent sequences, we further obtain the first optimal $W_1$-CLT rate for multivariate $U$-statistics. On the technical side, we derive a tractable auxiliary bound for $W_1$ Gaussian approximation errors that is well suited for studying dependent data. For Markov chains, we further prove that the regeneration time of the split chain associated with a geometrically ergodic chain has a geometric tail without assuming strong aperiodicity or other restrictive conditions. These tools may be of independent interests and enable our optimal $W_1$ rates and underpin our $W_p$ ($p\ge 2$) results.
Authors: Yujin Jo, Sangyoon Bae, Taesup Kim
Abstract: Hallucinations in large vision--language models (LVLMs) often arise when language priors dominate over visual evidence, leading to object misidentification and visually inconsistent descriptions. We address this problem by framing hallucination mitigation as contrastive guidance that steers generation toward visually grounded and semantically faithful text. We propose Attention-space Contrastive Guidance (ACG), a training-free, single-pass method that operates directly in self-attention layers, where hallucination-inducing cross-modal biases emerge. ACG constructs both image-conditioned and approximate text-only attention paths within a single forward pass, enabling efficient guidance before errors accumulate at the output layer. Because this masking-based surrogate can introduce approximation bias, we further apply a lightweight orthogonal projection that suppresses components aligned with the text-only path, yielding a more visually grounded correction. Experiments on CHAIR and POPE show that ACG improves faithfulness over existing training-free baselines while maintaining caption quality, reducing latency by up to $2\times$ compared to multi-pass contrastive decoding methods.
Authors: Tunazzina Islam
Abstract: Large language models (LLMs) are increasingly capable of generating personalized, persuasive text at scale, raising new questions about bias and fairness in automated communication. This paper presents the first systematic analysis of how LLMs behave when tasked with demographic-conditioned targeted messaging. We introduce a controlled evaluation framework using three leading models: GPT-4o, Llama-3.3, and Mistral-Large-2.1, across two generation settings: Standalone Generation, which isolates intrinsic demographic effects, and Context-Rich Generation, which incorporates thematic and regional context to emulate realistic targeting. We evaluate generated messages along three dimensions: lexical content, language style, and persuasive framing. We instantiate this framework on climate communication and find consistent age- and gender-based asymmetries across models: male- and youth-targeted messages tend to emphasize more assertive and progressive framing, while female- and senior-targeted messages more often reflect warmth, care, and traditional themes. Contextual prompts systematically amplify these disparities, with persuasion scores being higher for male-targeted messages, while age-related differences vary across models. Our findings demonstrate how demographic stereotypes can surface and intensify in LLM-generated targeted communication, underscoring the need for bias-aware generation pipelines and transparent auditing frameworks that explicitly account for demographic conditioning in socially sensitive applications.
Authors: Akshith Reddy Putta, Jacob Devasier, Chengkai Li
Abstract: Automated Fact-Checking has largely focused on verifying general knowledge against static corpora, overlooking high-stakes domains like law where truth is evolving and technically complex. We introduce CaseFacts, a benchmark for verifying colloquial legal claims against U.S. Supreme Court precedents. Unlike existing resources that map formal texts to formal texts, CaseFacts challenges systems to bridge the semantic gap between layperson assertions and technical jurisprudence while accounting for temporal validity. The dataset consists of 6,294 claims categorized as Supported, Refuted, or Overruled. We construct this benchmark using a multi-stage pipeline that leverages Large Language Models (LLMs) to synthesize claims from expert case summaries, employing a novel semantic similarity heuristic to efficiently identify and verify complex legal overrulings. Experiments with state-of-the-art LLMs reveal that the task remains challenging; notably, augmenting models with unrestricted web search degrades performance compared to closed-book baselines due to the retrieval of noisy, non-authoritative precedents. We release CaseFacts to spur research into legal fact verification systems.
Authors: Mohit Talreja, Joshua Diao, Jim Thannikary James, Radu Casapu, Tejas Santanam, Ethan Mendes, Alan Ritter, Wei Xu, James Hays
Abstract: Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at \textit{explaining} which image evidence led to their prediction, even when their location prediction is correct. In this paper, we introduce GeoRC, the first benchmark for geolocation reasoning chains sourced directly from Champion-tier GeoGuessr experts, including the reigning world champion. This benchmark consists of 800 ``ground truth'' reasoning chains across 500 query scenes from GeoGuessr maps, with expert chains addressing hundreds of different discriminative attributes, such as soil properties, architecture, and license plate shapes. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human-expert scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at predicting locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Small open-weight VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but \textit{no visual information at all}. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images. We open source our benchmark for the community to use.
Authors: Hanjun Park, Aleksandr D. Kuznetsov, Ville Viikari
Abstract: Predicting interdependent load values in multiport scatterers is challenging due to high dimensionality and complex dependence between impedance and scattering ability, yet this prediction remains crucial for the design of communication and measurement systems. In this paper, we propose a two-stage cluster-then-predict framework for multiple load values prediction task in multiport scatterers. The proposed cluster-then-predict approach effectively captures the underlying functional relation between S-parameters and corresponding load impedances, achieving up to a 46% reduction in Root Mean Square Error (RMSE) compared to the baseline when applied to gradient boosting (GB). This improvement is consistent across various clustering and regression methods. Furthermore, we introduce the Real-world Unified Index (RUI), a metric for quantitative analysis of trade-offs among multiple metrics with conflicting objectives and different scales, suitable for performance assessment in realistic scenarios. Based on RUI, the combination of K-means clustering and k-nearest neighbors (KNN) is identified as the optimal setup for the analyzed multiport scatterer.
Authors: Brandon Yee, Wilson Collins, Pairie Koh, Maximilian Rutkowski
Abstract: We extend the Prometheus framework for unsupervised phase transition discovery from two-dimensional classical systems to three-dimensional classical systems and quantum many-body systems. Building upon preliminary observations from a 2D Ising model student abstract [Yee et al., 2026], we address two fundamental questions: (1) Does the framework scale to higher dimensions where exact solutions are unavailable? (2) Can it generalize to quantum phase transitions driven by quantum fluctuations rather than thermal fluctuations? For the 3D Ising model on lattices up to $L{=}32$, we achieve critical temperature detection within 0.01\% of literature values ($\Tc/J = 4.511 \pm 0.005$) and extract critical exponents with ${\geq}70\%$ accuracy, with statistical analysis correctly identifying the 3D Ising universality class ($p = 0.72$). For quantum systems, we develop quantum-aware VAE (Q-VAE) architectures operating on complex-valued wavefunctions with fidelity-based loss functions, achieving 2\% accuracy in quantum critical point detection for the transverse field Ising model. For the disordered TFIM, we perform a consistency check of activated dynamical scaling $\ln \xi \sim |h - \hc|^{-\psi}$, extracting tunneling exponent $\psi = 0.48 \pm 0.08$ consistent with theoretical predictions ($\psi = 0.5$, $\Delta\chi^2 = 12.3$, $p < 0.001$). This demonstrates that unsupervised learning can identify qualitatively different \emph{types} of critical behavior, serving as a consistency check on known IRFP physics.
Authors: Ankur Mahesh, William D. Collins, Travis A. O'Brien, Paul B. Goddard, Sinclaire Zebaze, Shashank Subramanian, James P. C. Duncan, Oliver Watt-Meyer, Boris Bonev, Thorsten Kurth, Karthik Kashinath, Michael S. Pritchard, Da Yang
Abstract: The response of the climate system to increased greenhouse gases and other radiative perturbations is governed by a combination of fast and slow feedbacks. Slow feedbacks are typically activated in response to changes in ocean temperatures on decadal timescales and manifest as changes in climatic state with no recent historical analogue. However, fast feedbacks are activated in response to rapid atmospheric physical processes on weekly timescales, and they are already operative in the present-day climate. This distinction implies that the physics of fast radiative feedbacks is present in the historical meteorological reanalyses used to train many recent successful machine-learning-based (ML) emulators of weather and climate. In addition, these feedbacks are functional under the historical boundary conditions pertaining to the top-of-atmosphere radiative balance and sea-surface temperatures. Together, these factors imply that we can use historically trained ML weather emulators to study the response of radiative-convective equilibrium (RCE), and hence the global hydrological cycle, to perturbations in carbon dioxide and other well-mixed greenhouse gases. Without retraining on prospective Earth system conditions, we use ML weather emulators to quantify the fast precipitation response to reduced and elevated carbon dioxed concentrations with no recent historical precedent. We show that the responses from historically trained emulators agree with those produced by full-physics Earth System Models (ESMs). In conclusion, we discuss the prospects for and advantages from using ESMs and ML emulators to study fast processes in global climate.
Authors: Rong Fu, Ziming Wang, Chunlei Meng, Jiaxuan Lu, Jiekai Wu, Kangan Qian, Hao Zhang, Simon Fong
Abstract: As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.
Authors: Rong Fu, Ziming Wang, Shuo Yin, Haiyun Wei, Kun Liu, Xianda Li, Zeli Su, Simon Fong
Abstract: Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.
Authors: Rong Fu, Zijian Zhang, Haiyun Wei, Jiekai Wu, Kun Liu, Xianda Li, Haoyu Zhao, Yang Li, Yongtai Liu, Ziming Wang, Rui Lu, Simon Fong
Abstract: The continuous expansion of digital learning environments has catalyzed the demand for intelligent systems capable of providing personalized educational content. While current exercise recommendation frameworks have made significant strides, they frequently encounter obstacles regarding the long-tailed distribution of student engagement and the failure to adapt to idiosyncratic learning trajectories. We present LiveGraph, a novel active-structure neural re-ranking framework designed to overcome these limitations. Our approach utilizes a graph-based representation enhancement strategy to bridge the information gap between active and inactive students while integrating a dynamic re-ranking mechanism to foster content diversity. By prioritizing the structural relationships within learning histories, the proposed model effectively balances recommendation precision with pedagogical variety. Comprehensive experimental evaluations conducted on multiple real-world datasets demonstrate that LiveGraph surpasses contemporary baselines in both predictive accuracy and the breadth of exercise diversity.
Authors: Brandon Yee, Wilson Collins, Maximilian Rutkowski
Abstract: The spin-$1/2$ $J_1$-$J_2$ Heisenberg model on the square lattice exhibits a debated intermediate phase between N\'eel antiferromagnetic and stripe ordered regimes, with competing theories proposing plaquette valence bond, nematic, and quantum spin liquid ground states. We apply the Prometheus variational autoencoder framework -- previously applied to classical (2D, 3D Ising) and quantum (disordered transverse field Ising) phase transitions -- to systematically explore the $J_1$-$J_2$ phase diagram using a multi-scale approach. For $L=4$, we employ exact diagonalization with full wavefunction analysis via quantum-aware VAE. For larger systems ($L=6, 8$), we introduce a reduced density matrix (RDM) based methodology using DMRG ground states, enabling scaling beyond the exponential barrier of full Hilbert space representation. Through dense parameter scans of $J_2/J_1 \in [0, 1]$ and comprehensive latent space analysis, we identify the structure factor $S(\pi,\pi)$ and $S(\pi,0)$ as the dominant order parameters discovered by the VAE, with correlations exceeding $|r| > 0.97$. The RDM-VAE approach successfully captures the N\'eel-to-stripe crossover near $J_2/J_1 \approx 0.5$--$0.6$, demonstrating that local quantum correlations encoded in reduced density matrices contain sufficient information for unsupervised phase discovery. This work establishes a scalable pathway for applying machine learning to frustrated quantum systems where full wavefunction access is computationally prohibitive.
Authors: Yang Yu, Lei Kou, Huaikuan Yi, Bin Chen, Yayu Cao, Lei Shen, Chao Zhang, Bing Wang, Xiaoyi Zeng
Abstract: With the rapid evolution of Large Language Models (LLMs), generative recommendation is gradually reshaping the paradigm of recommender systems. However, most existing methods remain confined to the interaction-driven next-item prediction paradigm, struggling to keep pace with the latest evolving trends or address the diverse recommendation tasks along with business-specific requirements in real-world scenarios. To this end, we present SIGMA, a Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender deployed at AliExpress. Specifically, we first ground item entities in a unified latent space capturing both general semantics and collaborative signals. Building upon this, we introduce a hybrid item tokenization method for both precise modeling and efficient generation. Moreover, we construct a large-scale multi-task supervised fine-tuning dataset empowering SIGMA to fulfill various recommendation demands via instruction-following. Finally, we design a three-step item generation procedure integrated with an adaptive probabilistic fusion mechanism to calibrate the output distributions based on task-specific requirements for recommendation accuracy and diversity. Extensive offline experiments and online A/B tests demonstrate the effectiveness of SIGMA across various real-world recommendation tasks.
Authors: Brandon Yee, Lucas Wang, Kundana Kommini, Krishna Sharma
Abstract: We present Geodesic Semantic Search (GSS), a retrieval system that learns node-specific Riemannian metrics on citation graphs to enable geometry-aware semantic search. Unlike standard embedding-based retrieval that relies on fixed Euclidean distances, \gss{} learns a low-rank metric tensor $\mL_i \in \R^{d \times r}$ at each node, inducing a local positive semi-definite metric $\mG_i = \mL_i \mL_i^\top + \eps \mI$. This parameterization guarantees valid metrics while keeping the model tractable. Retrieval proceeds via multi-source Dijkstra on the learned geodesic distances, followed by Maximal Marginal Relevance reranking and path coherence filtering. On citation prediction benchmarks with 169K arXiv papers, GSS achieves 23\% relative improvement in Recall@20 over SPECTER+FAISS baselines. We provide a Bridge Recovery Guarantee characterizing when geodesic retrieval qualitatively outperforms direct similarity, a margin separation result connecting training loss to retrieval quality, and characterize the expressiveness of low-rank metric parameterization. Our hierarchical coarse-to-fine search with k-means pooling reduces computational cost by $4\times$ while maintaining 97\% retrieval quality.
Authors: Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju Kim
Abstract: While Large Language Models (LLMs) have revolutionized code generation, standard ``System 1'' approaches that generate solutions in a single forward pass often hit a performance ceiling on complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-only training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B to 14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, efficient reasoning and reflection patterns. The source code and data are available at https://github.com/juyongjiang/ReflexiCoder.
Authors: Brandon Yee, Wilson Collins, Michael Iofin, Jiayi Fu
Abstract: We develop a theoretical and computational framework for beam-plasma collective oscillations in intense charged-particle beams at intermediate energies (10-100 MeV). In Part I, we formulate a kinetic field theory governed by the Vlasov-Poisson system, deriving the Lindhard dielectric function and random phase approximation (RPA) polarization tensor for three beam distribution functions. We prove via the dielectric function epsilon(omega,q)=0 the existence of undamped Langmuir wave modes above a critical beam density n_c, obtain explicit beam-plasma dispersion relations, and show that Landau damping vanishes above the particle-hole continuum. The plasma frequency Omega_p^2 = ne^2/(m*epsilon_0) is fixed by the f-sum rule independently of distribution shape; higher dispersion coefficients depend on velocity moments. Space charge effects drive anomalous beam broadening with sqrt(n-n_c) onset and Friedel oscillations at q=2k_F. The beam-plasma transition belongs to the 3D Ising universality class via renormalization group analysis. In Part II, we validate these predictions using Prometheus, a beta-VAE trained on static structure factor data S(q) from particle-in-cell (PIC) beam simulations. Prometheus detects collective plasma oscillation onset in Gaussian and uniform distributions, confirms their absence in the degenerate Fermi gas (n_c -> 0), and resolves the Kohn anomaly at q=2k_F. Dispersion analysis of S(q,omega) from PIC simulations verifies the distribution-independent Omega_p predicted by the f-sum rule. All six validation checks pass. Predicted signatures -- density-tunable plasma resonances at omega_p proportional to sqrt(n), anomalous beam broadening with sqrt(n-n_c) onset, and Friedel oscillations -- are accessible at existing intermediate-energy beam facilities.
Authors: Houston Haynes
Abstract: Prevailing AI training infrastructure assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework [6], which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph [8], which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit 2026 standard [10], which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce Bayesian distillation, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce warm rotation, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with structural correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.
Authors: Houston Haynes
Abstract: A prevailing assumption in machine learning is that model correctness must be enforced after the fact. We observe that the properties determining whether an AI model is numerically stable, computationally correct, or consistent with a physical domain do not necessarily demand post hoc enforcement. They can be verified at design time, before training begins, at marginal computational cost, with particular relevance to models deployed in high-leverage decision support and scientifically constrained settings. These properties share a specific algebraic structure: they are expressible as constraints over finitely generated abelian groups $\mathbb{Z}^n$, where inference is decidable in polynomial time and the principal type is unique. A framework built on this observation composes three prior results (arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104): a dimensional type system carrying arbitrary annotations as persistent codata through model elaboration; a program hypergraph that infers Clifford algebra grade and derives geometric product sparsity from type signatures alone; and an adaptive domain model architecture preserving both invariants through training via forward-mode coeffect analysis and exact posit accumulation. We believe this composition yields a novel information-theoretic result: Hindley-Milner unification over abelian groups computes the maximum a posteriori hypothesis under a computable restriction of Solomonoff's universal prior, placing the framework's type inference on the same formal ground as universal induction. We compare four contemporary approaches to AI reliability and show that each imposes overhead that can compound across deployments, layers, and inference requests. This framework eliminates that overhead by construction.
Authors: Shaik Aman
Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence. Their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, degrading reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model's hidden states with 98.4% accuracy, and a dependency-ordered scheduler unmasks tokens in logical order. In zero-shot settings, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. However, with 8-shot chain-of-thought prompting, the baseline reaches approximately 70% and LogicDiff provides no additional improvement. Analysis reveals that few-shot prompting implicitly resolves the same ordering problem that LogicDiff explicitly addresses, and that fixed role-based ordering can cause premature commitment to numerical values before sufficient context is available. Our results characterize the Flexibility Trap as primarily a zero-shot phenomenon and identify context-adaptive ordering as a key direction for future work.
Authors: Tareq Aldirawi, Yun Li, Wenge Guo
Abstract: Conformal risk control (CRC) provides distribution-free guarantees for controlling the expected loss at a user-specified level. Existing theory typically assumes that the loss decreases monotonically with a tuning parameter that governs the size of the prediction set. However, this assumption is often violated in practice, where losses may behave non-monotonically due to competing objectives such as coverage and efficiency. In this paper, we study CRC under non-monotone loss functions when the tuning parameter is selected from a finite grid, a setting commonly arising in thresholding and discretized decision rules. Revisiting a known counterexample, we show that the validity of CRC without monotonicity depends critically on the relationship between the calibration sample size and the grid resolution. In particular, reliable risk control can still be achieved when the calibration sample is sufficiently large relative to the grid size. We establish a finite-sample guarantee for bounded losses over a grid of size $m$, showing that the excess risk above the target level $\alpha$ scales on the order of $\sqrt{\log(m)/n}$, where $n$ is the calibration sample size. A matching lower bound demonstrates that this rate is minimax optimal. We also derive refined guarantees under additional structural conditions, including Lipschitz continuity and monotonicity, and extend the analysis to settings with distribution shift via importance weighting. Numerical experiments on synthetic multilabel classification and real object detection data illustrate the practical implications of non-monotonicity. Methods that explicitly account for finite-sample uncertainty achieve more stable risk control than approaches based on monotonicity transformations, while maintaining competitive prediction set sizes.
Authors: Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, Shumin Deng
Abstract: Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited experience, resulting in redundant exploration and poor generalization. To address this problem, we propose SkillX, a fully automated framework for constructing a \textbf{plug-and-play skill knowledge base} that can be reused across agents and environments. SkillX operates through a fully automated pipeline built on three synergistic innovations: \textit{(i) Multi-Level Skills Design}, which distills raw trajectories into three-tiered hierarchy of strategic plans, functional skills, and atomic skills; \textit{(ii) Iterative Skills Refinement}, which automatically revises skills based on execution feedback to continuously improve library quality; and \textit{(iii) Exploratory Skills Expansion}, which proactively generates and validates novel skills to expand coverage beyond seed training data. Using a strong backbone agent (GLM-4.6), we automatically build a reusable skill library and evaluate its transferability on challenging long-horizon, user-interactive benchmarks, including AppWorld, BFCL-v3, and $\tau^2$-Bench. Experiments show that SkillKB consistently improves task success and execution efficiency when plugged into weaker base agents, highlighting the importance of structured, hierarchical experience representations for generalizable agent learning. Our code will be publicly available soon at https://github.com/zjunlp/SkillX.
Authors: Yu Liang, Liangxin Liu, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Daiting Shi
Abstract: Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs. Our implementation is available at https://github.com/yuliangCarmelo/ConsistRM.
Authors: Tunazzina Islam
Abstract: Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms. Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels through a two-stage process that generates and consolidates semantically similar labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates the common failure modes of embedding-only approaches. We evaluate the framework in real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analysis under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analysis of large text collections without supervision.
Authors: Xinshun Feng, Xinhao Song, Lijun Li, Gongshen Liu, Jing Shao
Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.
Authors: Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang
Abstract: Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Pl\"ucker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.
Authors: Chen Zhan, Xiaoyu Tan, Gengchen Ma, Yu-Jie Xiong, Xiaoyan Jiang, Xihe Qiu
Abstract: The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce "correct answers through flawed reasoning." This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL's progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.
Authors: Daeyeon Son
Abstract: An OS kernel that runs LLM inference internally can read logit distributions before any text is generated and act on them as a governance primitive. This paper presents ProbeLogits, a kernel-level operation that performs a single forward pass and reads specific token logits to classify agent actions as safe or dangerous, with zero learned parameters. I evaluate ProbeLogits across three base models (Qwen 2.5-7B, Llama 3 8B, Mistral 7B) on three external benchmarks: HarmBench, XSTest, and ToxicChat. On HarmBench non-copyright (n=300), all three models reach 97-99% block rate with the right verbalizer. On ToxicChat (n=1,000), ProbeLogits achieves F1 parity-or-better against Llama Guard 3 in the same hosted environment: the strongest configuration (Qwen 2.5-7B Safe/Dangerous, alpha=0.0) reaches F1=0.812 with bootstrap 95% CIs disjoint from LG3 (+13.7pp significant); Llama 3 S/D matches LG3 within CI (+0.4pp, parity); Mistral Y/N exceeds by +4.4pp. Latency is approximately 2.5x faster than LG3 in the same hosted environment because the primitive reads a single logit position instead of generating tokens; in the bare-metal native runtime ProbeLogits drops to 65 ms. A key design contribution is the calibration strength alpha, which serves as a deployment-time policy knob rather than a learned hyperparameter. Contextual calibration corrects verbalizer prior asymmetry, with bias magnitude varying by (model, verbalizer) pair. I implement ProbeLogits within Anima OS, a bare-metal x86_64 OS written in approximately 86,000 lines of Rust. Because agent actions must pass through 15 kernel-mediated host functions, ProbeLogits enforcement operates below the WASM sandbox boundary, making it significantly harder to circumvent than application-layer classifiers.
Authors: Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu
Abstract: As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign request into one that is unsafe, unfair, or privacy-violating. Existing benchmarks, however, largely focus on basic audio comprehension, study individual risks in isolation, or conflate content that is inherently harmful with content that only becomes problematic due to its acoustic context. We introduce VoxSafeBench, among the first benchmarks to jointly evaluate social alignment in SLMs across three dimensions: safety, fairness, and privacy. VoxSafeBench adopts a Two-Tier design: Tier1 evaluates content-centric risks using matched text and audio inputs, while Tier2 targets audio-conditioned risks in which the transcript is benign but the appropriate response hinges on the speaker, paralinguistic cues, or the surrounding environment. To validate Tier2, we include intermediate perception probes and confirm that frontier SLMs can successfully detect these acoustic cues yet still fail to act on them appropriately. Across 22 tasks with bilingual coverage, we find that safeguards appearing robust on text often degrade in speech: safety awareness drops for speaker- and scene-conditioned risks, fairness erodes when demographic differences are conveyed vocally, and privacy protections falter when contextual cues arrive acoustically. Together, these results expose a pervasive speech grounding gap: current SLMs frequently recognize the relevant social norm in text but fail to apply it when the decisive cue must be grounded in speech. Code and data are publicly available at: https://amphionteam.github.io/VoxSafeBench_demopage/
Authors: Sazzadul Islam, Tasnim Tabassum, Hao Zheng
Abstract: Generating synthesizable Verilog for large, hierarchical hardware designs remains a significant challenge for large language models (LLMs), which struggle to replicate the structured reasoning that human experts employ when translating complex specifications into RTL. When tasked with producing hierarchical Verilog, LLMs frequently lose context across modules, hallucinate interfaces, fabricate inter-module wiring, and fail to maintain structural coherence - failures that intensify as design complexity grows and specifications involve informal prose, figures, and tables that resist direct operationalization. To address these challenges, we present VeriGraphi, a framework that introduces a spec-anchored Knowledge Graph as the architectural substrate driving the RTL generation pipeline. VeriGraphi constructs a HDA, a structured knowledge graph that explicitly encodes module hierarchy, port-level interfaces, wiring semantics, and inter-module dependencies as first-class graph entities and relations. Built through iterative multi-agent analysis of the specification, this Knowledge Graph provides a deterministic, machine-checkable structural scaffold before code generation. Guided by the KG, a progressive coding module incrementally generates pseudo-code and synthesizable RTL while enforcing interface consistency and dependency correctness at each submodule stage. We evaluate VeriGraphi on a benchmark of three representative specification documents from the National Institute of Standards and Technology and their corresponding implementations, and we present a RV32I processor as a detailed case study to illustrate the full pipeline. The results demonstrate that VeriGraphi enables reliable hierarchical RTL generation with minimal human intervention for RISC-V, marking a significant milestone for LLM-generated hardware design while maintaining strong functional correctness.
Authors: Jiamei Wu, Ce Zhang, Zhipeng Cai, Jingsen Kong, Bei Jiang, Linglong Kong, Lingchen Kong
Abstract: Conformal prediction (CP) has attracted broad attention as a simple and flexible framework for uncertainty quantification through prediction sets. In this work, we study how to deploy CP under differential privacy (DP) in a statistically efficient manner. We first introduce differential CP, a non-splitting conformal procedure that avoids the efficiency loss caused by data splitting and serves as a bridge between oracle CP and private conformal inference. By exploiting the stability properties of DP mechanisms, differential CP establishes a direct connection to oracle CP and inherits corresponding validity behavior. Building on this idea, we develop Differentially Private Conformal Prediction (DPCP), a fully private procedure that combines DP model training with a private quantile mechanism for calibration. We establish the end-to-end privacy guarantee of DPCP and investigate its coverage properties under additional regularity conditions. We further study the efficiency of both differential CP and DPCP under empirical risk minimization and general regression models, showing that DPCP can produce tighter prediction sets than existing private split conformal approaches under the same privacy budget. Numerical experiments on synthetic and real datasets demonstrate the practical effectiveness of the proposed methods.
Authors: Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, Donglin Wang
Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios. Code is available at https://github.com/Win-commit/WAV.
Authors: Shen Fan, Miko{\l}aj Kida, Przemyslaw Musialski
Abstract: Many CAD learning pipelines discretize Boundary Representations (B-Reps) into triangle meshes, discarding analytic surface structure and topological adjacency and thereby weakening consistent instance-level analysis. We present STEP-Parts, a deterministic CAD-to-supervision toolchain that extracts geometric instance partitions directly from raw STEP B-Reps and transfers them to tessellated carriers through retained source-face correspondence, yielding instance labels and metadata for downstream learning and evaluation. The construction merges adjacent B-Rep faces only when they share the same analytic primitive type and satisfy a near-tangent continuity criterion. On ABC, same-primitive dihedral angles are strongly bimodal, yielding a threshold-insensitive low-angle regime for part extraction. Because the partition is defined on intrinsic B-Rep topology rather than on a particular triangulation, the resulting boundaries remain stable under changes in tessellation. Applied to the DeepCAD subset of ABC, the pipeline processes approximately 180{,}000 models in under six hours on a consumer CPU. We release code and precomputed labels, and show that STEP-Parts serves both as a tessellation-robust geometric reference and as a useful supervision source in two downstream probes: an implicit reconstruction--segmentation network and a dataset-level point-based backbone.
Authors: Y-h. Taguchi, Yoh-ichi Mototake
Abstract: In this paper, we proposed Bayesian Tucker decomposition (BTuD) in which residual is supposed to obey Gaussian distribution analogous to linear regression. Although we have proposed an algorithm to perform the proposed BTuD, the conventional higher-order orthogonal iteration can generate Tucker decomposition consistent with the present implementation. Using the proposed BTuD, we can perform unsupervised feature selection successfully applied to various synthetic datasets, global coupled maps with randomized coupling strength, and gene expression profiles. Thus we can conclude that our newly proposed unsupervised feature selection method is promising. In addition to this, BTuD based unsupervised FE is expected to coincide with TD based unsupervised FE that were previously proposed and successfully applied to a wide range of problems.
Authors: Haozhi Fan, Jinhao Duan, Kaidi Xu
Abstract: Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at https://github.com/louisfanhz/IUQ.
Authors: Tianhao Fu, Austin Wang, Charles Chen, Roby Aldave-Garza, Yucheng Chen
Abstract: Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.
Authors: Panos Tsimpos, Daniel Sharp, Youssef Marzouk
Abstract: We study dynamic measure transport for generative modelling in the setting of a stochastic process $X_\bullet$ whose marginals interpolate between a source distribution $P_0$ and a target distribution $P_1$ while remaining independent, i.e., when $(X_0,X_1)\sim P_0\otimes P_1$. Conditional expectations of this process $X_\bullet$ define an ODE whose flow map transports from $P_0$ to $P_1$. We discuss when such a process induces a \emph{straight-line flow}, namely one whose pointwise acceleration vanishes and is therefore exactly integrable by any first-order method. We first develop multiple characterizations of straightness in terms of PDEs involving the conditional statistics of the process. Then, we prove that straightness under endpoint independence exhibits a sharp dichotomy. On one hand, we construct explicit, computable straight-line processes for arbitrary Gaussian endpoints. On the other hand, we show straight-line processes do not exist for targets with sufficiently well-separated modes. We demonstrate this through a sequence of increasingly general impossibility theorems that uncover a fundamental relationship between the sample-path behavior of a process with independent endpoints and the space-time geometry of this process' flow map. Taken together, these results provide a structural theory of when straight generative flows can, and cannot, exist.
Authors: Yutong Gao, Qinglin Meng, Yuan Zhou, Liangming Pan
Abstract: While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.
URLs: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.