Hasil untuk "Men"

Menampilkan 20 dari ~2330498 hasil · dari DOAJ, arXiv, CrossRef, Semantic Scholar

JSON API
arXiv Open Access 2026
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

Zihan Qiu, Zeyu Huang, Kaiyue Wen et al.

We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).

en cs.CL
arXiv Open Access 2026
RoboBrain 2.5: Depth in Sight, Time in Mind

Huajie Tan, Enshen Zhou, Zhiyu Li et al.

We introduce RoboBrain 2.5, a next-generation embodied AI foundation model that advances general perception, spatial reasoning, and temporal modeling through extensive training on high-quality spatiotemporal supervision. Building upon its predecessor, RoboBrain 2.5 introduces two major capability upgrades. Specifically, it unlocks Precise 3D Spatial Reasoning by shifting from 2D pixel-relative grounding to depth-aware coordinate prediction and absolute metric constraint comprehension, generating complete 3D manipulation traces as ordered keypoint sequences under physical constraints. Complementing this spatial precision, the model establishes Dense Temporal Value Estimation that provides dense, step-aware progress prediction and execution state understanding across varying viewpoints, producing stable feedback signals for downstream learning. Together, these upgrades extend the framework toward more physically grounded and execution-aware embodied intelligence for complex, fine-grained manipulation. The code and checkpoints are available at project website: https://superrobobrain.github.io

en cs.RO
arXiv Open Access 2026
Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang et al.

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

en cs.CL
arXiv Open Access 2025
The flat tail in the burst energy distribution of FRB 20240114A

Yu-Xiang Huang, Jun-Shuo Zhang, Heng Xu et al.

Fast Radio Bursts (FRBs) are enigmatic millisecond-duration radio transients of extra-galactic origin, whose underlying mechanisms and progenitors remain poorly understood. FRBs are broadly classified into two categories: repeating FRBs, which emit multiple bursts over time, and one-off FRBs, which are detected as single events. A central question in FRB research is whether these two classes share a common origin. In this study, we present observations of FRB 20240114A, a repeating FRB that entered a hyperactive phase in January 2024. We conducted a 318-hour monitoring campaign using the Kunming 40-Meter Radio Telescope (KM40M) in the S-band (2.187-2.311 GHz), during which we detected eight radio bursts. We analyzed their properties, including dispersion measure (DM), bandwidth, pulse width, flux, fluence, and energy. Additionally, we searched for counterparts in overlapping data from the Five-hundred-meter Aperture Spherical Telescope (FAST) in the L-band (1.0-1.5 GHz). While no bursts were temporally aligned between the two telescopes, we identified one FAST burst that arrived approximately 6 ms after one of the KM40M bursts. The absence of FAST counterparts for the KM40M bursts suggests that individual bursts from FRB 20240114A are likely narrow-band, with fractional bandwidths less than 10%. By comparing the cumulative event rates from KM40M and FAST observations, we found that the two measurements are compatible, indicating a possible flattening of the event rate at higher energies. This feature aligns with observations of one-off FRBs, supporting the hypothesis that repeating and one-off FRBs may share a common origin.

en astro-ph.HE
arXiv Open Access 2025
Qwen2.5-1M Technical Report

An Yang, Bowen Yu, Chengyuan Li et al.

We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

en cs.CL

Halaman 12 dari 116525