Aerosol Composition and Extinction of the 2022 Hunga Plume Using CALIOP
C. Duchamp, B. Legras, A. Podglajen
et al.
<p>We use the CALIOP (Cloud-Aerosol Lidar with Orthogonal Polarization) instrument to determine the microphysical properties of the stratospheric aerosol plume after the Hunga eruption in 2022, the largest so far after the Pinatubo in 1991. In the early stages, low depolarization (<span class="inline-formula"><2 <i>%</i></span>) is found everywhere except in patches of high depolarization (up to 35 %) detected within the plumes of sulfur compounds up to 3 d after the eruption. As standard CALIOP L2 products are not operational in the case of the Hunga aerosol plume, we implement an iterative method of successive approximations to retrieve extinction profiles, by estimating the aerosol optical depth (AOD) and then the lidar ratio (LR). The AOD of the plume at 532 nm is between 0.5 and 1.25 on the first four days, then decreases rapidly and stabilizes at <span class="inline-formula">0.047 ± 0.011</span> for March 2022. The LR is initially between 60 and 80 sr, consistent with the early growth of sulfate aerosol particles, and then decreases to <span class="inline-formula">48 ± 6 sr</span> between late January and late March 2022. Results are compared and validated with the solar occultation instrument SAGE III (Stratospheric Aerosol and Gas Experiment) on board the International Space Station (ISS) and Mie calculations. A comparison with limb-viewing instruments highlights significant quantitative disagreements in extinction and AOD estimates, which we attribute, in part, to the unusual size distribution of the aerosols within the Hunga plume.</p>
Environmental engineering, Earthwork. Foundations
TVWorld: Foundations for Remote-Control TV Agents
Zhantao Ma, Quanfeng Lu, Shuai Zhong
et al.
Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
Sicheng Fan, Qingyun Shi, Shengze Xu
et al.
Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.
Assessing the Detection of Methane Plumes in Offshore Areas Using High-Resolution Imaging Spectrometers
J. Roger, L. Guanter, L. Guanter
et al.
<p>The offshore oil and gas industry is an important contributor to global anthropogenic methane emissions. Satellite-based, high-resolution imaging spectrometers are showing a great potential for the detection of methane emissions over land. However, the use of the same methods over offshore oil and gas extraction basins is challenged by the low reflectance of water in the near- and shortwave infrared spectral windows used for methane retrievals. This limitation can be partly alleviated by data acquisitions under the so-called sun glint configuration, which enhances the at-sensor radiance. In this work, we assess the performance of two space-based imaging spectrometers, EnMAP and EMIT, for the detection of offshore methane plumes applying the matched filter method. We use simulated plumes to generate parametric probability of detection (POD) models for a range of emission flux rates (<span class="inline-formula"><i>Q</i></span>), at-sensor radiances and wind speeds. The POD models were confronted with real plume detections for the two instruments. Our analysis shows that the spatial resolution of the instrument and the at-sensor radiance (which drives the retrieval precision) are the two factors with the greatest impact on plume POD. We also evaluate the dependence of the at-sensor radiance on the illumination-observation geometry and the surface roughness. Our POD models properly represent the different trade-offs between spatial resolution and retrieval precision in EnMAP and EMIT. As an example, for most combinations of <span class="inline-formula"><i>Q</i></span> and wind speed values at POD <span class="inline-formula">=</span> 50 %, EMIT demonstrates better detection performance at <span class="inline-formula"><i>Q</i>>7</span> t h<span class="inline-formula"><sup>−1</sup></span>, whereas EnMAP performs better at <span class="inline-formula"><i>Q</i><7</span> t h<span class="inline-formula"><sup>−1</sup></span>. This study demonstrates the ability of these two satellite instruments to detect high-emitting offshore point sources under a range of different conditions. By filtering data based on these conditions, methane emission detection and monitoring efforts can be optimized, reducing unnecessary searches and ultimately increasing the action taken on these emissions.</p>
Environmental engineering, Earthwork. Foundations
Calibration of weather radars with a target simulator
M. Schneebeli, M. Schneebeli, A. Leuenberger
et al.
<p>We present findings from radar calibration experiments involving three radars operated by the Colorado State University (CSU) in the US and by the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland. The experiments were based on the comparison between measured radar variables and the known properties of artificial point targets electronically generated with a polarimetric radar target simulator (RTS) from Palindrome Remote Sensing. Radars under test included the two magnetron-based radars CHILL and SPLASH (its mobile version) from CSU and EPFL's new solid-state radar StXPol.</p>
<p>For the CHILL and SPLASH calibration measurements in Colorado, a mobile lifting platform was employed that elevated the target simulator instrument to approximately 15 <span class="inline-formula">m</span> above ground. The creation of virtual targets with polarimetric signatures allowed for a direct calibration of polarimetric variables. While the SPLASH radar exhibited good <span class="inline-formula"><i>Z</i><sub>dr</sub></span> and sufficient <span class="inline-formula"><i>Z</i><sub>h</sub></span> accuracy, remarkable precision and stability were found in CHILL's reflectivity data time series, where the reflectivity bias compared to the virtual target was less than 0.2 <span class="inline-formula">dB</span> over a 1 h time series.</p>
<p>Calibration issues that arise with solid-state radar systems were investigated with experiments conducted with the EPFL StXPol radar. This pulse compression system transmits a linear frequency-modulated long pulse as well as a non-modulated short pulse for observations at close ranges. The two pulses are separated in frequency by 50 <span class="inline-formula">MHz</span>, and consequently calibration targets were generated independently for the two channels. Excellent stability and accuracy were found for <span class="inline-formula"><i>Z</i><sub>dr</sub></span> in both channels. While <span class="inline-formula"><i>Z</i><sub>h</sub></span> stability was also very high, a large reflectivity bias in both the long and the short pulse channel was detected.</p>
<p>For the first time, the article introduces and analyzes a weather radar calibration procedure that is based on electronically generated radar targets. Experimental data suggest that precise absolute and differential calibrations can be achieved if data are obtained in an environment free from multipaths and if the generated targets are precisely located in the center of the radar's range gate. Experimental shortcomings associated with limited sampling resolution of the radar scan over the targets are also investigated.</p>
Environmental engineering, Earthwork. Foundations
Retrieving cloud-base height and geometric thickness using the oxygen A-band channel of GCOM-C/SGLI
T. M. Nagao, K. Suzuki, M. Kuji
<p>Measurements with a 763 nm channel, located within the oxygen A-band and equipped on the Second-generation Global Imager (SGLI) on board the Japan Aerospace Exploration Agency (JAXA) Global Change Observation Mission – Climate (GCOM-C) satellite, have the potential to retrieve cloud-base height (CBH) and cloud geometric thickness (CGT) through passive remote sensing. This study implemented an algorithm to retrieve the CBH using the SGLI 763 nm channel in combination with several other SGLI channels in the visible, shortwave infrared, and thermal infrared regions. In addition to CBH, the algorithm can simultaneously retrieve other key cloud properties, including cloud optical thickness (COT), cloud effective radius, ice COT fraction as the cloud thermodynamic phase, cloud-top height (CTH), and CGT. Moreover, the algorithm can be seamlessly applied to global clouds comprised of liquid, ice, and mixed phases. The SGLI-retrieved CBH exhibited quantitative consistency with CBH data obtained from the ground-based ceilometer network, shipborne ceilometer, satellite-borne radar, and lidar observations, as evidenced by sufficiently high correlations and small biases. These results provide practical evidence that the retrieval of CBH is indeed possible using the SGLI 763 nm channel. Moreover, the results lend credence to the future use of SGLI CBH data, including the estimation of the surface downward longwave radiative flux from clouds. Nevertheless, issues remain that must be addressed to enhance the value of SGLI-derived cloud retrieval products. These include the bias of SGLI CTH related to cirrus clouds and the bias of SGLI CBH caused by multi-layer clouds.</p>
Environmental engineering, Earthwork. Foundations
Foundation Models for Astrobiology: Paper I -- Workshop and Overview
Ryan Felton, Caleb Scharf, Stuart Bartlett
et al.
Advances in machine learning over the past decade have resulted in a proliferation of algorithmic applications for encoding, characterizing, and acting on complex data that may contain many high dimensional features. Recently, the emergence of deep-learning models trained across very large datasets has created a new paradigm for machine learning in the form of Foundation Models. Foundation Models are programs trained on very large and broad datasets with an extensive number of parameters. Once built, these powerful, and flexible, models can be utilized in less resource-intensive ways to build many different, downstream applications that can integrate previously disparate, multimodal data. The development of these applications can be done rapidly and with a much lower demand for machine learning expertise. And the necessary infrastructure and models themselves are already being established within agencies such as NASA and ESA. At NASA this work is across several divisions of the Science Mission Directorate including the NASA Goddard and INDUS Large Language Models and the Prithvi Geospatial Foundation Model. And ESA initiatives to bring Foundation Models to Earth observations has led to the development of TerraMind. A workshop was held by the NASA Ames Research Center and the SETI Institute, in February 2025, to investigate the potential of Foundation Models for astrobiological research and to determine what steps would be needed to build and utilize such a model or models. This paper shares the findings and recommendations of that workshop, and describes clear near-term, and future opportunities in the development of a Foundation Model (or Models) for astrobiology applications. These applications would include a biosignature, or life characterization, task, a mission development and operations task, and a natural language task for integrating and supporting astrobiology research needs.
en
astro-ph.IM, astro-ph.EP
Generative Artificial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Translation
Xuanru Zhou, Cheng Li, Shuqiang Wang
et al.
Generative artificial intelligence (AI) is rapidly transforming medical imaging by enabling capabilities such as data synthesis, image enhancement, modality translation, and spatiotemporal modeling. This review presents a comprehensive and forward-looking synthesis of recent advances in generative modeling including generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models, and emerging multimodal foundation architectures and evaluates their expanding roles across the clinical imaging continuum. We systematically examine how generative AI contributes to key stages of the imaging workflow, from acquisition and reconstruction to cross-modality synthesis, diagnostic support, and treatment planning. Emphasis is placed on both retrospective and prospective clinical scenarios, where generative models help address longstanding challenges such as data scarcity, standardization, and integration across modalities. To promote rigorous benchmarking and translational readiness, we propose a three-tiered evaluation framework encompassing pixel-level fidelity, feature-level realism, and task-level clinical relevance. We also identify critical obstacles to real-world deployment, including generalization under domain shift, hallucination risk, data privacy concerns, and regulatory hurdles. Finally, we explore the convergence of generative AI with large-scale foundation models, highlighting how this synergy may enable the next generation of scalable, reliable, and clinically integrated imaging systems. By charting technical progress and translational pathways, this review aims to guide future research and foster interdisciplinary collaboration at the intersection of AI, medicine, and biomedical engineering.
Are Foundational Atomistic Models Reliable for Finite-Temperature Molecular Dynamics?
Denan Li, Jiyuan Yang, Xiangkai Chen
et al.
Machine learning force fields have emerged as promising tools for molecular dynamics (MD) simulations, potentially offering quantum-mechanical accuracy with the efficiency of classical MD. Inspired by foundational large language models, recent years have seen considerable progress in developing foundational atomistic models, sometimes referred to as universal force fields, designed to cover most elements in the periodic table. This Perspective adopts a practitioner's viewpoint to ask a critical question: Are these foundational atomistic models reliable for one of their most compelling applications, in particular simulating finite-temperature dynamics? Instead of a broad benchmark, we use the canonical ferroelectric-paraelectric phase transition in PbTiO$_3$ as a focused case study to evaluate prominent foundational atomistic models. Our findings suggest a potential disconnect between static accuracy and dynamic reliability. While 0 K properties are often well-reproduced, we observed that the models can struggle to consistently capture the correct phase transition, sometimes exhibiting simulation instabilities. We believe these challenges may stem from inherent biases in training data and a limited description of anharmonicity. These observed shortcomings, though demonstrated on a single system, appear to point to broader, systemic challenges that can be addressed with targeted fine-tuning. This Perspective serves not to rank models, but to initiate a crucial discussion on the practical readiness of foundational atomistic models and to explore future directions for their improvement.
en
physics.comp-ph, cond-mat.mtrl-sci
CoD: A Diffusion Foundation Model for Image Compression
Zhaoyang Jia, Zihan Zheng, Naifu Xue
et al.
Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300$\times$ faster training than Stable Diffusion ($\sim$ 20 vs. $\sim$ 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes are released at https://github.com/microsoft/GenCodec/tree/main/CoD.
A Survey of Foundation Models for Environmental Science
Runlong Yu, Shengyu Chen, Yiqun Xie
et al.
Modeling environmental ecosystems is essential for effective resource management, sustainable development, and understanding complex ecological processes. However, traditional methods frequently struggle with the inherent complexity, interconnectedness, and limited data of such systems. Foundation models, with their large-scale pre-training and universal representations, offer transformative opportunities by integrating diverse data sources, capturing spatiotemporal dependencies, and adapting to a broad range of tasks. This survey presents a comprehensive overview of foundation model applications in environmental science, highlighting advancements in forward prediction, data generation, data assimilation, downscaling, model ensembling, and decision-making across domains. We also detail the development process of these models, covering data collection, architecture design, training, tuning, and evaluation. By showcasing these emerging methods, we aim to foster interdisciplinary collaboration and advance the integration of cutting-edge machine learning for sustainable solutions in environmental science.
Universality Frontier for Asynchronous Cellular Automata
Ivan Baburin, Matthew Cook, Florian Grötschla
et al.
In this work, we investigate the computational aspects of asynchronous cellular automata (ACAs), a modification of cellular automata in which cells update independently, following an asynchronous schedule. We introduce flip automata networks (FAN), a simple modification of automata networks that remain robust under any asynchronous update schedule. We show that asynchronous automata can efficiently simulate their synchronous counterparts with a linear memory overhead, which improves upon the previously established quadratic bound. Additionally, we address the universality gap for (a)synchronous cellular automata -- the boundary separating universal and non-universal automata, which is still not fully understood. We tighten this boundary by proving that all one-way asynchronous automata lack universal computational power. Conversely, we establish the existence of a universal 6-state first-neighbor automaton in one dimension and a 3-state von Neumann automaton in two dimensions, which represent the smallest known universal constructions to date.
Towards Self-Supervised Foundation Models for Critical Care Time Series
Katja Naasunnguaq Jagd, Rachael DeVries, Ole Winther
Domain-specific foundation models for healthcare have expanded rapidly in recent years, yet foundation models for critical care time series remain relatively underexplored due to the limited size and availability of datasets. In this work, we introduce an early-stage pre-trained foundation model for critical care time-series based on the Bi-Axial Transformer (BAT), trained on pooled electronic health record datasets. We demonstrate effective transfer learning by fine-tuning the model on a dataset distinct from the training sources for mortality prediction, where it outperforms supervised baselines, particularly for small datasets ($<5,000$). These contributions highlight the potential of self-supervised foundation models for critical care times series to support generalizable and robust clinical applications in resource-limited settings.
HETEAC-Flex: an optimal estimation method for aerosol typing based on lidar-derived intensive optical properties
A. A. Floutsi, H. Baars, U. Wandinger
<p>This study introduces a novel methodology for the characterization of atmospheric aerosol based on lidar-derived intensive optical properties. The proposed aerosol-typing scheme is based on the optimal estimation method (OEM) and allows the identification of up to four different aerosol components of an aerosol mixture, as well as the quantification of their contribution to the aerosol mixture in terms of relative volume. The four aerosol components considered in this typing scheme are associated with the most commonly observed aerosol particles in nature and are assumed to be physically separated from each other and, therefore, can create external mixtures. Two components represent absorbing and less-absorbing fine-mode particles, and the other two components represent spherical and non-spherical coarse-mode particles. These components reflect adequately the most frequently observed aerosol types in the atmosphere: combustion- and pollution-related aerosol, sea salt, and desert dust, respectively. In addition, to consolidate the calibration and validation efforts for the upcoming EarthCARE mission, the typing scheme proposed here is in accordance with the Hybrid End-To-End Aerosol Classification (HETEAC) model of EarthCARE. The lidar-derived optical parameters used in this typing scheme are the lidar ratio and the particle linear depolarization ratio at two distinct wavelengths (355 and 532 <span class="inline-formula">nm</span>), the backscatter-related color ratio for the wavelength pair of 532/1064 <span class="inline-formula">nm</span> and the extinction-related Ångström exponent for the wavelength pair of 355/532 <span class="inline-formula">nm</span>. These intensive optical properties can be combined in different ways, making the methodology flexible, thus allowing its application to lidar systems with different configurations (e.g., single wavelength or multiwavelength, Raman, high spectral resolution). The typing scheme was therefore named HETEAC-Flex due to its compatibility with EarthCARE's HETEAC and its methodological flexibility. The functionality of the typing scheme is demonstrated by its application to three case studies based on layer-averaged optical properties.</p>
Environmental engineering, Earthwork. Foundations
Model-based evaluation of cloud geometry and droplet size retrievals from two-dimensional polarized measurements of specMACS
L. Volkmer, V. Pörtge, F. Jakub
et al.
<p>Cloud radiative properties play a significant role in radiation and energy budgets and are influenced by both the cloud top height and the particle size distribution. Both cloud top heights and particle size distributions can be derived from 2-D intensity and polarization measurements by the airborne spectrometer of the Munich Aerosol Cloud Scanner (specMACS). The cloud top heights are determined using a stereographic method <span class="cit" id="xref_paren.1">(<a href="#bib1.bibx23">Kölling et al.</a>, <a href="#bib1.bibx23">2019</a>)</span>, and the particle size distributions are derived in terms of the cloud effective radius and the effective variance from multidirectional polarized measurements of the cloudbow <span class="cit" id="xref_paren.2">(<a href="#bib1.bibx37">Pörtge et al.</a>, <a href="#bib1.bibx37">2023</a>)</span>. In this study, the accuracy of the two methods is evaluated using realistic 3-D radiative transfer simulations of specMACS measurements of a synthetic field of shallow cumulus clouds, and possible error sources are determined. The simulations are performed with the 3-D Monte Carlo radiative transport model MYSTIC <span class="cit" id="xref_paren.3">(<a href="#bib1.bibx30">Mayer</a>, <a href="#bib1.bibx30">2009</a>)</span> using cloud data from highly resolved large-eddy simulations (LESs). Both retrieval methods are applied to the simulated data and compared to the respective properties of the underlying cloud field from the LESs. Moreover, the influence of the cloud development on both methods is evaluated by applying the algorithms to idealized simulated data where the clouds did not change during the simulated overflight of 1 min over the cloud field. For the cloud top height retrieval, an absolute mean difference of less than <span class="inline-formula">70</span> m with a standard deviation of about <span class="inline-formula">130</span> m compared to the expected heights from the model is found. The elimination of the cloud development as a possible error source results in mean differences of <span class="inline-formula">(46±140)</span> m. For the effective radius, an absolute average difference of about <span class="inline-formula"><math xmlns="http://www.w3.org/1998/Math/MathML" id="M4" display="inline" overflow="scroll" dspmath="mathml"><mrow><mo>(</mo><mo>-</mo><mn mathvariant="normal">0.2</mn><mo>±</mo><mn mathvariant="normal">1.30</mn><mo>)</mo></mrow></math><span><svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="65pt" height="12pt" class="svg-formula" dspmath="mathimg" md5hash="059b14f05bef618d733c9ad12cc97ed6"><svg:image xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="amt-17-1703-2024-ie00001.svg" width="65pt" height="12pt" src="amt-17-1703-2024-ie00001.png"/></svg:svg></span></span> <span class="inline-formula">µ</span>m from the expected effective radius from the LES model input is derived for the realistic simulation and <span class="inline-formula"><math xmlns="http://www.w3.org/1998/Math/MathML" id="M6" display="inline" overflow="scroll" dspmath="mathml"><mrow><mo>(</mo><mo>-</mo><mn mathvariant="normal">0.03</mn><mo>±</mo><mn mathvariant="normal">1.28</mn><mo>)</mo></mrow></math><span><svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="71pt" height="12pt" class="svg-formula" dspmath="mathimg" md5hash="40b896c2a4105ab438d14e9200e33dc5"><svg:image xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="amt-17-1703-2024-ie00002.svg" width="71pt" height="12pt" src="amt-17-1703-2024-ie00002.png"/></svg:svg></span></span> <span class="inline-formula">µ</span>m for the simulation without cloud development. The difference between the effective variance derived from the cloudbow retrieval and the expected effective variance is <span class="inline-formula">(0.02±0.05)</span> for both simulations. Additional studies concerning the correlations between larger errors in the effective radius or variance and the optical thickness of the observed clouds have revealed that low values in the optical thickness do not have an impact on the accuracy of the retrieval.</p>
Environmental engineering, Earthwork. Foundations
Towards Graph Foundation Models for Personalization
Andreas Damianou, Francesco Fabbri, Paul Gigioli
et al.
In the realm of personalization, integrating diverse information sources such as consumption signals and content-based representations is becoming increasingly critical to build state-of-the-art solutions. In this regard, two of the biggest trends in research around this subject are Graph Neural Networks (GNNs) and Foundation Models (FMs). While GNNs emerged as a popular solution in industry for powering personalization at scale, FMs have only recently caught attention for their promising performance in personalization tasks like ranking and retrieval. In this paper, we present a graph-based foundation modeling approach tailored to personalization. Central to this approach is a Heterogeneous GNN (HGNN) designed to capture multi-hop content and consumption relationships across a range of recommendable item types. To ensure the generality required from a Foundation Model, we employ a Large Language Model (LLM) text-based featurization of nodes that accommodates all item types, and construct the graph using co-interaction signals, which inherently transcend content specificity. To facilitate practical generalization, we further couple the HGNN with an adaptation mechanism based on a two-tower (2T) architecture, which also operates agnostically to content type. This multi-stage approach ensures high scalability; while the HGNN produces general purpose embeddings, the 2T component models in a continuous space the sheer size of user-item interaction data. Our comprehensive approach has been rigorously tested and proven effective in delivering recommendations across a diverse array of products within a real-world, industrial audio streaming platform.
Foundations and Recent Trends in Multimodal Mobile Agents: A Survey
Biao Wu, Yanda Li, Zhiwei Zhang
et al.
Mobile agents are essential for automating tasks in complex and dynamic mobile environments. As foundation models evolve, the demands for agents that can adapt in real-time and process multimodal data have grown. This survey provides a comprehensive review of mobile agent technologies, focusing on recent advancements that enhance real-time adaptability and multimodal interaction. Recent evaluation benchmarks have been developed better to capture the static and interactive environments of mobile tasks, offering more accurate assessments of agents' performance. We then categorize these advancements into two main approaches: prompt-based methods, which utilize large language models (LLMs) for instruction-based task execution, and training-based methods, which fine-tune multimodal models for mobile-specific applications. Additionally, we explore complementary technologies that augment agent performance. By discussing key challenges and outlining future research directions, this survey offers valuable insights for advancing mobile agent technologies. A comprehensive resource list is available at https://github.com/aialt/awesome-mobile-agents
Leveraging Foundation Models for Crafting Narrative Visualization: A Survey
Yi He, Ke Xu, Shixiong Cao
et al.
Narrative visualization transforms data into engaging stories, making complex information accessible to a broad audience. Foundation models, with their advanced capabilities such as natural language processing, content generation, and multimodal integration, hold substantial potential for enriching narrative visualization. Recently, a collection of techniques have been introduced for crafting narrative visualizations based on foundation models from different aspects. We build our survey upon 66 papers to study how foundation models can progressively engage in this process and then propose a reference model categorizing the reviewed literature into four essential phases: Analysis, Narration, Visualization, and Interaction. Furthermore, we identify eight specific tasks (e.g. Insight Extraction and Authoring) where foundation models are applied across these stages to facilitate the creation of visual narratives. Detailed descriptions, related literature, and reflections are presented for each task. To make it a more impactful and informative experience for diverse readers, we discuss key research problems and provide the strengths and weaknesses in each task to guide people in identifying and seizing opportunities while navigating challenges in this field.
Does Data-Efficient Generalization Exacerbate Bias in Foundation Models?
Dilermando Queiroz, Anderson Carlos, Maíra Fatoretto
et al.
Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.
Are foundation models for computer vision good conformal predictors?
Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed
et al.
Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has received little attention. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.