Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.
Leonardo Alchieri, Matteo Garzon, Lidia Alecci
et al.
Foundation models have recently extended beyond natural language and vision to timeseries domains, including physiological signals. However, progress in electrodermal activity (EDA) modeling is hindered by the absence of large-scale, curated, and openly accessible datasets. EDA reflects sympathetic nervous system activity and is widely used to infer cognitive load, stress, and engagement. Yet very few wearable devices provide continuous, unobtrusive sensing, and the only large-scale archive to date is proprietary. To address this gap, we compile EDAMAME, a collection of EDA traces from 24 public datasets, comprising more than 25,000 hours from 634 users. Using this resource, we train UME, the first dedicated foundation model for EDA. In eight out of ten scenarios, UME outperforms baselines and matches generalist timeseries foundation models while using 20x fewer computational resources. Our findings, however, also highlight the intrinsic challenges of EDA modeling, motivating further research to unlock its full potential. All datasets, model weights, and code are released to support further research.
Mouïn Ben Ammar, Arturo Mendoza, Nacim Belkhir
et al.
In line with the development of deep learning, this survey examines the transformative role of Transformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity. The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.
Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF's slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.
Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative - democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by more than 7x, (2) plausible zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) more stable long-term predictions through long-horizon rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.
Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum toward learning high-quality solutions. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across local optima in policy space. We propose a family of approaches: (1) \textbf{Vanilla Foundation-Model Self-Play (vFMSP)} continually refines agent policies via competitive self-play; (2) \textbf{Novelty-Search Self-Play (NSSP)} builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, \textbf{Quality-Diveristy Self-Play (QDSP)}, creates a diverse set of high-quality policies by combining the diversity of NSSP and refinement of vFMSP. We evaluate FMSPs in Car Tag, a continuous-control pursuer-evader setting, and in Gandalf, a simple AI safety simulation in which an attacker tries to jailbreak an LLM's defenses. In Car Tag, FMSPs explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, \ouralgo and vFMSP surpass strong human-designed strategies. In Gandalf, FMSPs can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs can automatically proceed to patch the discovered vulnerabilities. Overall, FMSPs represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery
<p>The Greenhouse gas Emissions Monitoring network to Inform Net-zero Initiatives for the UK (GEMINI-UK) includes ten Bruker EM27/SUN instruments located across the UK that collect dry average volume mixing ratios of <span class="inline-formula">CO<sub>2</sub></span> and methane (<span class="inline-formula">XCO<sub>2</sub></span> and <span class="inline-formula">XCH<sub>4</sub></span>). The primary objective of GEMINI-UK is to infer regional net flux estimates of <span class="inline-formula">CO<sub>2</sub></span> and methane across the UK that can be used to provide actionable information to the UK Government. The instruments are housed in bespoke autonomous weatherproof enclosures that help maximize cloud-free data collection throughout the calendar year. The network will become fully operational in late 2025. As part of our commissioning phase, we designed the network so it would deliver the biggest uncertainty reduction in net <span class="inline-formula">CO<sub>2</sub></span> fluxes, based on prior emission inventories. The ten sites are located at UK education institutions and a national scientific research laboratory, underlining our commitment to make these data openly available to all. In this study, we use a series of closed-loop numerical experiments for the nominal calendar year of 2019 to quantify the theoretical benefit of using this new ground-based remote sensing network, accounting for cloudy scenes, to estimate spatially resolved net fluxes of <span class="inline-formula">CO<sub>2</sub></span> and methane across the UK. Based on our results, we expect that GEMINI-UK will deliver significant error reductions in <span class="inline-formula">CO<sub>2</sub></span> flux estimates, up to 51 <span class="inline-formula">%</span> in January and up to 59 <span class="inline-formula">%</span> in July (GEMINI-UK only), and up to 8 <span class="inline-formula">%</span> in January and up to 24 <span class="inline-formula">%</span> in July when combined with existing tall tower measurements collected across England and Ireland. Despite the network being optimally designed to enhance our understanding of UK <span class="inline-formula">CO<sub>2</sub></span> fluxes, we expect, based on our calculations, that GEMINI-UK will also substantially reduce uncertaintes of methane emissions, achieving a priori error reductions up to 55 <span class="inline-formula">%</span> in January and up to 75 <span class="inline-formula">%</span> in July (GEMINI-UK only), and up to 10 <span class="inline-formula">%</span> in January and up to 29 <span class="inline-formula">%</span> in July when combined with the existing tall tower sites. In the context of augmenting the information collected by the established tall tower network, we find that GEMINI-UK data have the greatest potential over high flux regions in the central and southern parts of the UK during winter months, and over broader southern to northern regions during the summer months. More broadly, the data collected by GEMINI-UK will also provide the basis to evaluate satellite observations of these trace gases, thereby providing confidence in their ability to supplement data collected by GEMINI-UK and the tall tower network.</p>
<p>Meteorological satellite data have been extensively utilized in global numerical weather prediction systems and have a positive impact to improve forecast accuracy. In order to correctly assimilate satellite radiance observations in data assimilation systems, the systematic observation biases must be corrected to conform to a Gaussian normal distribution with a mean of 0. By selecting appropriate air-mass predictors through correlation assessment, a two-step bias correction scheme (namely the scan-angle bias correction and the air-mass bias correction) is established in this paper based on radiance observations of FY-3E/ HIRAS-II from 1 to 31 January 2023. The results indicate that FY-3E/HIRAS-II <span class="inline-formula"><i>O</i>−<i>B</i></span> (observation-simulation) bias exhibits scanning angle bias dependence from nadir to limb field of view. Statistics have found that this scanning angle bias does not depend on latitude band. After scan-angle bias correction using statistical scan-angle correction coefficients, the dependence of the <span class="inline-formula"><i>O</i>−<i>B</i></span> biases on the scan angle can be eliminated. The second step is to perform air-mass correction. Our correction scheme is compared with the air-mass bias correction scheme in NCEP-GSI. Although the scan angle influence is also considered in NECP-GSI scheme, it does not account for the water vapor effect in the atmosphere. Consequently, the correction effect is not good for channels with lower peak height of weighting function, resulting in a slightly residual positive bias after correction. The combination of air-mass predictors (model surface skin temperature, model total column water vapor, thickness of 1000–300 hPa, and thickness of 200–50 hPa) selected through importance assessment in this study effectively eliminates the air-mass biases. The systematic biases between observed brightness temperature and background simulated brightness temperature from background atmospheric field for all HIRAS-II channels significantly decrease after bias correction, and the bias distribution essentially follows a Gaussian normal distribution with a mean of 0. The FY-3E/HIRAS-II data assimilation experiments show that the selected air-mass predictors (EXP-2 scheme) is the most effective among the four experiments. The mean <span class="inline-formula"><i>O</i>−<i>B</i></span> and <span class="inline-formula"><i>O</i>−<i>A</i></span> in all channels are the smallest after bias correction. Compared with the independent ERA5 objective analysis fields, the EXP-2 scheme has a significant improvement for the temperature analysis at upper air and near surface. The water vapor profiles of the EXP-2 scheme are the closest to ERA5 at almost all height levels.</p>
<p>Smartphone pressure observations have demonstrated significant potential to complement traditional pressure monitoring. However, challenges remain in correcting biases and further leveraging these observations for practical applications. In this study, we used tropical cyclones (TCs) Lekima in 2019, Hagupit in 2020 and In-fa in 2021 as examples to conduct bias correction on labeled smartphone pressure data from the Moji Weather app. We propose a quality control procedure utilizing random forest machine learning models. By applying this quality control approach to the selected TCs, we discovered that the performance of the method for labeled data significantly surpassed that for unlabeled data developed in a previous study, reducing the mean absolute error from 3.105 to 0.904 <span class="inline-formula">hPa</span>. The bias-corrected smartphone data were then supplemented with weather station data for sea-level-pressure analyses and compared with the analyses that used only weather station data. The significantly higher spatial resolution and broader coverage of the smartphone data led to notable differences between the two analysis fields. Additionally, we compared the minimum sea-level pressure of TCs derived from smartphone data, weather station observations and the best-track dataset from the Shanghai Typhoon Institute (STI) of the China Meteorological Administration. We found that the best track published by STI consistently underestimated the minimum sea-level pressure, with a median difference of 0.51 <span class="inline-formula">hPa</span> in the three TC cases.</p>
Neural networks are increasingly evolving towards training large models with big data, a method that has demonstrated superior performance across many tasks. However, this approach introduces an urgent problem: current deep learning models are predominantly serial, meaning that as the number of network layers increases, so do the training and inference times. This is unacceptable if deep learning is to continue advancing. Therefore, this paper proposes a deep learning parallelization strategy based on the Universal Approximation Theorem (UAT). From this foundation, we designed a parallel network called Para-Former to test our theory. Unlike traditional serial models, the inference time of Para-Former does not increase with the number of layers, significantly accelerating the inference speed of multi-layer networks. Experimental results validate the effectiveness of this network.
Foundation models (e.g., ChatGPT, DALL-E, PengCheng Mind, PanGu-$Σ$) have demonstrated extraordinary performance in key technological areas, such as natural language processing and visual recognition, and have become the mainstream trend of artificial general intelligence. This has led more and more major technology giants to dedicate significant human and financial resources to actively develop their foundation model systems, which drives continuous growth of these models' parameters. As a result, the training and serving of these models have posed significant challenges, including substantial computing power, memory consumption, bandwidth demands, etc. Therefore, employing efficient training and serving strategies becomes particularly crucial. Many researchers have actively explored and proposed effective methods. So, a comprehensive survey of them is essential for system developers and researchers. This paper extensively explores the methods employed in training and serving foundation models from various perspectives. It provides a detailed categorization of these state-of-the-art methods, including finer aspects such as network, computing, and storage. Additionally, the paper summarizes the challenges and presents a perspective on the future development direction of foundation model systems. Through comprehensive discussion and analysis, it hopes to provide a solid theoretical basis and practical guidance for future research and applications, promoting continuous innovation and development in foundation model systems.
In computational pathology, several foundation models have recently emerged and demonstrated enhanced learning capability for analyzing pathology images. However, adapting these models to various downstream tasks remains challenging, particularly when faced with datasets from different sources and acquisition conditions, as well as limited data availability. In this study, we benchmark four pathology-specific foundation models across 14 datasets and two scenarios-consistency assessment and flexibility assessment-addressing diverse adaptation scenarios and downstream tasks. In the consistency assessment scenario, involving five fine-tuning methods, we found that the parameter-efficient fine-tuning approach was both efficient and effective for adapting pathology-specific foundation models to diverse datasets within the same downstream task. In the flexibility assessment scenario under data-limited environments, utilizing five few-shot learning methods, we observed that the foundation models benefited more from the few-shot learning methods that involve modification during the testing phase only. These findings provide insights that could guide the deployment of pathology-specific foundation models in real clinical settings, potentially improving the accuracy and reliability of pathology image analysis. The code for this study is available at: https://github.com/QuIIL/BenchmarkingPathologyFoundationModels.
Foundation Models are neural networks that are capable of simultaneously solving many problems. Large Language Foundation Models like ChatGPT have revolutionized many aspects of daily life, but their impact for science is not yet clear. In this paper, we use a new Foundation Model for hadronic jets to solve three key challenges in collider physics. In particular, we show how experiments can (1) save significant computing power when developing reconstruction algorithms, (2) perform a complete uncertainty quantification for high-dimensional measurements, and (3) search for new physics with model agnostic methods using low-level inputs. In each case, there are significant computational or methodological challenges with current methods that limit the science potential of deep learning algorithms. By solving each problem, we take jet Foundation Models beyond proof-of-principle studies and into the toolkit of practitioners.
As natural image understanding moves towards the pretrain-finetune era, research in pathology imaging is concurrently evolving. Despite the predominant focus on pretraining pathological foundation models, how to adapt foundation models to downstream tasks is little explored. For downstream adaptation, we propose the existence of two domain gaps, i.e., the Foundation-Task Gap and the Task-Instance Gap. To mitigate these gaps, we introduce PathoTune, a framework designed to efficiently adapt pathological or even visual foundation models to pathology-specific tasks via multi-modal prompt tuning. The proposed framework leverages Task-specific Visual Prompts and Task-specific Textual Prompts to identify task-relevant features, along with Instance-specific Visual Prompts for encoding single pathological image features. Results across multiple datasets at both patch-level and WSI-level demonstrate its superior performance over single-modality prompt tuning approaches. Significantly, PathoTune facilitates the direct adaptation of natural visual foundation models to pathological tasks, drastically outperforming pathological foundation models with simple linear probing. The code is available at https://github.com/openmedlab/PathoDuet.
<p>A large convection–cloud chamber has the potential to produce drizzle-sized droplets, thus offering a new opportunity to investigate aerosol–cloud–drizzle interactions at a fundamental level under controlled environmental conditions. One key measurement requirement is the development of methods to detect the low-concentration drizzle drops in such a large cloud chamber. In particular, remote sensing methods may overcome some limitations of in situ methods.</p>
<p>Here, the potential of an ultrahigh-resolution radar to detect the radar return signal of a small drizzle droplet against the cloud droplet background signal is investigated. It is found that using a small sampling volume is critical to drizzle detection in a cloud chamber to allow a drizzle drop in the radar sampling volume to dominate over the background cloud droplet signal. For instance, a radar volume of 1 cubic centimeter (cm<span class="inline-formula"><sup>3</sup>)</span> would enable the detection of drizzle embryos with diameter larger than 40 <span class="inline-formula">µ</span>m. However, the probability of drizzle sampling also decreases as the sample volume reduces, leading to a longer observation time. Thus, the selection of radar volume should consider both the signal power and the drizzle occurrence probability. Finally, observations from the Pi Convection–Cloud Chamber are used to demonstrate the single-drizzle-particle detection concept using small radar volume. The results presented in this study also suggest new applications of ultrahigh-resolution cloud radar for atmospheric sensing.</p>
Aran Nayebi, Rishi Rajalingham, Mehrdad Jazayeri
et al.
Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-centric objectives, to models that future predict in the latent space of purely static image-based or dynamic video-based pretrained foundation models. We find strong differentiation across these model classes in their ability to predict neural and behavioral data both within and across diverse environments. In particular, we find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation are thus far most consistent with being optimized to future predict on dynamic, reusable visual representations that are useful for Embodied AI more generally.
<p>Analyses of the mesosphere and lower thermosphere suffer from a lack of global measurements. This is problematic because this region has a complex dynamic structure, with gravity waves playing an important role. A limb-sounding spatial heterodyne interferometer (SHI) was developed to obtain atmospheric temperature retrieved from the <span class="inline-formula">O<sub>2</sub></span> A-band emission, which can be used to derive gravity wave parameters in this region. The 2-D spatial distribution of the atmospheric scene is captured by a focal plane array. The SHI superimposes the spectral information onto the horizontal axis across the line-of-sight (LOS). In the usual case, the instrument exploits the horizontal axis to obtain spectral information and uses the vertical axis to get spatial information, i.e. temperature observations at the corresponding tangent points. This results in a finely resolved 1-D vertical atmospheric temperature profile. However, this method does not make use of the horizontal across-LOS information contained in the data.</p>
<p>In this paper a new processing method is investigated, which uses single-sided interferograms to gain horizontal across-LOS information about the observed temperature field. Hereby, the interferogram is split, and each side is mirrored at the centre of the horizontal axis. Each side can then be used to retrieve an individual 1-D temperature profile. The location of the two retrieved temperature profiles is analysed using prescribed horizontal temperature variations, as it is needed for deriving wave parameters. We show that it is feasible to derive two independent temperature profiles, which however will increase the requirements of an accurate calibration and processing.</p>