Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
Yunbo Long, Liming Xu, Alexandra Brintrup
Current evaluations of synthetic tabular data mainly focus on how well joint distributions are modeled, often overlooking the assessment of their effectiveness in preserving realistic event sequences and coherent entity relationships across columns.This paper proposes three evaluation metrics designed to assess the preservation of logical relationships among columns in synthetic tabular data. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.Experimental results reveal that existing methods often fail to rigorously maintain logical consistency (e.g., hierarchical relationships in geography or organization) and dependencies (e.g., temporal sequences or mathematical relationships), which are crucial for preserving the fine-grained realism of real-world tabular data. Building on these insights, this study also discusses possible pathways to better capture logical relationships while modeling the distribution of synthetic tabular data.
A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning
Anan Yaghmour, Melba M. Crawford, Saurabh Prasad
Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training. We further provide new mathematical insights into MAE-based generative learning for domain-invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.
The Probability of Food Security: A new longitudinal data set using the Panel Study of Income Dynamics
Seungmin Lee, John Hoddinott, Christopher B. Barrett
et al.
The study of food security dynamics in the U.S. has long been impeded by the lack of extended longitudinal observations of the same households or individuals. This paper applies a newly-introduced household-level food security measure, the probability of food security (PFS), to 26 waves of Panel Study of Income Dynamics (PSID) data, spanning 1979-2019, to generate a data product we describe and make newly available to the research community. We detail the construction of this unprecedentedly long food security panel data series in PSID data. Finally, we estimate key subpopulation- and national-level food security dynamics identifiable over the 40-year (1979-2019) period spanning multiple recessions and federal nutrition assistance policy changes, including disaggregated dynamics based on geography, race, sex, and educational attainment.
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Zheyuan Zhang, Runze Li, Tasnim Kabir
et al.
Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.
Interpretable Air Pollution Forecasting by Physics-Guided Spatiotemporal Decoupling
Zhiguo Zhang, Xiaoliang Ma, Daniel Schlesinger
Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model's integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.
Genome-resolved metatranscriptomics unveils distinct microbial functionalities across aggregate sizes in aerobic granular sludge
A.Y.A. Mohamed, Laurence Gill, Alejandro Monleon
et al.
Microbial aggregates of different sizes in aerobic granular sludge (AGS) systems have been shown to exhibit distinct microbial community compositions. However, studies comparing the microbial activities of different-sized aggregates in AGS systems remain limited. In this study, genome-resolved metatranscriptomics was used to investigate microbial activity patterns within differently sized aggregates in a full-scale AGS plant. Our analysis revealed a weak correlation between the relative abundance of metagenome-assembled genomes (MAGs) and their transcriptomic activity, indicating that microbial abundance does not directly correspond to metabolic activity within the system. Flocculent sludge (FL; <0.2 mm) predominantly featured active nitrifiers and fermentative polyphosphate-accumulating organisms (PAOs) from Candidatus Phosphoribacter, while small granules (SG; 0.2–1.0 mm) and large granules (LG; >1.0 mm) hosted more metabolically active PAOs affiliated with Ca. Accumulibacter. Differential gene expression analysis further supported these findings, demonstrating significantly higher expression levels of key phosphorus uptake genes associated with Ca. Accumulibacter in granular sludge (SG and LG) compared to flocculent sludge. Conversely, Ca. Phosphoribacter showed higher expression of these genes in the FL fraction. This study highlights distinct functional roles and metabolic activities of crucial microbial communities depending on aggregate size within AGS systems, offering new insights into optimizing wastewater treatment processes.
Environmental sciences, Environmental technology. Sanitary engineering
Mexico's High Resolution Climate Database (MexHiResClimDB): a new daily high-resolution gridded climate dataset for Mexico covering 1951–2020
J. J. Carrera-Hernández
<p>This work presents Mexico's High Resolution Climate Database (MexHiResClimDB), which is a newly developed gridded, high-resolution climate dataset comprised of daily, monthly and yearly precipitation and temperature (<span class="inline-formula"><i>T</i><sub>min</sub></span>, <span class="inline-formula"><i>T</i><sub>max</sub></span>, <span class="inline-formula"><i>T</i><sub>avg</sub></span>). This new database provides the largest temporal coverage of the aforementioned climate variables at the highest spatial resolution (20 arcsec, or 560 m on Mexico's CCL projection) when compared to the other currently available gridded datasets for Mexico and its development has allowed for the analysis of the country's climate extremes for the 1951–2020 period. By comparing the spatial distribution of precipitation from the MexHiResClimDB with other gridded data (Daymet, L15, CHIRPS and PERSIANN CDR), it was found that the precipitation provided by this new dataset adequately represents the spatial variation of extreme precipitation events, in particular for the precipitation that occurred during 15–16 September 2013, caused by the presence of Tropical storm Manuel in the Pacific Ocean and Hurricane Ingrid (Cat 1) in the Gulf of Mexico. Using data from 61 days retrieved from Automated Weather Stations located throughout Mexico – and correspoding to the two months with the largest precipitation in Mexico – it was found that precipitation data from MexHiResClimDB has the lowest MAE (8.7 mm), compared to those of L15 (9.5 mm), Daymet (10.1 mm) and CHIRPS (11.7 mm). For <span class="inline-formula"><i>T</i><sub>min</sub></span> and <span class="inline-formula"><i>T</i><sub>max</sub></span>, the lowest MAE was obtained with MexHiResClimDB (1.7 and 1.8 °C, respectively), followed by Daymet (2.0 °C for both temperatures) and L15 (2.4 and 2.5 °C). With this new database an analysis of the extreme events of precipitation and temperature in Mexico for the 1951–2020 period was undertaken: the wettest year was 1958, the wettest day 26 September 1970, and September of 2013 the wettest month. It was also found that eight out of the ten days with the highest <span class="inline-formula"><i>T</i><sub>min</sub></span> occurred in 2020, the two months with the highest <span class="inline-formula"><i>T</i><sub>min</sub></span> were July and August of 2020 and that the six years with the highest <span class="inline-formula"><i>T</i><sub>min</sub></span> were 2015–2020. When <span class="inline-formula"><i>T</i><sub>max</sub></span> was analysed, it was found that the hottest day was 15 June 1998, while June of 1998 was the hottest month and 2020 the hottest year, and that the four hottest years occurred between 2011–2020. Nationwide (and considering 1961–1990 as the baseline period), <span class="inline-formula"><i>T</i><sub>min</sub></span>, <span class="inline-formula"><i>T</i><sub>avg</sub></span> and <span class="inline-formula"><i>T</i><sub>max</sub></span> have increased, with their anomalies drastically increasing in recent years and reaching values above 1.0 °C in 2020. At the same time, precipitation has also decreased in recent years – which combined with the increase in temperature will have severe impacts on water availability. This new database provides a tool to quantify – in detail – the spatio-temporal variability of climate throughout Mexico.</p>
<p>The MexHiResClimDB entire dataset is available on Figshare (<a href="https://doi.org/10.6084/m9.figshare.c.7689428.v2">https://doi.org/10.6084/m9.figshare.c.7689428.v2</a>, <span class="cit" id="xref_altparen.1"><a href="#bib1.bibx16">Carrera-Hernández</a>, <a href="#bib1.bibx16">2025</a><a href="#bib1.bibx16">a</a></span>).</p>
Environmental sciences, Geology
Lessons Learned from Developing a Massive Open Online Course (MOOC) to Support Citizen Scientists in Africa
Fiona Preston-Whyte , Toshka Barnardo , Danica Marlin
et al.
Data gaps limit solutions and policy development for environmental issues. Citizen science offers a possible solution to reduce data gaps at a limited cost while enhancing environmental education (EE). While highly effective in the latter, citizen science campaigns rarely produce reliable, comparable, and meaningful data. This often results from fragmented awareness, varying data collection methods, and little training prior to data collection. This article explores how Massive Open Online Courses (MOOCs) can be used to train citizen scientists, increase the value of citizen science data, and ensure that resources invested in citizen science initiatives are used more efficiently. We use a beach macrolitter monitoring course developed by Sustainable Seas Trust (SST) (NGO/NPO) and GRID-Arendal (a research foundation) as a case study in Africa, since the marine litter issue has widespread public support, and beaches are pleasant locations that attract potential citizen scientists. Beach macrolitter surveys utilise everyday equipment, and monitoring methods are simple if individuals are supported with appropriate training. This is especially relevant in Africa, where plastic pollution is forecasted to increase faster than other regions, and resources for research can be limited. This article gives a modified problemsolution model (mPSM) perspective, considering the challenges and solutions of MOOC development by two organisations working in the same space with limited resources. Challenges to inclusivity for online training in Africa included language barriers and limited technological access. Using Africa as a case study, we show that by combining professional abilities, inclusive digital education can be achieved using data-light MOOCs, offline engagement and other inclusive strategies to overcome the challenges of m- (mobile) and e- (electronic) learning. This kind of EE can be a powerful tool in developing reliable data while enhancing citizens’ agency in working towards Sustainable Development Goals (SDGs).
Education, Environmental sciences
Educação geográfica
Davi Laurentino da Silva
O ensino de Geografia tem como finalidade fazer o aluno aprender para crescer, viver e conviver em sociedade. No entanto, persistem desafios no processo de ensino, como a desconexão entre os conteúdos e a realidade dos estudantes e a falta de conhecimento pleno das bases epistemológicas da disciplina por parte de alguns professores. É necessário pensar em metodologias didáticas para que os conteúdos promovam raciocínio crítico e análise geográfica eficaz, uma vez que ele deve proporcionar aos alunos uma compreensão reflexiva sobre o espaço em que vivem e as dinâmicas sociais e ambientais que o moldam. Este estudo discute o papel do professor, as reflexões sobre o ensino de Geografia e a implementação de propostas complementares, como jogos e maquetes, no contexto educacional. O embasamento desta pesquisa deu-se a partir do registro disponível, decorrente de pesquisas anteriores, em documentos impressos, como livros, artigos, teses etc. Utiliza-se de dados ou de categorias teóricas já trabalhados por outros pesquisadores e devidamente registrados.
Siamese text classification network (SiamTCN) for multi-class multi-label information extraction of typhoon disasters from social media data
Zhi He, Chengle Zhou, Liwei Zou
et al.
Accurately monitoring disaster effects is a crucial task in relief efforts (e.g. typhoon rescue). Social media data plays a vital role in disaster management, while deep learning-based methods gain more attention in typhoon disaster research. However, most existing classification methods for typhoon disasters are limited to multi-class but single-label levels, contradicting the reality that a social media text may correspond to multiple types of disaster damage. This paper proposes a siamese text classification network (SiamTCN) for multi-class multi-label information extraction from typhoon disasters based on Sina Weibo data. The SiamTCN leverages a dual-path architecture with shared weights, utilizing multi-head self-attention and convolution to extract hidden features from texts. A novel multi-class multi-label contrastive loss function is proposed to optimize the model. Additionally, address information is extracted through address matching and check-in locations. The spatio-temporal characteristics provide actionable insights for disaster management, enabling timely and targeted responses to affected regions. Experiments are conducted on Sina Weibo texts collected from six typical typhoon land-falls in Chinese coastal regions from 2018 to 2023. The results demonstrate that the accuracy achieved by the proposed method is 0.9454, 0.9391, and 0.9422, respectively. The code for this paper is available at https://github.com/SiamTCN.
Mathematical geography. Cartography
Research on the Spatial Data Intelligent Foundation Model
Shaohua Wang, Xing Xie, Yong Li
et al.
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios. Additionally, it summarizes the latest application cases of spatial data intelligent large models in themes such as urban development, multimodal systems, remote sensing, smart transportation, and resource environments. Finally, the report concludes with an overview and outlook on the development prospects of spatial data intelligent large models.
Multilevel Regression and Poststratification Interface: An Application to Track Community-level COVID-19 Viral Transmission
Yajuan Si, Toan Tran, Jonah Gabry
et al.
We present a novel Bayesian workflow for multilevel regression and poststratification (MRP), introducing extensions to time-varying data and granular geography and publicly available open-source computation tools, facilitating broad research adoption and reproducibility. In the absence of comprehensive or random testing throughout the COVID-19 pandemic, we have developed a proxy method for synthetic random sampling to estimate community-level viral incidence, based on viral RNA testing of asymptomatic patients who present for elective procedures within a hospital system. The approach collects routine testing data on SARS-CoV-2 exposure among outpatients and performs statistical adjustments of sample representation using MRP, a procedure that adjusts for nonrepresentativeness of the sample and yields stable small group estimates. We illustrate the MRP interface with an application to track community-level COVID-19 viral transmission in the state of Michigan.
Porting Large Language Models to Mobile Devices for Question Answering
Hannes Fassold
Deploying Large Language Models (LLMs) on mobile devices makes all the capabilities of natural language processing available on the device. An important use case of LLMs is question answering, which can provide accurate and contextually relevant answers to a wide array of user queries. We describe how we managed to port state of the art LLMs to mobile devices, enabling them to operate natively on the device. We employ the llama.cpp framework, a flexible and self-contained C++ framework for LLM inference. We selected a 6-bit quantized version of the Orca-Mini-3B model with 3 billion parameters and present the correct prompt format for this model. Experimental results show that LLM inference runs in interactive speed on a Galaxy S21 smartphone and that the model delivers high-quality answers to user queries related to questions from different subjects like politics, geography or history.
Raking mortality rates across cause, population group and geography with uncertainty quantification
Ariane Ducellier, Alexander Hsu, Parkes Kendrick
et al.
The Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) is the single largest and most detailed scientific effort ever conducted to quantify levels and trends in health. This global health model to estimate mortality rates and other health metrics is run at different scales, leading to large data sets of results for a global region and its different sub-regions, or for a cause of death and different sub-causes for example. These models do not necessarily lead to consistent data tables where, for instance, the sum of the number of deaths for each of the sub-regions is equal to the number of deaths for the global region. Raking is widely used in survey inference and global health models to adjust the observations in contingency tables to given marginals, in the latter case reconciling estimates between models with different granularities. The results of global health models usually associate to the point estimates an uncertainty, such as standard deviations or confidence intervals. In this paper, we propose an uncertainty propagation approach that obtains, at the cost of a single solve, nearly the same uncertainty estimates as computationally intensive Monte Carlo techniques that pass thousands of observed and marginal samples through the entire raking process. We introduce a convex optimization approach that provides a unified framework to raking extensions such as uncertainty propagation, raking with differential weights, raking with different loss functions in order to ensure that bounds on estimates are respected, verifying the feasibility of the constraints, raking to margins either as hard constraints or as aggregate observations, and handling missing data.
The Software Complexity of Nations
Sándor Juhász, Johannes Wachs, Jermain Kaminski
et al.
Despite the growing importance of the digital sector, research on economic complexity and its implications continues to rely mostly on administrative records, e.g. data on exports, patents, and employment, that have blind spots when it comes to the digital economy. In this paper we use data on the geography of programming languages used in open-source software to extend economic complexity ideas to the digital economy. We estimate a country's software economic complexity index (ECIsoftware) and show that it complements the ability of measures of complexity based on trade, patents, and research to account for international differences in GDP per capita, income inequality, and emissions. We also show that open-source software follows the principle of relatedness, meaning that a country's entries and exits in programming languages are partly explained by its current pattern of specialization. Together, these findings help extend economic complexity ideas and their policy implications to the digital economy.
Mapping drivers of tropical forest loss with satellite image time series and machine learning
Jan Pišl, Marc Rußwurm, Lloyd Haydn Hughes
et al.
The rates of tropical deforestation remain high, resulting in carbon emissions, biodiversity loss, and impacts on local communities. To design effective policies to tackle this, it is necessary to know what the drivers behind deforestation are. Since drivers vary in space and time, producing accurate spatially explicit maps with regular temporal updates is essential. Drivers can be recognized from satellite imagery but the scale of tropical deforestation makes it unfeasible to do so manually. Machine learning opens up possibilities for automating and scaling up this process. In this study, we developed and trained a deep learning model to classify the drivers of any forest loss—including deforestation—from satellite image time series. Our model architecture allows understanding of how the input time series is used to make a prediction, showing the model learns different patterns for recognizing each driver and highlighting the need for temporal data. We used our model to classify over $588^{^{\prime}}000$ sites to produce a map detailing the drivers behind tropical forest loss. The results confirm that the majority of it is driven by agriculture, but also show significant regional differences. Such data is a crucial source of information to enable targeting specific drivers locally and can be updated in the future using free satellite data.
Environmental technology. Sanitary engineering, Environmental sciences
Ecotoxicity of polylactic acid microplastic fragments to Daphnia magna and the effect of ultraviolet weathering
Alisa Luangrath, Joorim Na, Pandi Kalimuthu
et al.
Biodegradable plastics (BPs) are widely used as alternatives to non-BPs due to their inherent ability to undergo facile degradation. However, the ecotoxicological impact of biodegradable microplastics (MPs) rarely remains scientific documented especially to aquatic ecosystem and organisms compared to conventional microplastics. Therefore, this study aimed to investigate the ecotoxicity of biodegradable polylactic acid (PLA) MPs to Daphnia magna with that of conventional polyethylene (PE) MPs with and without ultraviolet (UV) treatment (4 weeks). The acute toxicity (48 h) of PLA MPs was significantly higher than that of PE MPs, potentially attributable to their elevated bioconcentration resulting from their higher density. UV treatment notably reduced the particle size of PLA MPs and induced new hydrophilic functional groups containing oxygen. Thus, the acute lethal toxicity of PLA MPs exhibited noteworthy increase, compared to before UV treatment after UV treatment, which was greater than that of UV-PE MPs. In addition, UV-PLA MPs showed markedly elevated reactive oxygen species concentration in D. magna compared to positive control. However, there was no significant increase in the level of lipid peroxidation, possibly due to successful defense by antioxidant enzymes (superoxide dismutase and catalase). These findings highlight the ecotoxicological risks of biodegradable MPs to aquatic organisms, which require comprehensive long-term studies.
Environmental pollution, Environmental sciences
Los dispositivos de la gentrificación liderada por el Estado: lecciones desde la Argentina
Joseph Palumbo
La gentrificación es reconocida como uno de los procesos urbanos, a nivel planetario, tendiente a agudizar las desigualdades socioespaciales y a cercenar el derecho a la ciudad para grupos de menores ingresos. El papel decisivo del Estado en este proceso ha recibido creciente atención, y la investigación latinoamericana sobre el tema ha aportado evidencias clave en ese sentido. A partir de una revisión de la literatura especializada sobre gentrificación, gestión de la ciudad y políticas urbanas neoliberales, el presente artículo propone una categorización de diferentes mecanismos concretos a través de los cuales se pone en práctica la gentrificación liderada por el Estado. Se detallan los siguientes mecanismos: el marco regulatorio urbano y la “excepcionalidad sistemática”; la institucionalización de la gobernanza urbana empresarial; la (re)inversión selectiva en el territorio; y las estrategias de (re)activación territorial. El artículo ilustra estos mecanismos a partir de la experiencia argentina, con el fin de sentar las bases para operacionalizar diversas variables capaces de captar cada uno de ellos.
A Large-scale Examination of "Socioeconomic" Fairness in Mobile Networks
Souneil Park, Pavol Mulinka, Diego Perino
Internet access is a special resource of which needs has become universal across the public whereas the service is operated in the private sector. Mobile Network Operators (MNOs) put efforts for management, planning, and optimization; however, they do not link such activities to socioeconomic fairness. In this paper, we make a first step towards understanding the relation between socioeconomic status of customers and network performance, and investigate potential discrimination in network deployment and management. The scope of our study spans various aspects, including urban geography, network resource deployment, data consumption, and device distribution. A novel methodology that enables a geo-socioeconomic perspective to mobile network is developed for the study. The results are based on an actual infrastructure in multiple cities, covering millions of users densely covering the socioeconomic scale. We report a thorough examination of the fairness status, its relationship with various structural factors, and potential class specific solutions.
New palaeomagnetic data for Palaeoproterozoic AMCG complexes of the Ukrainian Shield
V.G. Bakhmutov, O.V. Mytrokhyn, I.B. Poliachenko
et al.
A palaeomagnetic study of rocks for two Palaeoproterozoic anorthosite-mangerite-charnockite-granite (AMCG) complexes in the Ukrainian Shield was done to put additional constraints on the interpretation of palaeogeography of Fennoscandia and Volgo-Sarmatia in the Palaeoproterozoic. With this study, 5 sites of Korsun-Novomyrhorod and 3 sites of Korosten AMCG complexes in central and north-western parts of the shield, respectively, were chosen for palaeomagnetic sampling given the geological, modern geochronological and previous palaeomagnetic data. Primary remanent magnetization was isolated on samples of anorthosites, Gabbro, and monzonites within a narrow time interval of U-Pb geochronology dataset of 1.76—1.75 Ga. The palaeomagnetic poles calculated for Korosten and Korsun-Novomyrhorod complexes are almost identical, which indicates that the Volyn and Ingul Domains developed within a single structure of the Ukrainian Shield since at least 1.75 Ga. The new palaeomagnetic pole calculated for all 8 sites (Plat=22.7 °N, Plon=167.4 °E, A95=3.3°) agrees well with previous studies by Elming et al. [2001, 2010]. The selection of the most reliable palaeomagnetic poles for Fennoscandia and Volgo-Sarmatia of this time indicates that the present position of the Ukrainian Shield relative to Fennoscandia is not the same as for about 1.75 Ga, when Fennoscandia occupied a subequatorial position within palaeolatitudes of 5—20 °N, and Volgo-Sarmatia was located close to the equator and rotated relative to Fennoscandia counterclockwise by about 40° compared to its present position.
Geography (General), Geology