{"results":[{"id":"arxiv_2602.09704","title":"Extended Isolation Forest with feature sensitivities","authors":[{"name":"Illia Donhauzer"}],"abstract":"Compared to theoretical frameworks that assume equal sensitivity to deviations in all features of data, the theory of anomaly detection allowing for variable sensitivity across features is less developed. To the best of our knowledge, this issue has not yet been addressed in the context of isolation-based methods, and this paper represents the first attempt to do so. This paper introduces an Extended Isolation Forest with feature sensitivities, which we refer to as the Anisotropic Isolation Forest (AIF). In contrast to the standard EIF, the AIF enables anomaly detection with controllable sensitivity to deviations in different features or directions in the feature space. The paper also introduces novel measures of directional sensitivity, which allow quantification of AIF's sensitivity in different directions in the feature space. These measures enable adjustment of the AIF's sensitivity to task-specific requirements. We demonstrate the performance of the algorithm by applying it to synthetic and real-world datasets. The results show that the AIF enables anomaly detection that focuses on directions in the feature space where deviations from typical behavior are more important.","source":"arXiv","year":2026,"language":"en","subjects":["stat.ME","stat.ML"],"url":"https://arxiv.org/abs/2602.09704","pdf_url":"https://arxiv.org/pdf/2602.09704","is_open_access":true,"published_at":"2026-02-10T12:03:15Z","score":70},{"id":"crossref_10.1088/1748-605x/ae2e6b/data1","title":"Optical Images","authors":null,"abstract":"","source":"CrossRef","year":2026,"language":"en","subjects":null,"doi":"10.1088/1748-605x/ae2e6b/data1","url":"https://doi.org/10.1088/1748-605x/ae2e6b/data1","is_open_access":true,"published_at":"","score":70},{"id":"crossref_10.1002/sta4.70043","title":"Augmenting Intelligence: The Convergence of ML/LLMs and Statistics","authors":[{"name":"Joaquin Carbonara"},{"name":"Ernest Fokoue"}],"abstract":"ABSTRACTThe rapid advancements in artificial intelligence (AI), machine learning (ML), neural networks (NN) and language models (LM) research, coupled with the widespread availability of large language models as a service (LLMaaS), have begun to influence most domains, particularly the field of statistics, in unprecedented ways that are difficult to forecast. The awarding of two Nobel Prizes in 2024 for computational work in AI—to Hopfield and Hinton for their foundational discoveries and inventions in machine learning with artificial neural networks and to Baker, Hassabis and Jumper for developing an AI model to solve the longstanding problem of predicting proteins' complex structures—is a testament to the significant impact of AI in these fields. Two key contributors for the current revolution are statistics and data science. The merger of data science with AI research led to the creation of tools like LLMs, profound advancements in AI as a tool and speculations of humanity being close to creating AGI. These transformative technologies have opened up a vast array of opportunities, but they have also presented new challenges that necessitate careful consideration. Here, we discuss what is needed to successfully navigate these stormy times in the current sea of information surrounding us.","source":"CrossRef","year":2025,"language":"en","subjects":null,"doi":"10.1002/sta4.70043","url":"https://doi.org/10.1002/sta4.70043","is_open_access":true,"citations":1,"published_at":"","score":69.03},{"id":"arxiv_2503.04358","title":"Learning Causal Response Representations through Direct Effect Analysis","authors":[{"name":"Homer Durand"},{"name":"Gherardo Varando"},{"name":"Gustau Camps-Valls"}],"abstract":"We propose a novel approach for learning causal response representations. Our method aims to extract directions in which a multidimensional outcome is most directly caused by a treatment variable. By bridging conditional independence testing with causal representation learning, we formulate an optimisation problem that maximises the evidence against conditional independence between the treatment and outcome, given a conditioning set. This formulation employs flexible regression models tailored to specific applications, creating a versatile framework. The problem is addressed through a generalised eigenvalue decomposition. We show that, under mild assumptions, the distribution of the largest eigenvalue can be bounded by a known $F$-distribution, enabling testable conditional independence. We also provide theoretical guarantees for the optimality of the learned representation in terms of signal-to-noise ratio and Fisher information maximisation. Finally, we demonstrate the empirical effectiveness of our approach in simulation and real-world experiments. Our results underscore the utility of this framework in uncovering direct causal effects within complex, multivariate settings.","source":"arXiv","year":2025,"language":"en","subjects":["stat.ML","cs.LG","math.ST","stat.AP"],"url":"https://arxiv.org/abs/2503.04358","pdf_url":"https://arxiv.org/pdf/2503.04358","is_open_access":true,"published_at":"2025-03-06T12:01:41Z","score":69},{"id":"arxiv_2510.09619","title":"Risk-Calibrated Bayesian Streaming Intrusion Detection with SRE-Aligned Decisions","authors":[{"name":"Michel Youssef"}],"abstract":"We present a risk-calibrated approach to streaming intrusion detection that couples Bayesian Online Changepoint Detection (BOCPD) with decision thresholds aligned to Site Reliability Engineering (SRE) error budgets. BOCPD provides run-length posteriors that adapt to distribution shift and concept drift; we map these posteriors to alert decisions by optimizing expected operational cost under false-positive and false-negative budgets. We detail the hazard model, conjugate updates, and an O(1)-per-event implementation. A concrete SRE example shows how a 99.9% availability SLO (43.2 minutes per month error budget) yields a probability threshold near 0.91 when missed incidents are 10x more costly than false alarms. We evaluate on the full UNSW-NB15 and CIC-IDS2017 benchmarks with chronological splits, comparing against strong unsupervised baselines (ECOD, COPOD, and LOF). Metrics include PR-AUC, ROC-AUC, Brier score, calibration reliability diagrams, and detection latency measured in events. Results indicate improved precision-recall at mid to high recall and better probability calibration relative to baselines. We release implementation details, hyperparameters, and ablations for hazard sensitivity and computational footprint. Code and reproducibility materials will be made available upon publication; datasets and implementation are available from the corresponding author upon reasonable request.","source":"arXiv","year":2025,"language":"en","subjects":["cs.CR","cs.LG"],"url":"https://arxiv.org/abs/2510.09619","pdf_url":"https://arxiv.org/pdf/2510.09619","is_open_access":true,"published_at":"2025-09-17T17:36:08Z","score":69},{"id":"arxiv_2511.12491","title":"Uncover and Unlearn Nuisances: Agnostic Fully Test-Time Adaptation","authors":[{"name":"Ponhvoan Srey"},{"name":"Yaxin Shi"},{"name":"Hangwei Qian"},{"name":"Jing Li"},{"name":"Ivor W. Tsang"}],"abstract":"Fully Test-Time Adaptation (FTTA) addresses domain shifts without access to source data and training protocols of the pre-trained models. Traditional strategies that align source and target feature distributions are infeasible in FTTA due to the absence of training data and unpredictable target domains. In this work, we exploit a dual perspective on FTTA, and propose Agnostic FTTA (AFTTA) as a novel formulation that enables the usage of off-the-shelf domain transformations during test-time to enable direct generalization to unforeseeable target data. To address this, we develop an uncover-and-unlearn approach. First, we uncover potential unwanted shifts between source and target domains by simulating them through predefined mappings and consider them as nuisances. Then, during test-time prediction, the model is enforced to unlearn these nuisances by regularizing the consequent shifts in latent representations and label predictions. Specifically, a mutual information-based criterion is devised and applied to guide nuisances unlearning in the feature space and encourage confident and consistent prediction in label space. Our proposed approach explicitly addresses agnostic domain shifts, enabling superior model generalization under FTTA constraints. Extensive experiments on various tasks, involving corruption and style shifts, demonstrate that our method consistently outperforms existing approaches.","source":"arXiv","year":2025,"language":"en","subjects":["cs.LG","cs.AI"],"doi":"10.1007/s10994-025-06842-w","url":"https://arxiv.org/abs/2511.12491","pdf_url":"https://arxiv.org/pdf/2511.12491","is_open_access":true,"published_at":"2025-11-16T08:01:50Z","score":69},{"id":"ss_f459b8a5a65e00a58d1658a9d3317c413ffe5294","title":"Estimating the Local Learning Coefficient at Scale","authors":[{"name":"Zach Furman"},{"name":"Edmund Lau"}],"abstract":"The \\textit{local learning coefficient} (LLC) is a principled way of quantifying model complexity, originally derived in the context of Bayesian statistics using singular learning theory (SLT). Several methods are known for numerically estimating the local learning coefficient, but so far these methods have not been extended to the scale of modern deep learning architectures or data sets. Using a method developed in {\\tt arXiv:2308.12108 [stat.ML]} we empirically show how the LLC may be measured accurately and self-consistently for deep linear networks (DLNs) up to 100M parameters. We also show that the estimated LLC has the rescaling invariance that holds for the theoretical quantity.","source":"Semantic Scholar","year":2024,"language":"en","subjects":["Computer Science","Mathematics"],"doi":"10.48550/arXiv.2402.03698","url":"https://www.semanticscholar.org/paper/f459b8a5a65e00a58d1658a9d3317c413ffe5294","is_open_access":true,"citations":5,"published_at":"","score":68.15},{"id":"ss_d59376814443016581b2d68c5614d13be0b70303","title":"Particle-MALA and Particle-mGRAD: Gradient-based MCMC methods for high-dimensional state-space models","authors":[{"name":"Adrien Corenflos"},{"name":"A. Finke"}],"abstract":"State-of-the-art methods for Bayesian inference in state-space models are (a) conditional sequential Monte Carlo (CSMC) algorithms; (b) sophisticated 'classical' MCMC algorithms like MALA, or mGRAD from Titsias and Papaspiliopoulos (2018, arXiv:1610.09641v3 [stat.ML]). The former propose $N$ particles at each time step to exploit the model's 'decorrelation-over-time' property and thus scale favourably with the time horizon, $T$ , but break down if the dimension of the latent states, $D$, is large. The latter leverage gradient-/prior-informed local proposals to scale favourably with $D$ but exhibit sub-optimal scalability with $T$ due to a lack of model-structure exploitation. We introduce methods which combine the strengths of both approaches. The first, Particle-MALA, spreads $N$ particles locally around the current state using gradient information, thus extending MALA to $T\u003e1$ time steps and $N\u003e1$ proposals. The second, Particle-mGRAD, additionally incorporates (conditionally) Gaussian prior dynamics into the proposal, thus extending the mGRAD algorithm to $T\u003e1$ time steps and $N\u003e1$ proposals. We prove that Particle-mGRAD interpolates between CSMC and Particle-MALA, resolving the 'tuning problem' of choosing between CSMC (superior for highly informative prior dynamics) and Particle-MALA (superior for weakly informative prior dynamics). We similarly extend other 'classical' MCMC approaches like auxiliary MALA, aGRAD, and preconditioned Crank-Nicolson-Langevin (PCNL) to $T\u003e1$ time steps and $N\u003e1$ proposals. In experiments, for both highly and weakly informative prior dynamics, our methods substantially improve upon both CSMC and sophisticated 'classical' MCMC approaches.","source":"Semantic Scholar","year":2024,"language":"en","subjects":["Mathematics"],"url":"https://www.semanticscholar.org/paper/d59376814443016581b2d68c5614d13be0b70303","is_open_access":true,"citations":5,"published_at":"","score":68.15},{"id":"crossref_10.3103/s1066530724700042","title":"Characterizing Existence and Location of the ML Estimate in the Conway–Maxwell–Poisson Model","authors":[{"name":"Stefan Bedbur"},{"name":"Anton Imm"},{"name":"Udo Kamps"}],"abstract":"","source":"CrossRef","year":2024,"language":"en","subjects":null,"doi":"10.3103/s1066530724700042","url":"https://doi.org/10.3103/s1066530724700042","pdf_url":"https://link.springer.com/content/pdf/10.3103/S1066530724700042.pdf","is_open_access":true,"citations":1,"published_at":"","score":68.03},{"id":"arxiv_2402.03698","title":"Estimating the Local Learning Coefficient at Scale","authors":[{"name":"Zach Furman"},{"name":"Edmund Lau"}],"abstract":"The \\textit{local learning coefficient} (LLC) is a principled way of quantifying model complexity, originally derived in the context of Bayesian statistics using singular learning theory (SLT). Several methods are known for numerically estimating the local learning coefficient, but so far these methods have not been extended to the scale of modern deep learning architectures or data sets. Using a method developed in {\\tt arXiv:2308.12108 [stat.ML]} we empirically show how the LLC may be measured accurately and self-consistently for deep linear networks (DLNs) up to 100M parameters. We also show that the estimated LLC has the rescaling invariance that holds for the theoretical quantity.","source":"arXiv","year":2024,"language":"en","subjects":["cs.LG","stat.ML"],"url":"https://arxiv.org/abs/2402.03698","pdf_url":"https://arxiv.org/pdf/2402.03698","is_open_access":true,"published_at":"2024-02-06T04:37:09Z","score":68},{"id":"ss_5b5ea5a69e55e1fe5f9e21aeb475c05aadfac025","title":"Improving Indoor Occupancy Detection Accuracy of the SLEEPIR Sensor Using LSTM Models","authors":[{"name":"Zhangjie Chen"},{"name":"Min Wang"},{"name":"Ya Wang"}],"abstract":"We recently developed a synchronized low-energy electronically chopped passive infrared (SLEEPIR) sensor node to detect stationary and moving occupants. It uses a liquid crystal shutter to modulate the infrared signal received by a traditional passive infrared (PIR) sensor and thus enables its capability to detect stationary occupants. However, the detection accuracy of the SLEEPIR sensor can be easily influenced by infrared environmental disturbances. To address this problem, in this article, we propose two long short-term memory (LSTM) models to filter infrared environmental disturbance, named baseline LSTM (Base.LSTM) and statistical LSTM (Stat.LSTM). They use the sensor node raw output and statistical features as their respective input. For comparison, we propose two other models: the occupancy state switch detection (SSD) algorithm that directly uses a predetermined threshold voltage value to classify the occupancy state and its status change; and the multilayer perception (MLP) classifier with statistical feature inputs (Stat.ML). To validate their detection performance, we designed two testing scenarios in different environment settings: 1) daily occupancy tests and 2) EDGE case tests. The first scenario intends to restore complex real-life environmental situations as much as possible in the lab and apartment rooms. The second scenario aims to verify their detection accuracy under different environmental temperatures. This scenario also considers different occupancy postures, such as lying down. Experimental results show that the detection accuracy of both LSTM models ( $\u003e$ 95%) in both testing scenarios outperforms that of the SSD (around 82%–94%) and the Stat.ML (around 80%–90%).","source":"Semantic Scholar","year":2023,"language":"en","subjects":null,"doi":"10.1109/JSEN.2023.3287565","url":"https://www.semanticscholar.org/paper/5b5ea5a69e55e1fe5f9e21aeb475c05aadfac025","is_open_access":true,"citations":15,"published_at":"","score":67.45},{"id":"arxiv_2302.02859","title":"A Fast Bootstrap Algorithm for Causal Inference with Large Data","authors":[{"name":"Matthew Kosko"},{"name":"Lin Wang"},{"name":"Michele Santacatterina"}],"abstract":"Estimating causal effects from large experimental and observational data has become increasingly prevalent in both industry and research. The bootstrap is an intuitive and powerful technique used to construct standard errors and confidence intervals of estimators. Its application however can be prohibitively demanding in settings involving large data. In addition, modern causal inference estimators based on machine learning and optimization techniques exacerbate the computational burden of the bootstrap. The bag of little bootstraps has been proposed in non-causal settings for large data but has not yet been applied to evaluate the properties of estimators of causal effects. In this paper, we introduce a new bootstrap algorithm called causal bag of little bootstraps for causal inference with large data. The new algorithm significantly improves the computational efficiency of the traditional bootstrap while providing consistent estimates and desirable confidence interval coverage. We describe its properties, provide practical considerations, and evaluate the performance of the proposed algorithm in terms of bias, coverage of the true 95% confidence intervals, and computational time in a simulation study. We apply it in the evaluation of the effect of hormone therapy on the average time to coronary heart disease using a large observational data set from the Women's Health Initiative.","source":"arXiv","year":2023,"language":"en","subjects":["stat.ME","stat.AP","stat.ML"],"url":"https://arxiv.org/abs/2302.02859","pdf_url":"https://arxiv.org/pdf/2302.02859","is_open_access":true,"published_at":"2023-02-06T15:26:36Z","score":67},{"id":"ss_7321b03e135a950a4380d0f5dae0073c61742715","title":"Indeterminacy in Generative Models: Characterization and Strong Identifiability","authors":[{"name":"Quanhan Xi"},{"name":"Benjamin Bloem-Reddy"}],"abstract":"Most modern probabilistic generative models, such as the variational autoencoder (VAE), have certain indeterminacies that are unresolvable even with an infinite amount of data. Different tasks tolerate different indeterminacies, however recent applications have indicated the need for strongly identifiable models, in which an observation corresponds to a unique latent code. Progress has been made towards reducing model indeterminacies while maintaining flexibility, and recent work excludes many--but not all--indeterminacies. In this work, we motivate model-identifiability in terms of task-identifiability, then construct a theoretical framework for analyzing the indeterminacies of latent variable models, which enables their precise characterization in terms of the generator function and prior distribution spaces. We reveal that strong identifiability is possible even with highly flexible nonlinear generators, and give two such examples. One is a straightforward modification of iVAE (arXiv:1907.04809 [stat.ML]); the other uses triangular monotonic maps, leading to novel connections between optimal transport and identifiability.","source":"Semantic Scholar","year":2022,"language":"en","subjects":["Computer Science","Mathematics"],"url":"https://www.semanticscholar.org/paper/7321b03e135a950a4380d0f5dae0073c61742715","is_open_access":true,"citations":30,"published_at":"","score":66.9},{"id":"ss_27208d9a4cf5915142c6d9996530ca2ebc29fd34","title":"On the Origin of Species of Self-Supervised Learning","authors":[{"name":"Samuel Albanie"},{"name":"Erika Lu"},{"name":"João F. Henriques"}],"abstract":"In the quiet backwaters of cs.CV, cs.LG and stat.ML, a cornucopia of new learning systems is emerging from a primordial soup of mathematics-learning systems with no need for external supervision. To date, little thought has been given to how these self-supervised learners have sprung into being or the principles that govern their continuing diversification. After a period of deliberate study and dispassionate judgement during which each author set their Zoom virtual background to a separate Galapagos island, we now entertain no doubt that each of these learning machines are lineal descendants of some older and generally extinct species. We make five contributions: (1) We gather and catalogue row-major arrays of machine learning specimens, each exhibiting heritable discriminative features; (2) We document a mutation mechanism by which almost imperceptible changes are introduced to the genotype of new systems, but their phenotype (birdsong in the form of tweets and vestigial plumage such as press releases) communicates dramatic changes; (3) We propose a unifying theory of self-supervised machine evolution and compare to other unifying theories on standard unifying theory benchmarks, where we establish a new (and unifying) state of the art; (4) We discuss the importance of digital biodiversity, in light of the endearingly optimistic Paris Agreement.","source":"Semantic Scholar","year":2021,"language":"en","subjects":["Computer Science"],"url":"https://www.semanticscholar.org/paper/27208d9a4cf5915142c6d9996530ca2ebc29fd34","is_open_access":true,"citations":1,"published_at":"","score":65.03},{"id":"doaj_10.46298/jnsao-2021-6480","title":"Inexact and Stochastic Generalized Conditional Gradient with Augmented Lagrangian and Proximal Step","authors":[{"name":"Antonio Silveti-Falls"},{"name":"Cesare Molinari"},{"name":"Jalal Fadili"}],"abstract":"In this paper we propose and analyze inexact and stochastic versions of the CGALP algorithm developed in [25], which we denote ICGALP , that allow for errors in the computation of several important quantities. In particular this allows one to compute some gradients, proximal terms, and/or linear minimization oracles in an inexact fashion that facilitates the practical application of the algorithm to computationally intensive settings, e.g., in high (or possibly infinite) dimensional Hilbert spaces commonly found in machine learning problems. The algorithm is able to solve composite minimization problems involving the sum of three convex proper lower-semicontinuous functions subject to an affine constraint of the form Ax = b for some bounded linear operator A. Only one of the functions in the objective is assumed to be differentiable, the other two are assumed to have an accessible proximal operator and a linear minimization oracle. As main results, we show convergence of the Lagrangian values (so-called convergence in the Bregman sense) and asymptotic feasibility of the affine constraint as well as strong convergence of the sequence of dual variables to a solution of the dual problem, in an almost sure sense. Almost sure convergence rates are given for the Lagrangian values and the feasibility gap for the ergodic primal variables. Rates in expectation are given for the Lagrangian values and the feasibility gap subsequentially in the pointwise sense. Numerical experiments verifying the predicted rates of convergence are shown as well.","source":"DOAJ","year":2021,"language":"","subjects":["Science"],"doi":"10.46298/jnsao-2021-6480","url":"http://jnsao.episciences.org/6480/pdf","pdf_url":"http://jnsao.episciences.org/6480/pdf","is_open_access":true,"published_at":"","score":65},{"id":"ss_8ff009c4dda6b533d46dfee32324a7d871fc6d19","title":"822 GraphITE: unsupervised graph embeddings approach to multiplex immunofluorescence image exploration reveals new insights into NSCLC and HNSCC tumor microenvironment","authors":[{"name":"M. Surace"},{"name":"H. Angell"},{"name":"Christopher Innocenti"},{"name":"Zhenning Zhang"},{"name":"Isabelle Gaffney"},{"name":"A. Spitzmüller"},{"name":"Khan Baykaner"},{"name":"B. Selvaraj"}],"abstract":"Predictive biomarkers for response to IO therapies remain insufficient. Although multiplex immunofluorescence has the potential to provide superior biomarkers, the information garnered from these studies is frequently underleveraged. Due to the large number of markers that must be analyzed (6 - 40 +), and the complexity of the spatial information, the number of hypotheses is large and must be tested systematically and automatically. GraphITE (Graphs-based Investigation of Tissues with Embeddings) is a novel method of converting multiplex IF image analysis results into embeddings, numerical vectors which represent the phenotype of each cell as well as the immediate neighborhood. This allows for the clustering of embeddings based on similarity as well as the discovery of novel predictive biomarkers based on both the spatial and multimarker data in multiplex IF images. Here we demonstrate initial observations from deployment of GraphITE on 564 commercially-sourced NSCLC and HNSCC resections stained with a multiplex IF panel containing CD8, PDL1, PD1, CD68, Ki67, and CK.4 μm FFPE tumor sections were stained with CD8, PDL1, PD1, CD68, Ki67, and CK at Akoya Biosciences using OPAL TSA-linked fluorophores and imaged on a Vectra Polaris. Images were analyzed by Computational Biology (AstraZeneca). Graphs were built by mapping each cell in the mIF image as a node, using the X, Y coordinates and connecting nodes with edges according to distance. 64-dimensional embeddings were generated using Deep Graph InfoMax (DGI).1 Embeddings are downprojected to 2 dimensions using UMAP.2. Details are available in the preprint of the GraphITE methods manuscript.3A single downprojection was developed using embeddings from 158 HNSCC and 406 NSCLC cases. 60–80 distinct clusters were observed, some of which contained embeddings from both indications and others which were exclusive to one indication. Exclusive clusters describe tissue neighborhoods observed only in one indication. Drivers of cluster exclusivity included increased cell density in HNSCC as compared to NSCLC both in PD-L1- tumor centers with few infiltrating lymphocytes as well as in PD-L1- macrophagedominated neighborhoods. HNSCC and NSCLC embeddings were more colocalized in PD-L1+ tumor centers and in tumor stroma with high CD8+ or CD68+ immune cell content and high PD-L1+ expression.This study demonstrates the utility and potential of the GraphITE platform to discriminate between and describe both unique and common neighborhood-level features of the tumor microenvironment. Deploying GraphITE across multiple indications effectively leverages spatial heterogeneity and multimarker information from multiplex IF panels.1. Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, DevonHjelm R. Deep Graph Infomax. 2018. arxiv:1809.10341 [stat.ML].2. McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximationand projection for dimension reduction. 2020; arxiv:1802.03426 [stat.ML].3. Innocenti C, Zhang Z, Selvaraj B, Gaffney I, Frangos M, Cohen-Setton J, Dillon LAL, Surace MJ, Pedrinaci C, Hipp J, Baykaner K. An unsupervised graph embeddings approach to multiplex immunofluorescence image explorationbioRxiv 2021.06.09.447654; doi: https://doi.org/10.1101/2021.06.09.447654The study was approved by AstraZeneca.","source":"Semantic Scholar","year":2021,"language":"en","subjects":null,"doi":"10.1136/jitc-2021-sitc2021.822","url":"https://www.semanticscholar.org/paper/8ff009c4dda6b533d46dfee32324a7d871fc6d19","pdf_url":"https://jitc.bmj.com/content/jitc/9/Suppl_2/A860.full.pdf","is_open_access":true,"published_at":"","score":65},{"id":"ss_bf3662b41c6fbcdee9b12f2bbff8285a517f215a","title":"Identifying the Development and Application of Artificial Intelligence in Scientific Text","authors":[{"name":"James W. Dunham"},{"name":"Jennifer Melot"},{"name":"D. Murdick"}],"abstract":"We describe a strategy for identifying the universe of research publications relevant to the application and development of artificial intelligence. The approach leverages the arXiv corpus of scientific preprints, in which authors choose subject tags for their papers from a set defined by editors. We compose a functional definition of AI relevance by learning these subjects from paper metadata, and then inferring the arXiv-subject labels of papers in larger corpora: Clarivate Web of Science, Digital Science Dimensions, and Microsoft Academic Graph. This yields predictive classification $F_1$ scores between .75 and .86 for Natural Language Processing (cs.CL), Computer Vision (cs.CV), and Robotics (cs.RO). For a single model that learns these and four other AI-relevant subjects (cs.AI, cs.LG, stat.ML, and cs.MA), we see precision of .83 and recall of .85. We evaluate the out-of-domain performance of our classifiers against other sources of topic information and predictions from alternative methods. We find that a supervised solution can generalize to identify publications that belong to the high-level fields of study represented on arXiv. This offers a method for identifying AI-relevant publications that updates at the pace of research output, without reliance on subject-matter experts for query development or labeling.","source":"Semantic Scholar","year":2020,"language":"en","subjects":["Computer Science"],"url":"https://www.semanticscholar.org/paper/bf3662b41c6fbcdee9b12f2bbff8285a517f215a","is_open_access":true,"citations":18,"published_at":"","score":64.53999999999999}],"total":159330,"page":1,"page_size":20,"sources":["DOAJ","arXiv","Semantic Scholar","CrossRef"],"query":"stat.ML"}