Transfer learning aims at improving the performance of target learners on target domains by transferring the knowledge contained in different but related source domains. In this way, the dependence on a large number of target-domain data can be reduced for constructing target learners. Due to the wide application prospects, transfer learning has become a popular and promising area in machine learning. Although there are already some valuable and impressive surveys on transfer learning, these surveys introduce approaches in a relatively isolated way and lack the recent advances in transfer learning. Due to the rapid expansion of the transfer learning area, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing transfer learning research studies, as well as to summarize and interpret the mechanisms and the strategies of transfer learning in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. Unlike previous surveys, this survey article reviews more than 40 representative transfer learning approaches, especially homogeneous transfer learning approaches, from the perspectives of data and model. The applications of transfer learning are also briefly introduced. In order to show the performance of different transfer learning models, over 20 representative transfer learning models are used for experiments. The models are performed on three different data sets, that is, Amazon Reviews, Reuters-21578, and Office-31, and the experimental results demonstrate the importance of selecting appropriate transfer learning models for different applications in practice.
Determinantal point processes (DPPs) are elegant probabilistic models of repulsion that arise in quantum physics and random matrix theory. In contrast to traditional structured models like Markov random fields, which become intractable and hard to approximate in the presence of negative correlations, DPPs offer efficient and exact algorithms for sampling, marginalization, conditioning, and other inference tasks. While they have been studied extensively by mathematicians, giving rise to a deep and beautiful theory, DPPs are relatively new in machine learning. Determinantal Point Processes for Machine Learning provides a comprehensible introduction to DPPs, focusing on the intuitions, algorithms, and extensions that are most relevant to the machine learning community, and shows how DPPs can be applied to real-world applications like finding diverse sets of high-quality search results, building informative summaries by selecting diverse sentences from documents, modeling non-overlapping human poses in images or video, and automatically building timelines of important news stories. It presents the general mathematical background to DPPs along with a range of modeling extensions, efficient algorithms, and theoretical results that aim to enable practical modeling and learning.
Background As more and more researchers are turning to big data for new opportunities of biomedical discoveries, machine learning models, as the backbone of big data analysis, are mentioned more often in biomedical journals. However, owing to the inherent complexity of machine learning methods, they are prone to misuse. Because of the flexibility in specifying machine learning models, the results are often insufficiently reported in research articles, hindering reliable assessment of model validity and consistent interpretation of model outputs. Objective To attain a set of guidelines on the use of machine learning predictive models within clinical settings to make sure the models are correctly applied and sufficiently reported so that true discoveries can be distinguished from random coincidence. Methods A multidisciplinary panel of machine learning experts, clinicians, and traditional statisticians were interviewed, using an iterative process in accordance with the Delphi method. Results The process produced a set of guidelines that consists of (1) a list of reporting items to be included in a research article and (2) a set of practical sequential steps for developing predictive models. Conclusions A set of guidelines was generated to enable correct application of machine learning models and consistent reporting of model specifications and results in biomedical research. We believe that such guidelines will accelerate the adoption of big data analysis, particularly with machine learning methods, in the biomedical research community.
There is no doubt that big data are now rapidly expanding in all science and engineering domains. While the potential of these massive data is undoubtedly significant, fully making sense of them requires new ways of thinking and novel learning techniques to address the various challenges. In this paper, we present a literature survey of the latest advances in researches on machine learning for big data processing. First, we review the machine learning techniques and highlight some promising learning methods in recent studies, such as representation learning, deep learning, distributed and parallel learning, transfer learning, active learning, and kernel-based learning. Next, we focus on the analysis and discussions about the challenges and possible solutions of machine learning for big data. Following that, we investigate the close connections of machine learning with signal processing techniques for big data processing. Finally, we outline several open issues and research trends.
Machine learning (ML) is the fastest growing field in computer science, and health informatics is among the greatest challenges. The goal of ML is to develop algorithms which can learn and improve over time and can be used for predictions. Most ML researchers concentrate on automatic machine learning (aML), where great advances have been made, for example, in speech recognition, recommender systems, or autonomous vehicles. Automatic approaches greatly benefit from big data with many training sets. However, in the health domain, sometimes we are confronted with a small number of data sets or rare events, where aML-approaches suffer of insufficient training samples. Here interactive machine learning (iML) may be of help, having its roots in reinforcement learning, preference learning, and active learning. The term iML is not yet well used, so we define it as “algorithms that can interact with agents and can optimize their learning behavior through these interactions, where the agents can also be human.” This “human-in-the-loop” can be beneficial in solving computationally hard problems, e.g., subspace clustering, protein folding, or k-anonymization of health data, where human expertise can help to reduce an exponential search space through heuristic selection of samples. Therefore, what would otherwise be an NP-hard problem, reduces greatly in complexity through the input and the assistance of a human agent involved in the learning phase.
We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.
As data science becomes increasingly mainstream, there will be an ever-growing demand for data science tools that are more accessible, flexible, and scalable. In response to this demand, automated machine learning (AutoML) researchers have begun building systems that automate the process of designing and optimizing machine learning pipelines. In this chapter we present TPOT v0.3, an open source genetic programming-based AutoML system that optimizes a series of feature preprocessors and machine learning models with the goal of maximizing classification accuracy on a supervised classification task. We benchmark TPOT on a series of 150 supervised classification tasks and find that it significantly outperforms a basic machine learning analysis in 21 of them, while experiencing minimal degradation in accuracy on 4 of the benchmarks—all without any domain knowledge nor human input. As such, genetic programming-based AutoML systems show considerable promise in the AutoML domain.
Radiomics is an emerging area in quantitative image analysis that aims to relate large-scale extracted imaging information to clinical and biological endpoints. The development of quantitative imaging methods along with machine learning has enabled the opportunity to move data science research towards translation for more personalized cancer treatments. Accumulating evidence has indeed demonstrated that noninvasive advanced imaging analytics, that is, radiomics, can reveal key components of tumor phenotype for multiple three-dimensional lesions at multiple time points over and beyond the course of treatment. These developments in the use of CT, PET, US, and MR imaging could augment patient stratification and prognostication buttressing emerging targeted therapeutic approaches. In recent years, deep learning architectures have demonstrated their tremendous potential for image segmentation, reconstruction, recognition, and classification. Many powerful open-source and commercial platforms are currently available to embark in new research areas of radiomics. Quantitative imaging research, however, is complex and key statistical principles should be followed to realize its full potential. The field of radiomics, in particular, requires a renewed focus on optimal study design/reporting practices and standardization of image acquisition, feature calculation, and rigorous statistical analysis for the field to move forward. In this article, the role of machine and deep learning as a major computational vehicle for advanced model building of radiomics-based signatures or classifiers, and diverse clinical applications, working principles, research opportunities, and available computational platforms for radiomics will be reviewed with examples drawn primarily from oncology. We also address issues related to common applications in medical physics, such as standardization, feature extraction, model building, and validation.
Randal S. Olson, W. L. Cava, P. Orzechowski
et al.
The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.
A~machine learning framework is developed to estimate ocean-wave conditions. By supervised training of machine learning models on many thousands of iterations of a physics-based wave model, accurate representations of significant wave heights and period can be used to predict ocean conditions. A model of Monterey Bay was used as the example test site; it was forced by measured wave conditions, ocean-current nowcasts, and reported winds. These input data along with model outputs of spatially variable wave heights and characteristic period were aggregated into supervised learning training and test data sets, which were supplied to machine learning models. These machine learning models replicated wave heights with a root-mean-squared error of 9cm and correctly identify over 90% of the characteristic periods for the test-data sets. Impressively, transforming model inputs to outputs through matrix operations requires only a fraction (<1/1,000) of the computation time compared to forecasting with the physics-based model.
Philipp Wegner, Marcus Grobe-Einsler, Lara Reimer
et al.
Abstract Background Gait disturbances are the clinical hallmark of ataxia. Their severity is assessed within a well-established clinical scale, which only allows coarse scoring and does not reflect the complexity of individual gait deterioration. We investigated whether sensor-free motion capture enables to replicate clinical scoring and improve the assessment of gait disturbances. Methods The normal walking task during clinical assessment was videotaped in 91 ataxia patients and 28 healthy controls. A full-body pose estimation model (AlphaPose) was used to extract positions, distances, and angles over time while walking. The resulting time series were analyzed with four machine learning (ML) models, which were combinations of feature extraction (tsfresh, ROCKET) and prediction methods (XGBoost, Ridge). First, in a regression and classification approach, we trained the ML models on reconstructing the clinical score. Second, we used explainable AI (SHAP) to identify the most important time series. Third, we investigated time series features to study longitudinal changes. Results Gait disturbances are assessed with high accuracy by ML models, slightly improving human rating (i) in the categorial prediction of the clinical score (F1-score best model: 63.99%, human: 60.57% F1-score), (ii) in the detection of subtle changes (pre-symptomatic patients, clinically rated unimpaired are differentiated from HC with a F1-score of 75.96%) and (iii) in the detection of longitudinal changes over time (Pearson’s correlation coefficient model: −0.626, p < 0.01; human: −0.060, not significant). Conclusions ML-based analysis shows improved sensitivity in assessing gait disturbances in ataxia. Subtle and longitudinal changes can be captured within this study. These findings suggest that such approaches may hold promise as potential outcome parameters for early interventions, therapy monitoring, and home-based assessments.
We propose a method for estimating human-object contact in video using deep learning, in which images with expanded human regions are fed into the model to enable more efficient feature extraction. When estimating human-object contact in video, it is important to remove unnecessary information from the image and direct the model’s attention to the relevant human regions. We propose a method that reduces background information and allows human body parts to be focused on by detecting the bounding boxes of human regions using object detection techniques and then inputting the images with these regions expanded into the model. Furthermore, we propose a novel approach that extends the input from still images to video sequences and introduces a new loss function that imposes constraints along the temporal dimension, thereby enabling the model to learn temporal information effectively. This study aims to further improve the accuracy of human-object contact estimation by modifying the model’s input and loss function.
Jamieson D Gray, Lukasz S Wylezinski, Charles F Spurlock
Objectives To develop and evaluate a machine learning (ML) model that predicts Crohn’s disease (CD) patients responsible for the top quartile of healthcare spending.Methods De-identified commercial claims (2016–2018) from ~267 000 continuously enrolled members in a Midwestern state were analysed, including 994 CD cases. Monthly data for each patient was aggregated into data points that included healthcare spending amounts, encounter interactions, demographics and binary flags for diagnoses, procedures and drug codes. Seven algorithm families were tuned using five-fold cross-validation (January 2016 to September 2017) and tested prospectively (November 2017 to February 2018). Monthly performance evaluations assessed the accuracy of predicting high-cost healthcare spending, using 4-month and 1-month historical cost analyses for comparison.Results ML models predicted an average of 80% of the dollars spent by top-quartile members during the 4-month evaluation period, compared with 67% for the 4-month baseline and 62% for the prior-month benchmark. The models identified an average of 51 new members entering the high-cost group each month, nearly double the yield of the 4-month historical method. These ML models more accurately anticipated inpatient encounters that drove excess spending.Discussion Claims-based ML offers actionable lead time for payers and clinicians to enhance monitoring, adjust biological therapy or schedule elective care before emergency admissions occur. Because this framework relies exclusively on standard claim fields, it can be quickly extended to other episodic, high-variance conditions.Conclusion Prospectively tested, claims-only ML models enhance short-term risk stratification in CD by identifying future high-cost patients. Future studies should confirm the clinical impact, cost savings and ensure equitable performance across diverse populations.
Computer applications to medicine. Medical informatics