Machine Learning: An Applied Econometric Approach
S. Mullainathan, Jann Spiess
Machines are increasingly doing “intelligent” things. Face recognition algorithms use a large dataset of photos labeled as having a face or not to estimate a function that predicts the presence y of a face from pixels x. This similarity to econometrics raises questions: How do these new empirical tools fit with what we know? As empirical economists, how can we use them? We present a way of thinking about machine learning that gives it its own place in the econometric toolbox. Machine learning not only provides new tools, it solves a different problem. Specifically, machine learning revolves around the problem of prediction, while many economic applications revolve around parameter estimation. So applying machine learning to economics requires finding relevant tasks. Machine learning algorithms are now technically easy to use: you can download convenient packages in R or Python. This also raises the risk that the algorithms are applied naively or their output is misinterpreted. We hope to make them conceptually easier to use by providing a crisper understanding of how these algorithms work, where they excel, and where they can stumble—and thus where they can be most usefully applied.
1585 sitasi
en
Computer Science
Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach
N Time, S. Hwang, J. Jeon
Abstract Machine learning approaches can establish the complex and non-linear relationship among input and response variables for the seismic damage assessment of structures. However, lack of explainability of complex machine learning models prevents their use in such assessment. This paper uses extensive experimental databases to suggest random forest machine learning models for failure mode predictions of reinforced concrete columns and shear walls, employs the recently developed SHapley Additive exPlanations approach to rank input variables for identification of failure modes, and explains why the machine learning model predicts a specific failure mode for a given sample or experiment. A random forest model established provides an accuracy of 84% and 86% for unknown data of columns and shear walls, respectively. The geometric variables and reinforcement indices are critical parameters that influence failure modes. The study also reveals that existing strategies of failure mode identification based solely on geometric features are not enough to properly identify failure modes.
837 sitasi
en
Computer Science
Underspecification Presents Challenges for Credibility in Modern Machine Learning
A. D'Amour, K. Heller, D. Moldovan
et al.
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
825 sitasi
en
Computer Science, Mathematics
Introduction to Machine Learning, Neural Networks, and Deep Learning
Rene Y. Choi, Aaron S. Coyner, Jayashree Kalpathy-Cramer
et al.
Purpose To present an overview of current machine learning methods and their use in medical research, focusing on select machine learning techniques, best practices, and deep learning. Methods A systematic literature search in PubMed was performed for articles pertinent to the topic of artificial intelligence methods used in medicine with an emphasis on ophthalmology. Results A review of machine learning and deep learning methodology for the audience without an extensive technical computer programming background. Conclusions Artificial intelligence has a promising future in medicine; however, many challenges remain. Translational Relevance The aim of this review article is to provide the nontechnical readers a layman's explanation of the machine learning methods being used in medicine today. The goal is to provide the reader a better understanding of the potential and challenges of artificial intelligence within the field of medicine.
824 sitasi
en
Computer Science, Medicine
ABY3: A Mixed Protocol Framework for Machine Learning
Payman Mohassel, Peter Rindal
886 sitasi
en
Computer Science
Machine Learning Testing: Survey, Landscapes and Horizons
J Zhang, M. Harman, Lei Ma
et al.
This paper provides a comprehensive survey of techniques for testing machine learning systems; Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.
847 sitasi
en
Computer Science, Mathematics
Comparative analysis of image classification algorithms based on traditional machine learning and deep learning
Pinxian Wang, E. Fan, Peng Wang
Abstract Image classification is a hot research topic in today's society and an important direction in the field of image processing research. SVM is a very powerful classification model in machine learning. CNN is a type of feedforward neural network that includes convolution calculation and has a deep structure. It is one of the representative algorithms of deep learning. Taking SVM and CNN as examples, this paper compares and analyzes the traditional machine learning and deep learning image classification algorithms. This study found that when using a large sample mnist dataset, the accuracy of SVM is 0.88 and the accuracy of CNN is 0.98; when using a small sample COREL1000 dataset, the accuracy of SVM is 0.86 and the accuracy of CNN is 0.83. The experimental results in this paper show that traditional machine learning has a better solution effect on small sample data sets, and deep learning framework has higher recognition accuracy on large sample data sets.
706 sitasi
en
Computer Science
Automated machine learning: Review of the state-of-the-art and opportunities for healthcare
Jonathan Waring, C. Lindvall, Renato Umeton
OBJECTIVE This work aims to provide a review of the existing literature in the field of automated machine learning (AutoML) to help healthcare professionals better utilize machine learning models "off-the-shelf" with limited data science expertise. We also identify the potential opportunities and barriers to using AutoML in healthcare, as well as existing applications of AutoML in healthcare. METHODS Published papers, accompanied with code, describing work in the field of AutoML from both a computer science perspective or a biomedical informatics perspective were reviewed. We also provide a short summary of a series of AutoML challenges hosted by ChaLearn. RESULTS A review of 101 papers in the field of AutoML revealed that these automated techniques can match or improve upon expert human performance in certain machine learning tasks, often in a shorter amount of time. The main limitation of AutoML at this point is the ability to get these systems to work efficiently on a large scale, i.e. beyond small- and medium-size retrospective datasets. DISCUSSION The utilization of machine learning techniques has the demonstrated potential to improve health outcomes, cut healthcare costs, and advance clinical research. However, most hospitals are not currently deploying machine learning solutions. One reason for this is that health care professionals often lack the machine learning expertise that is necessary to build a successful model, deploy it in production, and integrate it with the clinical workflow. In order to make machine learning techniques easier to apply and to reduce the demand for human experts, automated machine learning (AutoML) has emerged as a growing field that seeks to automatically select, compose, and parametrize machine learning models, so as to achieve optimal performance on a given task and/or dataset. CONCLUSION While there have already been some use cases of AutoML in the healthcare field, more work needs to be done in order for there to be widespread adoption of AutoML in healthcare.
657 sitasi
en
Computer Science, Medicine
A Comprehensive Survey of Loss Functions in Machine Learning
Qi Wang, Yue Ma, Kun Zhao
et al.
656 sitasi
en
Computer Science
Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence
S. Raschka, Joshua Patterson, Corey Nolet
Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical machine learning and scalable general-purpose graphics processing unit (GPU) computing, have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-level APIs. This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.
654 sitasi
en
Computer Science, Mathematics
Supervised Machine Learning: A Brief Primer.
T. Jiang, J. Gradus, A. J. Rosellini
Machine learning is increasingly used in mental health research and has the potential to advance our understanding of how to characterize, predict, and treat mental disorders and associated adverse health outcomes (e.g., suicidal behavior). Machine learning offers new tools to overcome challenges for which traditional statistical methods are not well-suited. This paper provides an overview of machine learning with a specific focus on supervised learning (i.e., methods that are designed to predict or classify an outcome of interest). Several common supervised learning methods are described, along with applied examples from the published literature. We also provide an overview of supervised learning model building, validation, and performance evaluation. Finally, challenges in creating robust and generalizable machine learning algorithms are discussed.
634 sitasi
en
Psychology, Medicine
Certified Data Removal from Machine Learning Models
Chuan Guo, T. Goldstein, Awni Y. Hannun
et al.
Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to "remove" data from a machine-learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.
617 sitasi
en
Computer Science, Mathematics
Principles and Practice of Explainable Machine Learning
Vaishak Belle, I. Papantonis
Artificial intelligence (AI) provides many opportunities to improve private and public life. Discovering patterns and structures in large troves of data in an automated manner is a core component of data science, and currently drives applications in diverse areas such as computational biology, law and finance. However, such a highly positive impact is coupled with a significant challenge: how do we understand the decisions suggested by these systems in order that we can trust them? In this report, we focus specifically on data-driven methods—machine learning (ML) and pattern recognition models in particular—so as to survey and distill the results and observations from the literature. The purpose of this report can be especially appreciated by noting that ML models are increasingly deployed in a wide range of businesses. However, with the increasing prevalence and complexity of methods, business stakeholders in the very least have a growing number of concerns about the drawbacks of models, data-specific biases, and so on. Analogously, data science practitioners are often not aware about approaches emerging from the academic literature or may struggle to appreciate the differences between different methods, so end up using industry standards such as SHAP. Here, we have undertaken a survey to help industry practitioners (but also data scientists more broadly) understand the field of explainable machine learning better and apply the right tools. Our latter sections build a narrative around a putative data scientist, and discuss how she might go about explaining her models by asking the right questions. From an organization viewpoint, after motivating the area broadly, we discuss the main developments, including the principles that allow us to study transparent models vs. opaque models, as well as model-specific or model-agnostic post-hoc explainability approaches. We also briefly reflect on deep learning models, and conclude with a discussion about future research directions.
551 sitasi
en
Computer Science, Medicine
Challenges in Deploying Machine Learning: A Survey of Case Studies
Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence
In recent years, machine learning has transitioned from a field of academic research interest to a field capable of solving real-world business problems. However, the deployment of machine learning models in production systems can present a number of issues and concerns. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries, and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. By mapping found challenges to the steps of the machine learning deployment workflow, we show that practitioners face issues at each stage of the deployment process. The goal of this article is to lay out a research agenda to explore approaches addressing these challenges.
550 sitasi
en
Computer Science
Fairness in Machine Learning
L. Oneto, S. Chiappa
Machine learning based systems are reaching society at large and in many aspects of everyday life. This phenomenon has been accompanied by concerns about the ethical issues that may arise from the adoption of these technologies. ML fairness is a recently established area of machine learning that studies how to ensure that biases in the data and model inaccuracies do not lead to models that treat individuals unfavorably on the basis of characteristics such as e.g. race, gender, disabilities, and sexual or political orientation. In this manuscript, we discuss some of the limitations present in the current reasoning about fairness and in methods that deal with it, and describe some work done by the authors to address them. More specifically, we show how causal Bayesian networks can play an important role to reason about and deal with fairness, especially in complex unfairness scenarios. We describe how optimal transport theory can be leveraged to develop methods that impose constraints on the full shapes of distributions corresponding to different sensitive attributes, overcoming the limitation of most approaches that approximate fairness desiderata by imposing constraints on the lower order moments or other functions of those distributions. We present a unified framework that encompasses methods that can deal with different settings and fairness criteria, and that enjoys strong theoretical guarantees. We introduce an approach to learn fair representations that can generalize to unseen tasks. Finally, we describe a technique that accounts for legal restrictions about the use of sensitive attributes.
544 sitasi
en
Computer Science, Mathematics
The rise of machine learning for detection and classification of malware: Research developments, trends and challenges
Daniel Gibert, Carles Mateu, Jordi Planes
Abstract The struggle between security analysts and malware developers is a never-ending battle with the complexity of malware changing as quickly as innovation grows. Current state-of-the-art research focus on the development and application of machine learning techniques for malware detection due to its ability to keep pace with malware evolution. This survey aims at providing a systematic and detailed overview of machine learning techniques for malware detection and in particular, deep learning techniques. The main contributions of the paper are: (1) it provides a complete description of the methods and features in a traditional machine learning workflow for malware detection and classification, (2) it explores the challenges and limitations of traditional machine learning and (3) it analyzes recent trends and developments in the field with special emphasis on deep learning approaches. Furthermore, (4) it presents the research issues and unsolved challenges of the state-of-the-art techniques and (5) it discusses the new directions of research. The survey helps researchers to have an understanding of the malware detection field and of the new developments and directions of research explored by the scientific community to tackle the problem.
513 sitasi
en
Computer Science
Mathematics for Machine Learning
Gaurav Kumar, R. Banerjee, Deepak Kr Singh
et al.
Machine learning is a way to study the algorithm and statistical model that is used by computer to perform a specific task through pattern and deduction [1]. It builds a mathematical model from a sample data which may come under either supervised or unsupervised learning. It is closely related to computational statistics which is an interface between statistics and computer science. Also, linear algebra and probability theory are two tools of mathematics which form the basis of machine learning. In general, statistics is a science concerned with collecting, analysing, interpreting the data. Data are the facts and figure that can be classified as either quantitative or qualitative. From the given set of data, we can predict the expected observation, difference between the outcome of two observations and how data look like which can help in better decision making process [2]. Descriptive and inferential statistics are the two methods of data analysis. Descriptive statistics summarize the raw data into information through which common expectation and variation of data can be taken. It also provides graphical methods that can be used to visualize the sample of data and qualitative understanding of observation whereas inferential statistics refers to drawing conclusions from data. Inferences are made under the framework of probability theory. So, understanding of data and interpretation of result are two important aspects of machine learning. In this paper, we have reviewed the different methods of ML, mathematics behind ML, its application in day to day life and future aspects.
Tslearn, A Machine Learning Toolkit for Time Series Data
R. Tavenard, Johann Faouzi, Gilles Vandewiele
et al.
509 sitasi
en
Computer Science
Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)
Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha
et al.
One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors. In 2019, the Neural Information Processing Systems (NeurIPS) conference, the premier international conference for research in machine learning, introduced a reproducibility program, designed to improve the standards across the community for how we conduct, communicate, and evaluate machine learning research. The program contained three components: a code submission policy, a community-wide reproducibility challenge, and the inclusion of the Machine Learning Reproducibility checklist as part of the paper submission process. In this paper, we describe each of these components, how it was deployed, as well as what we were able to learn from this initiative.
498 sitasi
en
Computer Science, Mathematics
Quantum embeddings for machine learning
S. Lloyd, M. Schuld, Aroosa Ijaz
et al.
Quantum classifiers are trainable quantum circuits used as machine learning models. The first part of the circuit implements a quantum feature map that encodes classical inputs into quantum states, embedding the data in a high-dimensional Hilbert space; the second part of the circuit executes a quantum measurement interpreted as the output of the model. Usually, the measurement is trained to distinguish quantum-embedded data. We propose to instead train the first part of the circuit---the embedding---with the objective of maximally separating data classes in Hilbert space, a strategy we call quantum metric learning. As a result, the measurement minimizing a linear classification loss is already known and depends on the metric used: for embeddings separating data using the l1 or trace distance, this is the Helstrom measurement, while for the l2 or Hilbert-Schmidt distance, it is a simple overlap measurement. This approach provides a powerful analytic framework for quantum machine learning and eliminates a major component in current models, freeing up more precious resources to best leverage the capabilities of near-term quantum information processors.
420 sitasi
en
Computer Science, Physics