Central limit theorem: the cornerstone of modern statistics
S. Kwak, Jong Hae Kim
According to the central limit theorem, the means of a random sample of size, n, from a population with mean, µ, and variance, σ2, distribute normally with mean, µ, and variance, σ2n. Using the central limit theorem, a variety of parametric tests have been developed under assumptions about the parameters that determine the population probability distribution. Compared to non-parametric tests, which do not require any assumptions about the population probability distribution, parametric tests produce more accurate and precise estimates with higher statistical powers. However, many medical researchers use parametric tests to present their data without knowledge of the contribution of the central limit theorem to the development of such tests. Thus, this review presents the basic concepts of the central limit theorem and its role in binomial distributions and the Student's t-test, and provides an example of the sampling distributions of small populations. A proof of the central limit theorem is also described with the mathematical concepts required for its near-complete understanding.
Causal inference in statistics: An overview
J. Pearl
This review presents empiricalresearcherswith recent advances in causal inference, and stresses the paradigmatic shifts that must be un- dertaken in moving from traditionalstatistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that un- derly all causal inferences, the languages used in formulating those assump- tions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coher- ent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries: (1) queries about the effects of potential interven- tions, (also called "causal effects" or "policy evaluation") (2) queries about probabilities of counterfactuals, (including assessment of "regret," "attri- bution" or "causes of effects") and (3) queries about direct and indirect effects (also known as "mediation"). Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both.
2404 sitasi
en
Mathematics
All of Statistics: A Concise Course in Statistical Inference
L. Wasserman
2142 sitasi
en
Computer Science
Introduction to the Practice of Statistics
Maurice Thunder
3241 sitasi
en
Mathematics
Probability Theory and Mathematical Statistics
M. Fisz
947 sitasi
en
Computer Science
Mathematical Statistics: Basic Ideas and Selected Topics
P. Bickel, K. Doksum
894 sitasi
en
Mathematics
Mathematical Statistics with Applications
K. Ramachandran, Ramachandran Tsokos, C. Tsokos
730 sitasi
en
Computer Science
Introduction to Mathematical Statistics.
P. Hoel
812 sitasi
en
Mathematics, Economics
Statistics in Theory and Practice
R. Lupton
Aimed at a diverse scientific audience, including physicists, astronomers, chemists, geologists, and economists, this book explains the theory underlying the classical statistical methods. Its level is between introductory "how to" texts and intimidating mathematical monographs. A reader without previous exposure to statistics will finish the book with a sound working knowledge of statistical methods, while a reader already familiar with the standard tests will come away with an understanding of their strengths, weaknesses, and domains of applicability. The mathematical level is that of an advanced undergraduate; for example, matrices and Fourier analysis are used where appropriate. Among the topics covered are common probability distributions; sampling and the distribution of sampling statistics; confidence intervals, hypothesis testing, and the theory of tests; estimation (including maximum likelihood); goodness of fit (including c2 and Kolmogorov-Smirnov tests); and non-parametric and rank tests. There are nearly one hundred problems (with answers) designed to bring out points in the text and to cover topics slightly outside the main line of development.
201 sitasi
en
Computer Science
MODEL SELECTION FOR B-SPLINE REGRESSION USING AKAIKE INFORMATION CRITERION (AIC) METHOD FOR IDR-USD EXCHANGE RATE PREDICTION
Indriani Wahyu Nur Pratiwi, Sri Utami Zuliana
Exchange rate data is a collection of information about the exchange rate the foreign currency which collected by time. Autoregressive Integrated Moving Average (ARIMA) is a well-known time series analysis. Several assumptions that need to be checked before running the ARIMA model are stationarity, normality, and white noise. B-spline regression is a method of modeling time series data without considering assumptions. This research aims to create a forecasting model for Rupiah exchange rate against US Dollar using B-spline regression. The B-spline regression model was generated with a combination of degrees two to four and a maximum of four knots. After that, the optimal model is selected using the Akaike Information Criterion (AIC) score. The performance of the selected model is validated using Mean Absolute Percentage Error (MAPE) values. The optimal degree is 3 (quadratic) and the optimal number of knot points is two-knot points with an AIC value of 857.8322 and a MAPE value of 0.0148376. The best model is:
Probabilities. Mathematical statistics
Comparative analysis of machine learning algorithms for predicting Dubai property prices
Abdulsalam Elnaeem Balila, Ani Bin Shabri
IntroductionPredicting property prices is a crucial task in the real estate market, and machine learning algorithms offer valuable tools for accurate predictions. In this study, we introduce a comprehensive comparison of eight well-known machine learning algorithms, namely, ensemble empirical mode decomposition (EEMD)–stochastic (S) + deterministic (D)–support vector machine (EEMD-SD-SVM), support vector machine (SVM), gradient boosting, random forest, K-nearest neighbors (KNN), linear regression, artificial neural networks (ANN), and decision trees. The focus is on predicting property prices in Dubai, with the primary objective of assessing the predictive performance of these algorithms within this specific market context.MethodsThe evaluation is based on four key performance metrics: R-squared (R2), mean squared error (MSE), root mean squared error (RMSE), and mean absolute percentage error (MAPE). These metrics provide insights into prediction errors, accuracy in percentage terms, and the proportion of variance in property prices explained by independent variables. The study compares the strengths and limitations of each algorithm for predicting property prices in Dubai, highlighting scenarios where certain algorithms excel based on the nature of decision boundaries, handling complex data, capturing localized patterns, and offering interpretability.ResultsFindings from the comparative analysis shed light on the performance of each algorithm in predicting property prices in Dubai. EEMD-SD-SVM and SVM excel in scenarios requiring precise decision boundaries, while gradient boosting and random forests demonstrate robust performance with complex and noisy property price data. KNN captures localized patterns effectively, linear regression is suitable for straightforward regression tasks, ANN excels with extensive datasets, and decision trees offer interpretability in understanding factors influencing property prices.DiscussionThe study emphasizes the significance of model tuning, feature selection, and data pre-processing to enhance predictive power. Additionally, practical aspects such as computational efficiency, model interpretability, and scalability in real-world applications are discussed. The comparative analysis provides valuable guidance for stakeholders, including real estate professionals, data scientists, and stakeholders interested in selecting the most suitable machine learning algorithm for predicting property prices in Dubai, with a focus on the essential evaluation metrics of MSE, RMSE, MAPE, and R2. This study offers insights into the applicability and performance of different machine learning algorithms for predicting property prices in Dubai. Stakeholders such as real estate agents, buyers, sellers, or investors can leverage these insights to make informed decisions in the Dubai real estate market.
Applied mathematics. Quantitative methods, Probabilities. Mathematical statistics
On the Regularity of the Entropy Solution of the Fractional-Semigeostrophic Equation Using Ultrafilters
Gilles Tamba Bokolo
This paper explores the regularity properties of the entropy solution to the fractional semigeostrophic (FSG) equation, leveraging the advanced mathematical framework of ultrafilters. The FSG equation, which extends the classical semigeostrophic model by incorporating fractional derivatives, is critical in modeling geophysical fluid dynamics with more accuracy, particularly in scenarios involving anomalous diffusion. Traditional methods to study entropy solutions often face limitations due to the complex nature of fractional derivatives and the nonlinearity inherent in the FSG equation. By employing ultrafilters, a powerful tool in non-standard analysis, we overcome these challenges and provide new insights into the regularity of entropy solutions. Our results demonstrate that the use of ultrafilters not only simplifies the analysis but also enhances the understanding of the smoothness and stability of the entropy solutions. This work lays the groundwork for future studies on fractional partial differential equations in geophysical contexts, potentially leading to more accurate predictive models in meteorology and oceanography.
Mathematics, Probabilities. Mathematical statistics
Euclid preparation. XXIX. Forecasts for ten different higher-order weak lensing statistics
Euclid Collaboration V. Ajani, M. Baldi, A. Barthelemy
et al.
Recent cosmic shear studies have shown that higher-order statistics (HOS) developed by independent teams now outperform standard two-point estimators in terms of statistical precision thanks to their sensitivity to the non-Gaussian features of large-scale structure. The aim of the Higher-Order Weak Lensing Statistics (HOWLS) project is to assess, compare, and combine the constraining power of ten different HOS on a common set of $Euclid$-like mocks, derived from N-body simulations. In this first paper of the HOWLS series, we computed the nontomographic ($\Omega_{\rm m}$, $\sigma_8$) Fisher information for the one-point probability distribution function, peak counts, Minkowski functionals, Betti numbers, persistent homology Betti numbers and heatmap, and scattering transform coefficients, and we compare them to the shear and convergence two-point correlation functions in the absence of any systematic bias. We also include forecasts for three implementations of higher-order moments, but these cannot be robustly interpreted as the Gaussian likelihood assumption breaks down for these statistics. Taken individually, we find that each HOS outperforms the two-point statistics by a factor of around two in the precision of the forecasts with some variations across statistics and cosmological parameters. When combining all the HOS, this increases to a $4.5$ times improvement, highlighting the immense potential of HOS for cosmic shear cosmological analyses with $Euclid$. The data used in this analysis are publicly released with the paper.
Modeling visceral leishmaniasis and tuberculosis co-infection dynamics
Oluwaseun F. Egbelowo, Oluwaseun F. Egbelowo, Justin B. Munyakazi
et al.
The co-infection of visceral leishmaniasis (VL) and tuberculosis (TB) patients pose a major public health challenge. In this study, we develop a mathematical model to study the transmission dynamics of VL and TB co-infection by first analyzing the VL and TB sub-models separately. The dynamics of these sub-models and the full co-infection model are determined based on the reproduction number. When the associated reproduction number (R1) for the TB-only model and (R2) for the VL-only are less than unity, the model exhibits backward bifurcation. If max{R1,R2}=R1, then the TB-VL co-infection model exhibits backward bifurcation for values of R1. Furthermore, if max{R1,R2}=R2, and by choosing the transmission probability, βL as the bifurcation parameter, then the phenomenon of backward bifurcation occurs for values of R2. Consequently, the full model, whose associated reproduction number is R0, also exhibits backward bifurcation when R0=1. The equilibrium points and their stability for the models are determined and analyzed based on the magnitude of the respective reproduction numbers. Finally, some numerical simulations are presented to show the reliability of our theoretical results.
Applied mathematics. Quantitative methods, Probabilities. Mathematical statistics
Analysis of China's Population with Uncertain Statistics
Yang Liu
Uncertain statistics is a set of mathematical techniques for collecting, analyzing, and interpreting data by using uncertainty theory. This paper employs the tool of uncertain statistics including uncertain time series analysis, uncertain regression analysis, and uncertain differential equation to model China’s population. Following that, uncertain hypothesis test is used to determine whether the estimated uncertain statistical models are appropriate. As a byproduct, this paper also indicates that probability statistics cannot model China’s population by analyzing the characteristics of the corresponding residuals.
29 sitasi
en
Computer Science
Approachable Case Studies Support Learning and Reproducibility in Data Science: An Example from Evolutionary Biology
Luna L. Sanchez Reyes, Emily Jane McTavish
ABSTRACTResearch reproducibility is essential for scientific development. Yet, rates of reproducibility are low. As increasingly more research relies on computers and software, efforts for improving reproducibility rates have focused on making research products digitally available, such as publishing analysis workflows as computer code, and raw and processed data in computer readable form. However, research products that are digitally available are not necessarily friendly for learners and interested parties with little to no experience in the field. This renders research products unapproachable, counteracts their availability, and hinders scientific reproducibility. To improve both short- and long-term adoption of reproducible scientific practices, research products need to be made approachable for learners, the researchers of the future. Using a case study within evolutionary biology, we identify aspects of research workflows that make them unapproachable to the general audience: use of highly specialized language; unclear goals and high cognitive load; and lack of trouble-shooting examples. We propose principles to improve the unapproachable aspects of research workflows and illustrate their application using an online teaching resource. We elaborate on the general application of these principles for documenting research products and teaching materials, to provide present learners and future researchers with tools for successful scientific reproducibility. Supplementary materials for this article are available online.
Probabilities. Mathematical statistics, Special aspects of education
OPTIMIZING THE PROCESS OF PICK-UP AND DELIVERY WITH TIME WINDOWS USING ANT COLONY AND TABU SEARCH ALGORITHMS
Imas Saumi Amalia, Toni Bakhtiar, Jaharuddin Jaharuddin
The provision of goods shuttle services sometimes faces several constraints, such as the limitation on the number of vehicles, vehicle capacity, and service time, or the vehicle used has single transport access. To avoid losses, a strategy is needed in determining the optimal route and policy for arranging goods in the vehicle especially if there are two types of goods involved. Traveling Salesman Problem and Pick-up and Delivery with Handling Costs and Time Windows (TSPPDHTW) is a model of an optimization problem that aims to minimize the total travel and goods handling costs in the goods pick-up and delivery with the constraints previously mentioned. Solving that model using the exact method requires a very long computation time so it’s not effective to be implemented in real-life. This study aims to develop a (meta)heuristic based on Ant Colony Optimization (ACO) and Tabu Search (TS) to be ACOTS to solve TSPPDHTW with reasonable computation time. The development is carried out by adding functions of clustering, evaluating constraints, cutting tours, arranging of goods, and evaluating moves on the TS, as well as modifying transition rules. The result has a deviation of about 22% and 99.99% less computational time than the exact method.
Probabilities. Mathematical statistics
A generalized Spatio-Temporal Threshold Clustering method for identification of extreme event patterns
V. Kholodovsky, X.-Z. Liang, X.-Z. Liang
<p>Extreme weather and climate events such as floods, droughts, and heat waves can cause extensive societal damages. While various statistical and climate models have been developed for the purpose of simulating extremes, a consistent definition of extreme events is still lacking. Furthermore, to better assess the performance of the climate models, a variety of spatial forecast verification measures have been developed. However, in most cases, the spatial verification measures that are widely used to compare mean states do not have sufficient theoretical justification to benchmark extreme events. In order to alleviate inconsistencies when defining extreme events within different scientific communities, we propose a new generalized Spatio-Temporal Threshold Clustering method for the identification of extreme event episodes, which uses machine learning techniques to couple existing pattern recognition indices with high or low threshold choices. The method consists of five main steps: (1) construction of essential field quantities; (2) dimension reduction; (3) spatial domain mapping; (4) time series clustering; and (5) threshold selection. We develop and apply this method using a gridded daily precipitation dataset derived from rain gauge stations over the contiguous United States. We observe changes in the distribution of conditional frequency of extreme precipitation from large-scale well-connected spatial patterns to smaller-scale more isolated rainfall clusters, possibly leading to more localized droughts and heat waves, especially during the summer months. The proposed method automates the threshold selection process through a clustering algorithm and can be directly applicable in conjunction with modeling and spatial forecast verification of extremes. Additionally, it allows for the identification of synoptic-scale spatial patterns that can be directly traced to the individual extreme episodes, and it offers users the flexibility to select an extreme threshold that is linked to the desired geometrical properties. The approach can be applied to broad scientific disciplines.</p>
Oceanography, Meteorology. Climatology
Introductory Mathematical Statistics.
B. Korin, E. Kreyszig
488 sitasi
en
Computer Science, Mathematics
Bayesian estimation for Life-Time distribution parameter under Compound Loss Function with Optimal Sample Size Determination
Safwan Nathem Rashad, Raya Al-Rassam
<strong>Abstract:</strong><br /> This research aims to find Bayes estimator under symmetric and asymmetric two loss functions, such as the squared Log error loss function and entropy loss function, as well as a loss function that combines these two functions. It's called compound loss function, which is asymmetric in nature. A comparison of the Bayes estimators for scale parameter of Life-Time distribution, which includes a collection of known distributions under the compound proposed loss function, and its contained loss functions as well as the estimation of optimal sample size. Using a mean square error criterion (MSE), where the generation of the random data using the simulation for estimate Weibull distribution parameters that represents a special case of Life-Time distribution different sample sizes (n=10,50,100) and (N=1000), taking initial values for the parameters , to get to the balanced estimator that add between two loss functions.
Probabilities. Mathematical statistics