Reason: Access restricted by the author. A copy can be requested for private research and study by contacting your institution's library service. This copy cannot be republished
Kernel density estimation: advances and applications
thesis
posted on 2017-02-27, 22:15authored byPolak, Julia
The availability of an accurate estimator of conditional densities is very important in part due to the high use and potential use of conditional densities in econometrics. It provides a wide range of properties, such as mean, dispersion, tail behavior and asymmetry in the examined data. Hence it allows the researcher to investigate a wider range of hypotheses than would be the case for the regression model and its many variations. The use of kernel estimation provides a convenient mathematical framework without the need to assume a particular parametric form of the examined data distribution. For the kernel density estimator, the selected bandwidth (the tuner parameter) is the most influential factor on estimator accuracy. Therefore, to increase the utility of the conditional kernel density estimators a variety of appropriate bandwidth selection methods is needed. Moreover, the flexibility of the kernel estimator has great potential in hypothesis testing because it does not require assuming a particular parametric distribution under the null and alternative hypotheses.
The purpose of this thesis is to suggest two new bandwidth selection methods for the conditional density estimator, targeted at two different types of users. Another goal is to develop a model clarification procedure that is versatile enough to be applicable to test different types of models and different types of changes. Finally, we aim to broaden the model clarification procedure to examining functional models.
The first contribution of this thesis is the suggested implementation of the Markov chain Monte Carlo (MCMC) estimation algorithm for optimal bandwidth selection (Zhang,King & Hyndman 2006) for the conditional density estimator. In addition, we propose a generalization to the Kullback-Leibler information and to the mean squared error criterion and apply them to assessing the accuracy of conditional density estimators. We conduct a comparison of the various conditional density estimators based on several bandwidth selection methods. Our numerical study shows that when the data has two modes or there is a correlation among the conditional covariates, the least square cross-validation for direct conditional density estimation (Hall, Racine & Li 2004) appears to be the preferred method. This, however, comes at very high computational cost, particularly for large data sets. The MCMC approach provides a density estimator that is much faster and only slightly less accurate, which makes it preferable in these situations. When the data is distributed with only one mode, the conditional normal reference rule bandwidth selection method (Bashtannyk & Hyndman 2001, Hyndman, Bashtannyk & Grunwald 1996) leads to the most accurate conditional density estimator and enjoys a low computational cost. The other examined bandwidth selection methods include the normal reference rule (Scott 1992), the plug-in bandwidth selector (Duong & Hazelton 2003) and the smooth cross-validation selector (Duong & Hazelton 2005a).
In order to simplify the application of the conditional density kernel estimator, we derive a reference rule for bandwidth selection. In contrast to the usual simple assumption of normally or uniformly distributed data, we assume that the distribution of y given x and the distribution of x are both skew t (with includes the normal, the skew normal and the Student's t distributions as special cases). Moreover, we allow distribution parameters to change as linear functions of the conditional x values. This flexible framework allows us to capture the variations in the skewness and in the kurtosis of the conditional density, as well as the change in its location and scale, as functions of the conditioning variables. We illustrate the improvement in the conditional density estimator accuracy when we choose the bandwidths under the skew t distribution assumption instead of the normality assumption(Bashtannyk & Hyndman 2001, Hyndman et al. 1996) on simulated data.
The next contribution of this work is the development of a method for the analysis of the model in use, and the examination of whether or not the model's predictive ability is still good enough. The proposed prediction capability testing procedure is based on a nonparametric density estimation of potential realizations from the examined model. An important property of this procedure is that it can provide guidance after a relatively low number of new realizations. The procedure's ability to recognize a change in the `reality' is demonstrated through AR(1) and linear models. We find that the procedure has correct empirical size and high power to recognize the changes in the data generating process after 10 to 15 new observations, depending on the type and the extent of the change.
Finally, we propose a pattern characteristics testing procedure for validating the predictive abilities of a functional model. With the growing interest in functional data analysis in the last several decades and with the expansion of the functional modeling to a diverse range of scientific disciplines, a procedure that clarifies the validity of the functional model is a vital tool. Our approach involves generation of many potential paths from the examined model and summarizing their characterizing dynamics using a density of the scores resulting from a functional principal component decomposition. Two sets of simulation experiments are presented to illustrate the size and power of the procedure. An example, testing the fertility rates forecasting method suggested by Hyndman & Ullah (2007), shows the application of the procedure to Australian fertility rates in years 1921 - 2002.