RT Dissertation/Thesis
T1 A general framework for prediction in generalized additive models
A1 Carballo González, Alba
AB Smoothing techniques have become one of the most popular modelling approaches in theunidimensional and multidimensional setting. However, out-of-sample prediction in thecontext of smoothing models is still an open problem that can significantly widen the useof these models in many areas of knowledge. The objective of this thesis is to proposea general framework for prediction in penalized regression, particularly in the P-splinescontext.To that end, Chapter 1 includes a review of the different proposals available in the literature,and results useful and necessary along the document, the formulation of a P-splinemodel and its reparameterization as a mixed model.In Chapter 2, we generalize the approach given by Currie et al. (2004) to predict withany regression basis and quadratic penalty. For the particular case of penalties based ondifferences between adjacent coefficients, we reparameterize the extended P-spline modelas a mixed model and we prove that the fit remains the same as the result we obtain onlyfitting the data and show the crucial role of the penalty order, since it determines theshape of the prediction. Moreover, we adapt available methods in contexts such as mixedmodels (Gilmour et al. 2004) or global optimization (Sacks et al. 1989) to predict in thecontext of penalized regression and prove their equivalence for the particular case of Psplines.An extensive section of examples illustrates the application of the methodology.We use three real datasets with particular characteristics: one of them on abovegroundbiomass allow us to show that prediction can also be performed to the left of the data;other of them, on monthly sulphur dioxide levels, illustrates how prediction can takeinto account the temporal trends and seasonal effects by using the smooth modulationmodel based on P-splines suggested by Eilers et al. (2008); and other, on yearly sea level,shows that prediction can also be carried out in the case of correlated errors. We alsointroduce the concept of “memory of a P-spline” as a tool to know how much of theknown information we use to predict new values.In the third chapter, we propose a general framework for prediction in multidimensional smoothing, we extend the proposal of Currie et al. (2004) to predict when more than onecovariate is extended. The extension of the prediction method to the multidimensionalcase is not straightforward in the sense that, in this context, the fit changes when the fitand the prediction are carried out simultaneously. To overcome this problem we proposean easy but elegant solution, based on Lagrange multipliers. The first part of the chapteris dedicated to show how out-of-sample predictions can be carried out in the context ofmultidimensional P-splines and the properties satisfied, under certain conditions, bythe coefficients that determine the prediction. We also propose the use of restrictionsto maintain the fit, and in general, to incorporate any known information about theprediction. The second part of the chapter is dedicated to extend the methodologyto the smooth mixed model framework. It is known that when a P-spline model isreparameterized as a mixed model, the structure of the coefficients is lost, that is, theyare not ordered according to the position of the knots. This fact is not relevant whenwe fit the data, but if we predict and impose restrictions over the coefficients, we needto differentiate between the coefficients that determine the fit and the coefficients thatdetermine the prediction. In order to do that, we define a particular transformationmatrix that preserves the original model matrices. The prediction method and the useof restrictions is illustrated with one real data example on log mortality rates of US malepopulation. We show how to solve the crossover problem of adjacent ages when mortalitytables are forecasted and compare the results with the well-known method developed inDelwarde et al. (2007).The research in Chapter 4, is motivated by the need to extend the prediction methodologyin the multidimensional case to more flexible models, the so-called Smooth-ANOVAmodels, which allow us to include interaction terms that can be decomposed as a sum ofseveral smooth functions. The construction of these models through B-splines basis sufferfrom identifiability problems. There are several alternatives to solve this problem, herewe follow Lee and Durbán (2011) and reparameterize them as mixed models. The firsttwo sections of the chapter are dedicated to introduce the Smooth-ANOVA models andto show how out-of-sample prediction can be carried out in these models. We illustratethe prediction with Smooth-ANOVA models reanalyzing the dataset on abovegroundbiomass. Now, the Smooth-ANOVA model allows us to represent the smooth functionas the sum of a smooth function for the height, a smooth function for the diameter of atree, and a smooth term for the height-diameter interaction. At the end of this chapter,we provide a simulation study in order to evaluate the accuracy of the 2D interactionP-spline models and Smooth-ANOVA models, with and without imposing invariance ofthe fit. From the results of the simulation study, we conclude that in most situationsthe constrained S-ANOVA model behaves better in the fit and out-of-sample predictions, however, results depend on the simulation scenario and on the number of dimensions inwhich the prediction is carried out (one or both dimensions).In the fifth chapter we generalize the developed methodology for generalized linear models(GLMs) in the context of P-splines (P-GLMs) and mixed models (P-GLMMs). Inboth frameworks, the coefficients and parameters estimation procedures involve nonlinearequations. To solve them iterative algorithms based on the Newton-Raphson methodsare used, regardless of the estimation criterion used (for instance, in the GLMMs contextwe can maximize the residual maximum likelihood (REML) or an approximate REML(based on Laplace approximation)). These iterative algorithms are based on a workingnormal theory model or a set of pseudodata and weights. Based on this idea, we extendthe Penalized Quasilikelihood method (PQL) to fit and predict simultaneously inthe context of GLMM. We highlight that, in the context of mixed models (even in theunivariate case), to maintain the fit a transformation that preserves the original modelmatrices has to be used, since different transformations deal with different working vectorsand therefore with different solutions. We also show how restrictions can be imposedin P-GLMs and P-GLMMs models. To illustrate the procedures we use a real dataset topredict deaths due to respiratory disease through 2D interaction P-splines and S-ANOVAmodels (both with and without the restriction the fit has to be maintained).Finally, Chapter 6 is devoted to summarize the main conclusions and pose a list of futurelines of work.
AB Las técnicas de suavizado se han convertido en uno de los enfoques de modelado máspopulares en el entorno unidimensional y multidimensional. Sin embargo, la predicciónfuera del rango de valores conocidos en el contexto de los modelos de suavizado siguesiendo un problema abierto que puede ampliar significativamente el uso de estos modelosen muchas áreas de conocimiento. El objetivo de este documento es proponer un marcogeneral para la predicción en regresión penalizada, particularmente en el contexto deP-splines.Con ese fin, el Capítulo 1 incluye una revisión de las diferentes propuestas disponibles enla literatura y los resultados útiles y necesarios a lo largo del documento, la formulaciónde un modelo P-spline y su reparametrización como modelo mixto.En el Capítulo 2, generalizamos el enfoque dado por Currie et al. (2004) para predecir concualquier base de regresión y penalización cuadrática. Para el caso particular de penalizacionesbasadas en diferencias entre coeficientes adyacentes, reparametrizamos el modeloP-spline extendido como un modelo mixto y demostramos que el ajuste sigue siendoel mismo que el resultado que obtenemos solo ajustando los datos, también mostramosel papel crucial del orden de penalización, ya que determina la forma de la predicción.Además, adaptamos los métodos disponibles en contextos como modelos mixtos (Gilmouret al. 2004) u optimización global (Sacks et al. 1989) predecir en el contexto de la regresión penalizada y demostramos su equivalencia para el caso particular de P-splines.Una extensa sección de ejemplos ilustra la aplicación de la metodología. Utilizamos tresconjuntos de datos reales con características particulares: uno de ellos, sobre biomasa, nospermite mostrar que la predicción también se puede realizar a la izquierda de los datos;otro de ellos, sobre los niveles mensuales de dióxido de azufre, ilustra como la predicciónpuede tener en cuenta las tendencias temporales y los efectos estacionales utilizando elmodelo de modulación suave basado en P-splines sugerido por Eilers et al. (2008); y otro,sobre el nivel anual del mar, muestra que la predicción también se puede realizar en elcaso de errores correlacionados. También presentamos el concepto de “memoria de unP-spline” como una herramienta para saber cuánta información conocida usamos para predecir nuevos valores.En el tercer capítulo, proponemos un marco general para la predicción en el suavizadomultidimensional, ampliamos la propuesta de Currie et al. (2004) para predecir cuandose extiende más de una covariable. La extensión del método de predicción al caso multidimensionalno es directa en el sentido de que, en este contexto, el ajuste cambia cuandoel ajuste y la predicción se llevan a cabo simultáneamente. Para resolver este problema,proponemos una solución fácil, basada en multiplicadores de Lagrange. La primera partedel capítulo está dedicada a mostrar cómo se pueden realizar predicciones fuera de lamuestra en el contexto de P-splines multidimensionales y las propiedades que satisfacen,bajo ciertas condiciones, los coeficientes que determinan la predicción. También proponemosel uso de restricciones para mantener el ajuste y, en general, para incorporarcualquier información conocida sobre la predicción. La segunda parte del capítulo estádedicada a extender la metodología al marco de modelos mixtos suaves. Se sabe quecuando un modelo de P-spline se reparametriza como un modelo mixto, la estructurade los coeficientes se pierde, es decir, no se ordenan de acuerdo con la posición de losnodos. Este hecho no es relevante cuando ajustamos los datos, pero si predecimos e imponemosrestricciones sobre los coeficientes, necesitamos diferenciar entre los coeficientesque determinan el ajuste y los coeficientes que determinan la predicción. Para hacer eso,definimos una matriz de transformación particular que conserva las matrices del modelooriginal. El método de predicción y el uso de restricciones se ilustran con un ejemplode datos reales sobre el logaritmo de las tasas de mortalidad de la población masculinaestadounidense. Mostramos como resolver el problema de cruce de proyecciones edadesadyacentes cuando se predicen tablas de mortalidad y comparamos los resultados con elmétodo desarrollado en Delwarde et al. (2007).La investigación en el Capítulo 4 está motivada por la necesidad de extender la metodologíade predicción en el caso multidimensional a modelos más flexibles, los modelos Smooth-ANOVA, que nos permiten incluir términos de interacción que pueden descomponersecomo una suma de varias funciones suaves. La construcción de estos modelos a travésde B-splines tiene problemas de identificabilidad. Hay varias alternativas para resolvereste problema, nosotros seguimos Lee and Durbán (2011) y lo reparametrizamos comomodelos mixtos. Las primeras dos secciones del capítulo están dedicadas a presentarlos modelos Smooth-ANOVA y mostrar cómo se puede llevar a cabo la predicción fueradel rango de valores observados en estos modelos. Ilustramos la predicción con modelosSmooth-ANOVA reanalizando el conjunto de datos sobre biomasa. Ahora, el modeloSmooth-ANOVA nos permite representar la función suave como la suma de una funciónsuave para la altura, un término suave para el diámetro y una función suave para la interacción altura-diámetro. Al final de este capítulo, proporcionamos un estudio desimulación para evaluar la precisión de los modelos de interacción 2D P-spline y losmodelos Smooth-ANOVA, con y sin imponer la invariancia del ajuste. A partir de losresultados del estudio de simulación, concluimos que en la mayoría de las situaciones elmodelo S-ANOVA restringido se comporta mejor tanto en el ajuste como en la predicción,sin embargo, los resultados dependen del escenario de simulación y del número de dimensionesen las que se realiza la predicción (una o ambas dimensiones).En el quinto capítulo generalizamos la metodología desarrollada para modelos linealesgeneralizados (GLM) en el contexto de P-splines (P-GLM) y modelos mixtos (P-GLMM).En ambos marcos, los procedimientos de estimación de coeficientes y parámetros involucranecuaciones no lineales. Para resolverlos, se utilizan algoritmos iterativos basados enlos métodos de Newton-Raphson, independientemente del criterio de estimación utilizado(por ejemplo, en el contexto de GLMMs podemos maximizar la máxima verosimilitudresidual (REML) o un REML aproximado (basado en la aproximación de Laplace)).Estos algoritmos iterativos se basan en un modelo teórico normal o en un conjunto depseudodatos y pesos. Basándonos en esta idea, ampliamos el método Penalized Quasilikelihood(PQL) para ajustar y predecir simultáneamente en el contexto de GLMMs.Destacamos que, en el contexto de modelos mixtos (incluso en el caso univariante), paramantener el ajuste, se debe utilizar una transformación que conserve las matrices delmodelo original, ya que las diferentes transformaciones tratan con diferentes vectores detrabajo y, por lo tanto, con diferentes soluciones. También mostramos como se puedenimponer restricciones en los modelos P-GLM y P-GLMM. Para ilustrar los procedimientos,utilizamos un conjunto de datos real para predecir las muertes por enfermedad respiratoriaa través de modelos 2D P-splines y modelos S-ANOVA (con y sin la restricciónel ajuste debe mantenerse).Finalmente, el Capítulo 6 se dedica a resumir las principales conclusiones y a plantearuna lista de futuras líneas de trabajo.
YR 2020
FD 2020-01-13
LK https://hdl.handle.net/10016/31312
UL https://hdl.handle.net/10016/31312
LA eng
NO Mención Internacional en el título de doctor
NO The research presented in this thesis has been partially supported by the Basque Government through the BERC 2018-2021 program and by Spanish Ministry of Economyand Competitiveness MINECO through BCAM Severo Ochoa excellence accreditationSEV-2013-0323 and through projects MTM2017-82379-R, funded by (AEI/FEDER, UE)and acronym “AFTERAM”, and MTM2014-52184-P.
DS e-Archivo
RD 1 sept. 2024