Por Irantzu Barrio (DMAEIO, Universidad del País Vasco / Euskal Herriko Unibertsitatea).
In the medical field, prediction models are gaining importance as a support for decision-making whereby increased knowledge of potential predictors helps the decision-making process. An important consideration needed in the development of prediction models is the selection of the predictors (clinical variables) to be used in the model. From a statistical perspective, categorising continuous variables is not advisable, since it may entail a loss of information and power. Yet in clinical research and, more specifically, in the development of prediction models for use in clinical practice, both clinicians and health managers call for the categorisation of continuous parameters. However, despite the fact that categorisation is a common practice in clinical research, there are no unified criteria for the selection of the cut points. Previous work has been done in the categorisation of continuous variables but with the aim in almost all cases of dichotomising the predictor variable. In this work, we focus on the categorisation of continuous variables to be used in the development of prediction models, considering that the use of more than two categories may be preferable. This serves to reduce the loss of information and enables the relationship between the covariate and the response variable to be retained. Our goal is to propose a methodology to categorise continuous predictor variables in regression-based prediction models, mainly focussing on the logistic and Cox regression models which are those most widely used in the medical field for modelling dichotomous and time-to-event outcomes respectively. For a dichotomous response variable Y our proposal consists on categorizing the continuous covariate X in such a way that the maximal area under the receiver operating characteristic curve (AUC) is obtained (Barrio et al., 2017a). The proposal can be extended to a multivariate logistic regression model with or without interactions. On the other hand, for time to event outcomes, we considered categorising the continuous predictor variable X in a Cox proportional hazard model. To measure the discriminative ability of the model, we considered the concordance probability index, and two different estimators were studied: the c-index and the concordance probability estimator (CPE) (Barrio et al., 2017b). In this talk I will present the methodology we have developed to categorize continuous variables in prediction models, showing an empirical validation by means of simulations and an application to a real data set of patients with chronic obstructive pulmonary disease. Finally, I will show the R package, named CatPredi, which implements these methods and provides the user with the optimal cut-points and the categorized variable to be used in practice.