caret
caret
Segumos con nuestro problema central: predecir los ingresos de la ocupación principal (p21
) en la EPH del segundo trimestre del 2015. Pero en esta oportunidad queremos evaluar si podemos predecir la no respuesta. Es decir, entrenar un modelo que nos permita predecir qué tan probable es que una persona NO responda ingresos.
Ya hemos preprocesado los datos y estamos listos
Lo primero que tenemos que hacer es importar las librerías con las que vamos a trabajar:
library(caret)
library(tidyverse)
library(rpart)
Luego, cargamos los datos y formateamos un poco algunas etiquetas:
load('../data/EPH_2015_II.RData')
data$pp03i<-factor(data$pp03i, labels=c('1-SI', '2-No', '9-NS'))
data$intensi<-factor(data$intensi, labels=c('1-Sub_dem', '2-SO_no_dem',
'3-Ocup.pleno', '4-Sobreoc',
'5-No trabajo', '9-NS'))
data$pp07a<-factor(data$pp07a, labels=c('0-NC',
'1-Menos de un mes',
'2-1 a 3 meses',
'3-3 a 6 meses',
'4-6 a 12 meses',
'5-12 a 60 meses',
'6-Más de 60 meses',
'9-NS'))
data <- data %>%
mutate(imp_inglab1=factor(imp_inglab1, labels=c('non_miss','miss')))
Ahora, nuestra variable a predecir es el indicador imp_inglab
. Por ende, elimonamos la \(p21\).
df_train <- data %>%
select(-p21)
Lo primero que vamos a hacer es crear una partición de datos:
set.seed(1234)
tr_index <- createDataPartition(y=df_train$imp_inglab1,
p=0.8,
list=FALSE)
Y generamos dos datasets
train <- df_train %>%
slice(tr_index)
#df_train[tr_index, ]
test <- df_train %>%
slice(-tr_index)
#df_train[-tr_index,]
train()
)Empecemos por entrenar algunos árboles simples para tener una idea del proceso. Para entrenar modelos sin tunear hiperparámetros, tenemos que definir un objeto trainControl
con method='none'
.
fitControl <- trainControl(method = "none", classProbs = FALSE)
Ahora podemos entrenar un árbol poco profundo… digamos, 3.
cart_tune <- train(imp_inglab1 ~ .,
data = df_train,
method = "rpart2",
trControl = fitControl,
tuneGrid = data.frame(maxdepth=3),
control = rpart.control(minsplit = 1,
minbucket = 1,
cp=0.00000001)
)
Y podemos plotearlo de forma fea:
plot(cart_tune$finalModel)
text(cart_tune$finalModel, pretty=1)
O de forma bonita:
library(rpart.plot)
rpart.plot(cart_tune$finalModel)
Testeemos la performance de este árbol:
table(predict(cart_tune, df_train) , df_train$imp_inglab1)
non_miss miss
non_miss 19397 4979
miss 9 13
¿Qué conclusión pueden sacar al respecto?
Entrenen, ahora, un segundo árbol pero más complejo: maxdepth=10
.
cart_tune <- train(imp_inglab1 ~ . ,
data = df_train,
method = "rpart2",
trControl = fitControl,
tuneGrid = data.frame(maxdepth=10),
control = rpart.control(cp=0.0001)
)
rpart.plot(cart_tune$finalModel)
table(predict(cart_tune, df_train) , df_train$imp_inglab1)
non_miss miss
non_miss 18836 3830
miss 570 1162
Hasta aquí estuvimos haciendo trampa. Vamos a ahora a tunear el parámetro de profundidad de forma correcta.
Primero, fijamos la semilla aleatoria (para asegurarnos la posibilidad de replicabilidad)
set.seed(789)
Podemos usar la función createFolds()
para generar los índices. Aquí, pas
cv_index <- createFolds(y = train$imp_inglab1,
k=5,
list=TRUE,
returnTrain=TRUE)
Finalmente, especificamos el diseño de remuestreo mediante la función trainControl
:
fitControl <- trainControl(
index=cv_index,
method="cv",
number=5
)
grid <- expand.grid(maxdepth=c(1, 2, 4, 8, 10, 15, 20))
Y volvemos a entrenar el modelo:
cart_tune
CART
19519 samples
25 predictor
2 classes: 'non_miss', 'miss'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 15615, 15615, 15616, 15615, 15615
Resampling results across tuning parameters:
maxdepth Accuracy Kappa
1 0.7986581 0.1090253
2 0.7986581 0.1090253
4 0.7986581 0.1090253
8 0.7997339 0.1261417
10 0.7997851 0.1290888
15 0.8011171 0.1513197
20 0.7707363 0.2210235
Accuracy was used to select the optimal model using
the largest value.
The final value used for the model was maxdepth = 15.
Una vez finalizado el proceso de tunning de los hiperparámetros, podemos proceder a elegir cuál es el mejor modelo y entrenarlo sobre todo el dataset. Podemos ver que el mejor es un árbol que parece demasiado complejo maxdepth=15
, por ello, vamos a elegir uno un poco más interpreable
cart_final
CART
19519 samples
25 predictor
2 classes: 'non_miss', 'miss'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 19519, 19519, 19519, 19519, 19519, 19519, ...
Resampling results:
Accuracy Kappa
0.7945336 0.1356593
Tuning parameter 'maxdepth' was held constant at
a value of 6
Podemos visualizarlo:
rpart.plot(cart_final$finalModel)
Y generamos las predicciones finales:
y_preds
[1] non_miss non_miss non_miss non_miss non_miss
[6] non_miss non_miss non_miss non_miss non_miss
[11] non_miss non_miss non_miss non_miss non_miss
[16] non_miss non_miss non_miss non_miss non_miss
[21] non_miss non_miss non_miss non_miss non_miss
[26] non_miss non_miss non_miss non_miss non_miss
[31] non_miss non_miss non_miss non_miss non_miss
[36] non_miss non_miss non_miss non_miss non_miss
[41] non_miss non_miss non_miss non_miss non_miss
[46] non_miss non_miss non_miss non_miss non_miss
[51] non_miss non_miss non_miss non_miss non_miss
[56] non_miss non_miss non_miss non_miss non_miss
[61] non_miss non_miss non_miss non_miss non_miss
[66] non_miss non_miss non_miss non_miss non_miss
[71] non_miss non_miss non_miss non_miss non_miss
[76] non_miss non_miss non_miss non_miss non_miss
[81] non_miss non_miss non_miss non_miss non_miss
[86] non_miss non_miss non_miss non_miss non_miss
[91] non_miss non_miss non_miss non_miss non_miss
[96] non_miss non_miss non_miss non_miss non_miss
[101] non_miss non_miss non_miss non_miss non_miss
[106] non_miss non_miss non_miss non_miss non_miss
[111] non_miss non_miss non_miss non_miss non_miss
[116] non_miss non_miss non_miss non_miss non_miss
[121] non_miss non_miss non_miss non_miss non_miss
[126] non_miss non_miss non_miss non_miss non_miss
[131] non_miss non_miss non_miss non_miss non_miss
[136] non_miss non_miss non_miss non_miss non_miss
[141] non_miss non_miss non_miss non_miss non_miss
[146] non_miss non_miss non_miss non_miss non_miss
[151] non_miss non_miss non_miss non_miss non_miss
[156] non_miss non_miss non_miss non_miss non_miss
[161] non_miss non_miss non_miss non_miss non_miss
[166] non_miss non_miss non_miss non_miss non_miss
[171] non_miss non_miss non_miss non_miss non_miss
[176] non_miss non_miss non_miss non_miss non_miss
[181] non_miss non_miss non_miss non_miss non_miss
[186] non_miss non_miss non_miss non_miss non_miss
[191] non_miss non_miss non_miss non_miss non_miss
[196] non_miss non_miss non_miss non_miss non_miss
[201] non_miss non_miss non_miss non_miss non_miss
[206] non_miss non_miss non_miss non_miss non_miss
[211] non_miss non_miss non_miss non_miss non_miss
[216] non_miss non_miss non_miss non_miss non_miss
[221] non_miss non_miss non_miss non_miss non_miss
[226] non_miss non_miss miss non_miss miss
[231] non_miss non_miss non_miss non_miss non_miss
[236] miss non_miss non_miss non_miss miss
[241] miss miss non_miss miss miss
[246] non_miss non_miss non_miss non_miss non_miss
[251] non_miss non_miss non_miss non_miss miss
[256] non_miss non_miss non_miss non_miss miss
[261] non_miss miss non_miss non_miss non_miss
[266] miss non_miss non_miss miss non_miss
[271] non_miss non_miss miss non_miss non_miss
[276] non_miss non_miss miss non_miss non_miss
[281] non_miss non_miss miss non_miss non_miss
[286] miss miss miss miss miss
[291] miss miss non_miss miss miss
[296] non_miss non_miss non_miss non_miss non_miss
[301] miss non_miss non_miss non_miss non_miss
[306] non_miss non_miss miss non_miss non_miss
[311] non_miss miss miss non_miss non_miss
[316] miss non_miss non_miss miss non_miss
[321] miss non_miss non_miss miss miss
[326] miss non_miss non_miss non_miss miss
[331] non_miss non_miss non_miss non_miss non_miss
[336] non_miss non_miss miss non_miss non_miss
[341] non_miss miss non_miss non_miss non_miss
[346] miss non_miss non_miss miss non_miss
[351] non_miss non_miss miss non_miss non_miss
[356] non_miss miss non_miss non_miss non_miss
[361] non_miss non_miss non_miss non_miss non_miss
[366] non_miss non_miss non_miss non_miss non_miss
[371] non_miss non_miss miss miss non_miss
[376] non_miss non_miss non_miss miss non_miss
[381] non_miss non_miss non_miss miss non_miss
[386] miss miss non_miss non_miss non_miss
[391] non_miss miss non_miss non_miss non_miss
[396] non_miss non_miss non_miss non_miss non_miss
[401] non_miss miss miss non_miss non_miss
[406] non_miss non_miss non_miss non_miss non_miss
[411] miss miss non_miss miss non_miss
[416] non_miss miss miss miss miss
[421] non_miss non_miss non_miss non_miss non_miss
[426] non_miss non_miss miss non_miss non_miss
[431] non_miss non_miss non_miss non_miss non_miss
[436] non_miss non_miss non_miss non_miss non_miss
[441] non_miss non_miss non_miss non_miss non_miss
[446] non_miss non_miss non_miss non_miss non_miss
[451] non_miss non_miss non_miss non_miss non_miss
[456] non_miss non_miss non_miss non_miss non_miss
[461] non_miss non_miss non_miss non_miss non_miss
[466] non_miss non_miss non_miss non_miss non_miss
[471] non_miss non_miss non_miss non_miss non_miss
[476] non_miss non_miss non_miss non_miss non_miss
[481] non_miss non_miss non_miss non_miss non_miss
[486] non_miss non_miss non_miss non_miss non_miss
[491] non_miss non_miss non_miss non_miss non_miss
[496] non_miss non_miss non_miss non_miss non_miss
[501] non_miss non_miss non_miss non_miss non_miss
[506] non_miss non_miss non_miss non_miss non_miss
[511] non_miss non_miss non_miss non_miss non_miss
[516] non_miss non_miss non_miss non_miss non_miss
[521] non_miss non_miss non_miss non_miss non_miss
[526] non_miss non_miss non_miss non_miss non_miss
[531] non_miss non_miss non_miss non_miss non_miss
[536] non_miss non_miss non_miss non_miss non_miss
[541] non_miss non_miss non_miss non_miss non_miss
[546] non_miss non_miss non_miss non_miss non_miss
[551] non_miss non_miss non_miss non_miss non_miss
[556] non_miss non_miss non_miss non_miss non_miss
[561] non_miss non_miss non_miss non_miss non_miss
[566] non_miss non_miss miss non_miss non_miss
[571] non_miss non_miss non_miss miss non_miss
[576] non_miss non_miss non_miss non_miss non_miss
[581] non_miss non_miss non_miss non_miss non_miss
[586] non_miss non_miss non_miss non_miss non_miss
[591] non_miss non_miss non_miss non_miss non_miss
[596] non_miss non_miss non_miss non_miss non_miss
[601] non_miss non_miss non_miss non_miss non_miss
[606] non_miss non_miss non_miss non_miss non_miss
[611] non_miss non_miss non_miss non_miss non_miss
[616] non_miss non_miss non_miss non_miss non_miss
[621] non_miss non_miss non_miss non_miss non_miss
[626] non_miss non_miss non_miss non_miss non_miss
[631] non_miss non_miss non_miss non_miss non_miss
[636] miss non_miss non_miss non_miss non_miss
[641] non_miss non_miss non_miss non_miss non_miss
[646] non_miss non_miss non_miss non_miss miss
[651] non_miss non_miss non_miss non_miss non_miss
[656] non_miss non_miss non_miss non_miss non_miss
[661] non_miss non_miss non_miss miss non_miss
[666] non_miss non_miss non_miss non_miss non_miss
[671] non_miss non_miss non_miss non_miss non_miss
[676] non_miss non_miss non_miss non_miss non_miss
[681] non_miss non_miss non_miss non_miss non_miss
[686] non_miss non_miss non_miss non_miss non_miss
[691] non_miss non_miss non_miss non_miss non_miss
[696] non_miss non_miss non_miss non_miss non_miss
[701] non_miss non_miss non_miss non_miss non_miss
[706] non_miss non_miss non_miss non_miss non_miss
[711] non_miss non_miss non_miss non_miss non_miss
[716] non_miss non_miss non_miss non_miss non_miss
[721] non_miss non_miss non_miss non_miss non_miss
[726] non_miss non_miss non_miss non_miss non_miss
[731] non_miss non_miss non_miss non_miss non_miss
[736] non_miss non_miss non_miss non_miss non_miss
[741] non_miss non_miss non_miss non_miss non_miss
[746] non_miss non_miss non_miss non_miss non_miss
[751] non_miss non_miss non_miss non_miss non_miss
[756] non_miss non_miss non_miss non_miss non_miss
[761] non_miss non_miss non_miss non_miss non_miss
[766] non_miss non_miss non_miss non_miss non_miss
[771] non_miss non_miss non_miss non_miss non_miss
[776] non_miss non_miss non_miss non_miss non_miss
[781] non_miss non_miss non_miss non_miss non_miss
[786] non_miss non_miss non_miss non_miss non_miss
[791] non_miss non_miss non_miss non_miss non_miss
[796] non_miss non_miss non_miss non_miss non_miss
[801] non_miss non_miss non_miss non_miss non_miss
[806] non_miss non_miss non_miss non_miss non_miss
[811] non_miss non_miss non_miss non_miss non_miss
[816] non_miss non_miss non_miss non_miss non_miss
[821] non_miss non_miss non_miss non_miss non_miss
[826] non_miss non_miss non_miss non_miss non_miss
[831] non_miss non_miss non_miss non_miss non_miss
[836] non_miss non_miss non_miss non_miss non_miss
[841] non_miss non_miss non_miss non_miss non_miss
[846] non_miss non_miss non_miss non_miss non_miss
[851] non_miss non_miss non_miss non_miss non_miss
[856] non_miss non_miss non_miss non_miss non_miss
[861] non_miss non_miss non_miss non_miss non_miss
[866] non_miss non_miss non_miss non_miss non_miss
[871] non_miss non_miss non_miss non_miss non_miss
[876] non_miss non_miss non_miss non_miss non_miss
[881] non_miss non_miss non_miss non_miss non_miss
[886] non_miss non_miss non_miss non_miss non_miss
[891] non_miss non_miss non_miss non_miss non_miss
[896] non_miss non_miss non_miss non_miss non_miss
[901] non_miss non_miss non_miss non_miss non_miss
[906] non_miss non_miss non_miss non_miss non_miss
[911] non_miss non_miss non_miss non_miss non_miss
[916] non_miss non_miss non_miss non_miss non_miss
[921] non_miss non_miss non_miss non_miss non_miss
[926] non_miss non_miss non_miss non_miss non_miss
[931] non_miss non_miss non_miss non_miss non_miss
[936] non_miss non_miss non_miss non_miss non_miss
[941] non_miss non_miss non_miss non_miss non_miss
[946] non_miss non_miss non_miss non_miss non_miss
[951] non_miss non_miss non_miss non_miss non_miss
[956] non_miss non_miss non_miss non_miss non_miss
[961] non_miss non_miss non_miss non_miss non_miss
[966] non_miss non_miss non_miss non_miss non_miss
[971] non_miss non_miss non_miss non_miss non_miss
[976] non_miss non_miss non_miss non_miss non_miss
[981] non_miss non_miss non_miss non_miss non_miss
[986] non_miss non_miss non_miss non_miss non_miss
[991] non_miss non_miss non_miss non_miss non_miss
[996] non_miss non_miss non_miss non_miss non_miss
[ reached getOption("max.print") -- omitted 3879 entries ]
Levels: non_miss miss
Generamos nuestra matriz de confusión:
confusionMatrix(y_preds, test$imp_inglab1)
Confusion Matrix and Statistics
Reference
Prediction non_miss miss
non_miss 3758 895
miss 123 103
Accuracy : 0.7914
95% CI : (0.7797, 0.8027)
No Information Rate : 0.7954
P-Value [Acc > NIR] : 0.7671
Kappa : 0.1003
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9683
Specificity : 0.1032
Pos Pred Value : 0.8077
Neg Pred Value : 0.4558
Prevalence : 0.7954
Detection Rate : 0.7702
Detection Prevalence : 0.9537
Balanced Accuracy : 0.5358
'Positive' Class : non_miss
¿Qué se puede decir del árbol de decisión? ¿Cómo funciona?