R 中 CARET 中的训练、验证、测试拆分模型

train,validation, test split model in CARET in R

我想寻求帮助。我使用此代码运行 Caret 包中的 XGboost 模型。但是,我想使用基于时间的验证拆分。我想要 60% 的训练,20% 的验证,20% 的测试。我已经拆分了数据,但是如果不是交叉验证,我确实知道如何处理验证数据。

谢谢你,

xgb_trainControl = trainControl(

method ="cv",

number = 5,

returnData = FALSE

)



xgb_grid <- expand.grid(nrounds = 1000,

               eta = 0.01,

               max_depth = 8,

               gamma = 1,

               colsample_bytree = 1,

               min_child_weight = 1,

               subsample = 1

)

set.seed(123)

xgb1 = train(sale~., data = trans_train,

 trControl = xgb_trainControl,

 tuneGrid = xgb_grid,

 method ="xgbTree",

)

xgb1

pred = predict(lm1, trans_test)# Define the partition (e.g. 75% of the data for training)

trainIndex <- createDataPartition(data$response, p = .75, 

                 list = FALSE, 

                 times = 1)



# Split the dataset using the defined partition

train_data <- data[trainIndex, ,drop=FALSE]

tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]



# Define a new partition to split the remaining 25%

tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,

                     p = .6,

                     list = FALSE,

                     times = 1)



# Split the remaining ~25% of the data: 40% (tune) and 60% (val)

tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]

val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]



# Outcome of this section is that the data (100%) is split into:

# training (~75%)

# tuning (~10%)

# validation (~15%)lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)

params2 <- list(booster ="gbtree",

       objective = lrn_tune$par.vals$objective,

       eta=lrn_tune$par.vals$eta, gamma=0,

       max_depth=lrn_tune$par.vals$max_depth,

       min_child_weight=lrn_tune$par.vals$min_child_weight,

       subsample = 0.8,

       colsample_bytree=lrn_tune$par.vals$colsample_bytree)



xgb2 <- xgb.train(params = params2,

         data = dtrain, nrounds = 50,

         watchlist = list(val=dtune, train=dtrain),

         print_every_n = 10, early_stopping_rounds = 50,

         maximize = FALSE, eval_metric ="error")xgbpred2_keep <- predict(xgb2, dval)



xg2_val <- data.frame("Prediction" = xgbpred2_keep,

          "Patient" = rownames(val),

          "Response" = val_data$response)



# Reorder Patients according to Response

xg2_val$Patient <- factor(xg2_val$Patient,

             levels = xg2_val$Patient[order(xg2_val$Response)])



ggplot(xg2_val, aes(x = Patient, y = Prediction,

          fill = Response)) +

 geom_bar(stat ="identity") +

 theme_bw(base_size = 16) +

 labs(title=paste("Patient predictions (xgb2) for the validation dataset (n =",

         length(rownames(val)),")", sep =""), 

   subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder", 

   caption=paste("JM", Sys.Date(), sep =""),

   x ="") +

 theme(axis.text.x = element_text(angle=90, vjust=0.5,

                 hjust = 1, size = 8)) +

# Distance from red line = confidence of prediction

 geom_hline(yintercept = 0.5, colour ="red")





# Convert predictions to binary outcome (responder / non-responder)

xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)



# Results matrix (i.e. true positives/negatives & false positives/negatives)

confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))





# Summary of results

Summary_of_results <- data.frame(Patient_ID = rownames(val),

                label = labels_tv, 

                pred = xgbpred2_binary)

Summary_of_results$eval <- ifelse(

 Summary_of_results$label != Summary_of_results$pred,

"wrong",

"correct")

Summary_of_results$conf <- round(predict(xgb2, dval), 2)

Summary_of_results$CDS <- val_data$`variants`

Summary_of_results

在创建模型时不应使用验证分区 - 应将其"搁置一旁",直到使用"训练"和"调整"分区对模型进行训练和调整,然后您可以应用模型进行预测验证数据集的结果并总结预测的准确性。

例如,在我自己的工作中,我创建了三个分区:训练(75%)、调整(10%)和测试/验证(15%)使用

xgb_trainControl = trainControl(

method ="cv",

number = 5,

returnData = FALSE

)



xgb_grid <- expand.grid(nrounds = 1000,

               eta = 0.01,

               max_depth = 8,

               gamma = 1,

               colsample_bytree = 1,

               min_child_weight = 1,

               subsample = 1

)

set.seed(123)

xgb1 = train(sale~., data = trans_train,

 trControl = xgb_trainControl,

 tuneGrid = xgb_grid,

 method ="xgbTree",

)

xgb1

pred = predict(lm1, trans_test)# Define the partition (e.g. 75% of the data for training)

trainIndex <- createDataPartition(data$response, p = .75, 

                 list = FALSE, 

                 times = 1)



# Split the dataset using the defined partition

train_data <- data[trainIndex, ,drop=FALSE]

tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]



# Define a new partition to split the remaining 25%

tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,

                     p = .6,

                     list = FALSE,

                     times = 1)



# Split the remaining ~25% of the data: 40% (tune) and 60% (val)

tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]

val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]



# Outcome of this section is that the data (100%) is split into:

# training (~75%)

# tuning (~10%)

# validation (~15%)lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)

params2 <- list(booster ="gbtree",

       objective = lrn_tune$par.vals$objective,

       eta=lrn_tune$par.vals$eta, gamma=0,

       max_depth=lrn_tune$par.vals$max_depth,

       min_child_weight=lrn_tune$par.vals$min_child_weight,

       subsample = 0.8,

       colsample_bytree=lrn_tune$par.vals$colsample_bytree)



xgb2 <- xgb.train(params = params2,

         data = dtrain, nrounds = 50,

         watchlist = list(val=dtune, train=dtrain),

         print_every_n = 10, early_stopping_rounds = 50,

         maximize = FALSE, eval_metric ="error")xgbpred2_keep <- predict(xgb2, dval)



xg2_val <- data.frame("Prediction" = xgbpred2_keep,

          "Patient" = rownames(val),

          "Response" = val_data$response)



# Reorder Patients according to Response

xg2_val$Patient <- factor(xg2_val$Patient,

             levels = xg2_val$Patient[order(xg2_val$Response)])



ggplot(xg2_val, aes(x = Patient, y = Prediction,

          fill = Response)) +

 geom_bar(stat ="identity") +

 theme_bw(base_size = 16) +

 labs(title=paste("Patient predictions (xgb2) for the validation dataset (n =",

         length(rownames(val)),")", sep =""), 

   subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder", 

   caption=paste("JM", Sys.Date(), sep =""),

   x ="") +

 theme(axis.text.x = element_text(angle=90, vjust=0.5,

                 hjust = 1, size = 8)) +

# Distance from red line = confidence of prediction

 geom_hline(yintercept = 0.5, colour ="red")





# Convert predictions to binary outcome (responder / non-responder)

xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)



# Results matrix (i.e. true positives/negatives & false positives/negatives)

confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))





# Summary of results

Summary_of_results <- data.frame(Patient_ID = rownames(val),

                label = labels_tv, 

                pred = xgbpred2_binary)

Summary_of_results$eval <- ifelse(

 Summary_of_results$label != Summary_of_results$pred,

"wrong",

"correct")

Summary_of_results$conf <- round(predict(xgb2, dval), 2)

Summary_of_results$CDS <- val_data$`variants`

Summary_of_results

这些数据分区被转换为 xgb.DMatrix 矩阵("dtrain"、"dtune"、"dval")。然后,我使用"训练"分区来训练模型,并使用"调整"分区来调整超参数(例如随机网格搜索)并评估模型训练(例如交叉验证)。这?相当于您问题中的代码。

xgb_trainControl = trainControl(

method ="cv",

number = 5,

returnData = FALSE

)



xgb_grid <- expand.grid(nrounds = 1000,

               eta = 0.01,

               max_depth = 8,

               gamma = 1,

               colsample_bytree = 1,

               min_child_weight = 1,

               subsample = 1

)

set.seed(123)

xgb1 = train(sale~., data = trans_train,

 trControl = xgb_trainControl,

 tuneGrid = xgb_grid,

 method ="xgbTree",

)

xgb1

pred = predict(lm1, trans_test)# Define the partition (e.g. 75% of the data for training)

trainIndex <- createDataPartition(data$response, p = .75, 

                 list = FALSE, 

                 times = 1)



# Split the dataset using the defined partition

train_data <- data[trainIndex, ,drop=FALSE]

tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]



# Define a new partition to split the remaining 25%

tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,

                     p = .6,

                     list = FALSE,

                     times = 1)



# Split the remaining ~25% of the data: 40% (tune) and 60% (val)

tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]

val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]



# Outcome of this section is that the data (100%) is split into:

# training (~75%)

# tuning (~10%)

# validation (~15%)lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)

params2 <- list(booster ="gbtree",

       objective = lrn_tune$par.vals$objective,

       eta=lrn_tune$par.vals$eta, gamma=0,

       max_depth=lrn_tune$par.vals$max_depth,

       min_child_weight=lrn_tune$par.vals$min_child_weight,

       subsample = 0.8,

       colsample_bytree=lrn_tune$par.vals$colsample_bytree)



xgb2 <- xgb.train(params = params2,

         data = dtrain, nrounds = 50,

         watchlist = list(val=dtune, train=dtrain),

         print_every_n = 10, early_stopping_rounds = 50,

         maximize = FALSE, eval_metric ="error")xgbpred2_keep <- predict(xgb2, dval)



xg2_val <- data.frame("Prediction" = xgbpred2_keep,

          "Patient" = rownames(val),

          "Response" = val_data$response)



# Reorder Patients according to Response

xg2_val$Patient <- factor(xg2_val$Patient,

             levels = xg2_val$Patient[order(xg2_val$Response)])



ggplot(xg2_val, aes(x = Patient, y = Prediction,

          fill = Response)) +

 geom_bar(stat ="identity") +

 theme_bw(base_size = 16) +

 labs(title=paste("Patient predictions (xgb2) for the validation dataset (n =",

         length(rownames(val)),")", sep =""), 

   subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder", 

   caption=paste("JM", Sys.Date(), sep =""),

   x ="") +

 theme(axis.text.x = element_text(angle=90, vjust=0.5,

                 hjust = 1, size = 8)) +

# Distance from red line = confidence of prediction

 geom_hline(yintercept = 0.5, colour ="red")





# Convert predictions to binary outcome (responder / non-responder)

xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)



# Results matrix (i.e. true positives/negatives & false positives/negatives)

confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))





# Summary of results

Summary_of_results <- data.frame(Patient_ID = rownames(val),

                label = labels_tv, 

                pred = xgbpred2_binary)

Summary_of_results$eval <- ifelse(

 Summary_of_results$label != Summary_of_results$pred,

"wrong",

"correct")

Summary_of_results$conf <- round(predict(xgb2, dval), 2)

Summary_of_results$CDS <- val_data$`variants`

Summary_of_results

一旦模型经过训练,我将使用 predict():

将模型应用于验证数据

xgb_trainControl = trainControl(

method ="cv",

number = 5,

returnData = FALSE

)



xgb_grid <- expand.grid(nrounds = 1000,

               eta = 0.01,

               max_depth = 8,

               gamma = 1,

               colsample_bytree = 1,

               min_child_weight = 1,

               subsample = 1

)

set.seed(123)

xgb1 = train(sale~., data = trans_train,

 trControl = xgb_trainControl,

 tuneGrid = xgb_grid,

 method ="xgbTree",

)

xgb1

pred = predict(lm1, trans_test)# Define the partition (e.g. 75% of the data for training)

trainIndex <- createDataPartition(data$response, p = .75, 

                 list = FALSE, 

                 times = 1)



# Split the dataset using the defined partition

train_data <- data[trainIndex, ,drop=FALSE]

tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]



# Define a new partition to split the remaining 25%

tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,

                     p = .6,

                     list = FALSE,

                     times = 1)



# Split the remaining ~25% of the data: 40% (tune) and 60% (val)

tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]

val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]



# Outcome of this section is that the data (100%) is split into:

# training (~75%)

# tuning (~10%)

# validation (~15%)lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)

params2 <- list(booster ="gbtree",

       objective = lrn_tune$par.vals$objective,

       eta=lrn_tune$par.vals$eta, gamma=0,

       max_depth=lrn_tune$par.vals$max_depth,

       min_child_weight=lrn_tune$par.vals$min_child_weight,

       subsample = 0.8,

       colsample_bytree=lrn_tune$par.vals$colsample_bytree)



xgb2 <- xgb.train(params = params2,

         data = dtrain, nrounds = 50,

         watchlist = list(val=dtune, train=dtrain),

         print_every_n = 10, early_stopping_rounds = 50,

         maximize = FALSE, eval_metric ="error")xgbpred2_keep <- predict(xgb2, dval)



xg2_val <- data.frame("Prediction" = xgbpred2_keep,

          "Patient" = rownames(val),

          "Response" = val_data$response)



# Reorder Patients according to Response

xg2_val$Patient <- factor(xg2_val$Patient,

             levels = xg2_val$Patient[order(xg2_val$Response)])



ggplot(xg2_val, aes(x = Patient, y = Prediction,

          fill = Response)) +

 geom_bar(stat ="identity") +

 theme_bw(base_size = 16) +

 labs(title=paste("Patient predictions (xgb2) for the validation dataset (n =",

         length(rownames(val)),")", sep =""), 

   subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder", 

   caption=paste("JM", Sys.Date(), sep =""),

   x ="") +

 theme(axis.text.x = element_text(angle=90, vjust=0.5,

                 hjust = 1, size = 8)) +

# Distance from red line = confidence of prediction

 geom_hline(yintercept = 0.5, colour ="red")





# Convert predictions to binary outcome (responder / non-responder)

xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)



# Results matrix (i.e. true positives/negatives & false positives/negatives)

confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))





# Summary of results

Summary_of_results <- data.frame(Patient_ID = rownames(val),

                label = labels_tv, 

                pred = xgbpred2_binary)

Summary_of_results$eval <- ifelse(

 Summary_of_results$label != Summary_of_results$pred,

"wrong",

"correct")

Summary_of_results$conf <- round(predict(xgb2, dval), 2)

Summary_of_results$CDS <- val_data$`variants`

Summary_of_results

这为您提供了模型在验证数据上的"工作"效果的摘要。


相关推荐

  • Spring部署设置openshift

    Springdeploymentsettingsopenshift我有一个问题让我抓狂了三天。我根据OpenShift帐户上的教程部署了spring-eap6-quickstart代码。我已配置调试选项,并且已将Eclipse工作区与OpehShift服务器同步-服务器上的一切工作正常,但在Eclipse中出现无法消除的错误。我有这个错误:cvc-complex-type.2.4.a:Invali…
    2025-04-161
  • 检查Java中正则表达式中模式的第n次出现

    CheckfornthoccurrenceofpatterninregularexpressioninJava本问题已经有最佳答案,请猛点这里访问。我想使用Java正则表达式检查输入字符串中特定模式的第n次出现。你能建议怎么做吗?这应该可以工作:MatchResultfindNthOccurance(intn,Patternp,CharSequencesrc){Matcherm=p.matcher…
    2025-04-161
  • 如何让 JTable 停留在已编辑的单元格上

    HowtohaveJTablestayingontheeditedcell如果有人编辑JTable的单元格内容并按Enter,则内容会被修改并且表格选择会移动到下一行。是否可以禁止JTable在单元格编辑后转到下一行?原因是我的程序使用ListSelectionListener在单元格选择上同步了其他一些小部件,并且我不想在编辑当前单元格后选择下一行。Enter的默认绑定是名为selectNext…
    2025-04-161
  • Weblogic 12c 部署

    Weblogic12cdeploy我正在尝试将我的应用程序从Tomcat迁移到Weblogic12.2.1.3.0。我能够毫无错误地部署应用程序,但我遇到了与持久性提供程序相关的运行时错误。这是堆栈跟踪:javax.validation.ValidationException:CalltoTraversableResolver.isReachable()threwanexceptionatorg.…
    2025-04-161
  • Resteasy Content-Type 默认值

    ResteasyContent-Typedefaults我正在使用Resteasy编写一个可以返回JSON和XML的应用程序,但可以选择默认为XML。这是我的方法:@GET@Path("/content")@Produces({MediaType.APPLICATION_XML,MediaType.APPLICATION_JSON})publicStringcontentListRequestXm…
    2025-04-161
  • 代码不会停止运行,在 Java 中

    thecodedoesn'tstoprunning,inJava我正在用Java解决项目Euler中的问题10,即"Thesumoftheprimesbelow10is2+3+5+7=17.Findthesumofalltheprimesbelowtwomillion."我的代码是packageprojecteuler_1;importjava.math.BigInteger;importjava…
    2025-04-161
  • Out of memory java heap space

    Outofmemoryjavaheapspace我正在尝试将大量文件从服务器发送到多个客户端。当我尝试发送大小为700mb的文件时,它显示了"OutOfMemoryjavaheapspace"错误。我正在使用Netbeans7.1.2版本。我还在属性中尝试了VMoption。但仍然发生同样的错误。我认为阅读整个文件存在一些问题。下面的代码最多可用于300mb。请给我一些建议。提前致谢publicc…
    2025-04-161
  • Log4j 记录到共享日志文件

    Log4jLoggingtoaSharedLogFile有没有办法将log4j日志记录事件写入也被其他应用程序写入的日志文件。其他应用程序可以是非Java应用程序。有什么缺点?锁定问题?格式化?Log4j有一个SocketAppender,它将向服务发送事件,您可以自己实现或使用与Log4j捆绑的简单实现。它还支持syslogd和Windows事件日志,这对于尝试将日志输出与来自非Java应用程序…
    2025-04-161