上QQ阅读APP看书，第一时间看更新

Random Forest for churn prediction

As described in Chapter 1, Analyzing Insurance Severity Claim, Random Forest is an ensemble technique that takes a subset of observations and a subset of variables to build decision trees—that is, an ensemble of DTs. More technically, it builds several decision trees and integrates them together to get a more accurate and stable prediction.

Figure 7: Random forest and its assembling technique explained

This is a direct consequence, since by maximum voting from a panel of independent juries, we get the final prediction better than the best jury (see the preceding figure). Now that we already know the working principle of RF, let's start using the Spark-based implementation of RF. Let's start by importing the required packages and libraries:

import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassifier, RandomForestClassificationModel}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

Now let's create Spark session and import implicit:

val spark: SparkSession = SparkSessionCreate.createSession("ChurnPredictionRandomForest")
import spark.implicits._

Now, once we have the hyperparameters defined and initialized, the next task is to instantiate a DecisionTreeClassifier estimator, as follows:

val rf = new RandomForestClassifier()
    .setLabelCol("label")
    .setFeaturesCol("features")
    .setSeed(1234567L)// for reproducibility

Now that we have three transformers and an estimator ready, the next task is to chain in a single pipeline—that is, each of them acts as a stage:

val pipeline = new Pipeline()
    .setStages(Array(PipelineConstruction.ipindexer,
    PipelineConstruction.labelindexer,
    PipelineConstruction.assembler,rf))

Let's define the paramgrid to perform such a grid search over the hyperparameter space:

val paramGrid = new ParamGridBuilder()
    .addGrid(rf.maxDepth, 3 :: 5 :: 15 :: 20 :: 50 :: Nil)
    .addGrid(rf.featureSubsetStrategy, "auto" :: "all" :: Nil)
    .addGrid(rf.impurity, "gini" :: "entropy" :: Nil)
    .addGrid(rf.maxBins, 2 :: 5 :: 10 :: Nil)
    .addGrid(rf.numTrees, 10 :: 50 :: 100 :: Nil)
    .build()

Let's define a BinaryClassificationEvaluator evaluator to evaluate the model:

val evaluator = new BinaryClassificationEvaluator()
    .setLabelCol("label")
    .setRawPredictionCol("prediction")

We use a CrossValidator for performing 10-fold cross-validation for best model selection:

val crossval = new CrossValidator()
    .setEstimator(pipeline)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(paramGrid)
    .setNumFolds(numFolds)

Let's now call the fit method so that the complete, predefined pipeline, including all feature preprocessing and the DT classifier, is executed multiple times—each time with a different hyperparameter vector:

val cvModel = crossval.fit(Preprocessing.trainDF)

Now it's time to evaluate the predictive power of the DT model on the test dataset. As a first step, we need to transform the test set to the model pipeline, which will map the features according to the same mechanism we described in the previous feature engineering step:

val predictions = cvModel.transform(Preprocessing.testSet)
prediction.show(10)
>>>

However, seeing the preceding prediction DataFrame, it is really difficult to guess the classification accuracy. In the second step, in the evaluation is the evaluate itself using BinaryClassificationEvaluator, as follows:

val accuracy = evaluator.evaluate(predictions)
println("Classification accuracy: " + accuracy)
>>>
Accuracy: 0.870334928229665

So, we get about 87% of classification accuracy from our binary classification model. Now, similar to SVM and LR, we will observe the area under the precision-recall curve and the area under the ROC curve based on the following RDD containing the raw scores on the test set:

val predictionAndLabels = predictions
    .select("prediction", "label")
    .rdd.map(x => (x(0).asInstanceOf[Double], x(1)
    .asInstanceOf[Double]))

Now the preceding RDD can be used to compute the two previously-mentioned performance metrics:

val metrics = new BinaryClassificationMetrics(predictionAndLabels)

println("Area under the precision-recall curve: " + metrics.areaUnderPR)
println("Area under the receiver operating characteristic (ROC) curve : " + metrics.areaUnderROC)
>>>
Area under the precision-recall curve: 0.7293101942399631
Area under the receiver operating characteristic (ROC) curve: 0.870334928229665

In this case, the evaluation returns 87% accuracy but only 73% precision, which is much better than that of SVM and LR. In the following, we again calculate some more metrics; for example, false and true positive and negative predictions are also useful to evaluate the model's performance:

val lp = predictions.select("label", "prediction")
val counttotal = predictions.count()

val correct = lp.filter($"label" === $"prediction").count()

val wrong = lp.filter(not($"label" === $"prediction")).count()

val ratioWrong = wrong.toDouble / counttotal.toDouble

val ratioCorrect = correct.toDouble / counttotal.toDouble

val truep = lp.filter($"prediction" === 0.0).filter($"label" ===
$"prediction").count() / counttotal.toDouble

val truen = lp.filter($"prediction" === 1.0).filter($"label" ===
$"prediction").count() / counttotal.toDouble

val falsep = lp.filter($"prediction" === 1.0).filter(not($"label" ===
$"prediction")).count() / counttotal.toDouble

val falsen = lp.filter($"prediction" === 0.0).filter(not($"label" ===
$"prediction")).count() / counttotal.toDouble

println("Total Count : " + counttotal)
println("Correct : " + correct)
println("Wrong: " + wrong)
println("Ratio wrong: " + ratioWrong)
println("Ratio correct: " + ratioCorrect)
println("Ratio true positive : " + truep)
println("Ratio false positive : " + falsep)
println("Ratio true negative : " + truen)
println("Ratio false negative : " + falsen)
>>>

We will get the following result:

Fantastic; we achieved 91% accuracy, but for what factors? Well, similar to DT, Random Forest can be debugged to get the decision tree that was constructed during the classification. For the tree to be printed and the most important features selected, try the last few lines of code in the DT, and you're done.

Can you now guess how many different models were trained? Well, we have 10-folds on CrossValidation and five-dimensional hyperparameter space cardinalities between 2 and 7. Now let's do some simple math: 10 * 7 * 5 * 2 * 3 * 6 = 12600 models!

Note that we still make the hyperparameter space confined, with numTrees, maxBins, and maxDepth limited to 7. Also, remember that bigger trees will most likely perform better. Therefore, feel free to play around with this code and add features, and also use a bigger hyperparameter space, say, bigger trees.