Tuesday, June 2, 2015

C4.5 Decision Trees with Missing Data

blog

C4.5 Decision Trees with Missing Data

Overview

We often encounter data with missing values; and various machine learning algorithms handle missing data with varying levels of grace. C4.5 is an algorithm that is advertised to be able to handle missing data since there is 'built-in' support for missing values.

In this post, we will walk through exactly how the C4.5 decision tree algorithm deals with missing values. We will do this by working out how the branches are created in detail.

Data

We'll work with the same data that we have used in previous posts on decision tree algorithms so that we are able to compare behaviors, but with the important difference that some of the data is missing (the missing values are marked with "?"). It is repeated below for ease of reference.

This data consists of weather conditions that we would like to use to decide whether we play tennis or choose not to play tennis on a given day. Thus "PlayTennis" is our outcome feature that we would like to build a classifier to predict. And we will continue to ignore the "Day" column since it is being used mainly as a row label rather than as an attribute that we would like to use in our model.

Original (non-missing) weather data

The original dataset (without missing values) is as follows:

Table 1: Weather data and an outcome variable, "PlayTennis", that we would like to predict. From T. Mitchel "Machine Learning" Chapter 3 pp. 59.
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Weather data with missing values

Here is the same data, but with missing values. Values were randomly removed, which is equivalent to the Missing Completely At Random missing data mechanism (missing data mechanisms will be the subject of a future post).

Table 2: Weather/tennis data with missing values.
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 ? Hot High Strong No
D3 ? ? High ? Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool ? Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 ? Mild High ? No
D9 ? Cool Normal Weak Yes
D10 ? ? Normal ? Yes
D11 ? Mild Normal ? Yes
D12 Overcast Mild ? Strong Yes
D13 Overcast Hot ? Weak Yes
D14 Rain Mild High Strong No

Altered gain and split info calculations

In C4.5, information gain and split info are used to calculate the gain ratio value (see other posts on C4.5 for definitions of gain ratio and split info). However, in the presences of missing values, J.R. Quinlan defines an altered version of these quantities to make the splitting criteria robust to missing data.

Information gain is simply weighted by the proportion of missing values on a particular attribute:

\[ G_m(S,A) = \frac{|S_{A}| - |M_{A}|}{|S_{A}|} G(S,A) \]

Where \(|M_{A}|\) is the number of missing values for attribute \(A\).

The new entropy value is calculated with only the non-missing values:

\[ H(S) = -\frac{4}{8} \log_2 \frac{4}{8} - \frac{4}{8} \log_2 \frac{4}{8} = 1 \]

So using the updated definition of information gain on the Outlook attribute would proceed as follows (using the data from the table with the missing values):

\[ G_m(S, Outlook) = \frac{8}{14} \left[ H(S) - \frac{1}{8}\left( -\frac{1}{1} \log_2 \frac{1}{1} - 0 \right)_{Sunny} - \frac{3}{8} \left( -\frac{3}{3} \log_2 \frac{3}{3} - 0 \right)_{Overcast} - \frac{4}{8} \left( -\frac{2}{4} \log_2 \frac{2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right)_{Rain} \right] = \frac{8}{14} ( 1 - \frac{1}{2} ) = 0.286 \]

Split info is calculated the same as before but with the missing values considered a separate state that an attribute can take.

\[ Split_m(S, Outlook) = -\frac{1}{14} \log_2 \frac{1}{14} - \frac{3}{14} \log_2 \frac{3}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 1.659 \]

Therefore the gain ratio for Outlook would be the ratio of the updated information gain to the updated split info: \(\frac{0.286}{1.659}=0.172\).

Doing similar calculations for the rest of the attributes yields the following values:

Attribute Gain Ratio
Outlook 0.172
Temperature 0.028
Humidity 0.081
Wind 0.066

We pick the attribute with the highest gain ratio on which to define a new branc in the decision tree, which in this case is Outlook. This initial tree looks – at least structurally – the same as the one constructed without missing values:

But there is an ambiguity that we must consider when choosing which rows are assigned to which branches of the tree.

Partitioning the training set

Now that we have made the decision to split based on the Outlook attribute, we need to decide which of the outcome values each row is 'assigned to'. This is a clear choice for those rows without missing values. But we face a problem when it comes to deciding based on the rows that contain missing values for a particular attribute.

For example, when we look at the D2 row we see that we do not know what the Outlook vale is. So which branch do we follow?

The way that Quinlan dealt with this problem was to assign these ambiguous cases to all of the newly created descendant nodes with a weight that is proportional to the number of non-missing instances of each classification. If we take Outlook==Sunny as an example, we have one example for which we know Outlook is Sunny and its class is PlayTennis==No. However, we have \(6\) instances in which we do not know the Outlook, 4 of which are PlayTennis==Yes and 2 of which are PlayTennis==No. We simply assign a fractional weight value of the number of Outlook==Sunny instances to the number of other known instances for Outlook (\(\frac{1}{8}\)) to each of these rows with unknown Outlook.

Classifying an instance with missing values

When we come across an instance that we would like to classify that has missing values, we explore all possibilities that the missing value can take when we reach the decision split in our classification tree. Then once we reach the leaves of the tree for each of the possible paths, we choose the path with the highest probability value given the weighted instaces of PlayTennis==Yes and PlayTennis==No at each branch.

4 comments:

  1. Can you give an example for the last section: "classifying an instance with missing values"? It's hard to understand

    ReplyDelete
  2. How are you getting 1.659? I punched that same formula in Excel and I'm getting 1.778. Here is my formula (just copy and paste in Excel):

    = (-LOG(1/14,2) * 1/14 - LOG(3/14,2) * 3/14 - LOG(4/14,2) * 4/14 - LOG(6/14,2) * 6/14)

    ReplyDelete
  3. Um isn't your overall entropy with non missing values for outlook wrong? you have 5 yes and 3 no. so how come 4/8?

    ReplyDelete
  4. Do you know of any python implementation that correctly takes missing values into account? Is the Weka/Java's J4.8 the only correct implementation?

    ReplyDelete