Inferate >> Data Science Blog: C4.5 Decision Trees with Missing Data

blog

C4.5 Decision Trees with Missing Data

Overview

We often encounter data with missing values; and various machine learning algorithms handle missing data with varying levels of grace. C4.5 is an algorithm that is advertised to be able to handle missing data since there is 'built-in' support for missing values.

In this post, we will walk through exactly how the C4.5 decision tree algorithm deals with missing values. We will do this by working out how the branches are created in detail.

Data

We'll work with the same data that we have used in previous posts on decision tree algorithms so that we are able to compare behaviors, but with the important difference that some of the data is missing (the missing values are marked with "?"). It is repeated below for ease of reference.

This data consists of weather conditions that we would like to use to decide whether we play tennis or choose not to play tennis on a given day. Thus "PlayTennis" is our outcome feature that we would like to build a classifier to predict. And we will continue to ignore the "Day" column since it is being used mainly as a row label rather than as an attribute that we would like to use in our model.

Original (non-missing) weather data

The original dataset (without missing values) is as follows:

Table 1: Weather data and an outcome variable, "PlayTennis", that we would like to predict. From T. Mitchel "Machine Learning" Chapter 3 pp. 59.
Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

Weather data with missing values

Here is the same data, but with missing values. Values were randomly removed, which is equivalent to the Missing Completely At Random missing data mechanism (missing data mechanisms will be the subject of a future post).

Table 2: Weather/tennis data with missing values.
Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	?	Hot	High	Strong	No
D3	?	?	High	?	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	?	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	?	Mild	High	?	No
D9	?	Cool	Normal	Weak	Yes
D10	?	?	Normal	?	Yes
D11	?	Mild	Normal	?	Yes
D12	Overcast	Mild	?	Strong	Yes
D13	Overcast	Hot	?	Weak	Yes
D14	Rain	Mild	High	Strong	No

Altered gain and split info calculations

In C4.5, information gain and split info are used to calculate the gain ratio value (see other posts on C4.5 for definitions of gain ratio and split info). However, in the presences of missing values, J.R. Quinlan defines an altered version of these quantities to make the splitting criteria robust to missing data.

Information gain is simply weighted by the proportion of missing values on a particular attribute:

\[ G_m(S,A) = \frac{|S_{A}| - |M_{A}|}{|S_{A}|} G(S,A) \]

Where \(|M_{A}|\) is the number of missing values for attribute \(A\).

The new entropy value is calculated with only the non-missing values:

\[ H(S) = -\frac{4}{8} \log_2 \frac{4}{8} - \frac{4}{8} \log_2 \frac{4}{8} = 1 \]

So using the updated definition of information gain on the Outlook attribute would proceed as follows (using the data from the table with the missing values):

\[ G_m(S, Outlook) = \frac{8}{14} \left[ H(S) - \frac{1}{8}\left( -\frac{1}{1} \log_2 \frac{1}{1} - 0 \right)_{Sunny} - \frac{3}{8} \left( -\frac{3}{3} \log_2 \frac{3}{3} - 0 \right)_{Overcast} - \frac{4}{8} \left( -\frac{2}{4} \log_2 \frac{2}{4} - \frac{2}{4} \log_2 \frac{2}{4} \right)_{Rain} \right] = \frac{8}{14} ( 1 - \frac{1}{2} ) = 0.286 \]

Split info is calculated the same as before but with the missing values considered a separate state that an attribute can take.

\[ Split_m(S, Outlook) = -\frac{1}{14} \log_2 \frac{1}{14} - \frac{3}{14} \log_2 \frac{3}{14} - \frac{4}{14} \log_2 \frac{4}{14} - \frac{6}{14} \log_2 \frac{6}{14} = 1.659 \]

Therefore the gain ratio for Outlook would be the ratio of the updated information gain to the updated split info: \(\frac{0.286}{1.659}=0.172\).

Doing similar calculations for the rest of the attributes yields the following values:

Attribute	Gain Ratio
Outlook	0.172
Temperature	0.028
Humidity	0.081
Wind	0.066

We pick the attribute with the highest gain ratio on which to define a new branc in the decision tree, which in this case is Outlook. This initial tree looks – at least structurally – the same as the one constructed without missing values:

But there is an ambiguity that we must consider when choosing which rows are assigned to which branches of the tree.

Partitioning the training set

Now that we have made the decision to split based on the Outlook attribute, we need to decide which of the outcome values each row is 'assigned to'. This is a clear choice for those rows without missing values. But we face a problem when it comes to deciding based on the rows that contain missing values for a particular attribute.

For example, when we look at the D2 row we see that we do not know what the Outlook vale is. So which branch do we follow?

The way that Quinlan dealt with this problem was to assign these ambiguous cases to all of the newly created descendant nodes with a weight that is proportional to the number of non-missing instances of each classification. If we take Outlook==Sunny as an example, we have one example for which we know Outlook is Sunny and its class is PlayTennis==No. However, we have \(6\) instances in which we do not know the Outlook, 4 of which are PlayTennis==Yes and 2 of which are PlayTennis==No. We simply assign a fractional weight value of the number of Outlook==Sunny instances to the number of other known instances for Outlook (\(\frac{1}{8}\)) to each of these rows with unknown Outlook.

Classifying an instance with missing values

When we come across an instance that we would like to classify that has missing values, we explore all possibilities that the missing value can take when we reach the decision split in our classification tree. Then once we reach the leaves of the tree for each of the possible paths, we choose the path with the highest probability value given the weighted instaces of PlayTennis==Yes and PlayTennis==No at each branch.

4 comments:

UnknownAugust 8, 2019 at 7:03 PM
Can you give an example for the last section: "classifying an instance with missing values"? It's hard to understand
UnknownOctober 9, 2019 at 4:20 AM
How are you getting 1.659? I punched that same formula in Excel and I'm getting 1.778. Here is my formula (just copy and paste in Excel):

= (-LOG(1/14,2) * 1/14 - LOG(3/14,2) * 3/14 - LOG(4/14,2) * 4/14 - LOG(6/14,2) * 6/14)
UnknownOctober 19, 2021 at 3:48 PM
Um isn't your overall entropy with non missing values for outlook wrong? you have 5 yes and 3 no. so how come 4/8?
asdasdJanuary 18, 2023 at 7:54 AM
Do you know of any python implementation that correctly takes missing values into account? Is the Weka/Java's J4.8 the only correct implementation?

Tuesday, June 2, 2015

C4.5 Decision Trees with Missing Data

C4.5 Decision Trees with Missing Data

Overview

Data

Original (non-missing) weather data

Weather data with missing values

Altered gain and split info calculations

Partitioning the training set

Classifying an instance with missing values

4 comments: