H(Π)= -Σ(Π log Π) e.g. for a training set containing p positive and n negative examples, we have: H(p/(p+n),n/(p+n)) = - p/(p+n) log(p/(p+n)) - n/(p+n) log(n/(p+n)) How to pick attributes? attribute A , with K distinct values, divides the training set E into subsets E 1 , ... , E K . Expected Entropy remaining after trying attribute A (with branches i=1,2,..., K ): EH(A) = Σ^K_{k=1} (p_k+n_k)/(p+n) H(p_k/(p_k+n_k), n_k/(p_k+n_k)) is each entropy multiplied by the proportion of that categorical attribute value where, p_k+n_k is the # of nodes (positive or negative) in kth child Information Gain for this attribute is: I(A)=H(p/(p+n),n/(p+n))-EH(A) Pick the attribute with largest I(A)! Once the attribute is set move down to each child and repeat the process. if data is continuoustrat each point as cut points. use projection of the points in x direction and y direction as attribtues in continuous case pick random x,y values as attributes and whther value is greater or smaller send it to left, right respectively but if we had 20,000 dimensional vectors we would go for random forrest if you want to do unsupervised and figureout clusters using randomforrest you can try to fit gaussian for each attribute and each one: fit a gausian (mean, variance) to each side, with highest information is selected To avoid overfitting Pre-pruning if # records < threshold if information gain < threshold Post-pruning start bottom up and remove node if its impact is less than threshold doesn't change accuracy or its removal increases validation set accuracy for continuous features, sort, all availblevalues or mean ofconsecutive values |