### Decision Tree

 Expected entropy of a categorical attribute with probability distribution Π :H(Π)= -Σ(Π log Π)e.g. for a training set containing p positive and n negative examples, we have:H(p/(p+n),n/(p+n)) = - p/(p+n) log(p/(p+n)) - n/(p+n) log(n/(p+n))How to pick attributes?attribute  A , with  K distinct values, divides the training set  E into subsets  E 1 , ... ,  E K . Expected Entropy remaining after trying attribute  A (with branches  i=1,2,..., K ):EH(A) = Σ^K_{k=1} (p_k+n_k)/(p+n) H(p_k/(p_k+n_k), n_k/(p_k+n_k))is each entropy multiplied by the proportion of that categorical attribute valuewhere, p_k+n_k is the # of nodes (positive or negative) in kth childInformation Gain for this attribute is:I(A)=H(p/(p+n),n/(p+n))-EH(A)Pick the attribute with largest I(A)!Once the attribute is set move down to each child and repeat the process.if data is continuoustrat each point as cut points. use projection of the points in x direction and y direction as attribtuesin continuous case pick random x,y values as attributes and whther value is greater or smaller send it to left, right respectivelyuse highest information gain as before to decide which attribute first. but if we had 20,000 dimensional vectors we would go for random forrestif you want to do unsupervised and figureout clusters using randomforrest you can try to fit gaussian for each attribute and each one: fit a gausian (mean, variance) to each side, with highest information is selectedTo avoid overfittingPre-pruningif # records < thresholdif information gain < thresholdPost-pruningstart bottom up and remove node if its impact is less than thresholddoesn't change accuracy or its removal increases validation set accuracyfor continuous features, sort, all availblevalues or mean ofconsecutive values