DecisionTree Goodness

Developer
Mar 10, 2011 at 5:57 PM

Just went spelunking through the decision tree algorithm.  Which DT type did you choose to implement?  It seems to have some characteristics of the ID3 but at times suggests C4.5.... or another DT I'm missing out on all together.

Also, as an aside - do you have a preference on contributions?  May I feel free to create a fork?

Coordinator
Mar 10, 2011 at 7:21 PM

Feel free to fork! Or I would love to have you be a dev on the project. 

To answer:

  1. If I'm understanding right, ID3 uses entropy as the information gain measure. If that is the case, the DT could use entropy (which makes it ID3-ish)
  2. At the same time (*gasp*) I also handle continuous variables by splitting in half and measuring information gain with each half. I *do not* however post-process the tree to remove unnecessary nodes etc. as C4.5 does.

In the plans for DT's I have the following:

Rather than simply splitting in half I am considering having a tree width set a priori. Meaning that if the split threshold is reached, limit tree fan out to the max width. In this sense it would be slightly more granular than C4.5.

I have the basics already set up for doing this already just need to get it done.

Thoughts?

Developer
Mar 10, 2011 at 8:14 PM

I would love to join the project - for sure!  Machine Learning, among much of Data Mining, has become a passion of mine and I'm looking for exactly this kind of project to code against and solidify the algorithms I'm learning and more over - learn from others.

As for the DTs, I noticed the implementation accepted any of the Error/Entropy/Gini options - and furthermore that it computes Information Gain, the splitting threw me but I totally missed that's why you were splitting, for continuous variables.  That makes complete sense. :)

Have you considered the 3-4-5 rule?  It might be a little overkill for small data sets (and by overkill I mean it likely won't work) but perhaps the algorithm can split in half for small data sets and attempt the 3-4-5 rule for larger data sets - which would essentially build in the fan-out limit (as it were) to 5. 

I could fire a prototype or two your way - we can compare approaches.

Coordinator
Mar 11, 2011 at 4:54 AM

I added a function to do the actual splitting that creates ranges based on (Max - Min) / TreeWidth kind of approach. Looks like I still need to check it in ;)

Developer
Mar 11, 2011 at 4:51 PM

Awesome!  I look forward to seeing the implementation for that.  

Coordinator
Apr 8, 2011 at 8:24 PM

Ok, I think it is done. I checked in yesterday.