Sonny不讀不行: Applied Machine Learning in Python 15

2017年9月28日星期四

Applied Machine Learning in Python 15 - Random Forest

Ensemble

這算是一種雞尾酒式的療法，結合不同ML方法的model產出一個混合式的model，random forest就是其中一種ensemble。一個decision tree容易受到overfitting的影響，但是多個random varied decision trees可以有更好的generalization。

Random Forest

下圖說明了random forest的概念：

第一步：bootstrapping samples
意思就是把X_train randomly分成幾分，每分有N個rows，可以允許重複(with replacement)。

第二步：random feature splitting node
跟建立單一decision tree不一樣的地方在於，每次split不找best split node，而是在一個random feature subset中找出一個best split node。這個過程保證所有forest中的tree一定是不同的。

這是由max_features參數控制，如果等於1，則相當於每次split都是隨機選擇一個feature，但由於tree要達到fit的目標，可能造成超長的tree且overffiting。如果max_features太大，甚至接近所有features的數目的話，那跟原本的decision tree splitting沒兩樣，會造成forest中的trees長得很類似。

第三步：prediction
可以weighted vote方式達成：

breast cancer dataset測試成果

使用max_features = 8的時候，達到了任何其他supervised learning methods(目前學過的）都達不到的好score:

最棒的優點是不需要做feature scaling等preprocessing，或是複雜的parameter tuning。
缺點是人類對這種隨機性的結構很難解讀，不容易知道為何某個預測會成功或是失敗，此外不適合用在spares feature vectors上（例如text classification）。

Sonny不讀不行

code

2017年9月28日星期四

Applied Machine Learning in Python 15 - Random Forest

Ensemble

Random Forest

breast cancer dataset測試成果

sklearn參數調整

沒有留言:

張貼留言

code

2017年9月28日 星期四

Applied Machine Learning in Python 15 - Random Forest

Ensemble

Random Forest

breast cancer dataset測試成果

sklearn參數調整

沒有留言:

張貼留言

2017年9月28日星期四