code

2017年9月8日 星期五

Applied Machine Learning in Python 8 - Lab: Linear Regression

假設有一個dataset定義如下:

  1. import numpy as np
  2. import pandas as pd
  3. from sklearn.model_selection import train_test_split
  4.  
  5.  
  6. np.random.seed(0)
  7. n = 15
  8. x = np.linspace(0,10,n) + np.random.randn(n)/5
  9. y = np.sin(x)+x/6 + np.random.randn(n)/10
  10.  
  11. X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)



使用polynomial regression來fit data

這個dataset很明顯是non-linear separable,嘗試先用polynomial feature transformation來增加維度。我們測試不同polynomial degree對一百個input點預測:

  1. def polynomial_regression():
  2. from sklearn.linear_model import LinearRegression
  3. from sklearn.preprocessing import PolynomialFeatures
  4. from sklearn.pipeline import make_pipeline
  5.  
  6. # generate points used for prediction
  7. x_prediction = np.linspace(0, 10, 100)
  8. x_prediction = x_prediction[:, np.newaxis]
  9.  
  10. # create matrix version
  11. X_poly = X_train[:, np.newaxis]
  12.  
  13. result = []
  14. for count, degree in enumerate([1, 3, 6, 9]):
  15. model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
  16. model.fit(X_poly, y_train)
  17. y_predicted = model.predict(x_prediction)
  18. result.append(y_predicted)
  19.  
  20. result = np.array(result)
  21. return result

這四個model train出來,利用100個點預測如下:



找出不同degree的R-squared score

我們現在把degree 0 ~ 9的polynomial regression R^2 score列出來,包含training set 和 test set:

  1. def ploynomial_r2_score():
  2.  
  3. from sklearn.linear_model import LinearRegression
  4. from sklearn.preprocessing import PolynomialFeatures
  5. from sklearn.pipeline import make_pipeline
  6.  
  7. # create matrix version
  8. X_poly = X_train[:, np.newaxis]
  9.  
  10. train_scores = []
  11. test_scores = []
  12. for count, degree in enumerate(range(0,10)):
  13. model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
  14. model.fit(X_poly, y_train)
  15. train_r2_score = model.score(X_poly, y_train)
  16. test_r2_score = model.score(X_test[:, np.newaxis], y_test)
  17. train_scores.append(train_r2_score)
  18. test_scores.append(test_r2_score)
  19.  
  20. result = (np.array(train_scores), np.array(test_scores))
  21. return result


可以看到結果如下:

degree 0 training score = 0.0, test score = -0.4780864173714179
degree 1 training score = 0.4292457781234663, test score = -0.45237104233936676
degree 2 training score = 0.4510998044408247, test score = -0.06856984149915935
degree 3 training score = 0.587199536877985, test score = 0.005331052945771075
degree 4 training score = 0.9194194471769332, test score = 0.7300494281868128
degree 5 training score = 0.9757864143068216, test score = 0.8770830091535732
degree 6 training score = 0.9901823324795085, test score = 0.9214093981312127
degree 7 training score = 0.9935250927840401, test score = 0.9202150411626775
degree 8 training score = 0.9963754538774482, test score = 0.6324794961087974
degree 9 training score = 0.9980370625648909, test score = -0.6452460241564286

所以最好的model 是degree = 6


Lasso regression

polynomial regression在高維度的時候明顯 (degree >= 8) 造成overfitting,試著用lasso regression來regularize。

  1. def lasso_regression():
  2. from sklearn.preprocessing import PolynomialFeatures
  3. from sklearn.linear_model import Lasso, LinearRegression
  4. from sklearn.pipeline import make_pipeline
  5.  
  6. # create matrix version
  7. X_train_2D = X_train[:, np.newaxis]
  8. X_test_2D = X_test[:, np.newaxis]
  9.  
  10. model_lr = make_pipeline(PolynomialFeatures(12), LinearRegression())
  11. model_lr.fit(X_train_2D, y_train)
  12.  
  13. model_lasso = make_pipeline(PolynomialFeatures(12), Lasso(alpha=0.01, max_iter=10000))
  14. model_lasso.fit(X_train_2D, y_train)
  15.  
  16. lr_test_r2_score = model_lr.score(X_test_2D, y_test)
  17. lasso_test_r2_score = model_lasso.score(X_test_2D, y_test)
  18.  
  19. return (lr_test_r2_score, lasso_test_r2_score)


degree = 12的時候,lasso regression得到的test set R^2 score = 0.84,而ordinary linear regression則得到一個 -4.31的值,為何有負的?


沒有留言:

張貼留言