code

2017年8月31日 星期四

Applied Machine Learning in Python 2 - Lab: KNN

我們用癌症統計資料來玩玩看KNN

讀入資料

  1. import numpy as np
  2. import pandas as pd
  3. from sklearn.datasets import load_breast_cancer
  4.  
  5. cancer = load_breast_cancer()

不過這是一個dict,還不是一個pandas dataframe。雖然Scikit-learn不需要一定使用dataframe,但這對清理資料有幫助,嘗試將之轉換成dataframe。

  1. def to_dataframe():
  2. df = pd.DataFrame(data=cancer['data'], columns=cancer['feature_names'])
  3. df['target'] = cancer['target']
  4. return df


class distribution

惡性與良性的分類各有多少呢?

  1. def class_distribution():
  2. df = to_dataframe()
  3. result = pd.Series({'malignant':len(df[df['target']==0]),
  4. 'benign':len(df[df['target']==1])})
  5. return result


區分label與data

  1. def split_data_label():
  2. df = to_dataframe()
  3. X = df[df.columns[:-1]]
  4. y = df['target']
  5.  
  6. return X,y


製作75% : 25% training set vs test set

  1. def training_set():
  2. X, y = split_data_label()
  3.  
  4. from sklearn.model_selection import train_test_split
  5. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
  6.  
  7. return X_train, X_test, y_train, y_test


train 1-NN classifier

  1. def _1NN_classifier():
  2. X_train, X_test, y_train, y_test = training_set()
  3.  
  4. knn = KNeighborsClassifier(n_neighbors=1)
  5. knn.fit(X_train, y_train)
  6.  
  7. return knn


使用classifier來預測

要製作input,這邊使用一個假的data,就是每個feature在dataframe的mean組成的feature vector。

  1. def predict():
  2. cancerdf = to_dataframe()
  3. means = cancerdf.mean()[:-1].values.reshape(1, -1)
  4.  
  5. knn = _1NN_classifier()
  6. result = knn.predict(means)
  7.  
  8. return result


predict test set

當然我們要來evaluate estimator的好壞的話,還是要來測試test set:
  1. def predict_test_set():
  2. X_train, X_test, y_train, y_test = training_set()
  3. knn = _1NN_classifier()
  4.  
  5. result = knn.predict(X_test)
  6.  
  7. return result


Accuracy

最後可以檢驗此次training結果的accuracy:
  1. def answer_eight():
  2. X_train, X_test, y_train, y_test = training_set()
  3. knn = _1NN_classifier()
  4.  
  5. result = knn.score(X_test, y_test)
  6.  
  7. return result


這次training結果對training set以及test set中的兩種classes的accuracy如下:





沒有留言:

張貼留言