问题描述

使用KNN算法训练模型，然后使用模型预测一个人的年收入是否大于50。

读取数据集并查看数据

# 导入相应库
import pandas as pd
from pandas import Series,DataFrame
import numpy as np

df = pd.read_csv("./adults.txt")
df.head()

	age	workclass	final_weight	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

该数据集包含14个特征：分别为age ；workclass ；final_weight ；education ；education_num ；marital_status ；occupation ；relationship ；race ；sex ；capital_gain ；capital_loss ；hours_per_week ；native_country

其中数据集最后一列：salary表示这个人的年收入

特征工程

分割特征与标签

# 特征数据
data = df.iloc[:,:-1].copy()
data.head()

	age	workclass	final_weight	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba

# 标签数据
target = df[["salary"]].copy()
target.head()

	salary
0	<=50K
1	<=50K
2	<=50K
3	<=50K
4	<=50K

对非数值特征进行量化

由于KNN算法只能对数值类型的值进行计算，因此需要对非数值特征进行量化处理

把字符串类型的特征属性进行量化

对workclass职业这一特征进行量化

# 查看总共有多少个职业
ws = data.workclass.unique()
ws

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

可以看出总共有9类职业：包括未知的“？”。下面我们使用0-8这9个数字，分别对9种职业进行编码

# 定义转化函数
def convert_ws(item):
    # np.argwhere函数会返回，相应职业对应的索引
    return np.argwhere(ws==item)[0,0]

# 将职业转化为职业列表中索引值
data.workclass = data.workclass.map(convert_ws)
# 查看职业转化后的数据
data.head()

	age	workclass	final_weight	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country
0	39	0	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States
1	50	1	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States
2	38	2	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States
3	53	2	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States
4	28	2	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba

np.argwhere函数会返回相应职业对应的索引, np.argwhere(ws==“?”)[0,0],返回值为5

对其他字符串特征属性进行量化

与上述职业量化过程相同

# 需要进行量化的属性
cols = ['education',"marital_status","occupation","relationship","race","sex","native_country"]

# 使用遍历的方式对各列属性进行量化
def convert_item(item):
    return np.argwhere(uni == item)[0,0]
for col in cols:
    uni = data[col].unique()
    data[col] = data[col].map(convert_item)

# 查看对所有列进行量化后的数据
data.head()

	age	workclass	final_weight	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country
0	39	0	77516	0	13	0	0	0	0	0	2174	40	0
1	50	1	83311	0	13	1	1	1	0	0	0	13	0
2	38	2	215646	1	9	2	2	0	0	0	0	40	0
3	53	2	234721	2	7	1	2	1	1	0	0	40	0
4	28	2	338409	0	13	1	3	2	1	1	0	40	1

建模与评估

好了，以上我们已经将所有特征进行了量化处理，下面就可以使用KNN算法进行建模了

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# 创建模型
knn = KNeighborsClassifier(n_neighbors=8)

# 划分训练集与测试集
x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.01)

# 对模型进行训练
knn.fit(x_train,y_train)

# 使用测试集查看模型的准确度
knn.score(x_test,y_test)

0.7822085889570553

模型优化

我们可以看到，如果不对上述所有的特征数值进行处理，直接使用KNN模型进行训练的话，模型的准确率仅为78%。

下面我们对特征数据进行归一化处理，然后再使用KNN模型进行建模与测试，看看结果如何。

# 把所有的数据归一化
# 创建归一化函数
def func(x):
    return (x-min(x))/(max(x)-min(x))

# 对特征数据进行归一化处理
data[data.columns] = data[data.columns].transform(func)
data.head()

	age	workclass	final_weight	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country
0	0.301370	0.000	0.044302	0.000000	0.800000	0.000000	0.000000	0.0	0.00	0.0	0.02174	0.397959	0.00000
1	0.452055	0.125	0.048238	0.000000	0.800000	0.166667	0.071429	0.2	0.00	0.0	0.00000	0.122449	0.00000
2	0.287671	0.250	0.138113	0.066667	0.533333	0.333333	0.142857	0.0	0.00	0.0	0.00000	0.397959	0.00000
3	0.493151	0.250	0.151068	0.133333	0.400000	0.166667	0.142857	0.2	0.25	0.0	0.00000	0.397959	0.00000
4	0.150685	0.250	0.221488	0.000000	0.800000	0.166667	0.214286	0.4	0.25	1.0	0.00000	0.397959	0.02439

# 划分训练集与测试集
x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.01)

# 创建模型
knn = KNeighborsClassifier(n_neighbors=8)

# 训练模型
knn.fit(x_train,y_train)

# 使用测试集查看模型的准确度
knn.score(x_test,y_test)