1. Python机器学习的库:scikit-learn
1.1: 特性:
简单高效的数据挖掘和机器学习分析
对所有用户开放,根据不同需求高度可重用性
基于Numpy, SciPy和matplotlib
开源,商用级别:获得 BSD许可
1.2 覆盖问题领域:
分类(classification), 回归(regression), 聚类(clustering), 降维(dimensionality reduction)
模型选择(model selection), 预处理(preprocessing)
2. 使用用scikit-learn
安装scikit-learn: pip, easy_install, windows installer
安装必要package:numpy, SciPy和matplotlib, 可使用Anaconda (包含numpy, scipy等科学计算常用package)
安装注意问题:Python解释器版本(2.7 or 3.4?), 32-bit or 64-bit系统
3,Demo
项目地址:
https://github.com/yulong12/ML_Demo
数据集是:AllElectronics.csv

代码在:AllElectronics.py
1,读取数据
allElectronicsData = open(r'./AllElectronics.csv', 'r')
reader = csv.reader(allElectronicsData)
headers = next(reader)#读取文件第一行
print(headers)
将特征列表放入字典
将标签放入列表
featureList = []
labelList = []
for row in reader:
labelList.append(row[len(row)-1])
rowDict = {}
for i in range(1, len(row)-1):
rowDict[headers[i]] = row[i]
featureList.append(rowDict)
print(featureList)

使用DictVectorizer,将数据转化为0,1值
# Vetorize features
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList) .toarray()
print("dummyX: " + str(dummyX))
print(vec.get_feature_names())
print("labelList: " + str(labelList))

转化y
# vectorize class labels
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY: " + str(dummyY))

调用决策树
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)
print("clf: " + str(clf))

预测新的值,数据为第一条记录的修改后的值
oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))
newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print("newRowX: " + str(newRowX))
predictedY = clf.predict([newRowX])# 用训练好的分类器去预测
print("predictedY: " + str(predictedY))

网友评论