在写这篇的时候,其实有两个问题在我的脑海里不停浮现,一个是逻辑回归真正的数学含义,一个是进行逻辑回归之前我需要做的数据预处理。
所谓逻辑回归的数学含义,例如,hypothesis function, cost function,sigmoid function,这些内容都是很久前看的,后面我会陆陆续续把这些补上。
所谓数据预处理,包括普通点的去掉空值,根据需求切片等等,对于逻辑回归而言,比较unique的就是需要我们将数据中的文本数据转化为离散型的数值型数据。至于说,为什么?因为我们要做假设方程的最优化计算,这里的预处理有点类似于朴素贝叶斯中的文本向量化,将文本信息转化为数据矩阵。这里先不说这些了,有时间我会将细节单独写出来。
废话不多说了,上硬货
-
数据导入
import pandas as pd
import numpy as np
data=pd.read_csv('c:\\PDM\\data.csv'
-
预处理1---去除空值
data=data.dropna()
-
预处理2---文本数据转离散型数值数据
这里有两种实现路径。
- 通过pandas自带的get_dummies功能,将所有分类扁平化扩张以增加列的形式实现,离散数据按照[0,1]分布。
- 人工定义:将每一列的文本信息种类以int值进行标注,最后通过DataFrame的map功能实现。
首先尝试第一种方法,get_dummies
- 创建需要预处理的列名
- 遍历每一列都包含的文本内容,取唯一
- 唯一的文本以列的形式新增,并且数值采用[0,1]分布,并去除一个文本属性
dummyColumns = ['Gender', 'Home Ownership', 'Internet Connection', 'Marital Status','Movie Selector','Prerec Format', 'TV Signal']
for column in dummyColumns:
data[column]=data[column].astype('category')
dummiesData = pandas.get_dummies(data, columns=dummyColumns,prefix=dummyColumns,prefix_sep=" ",drop_first=True)
尝试第二种方法,人工定义,DataFrame的map功能
educationLevelDict = {
'Post-Doc': 9,
'Doctorate': 8,
'Master\'s Degree': 7,
'Bachelor\'s Degree': 6,
'Associate\'s Degree': 5,
'Some College': 4,
'Trade School': 3,
'High School': 2,
'Grade School': 1
}
dummiesData['Education Level Map'] = dummiesData['Education Level'].map(educationLevelDict)
freqMap = {
'Never': 0,
'Rarely': 1,
'Monthly': 2,
'Weekly': 3,
'Daily': 4
}
dummiesData['PPV Freq Map'] = dummiesData['PPV Freq'].map(freqMap)
dummiesData['Theater Freq Map'] = dummiesData['Theater Freq'].map(freqMap)
dummiesData['TV Movie Freq Map'] = dummiesData['TV Movie Freq'].map(freqMap)
dummiesData['Prerec Buying Freq Map'] = dummiesData['Prerec Buying Freq'].map(freqMap)
dummiesData['Prerec Renting Freq Map'] = dummiesData['Prerec Renting Freq'].map(freqMap)
dummiesData['Prerec Viewing Freq Map'] = dummiesData['Prerec Viewing Freq'].map(freqMap)
-
建立新的矩阵
dummiesSelect = [
'Age', 'Num Bathrooms', 'Num Bedrooms', 'Num Cars', 'Num Children', 'Num TVs',
'Education Level Map', 'PPV Freq Map', 'Theater Freq Map', 'TV Movie Freq Map',
'Prerec Buying Freq Map', 'Prerec Renting Freq Map', 'Prerec Viewing Freq Map',
'Gender Male',
'Internet Connection DSL', 'Internet Connection Dial-Up',
'Internet Connection IDSN', 'Internet Connection No Internet Connection',
'Internet Connection Other',
'Marital Status Married', 'Marital Status Never Married',
'Marital Status Other', 'Marital Status Separated',
'Movie Selector Me', 'Movie Selector Other', 'Movie Selector Spouse/Partner',
'Prerec Format DVD', 'Prerec Format Laserdisk', 'Prerec Format Other',
'Prerec Format VHS', 'Prerec Format Video CD',
'TV Signal Analog antennae', 'TV Signal Cable',
'TV Signal Digital Satellite', 'TV Signal Don\'t watch TV'
]
inputData = dummiesData[dummiesSelect]
outputData= dummiesData[['Home Ownership Rent']]
-
导入计算包
from sklearn import linear_model
IrModel = linear_model.LogisticRegression()
IrModel.fit(inputData, outputData)
IrModel.score(inputData,outputData)
-
进行预测
这里还是需要先对预测数据进行预处理,重复上边预处理的工作
newData = read_csv('C:\\PDM\\newData.csv')
newData = newData.dropna()
for column in dummyColumns:
newData[column] = newData[column].astype('category', categories=data[column].cat.categories)
dummiesNewData = pandas.get_dummies(newData, columns=dummyColumns, prefix=dummyColumns, prefix_sep=" ",drop_first=True)
newData['Education Level Map'] = newData['Education Level'].map(educationLevelDict)
newData['PPV Freq Map'] = newData['PPV Freq'].map(freqMap)
newData['Theater Freq Map'] = newData['Theater Freq'].map(freqMap)
newData['TV Movie Freq Map'] = newData['TV Movie Freq'].map(freqMap)
newData['Prerec Buying Freq Map'] = newData['Prerec Buying Freq'].map(freqMap)
newData['Prerec Renting Freq Map'] = newData['Prerec Renting Freq'].map(freqMap)
newData['Prerec Viewing Freq Map'] = newData['Prerec Viewing Freq'].map(freqMap)
建立纯离散数据的新矩阵
inputNewData = dummiesNewData[dummiesSelect]
预测
lrModel.predict(inputData)
网友评论