查看训练好的模型的各个特征的系数有助于做特征筛选,下面针对不同特征类型使用了不同方法来得到不同特征的系数。
# 使用https://www.jianshu.com/p/20456b512fa7中的模型数据
# 假设数据已经处理好,因此直接训练模型以及预测
from itertools import chain
#原始数据如下
births.show(3)
+----------------------+-----------+----------------+------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+
|INFANT_ALIVE_AT_REPORT|BIRTH_PLACE|MOTHER_AGE_YEARS|FATHER_COMBINE_AGE|CIG_BEFORE|CIG_1_TRI|CIG_2_TRI|CIG_3_TRI|MOTHER_HEIGHT_IN|MOTHER_PRE_WEIGHT|MOTHER_DELIVERY_WEIGHT|MOTHER_WEIGHT_GAIN|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|
+----------------------+-----------+----------------+------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+
| 0| 1| 29| 99| 0| 0| 0| 0| 99| 999| 999| 99| 0| 0| 0| 0| 0|
| 0| 1| 22| 29| 0| 0| 0| 0| 65| 180| 198| 18| 0| 0| 0| 0| 0|
| 0| 1| 38| 40| 0| 0| 0| 0| 63| 155| 167| 12| 0| 0| 0| 0| 0|
+----------------------+-----------+----------------+------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+
# 需要注意的是:在训练模型之前需要使用 VectorAssembler 将所有特征合并在一列
pipeline = Pipeline(stages=[encoder, featuresCreator, logistic])
model = pipeline.fit(birth_train)
test_res= model.transform(birth_test)
lrm = model.stages[-1]
# 得到各个特征
attrs = sorted(
(attr["idx"], attr["name"]) for attr in (chain(*test_model
.schema[lrm.summary.featuresCol]
.metadata["ml_attr"]["attrs"].values())))
print(attrs)
# 输出
[(0, 'BIRTH_PLACE_VEC_0'),
(1, 'BIRTH_PLACE_VEC_1'),
(2, 'BIRTH_PLACE_VEC_2'),
(3, 'BIRTH_PLACE_VEC_3'),
(4, 'BIRTH_PLACE_VEC_4'),
(5, 'BIRTH_PLACE_VEC_5'),
(6, 'BIRTH_PLACE_VEC_6'),
(7, 'BIRTH_PLACE_VEC_7'),
(8, 'BIRTH_PLACE_VEC_8'),
(9, 'MOTHER_AGE_YEARS'),
(10, 'FATHER_COMBINE_AGE'),
(11, 'CIG_BEFORE'),
(12, 'CIG_1_TRI'),
(13, 'CIG_2_TRI'),
(14, 'CIG_3_TRI'),
(15, 'MOTHER_HEIGHT_IN'),
(16, 'MOTHER_PRE_WEIGHT'),
(17, 'MOTHER_DELIVERY_WEIGHT'),
(18, 'MOTHER_WEIGHT_GAIN'),
(19, 'DIABETES_PRE'),
(20, 'DIABETES_GEST'),
(21, 'HYP_TENS_PRE'),
(22, 'HYP_TENS_GEST'),
(23, 'PREV_BIRTH_PRETERM')]
# 将特征与系数对应起来
feats_coef = [(name, lrm.coefficients[idx]) for idx, name in attrs]
print(feats_coef)
[('BIRTH_PLACE_VEC_0', 0.0),
('BIRTH_PLACE_VEC_1', 0.594420849821937),
('BIRTH_PLACE_VEC_2', 2.4075589670913335),
('BIRTH_PLACE_VEC_3', 1.7823125440410161),
('BIRTH_PLACE_VEC_4', -1.6531133349571725),
('BIRTH_PLACE_VEC_5', -0.5495784312261248),
('BIRTH_PLACE_VEC_6', -1.7332912701009395),
('BIRTH_PLACE_VEC_7', 0.039713396666346504),
('BIRTH_PLACE_VEC_8', 0.0),
('MOTHER_AGE_YEARS', 0.00576202997456978),
('FATHER_COMBINE_AGE', -0.01461223060174637),
('CIG_BEFORE', 0.011062646656450726),
('CIG_1_TRI', 0.0080557042396814),
('CIG_2_TRI', 0.004632194351793351),
('CIG_3_TRI', 0.021007970934441053),
('MOTHER_HEIGHT_IN', -0.0010835415347563793),
('MOTHER_PRE_WEIGHT', -0.002190453970910452),
('MOTHER_DELIVERY_WEIGHT', -0.0011442841260634116),
('MOTHER_WEIGHT_GAIN', 0.02308236363565165),
('DIABETES_PRE', -0.9841689991671982),
('DIABETES_GEST', 0.7913093211204729),
('HYP_TENS_PRE', -0.2552870610582304),
('HYP_TENS_GEST', 0.26936315771969194),
('PREV_BIRTH_PRETERM', -1.2085697819317305)]
使用上面的方法可以查看一个模型各个特征的系数从而进行特征筛选,但是summary
函数目前只适用于二分类。此外,上面的数据的特征大部分都是数值型的,而在实际应用中,有的特征是从文本中提取的,需要使用CountVectorizer
将其转换为词向量。这时可以使用下面的方法来得到各个词的系数:
# 数据如下,其中channel为label,os和name为特征
df.show(4)
+-------+-------------------------------------------------------------------------------------+-------+
|os |name |channel|
+-------+-------------------------------------------------------------------------------------+-------+
|iOS |-中国X档案:驯火奇人.mp4-娱乐-高清正版视频在线观看–爱奇艺 |综艺 |
|android|0001.土豆网-锡剧新版全本《珍珠塔》--周东亮董云华许美-综艺-高清正版视频在线观看–爱奇艺|综艺 |
|iOS |0051彝族丽江打跳 (16)_baofeng-娱乐-高清正版视频在线观看–爱奇艺 |综艺 |
|iOS |10岁男孩从军 没想到竟是个神枪狙击手 男子看傻了-电视剧-高清正版视频在线观看–爱奇艺 |电视剧 |
+-------+-------------------------------------------------------------------------------------+-------+
### 方式一: 先将所有的词都放在一列,然后使用CountVectorizer词向量化
def text2terms(sentence):
'''使用textRank分词
'''
import jieba.analyse
terms = jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))
if not terms: # 若textrank算法的到的结果为空,则使用tf-idf算法提取关键词
terms = jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
for t in terms:
if t.isnumeric() or (t in ['一','二','三', '四', '五', '六', '七', '八', '九', '十', 'Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ','Ⅷ','Ⅸ']):
terms.remove(t)
return terms
def get_features(row):
features = []
features += [row.os]
terms = text2terms(row.name)
features += terms
if row.channel=='电视剧':
label=0
else:
label=1
return row.channel, label, row.name, features
df1 = df.rdd.map(lambda row: get_features(row)).toDF(['channel', 'label', 'name', 'terms'])
df1.show(2, truncate=False)
+-------+-----+-------------------------------------------------------------------------------------+-----------------------------------------------------+
|channel|label|name |terms |
+-------+-----+-------------------------------------------------------------------------------------+-----------------------------------------------------+
|综艺 |1 |-中国X档案:驯火奇人.mp4-娱乐-高清正版视频在线观看–爱奇艺 |[iOS, 视频, 正版, 驯火, 档案, 娱乐, 奇人, 观看, 中国]|
|综艺 |1 |0001.土豆网-锡剧新版全本《珍珠塔》--周东亮董云华许美-综艺-高清正版视频在线观看–爱奇艺|[android, 视频, 正版, 锡剧, 珍珠, 全本, 综艺, 观看] |
+-------+-----+-------------------------------------------------------------------------------------+-----------------------------------------------------+
### 拟合模型
cv = CountVectorizer(inputCol='terms', outputCol='features')
cv_model = cv.fit(df1)
df1 = cv_model.transform(df1)
df2 = df1.select('label', 'features')
logistic = cl.LogisticRegression(maxIter=10,
regParam=0.01,
featuresCol='features',
labelCol='label')
lr_model = logistic.fit(df2)
res = lr_model.transform(df2)
# 查看词向量中的所有词,这里只查看前10个
cv_model.vocabulary[:10]
#输出
['视频', '观看', '正版', 'iOS', 'wp', 'android', '娱乐', '电视剧', '片花', '综艺']
# 查看前10个词的系数
lr_model.coefficients[:10]
array([ 0.380, 0.357, 1.339, -0.182, -0.250, -0.587, 2.607, -2.313,
-0.966, 3.085])
# 将他们组合在一起
for i,j in zip(cv_model.vocabulary[:10], lr_model.coefficients[:10]):
print(i,j)
视频 0.37996859807875527
观看 0.3567728962448092
正版 1.3386805525611496
iOS -0.18176377875140984
wp -0.2501881651132442
android -0.5865211142654886
娱乐 2.6067191688211433
电视剧 -2.3126880551420914
片花 -0.966167767504617
综艺 3.0854430292662474
在上面我们展示了其中如何得到词向量中每个词的系数大小,主要是用到了CountVectorizerModel
的vocabulary
属性来得到词向量中的各个词,从而将词与系数对应起来。需要注意的是,这是用使用前面的summary
属性来得到特征名称是不可行的,返回的特征名为空,这可能是因为原始的所有词汇就在一列中。
下面我们将os和name特征放在两列中进行词向量化后再组合在一起进行训练模型:
def get_features2(row):
terms = text2terms(row.name)
if row.channel=='电视剧':
label=0
else:
label=1
return row.channel, label, [row.os], terms
df4 = df.rdd.map(lambda x: get_features2(x)).toDF(['channel','label','os','terms'])
df4.show(2)
# 输出
+-------+-----+---------+------------------------------------------------+
|channel|label|os |terms |
+-------+-----+---------+------------------------------------------------+
|综艺 |1 |[iOS] |[视频, 正版, 驯火, 档案, 娱乐, 奇人, 观看, 中国]|
|综艺 |1 |[android]|[视频, 正版, 锡剧, 珍珠, 全本, 综艺, 观看] |
+-------+-----+---------+------------------------------------------------+
### 词向量化然后和并,最后拟合模型
cv1 = CountVectorizer(inputCol='os',outputCol='os_vec')
cv_os = cv1.fit(df4)
df5 = cv_os.transform(df4)
cv2 = CountVectorizer(inputCol='terms', outputCol='terms_vec')
cv_term = cv2.fit(df5)
df6 = cv_term.transform(df5)
assembler = VectorAssembler(inputCols=['os_vec', 'terms_vec'], outputCol='features')
df7 = assembler.transform(df6)
df7.show()
# 输出
+-------+-----+---------+----------------------------+-------------+--------------------+--------------------+
|channel|label| os| terms| os_vec| terms_vec| features|
+-------+-----+---------+----------------------------+-------------+--------------------+--------------------+
| 综艺| 1| [iOS]|[视频, 正版, 驯火, 档案, ...|(3,[0],[1.0])|(888,[0,1,2,3,9,1...|(891,[0,3,4,5,6,1...|
| 综艺| 1|[android]|[视频, 正版, 锡剧, 珍珠, ...|(3,[2],[1.0])|(888,[0,1,2,6,49,...|(891,[2,3,4,5,9,5...|
+-------+-----+---------+----------------------------+-------------+--------------------+--------------------+
logistic = cl.LogisticRegression(maxIter=10,
regParam=0.01,
featuresCol='features',
labelCol='label')
lr2 = logistic.fit(df7)
res2 = lr2.transform(df7)
attrs = sorted(
(attr["idx"], attr["name"]) for attr in (chain(*res2
.schema['features']
.metadata["ml_attr"]["attrs"].values())))
for i,j in zip(attrs[:10], lr2.coefficients[:10]):
print(i, j)
# 输出
(0, 'os_vec_0') -0.18176377875140926
(1, 'os_vec_1') -0.25018816511324415
(2, 'os_vec_2') -0.5865211142654884
(3, 'terms_vec_0') 0.37996859807875594
(4, 'terms_vec_1') 0.35677289624480935
(5, 'terms_vec_2') 1.33868055256115
(6, 'terms_vec_3') 2.606719168821142
(7, 'terms_vec_4') -2.312688055142091
(8, 'terms_vec_5') -0.9661677675046166
(9, 'terms_vec_6') 3.0854430292662482
虽然不是很明显,但是通过简单的推测可以知道结果与上面是相同的,但是缺点是这里无法得知各个特征的准确名称。
网友评论