_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.5.0-rc1.0 (2020-06-26)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |

这是一个已知用户的各种属性, 预测用户是否会购买车险(Response)的标准分类问题. 数据集大家可以去kaggle自行下载.
- 载入数据
using Queryverse, MLJ, StatsKit, PrettyPrinting, LossFunctions, Plots
train_data = Queryverse.load("D:\\data\\archive\\train.csv") |> DataFrame
test_data = Queryverse.load("D:\\data\\archive\\test.csv") |> DataFrame
"|>" 是Julia的管道函数, 等效于R的"%>%". 作用是将上一个结果作为下一个函数的参数传入. 在上述语句中:是将读取的数据转换为DataFrame类型
- 查看数据的科学类型(Scitype)
train_data |> MLJ.schema

可以看到返回了两种类型:
1.types (机器类型)
2.scitypes (科学类型)
机器类型很好理解, 与R, python, SQL一样, 代表数据的存储类型. 科学类型是MLJ库为方便模型理解而定义的类型, 不同的模型兼容的科学类型也不同, 使用时需要注意.
详细说明文档里有
- 查看训练集统计摘要
train_data |> describe |> print
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────────────────┼──────────┼──────────┼──────────┼───────────┼─────────┼──────────┼──────────┤
│ 1 │ id │ 190555.0 │ 1 │ 190555.0 │ 381109 │ │ │ Int64 │
│ 2 │ Gender │ │ Female │ │ Male │ 2 │ │ String │
│ 3 │ Age │ 38.8226 │ 20 │ 36.0 │ 85 │ │ │ Int64 │
│ 4 │ Driving_License │ 0.997869 │ 0 │ 1.0 │ 1 │ │ │ Int64 │
│ 5 │ Region_Code │ 26.3888 │ 0.0 │ 28.0 │ 52.0 │ │ │ Float64 │
│ 6 │ Previously_Insured │ 0.45821 │ 0 │ 0.0 │ 1 │ │ │ Int64 │
│ 7 │ Vehicle_Age │ │ 1-2 Year │ │ > 2 Years │ 3 │ │ String │
│ 8 │ Vehicle_Damage │ │ No │ │ Yes │ 2 │ │ String │
│ 9 │ Annual_Premium │ 30564.4 │ 2630.0 │ 31669.0 │ 540165.0 │ │ │ Float64 │
│ 10 │ Policy_Sales_Channel │ 112.034 │ 1.0 │ 133.0 │ 163.0 │ │ │ Float64 │
│ 11 │ Vintage │ 154.347 │ 10 │ 154.0 │ 299 │ │ │ Int64 │
│ 12 │ Response │ 0.122563 │ 0 │ 0.0 │ 1 │ │ │ Int64 │
id: 对训练模型没有帮助需要剔除
Gender, Driving_License, Region_Code, Previously_Insured, Previously_Insured, Vehicle_Age, Vehicle_Damage, 以及Response: 分类变量处理为one-hot编码
- 查看正负样本是否均衡
train_data.Response |> StatsKit.countmap

正负样本不均衡, 选择后续在模型中处理. (也可在测试集中做欠采样)
- 从训练集中剔除id变量
train_data = train_data[:, Not(:id)]
- 拆包 - 将数据分为预测变量和目标变量
y, X = unpack(train_data, ==(:Response), colname -> true)

- 先用自动转换科学类型方法, 将预测变量转换为模型可接受的科学类型
X = coerce(X, autotype(X)) #先对训练集自动转换scitype为学习支持类型

预测变量的被转换成了三种科学类型: 无序分类, 有序因子, 连续数值
- 连续数值化
X = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), X)), X)
- 标准化
X = MLJ.transform(fit!(machine(Standardizer(), X)), X)

为提高梯度下降效率, 将数据标准化为标准差=1, 均值=0
- 将目标变量的科学类型转换为OrderedFactor
y = coerce(y, OrderedFactor)
- 查看逻辑回归学习器参数
info("LogisticClassifier", pkg = "ScikitLearn") |> pprint
[ Info: Training Machine{ContinuousEncoder} @192.
name = "LogisticClassifier",
package_name = "ScikitLearn",
is_supervised = true,
docstring = "Logistic regression classifier.\n→ based on [ScikitLearn](https://github.com/cstjean/ScikitLearn.jl).\n→ do `@load LogisticClassifier pkg=\"ScikitLearn\"` to use the model.\n→ do `?LogisticClassifier` for documentation.",
hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing),
hyperparameter_types = ("String", "Bool", "Float64", "Float64", "Bool", "Float64", "Any", "Any", "String", "Int64", "String", "Int64", "Bool", "Union{Nothing, Int64}", "Union{Nothing, Float64}"),
hyperparameters = (:penalty, :dual, :tol, :C, :fit_intercept, :intercept_scaling, :class_weight, :random_state, :solver, :max_iter, :multi_class, :verbose, :warm_start, :n_jobs, :l1_ratio),
implemented_methods = [:clean!, :fit, :fitted_params, :predict],
is_pure_julia = false,
is_wrapper = true,
load_path = "MLJScikitLearnInterface.LogisticClassifier",
package_license = "BSD",
package_url = "https://github.com/cstjean/ScikitLearn.jl",
package_uuid = "3646fa90-6ef7-5e7e-9f22-8aca16db6324",
prediction_type = :probabilistic,
supports_online = false,
supports_weights = false,
input_scitype = Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous),
target_scitype = AbstractArray{_s267,1} where _s267<:Finite,
output_scitype = Unknown)
- 载入模型
@load LogisticClassifier pkg="ScikitLearn"
lc = LogisticClassifier(class_weight = "balanced", #由于样本不均衡, 让模型自动计算权重
solver = "sag") #优化算法选择 随机梯度下降
- 训练模型
r = range(lc, :max_iter, lower = 100, upper = 500) #选择测试提升轮数的范围
tm = TunedModel(model = lc,
tuning = Grid(), #参数范围的搜索策略
resampling = CV(rng = 11, nfolds = 10),
range = [r], #参数范围
measure = area_under_curve #判断最优结果的指标 ROC曲线下面积
)
mtm = machine(tm, X, y) #构造machine(学习器)
fit!(mtm) #拟合已调整的模型
[ Info: Training Machine{ProbabilisticTunedModel{Grid,…}} @931.
[ Info: Attempting to evaluate 10 models.
Evaluating over 10 metamodels: 100%[=========================] Time: 0:07:00
14.可视化调参结果
res = report(mtm).plotting
scatter(res.parameter_values[:,1],
res.measurements)

best_model = fitted_params(mtm).best_model #查看模型最佳参数

max_iter = 278时, AUC最大(ROC曲线下面积)
15.同样的转换方法处理预测集
test_data |> describe |> pprint
id = test_data[:, :id]
test_data = select(test_data, Not(:id))
test_data = coerce(test_data, autotype(test_data)) #自动scitype
test_data = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), test_data)), test_data) #数值化scitype
test_data = MLJ.transform(fit!(machine(Standardizer(), test_data)), test_data) #标准化

- 用训练好的模型进行预测
result = predict_mode(mtm, test_data)

- 查看结果比例
result |> countmap

- 将id与预测结果合并至DataFrame
result_data = DataFrame(id = id, Response = result)

网友评论