美文网首页
Julia MLJ 逻辑回归 机器学习 梯度下降 调参 kagg

Julia MLJ 逻辑回归 机器学习 梯度下降 调参 kagg

作者: 二方亨 | 来源:发表于2020-09-28 20:58 被阅读0次
                   _
       _       _ _(_)_     |  Documentation: https://docs.julialang.org
      (_)     | (_) (_)    |
       _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.    
      | | | | | | |/ _` |  |
      | | |_| | | | (_| |  |  Version 1.5.0-rc1.0 (2020-06-26)
     _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release  
    |__/                   |
    
    数据集字段概览

    这是一个已知用户的各种属性, 预测用户是否会购买车险(Response)的标准分类问题. 数据集大家可以去kaggle自行下载.

    1. 载入数据
    using Queryverse, MLJ, StatsKit, PrettyPrinting, LossFunctions, Plots
    train_data = Queryverse.load("D:\\data\\archive\\train.csv") |> DataFrame
    test_data = Queryverse.load("D:\\data\\archive\\test.csv") |> DataFrame
    

    "|>" 是Julia的管道函数, 等效于R的"%>%". 作用是将上一个结果作为下一个函数的参数传入. 在上述语句中:是将读取的数据转换为DataFrame类型

    1. 查看数据的科学类型(Scitype)
    train_data |> MLJ.schema
    
    MLJ.schema

    可以看到返回了两种类型:
    1.types (机器类型)
    2.scitypes (科学类型)
    机器类型很好理解, 与R, python, SQL一样, 代表数据的存储类型. 科学类型是MLJ库为方便模型理解而定义的类型, 不同的模型兼容的科学类型也不同, 使用时需要注意.
    详细说明文档里有

    1. 查看训练集统计摘要
    train_data |> describe |> print
    
    │ Row │ variable             │ mean     │ min      │ median   │ max       │ nunique │ nmissing │ eltype   │
    │     │ Symbol               │ Union…   │ Any      │ Union…   │ Any       │ Union…  │ Nothing  │ DataType │
    ├─────┼──────────────────────┼──────────┼──────────┼──────────┼───────────┼─────────┼──────────┼──────────┤
    │ 1   │ id                   │ 190555.0 │ 1        │ 190555.0 │ 381109    │         │          │ Int64    │
    │ 2   │ Gender               │          │ Female   │          │ Male      │ 2       │          │ String   │
    │ 3   │ Age                  │ 38.8226  │ 20       │ 36.0     │ 85        │         │          │ Int64    │
    │ 4   │ Driving_License      │ 0.997869 │ 0        │ 1.0      │ 1         │         │          │ Int64    │
    │ 5   │ Region_Code          │ 26.3888  │ 0.0      │ 28.0     │ 52.0      │         │          │ Float64  │
    │ 6   │ Previously_Insured   │ 0.45821  │ 0        │ 0.0      │ 1         │         │          │ Int64    │
    │ 7   │ Vehicle_Age          │          │ 1-2 Year │          │ > 2 Years │ 3       │          │ String   │
    │ 8   │ Vehicle_Damage       │          │ No       │          │ Yes       │ 2       │          │ String   │
    │ 9   │ Annual_Premium       │ 30564.4  │ 2630.0   │ 31669.0  │ 540165.0  │         │          │ Float64  │
    │ 10  │ Policy_Sales_Channel │ 112.034  │ 1.0      │ 133.0    │ 163.0     │         │          │ Float64  │
    │ 11  │ Vintage              │ 154.347  │ 10       │ 154.0    │ 299       │         │          │ Int64    │
    │ 12  │ Response             │ 0.122563 │ 0        │ 0.0      │ 1         │         │          │ Int64    │
    

    id: 对训练模型没有帮助需要剔除
    Gender, Driving_License, Region_Code, Previously_Insured, Previously_Insured, Vehicle_Age, Vehicle_Damage, 以及Response: 分类变量处理为one-hot编码

    1. 查看正负样本是否均衡
    train_data.Response |> StatsKit.countmap
    
    StatsKit.countmap

    正负样本不均衡, 选择后续在模型中处理. (也可在测试集中做欠采样)

    1. 从训练集中剔除id变量
    train_data = train_data[:, Not(:id)]
    
    1. 拆包 - 将数据分为预测变量和目标变量
    y, X = unpack(train_data, ==(:Response), colname -> true)
    
    MLJ.unpack
    1. 先用自动转换科学类型方法, 将预测变量转换为模型可接受的科学类型
    X = coerce(X, autotype(X)) #先对训练集自动转换scitype为学习支持类型
    
    微信截图_20200928201525.jpg

    预测变量的被转换成了三种科学类型: 无序分类, 有序因子, 连续数值

    1. 连续数值化
    X = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), X)), X)
    
    1. 标准化
    X = MLJ.transform(fit!(machine(Standardizer(), X)), X)
    
    Standardizer

    为提高梯度下降效率, 将数据标准化为标准差=1, 均值=0

    1. 将目标变量的科学类型转换为OrderedFactor
    y = coerce(y, OrderedFactor)
    
    1. 查看逻辑回归学习器参数
    info("LogisticClassifier", pkg = "ScikitLearn") |> pprint
    
    [ Info: Training Machine{ContinuousEncoder} @192.
    name = "LogisticClassifier",
     package_name = "ScikitLearn",
     is_supervised = true,
     docstring = "Logistic regression classifier.\n→ based on [ScikitLearn](https://github.com/cstjean/ScikitLearn.jl).\n→ do `@load LogisticClassifier pkg=\"ScikitLearn\"` to use the model.\n→ do `?LogisticClassifier` for documentation.",  
     hyperparameter_ranges = (nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing, nothing),
     hyperparameter_types = ("String", "Bool", "Float64", "Float64", "Bool", "Float64", "Any", "Any", "String", "Int64", "String", "Int64", "Bool", "Union{Nothing, Int64}", "Union{Nothing, Float64}"),
     hyperparameters = (:penalty, :dual, :tol, :C, :fit_intercept, :intercept_scaling, :class_weight, :random_state, :solver, :max_iter, :multi_class, :verbose, :warm_start, :n_jobs, :l1_ratio),
     implemented_methods = [:clean!, :fit, :fitted_params, :predict],
     is_pure_julia = false,
     is_wrapper = true,
     load_path = "MLJScikitLearnInterface.LogisticClassifier",
     package_license = "BSD",
     package_url = "https://github.com/cstjean/ScikitLearn.jl",
     package_uuid = "3646fa90-6ef7-5e7e-9f22-8aca16db6324",
     prediction_type = :probabilistic,
     supports_online = false,
     supports_weights = false,
     input_scitype = Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous),
     target_scitype = AbstractArray{_s267,1} where _s267<:Finite,
     output_scitype = Unknown)
    
    1. 载入模型
    @load LogisticClassifier pkg="ScikitLearn"
    
    lc = LogisticClassifier(class_weight = "balanced",  #由于样本不均衡, 让模型自动计算权重
                            solver = "sag") #优化算法选择 随机梯度下降
    
    1. 训练模型
    r = range(lc, :max_iter, lower = 100, upper = 500) #选择测试提升轮数的范围
    
    tm = TunedModel(model = lc,
                    tuning = Grid(), #参数范围的搜索策略
                    resampling = CV(rng = 11, nfolds = 10),
                    range = [r], #参数范围
                    measure = area_under_curve #判断最优结果的指标 ROC曲线下面积
                    )
    
    mtm = machine(tm, X, y)  #构造machine(学习器)
    
    fit!(mtm) #拟合已调整的模型
    
    
    [ Info: Training Machine{ProbabilisticTunedModel{Grid,…}} @931.
    [ Info: Attempting to evaluate 10 models.
    Evaluating over 10 metamodels: 100%[=========================] Time: 0:07:00
    

    14.可视化调参结果

    res = report(mtm).plotting
    scatter(res.parameter_values[:,1],
            res.measurements)
    
    scatter
    best_model = fitted_params(mtm).best_model #查看模型最佳参数
    
    best_model

    max_iter = 278时, AUC最大(ROC曲线下面积)

    15.同样的转换方法处理预测集

    test_data |> describe |> pprint
    id = test_data[:, :id]
    test_data = select(test_data, Not(:id))
    
    test_data = coerce(test_data, autotype(test_data)) #自动scitype
    test_data = MLJ.transform(fit!(machine(ContinuousEncoder(drop_last = true), test_data)), test_data) #数值化scitype
    test_data = MLJ.transform(fit!(machine(Standardizer(), test_data)), test_data) #标准化
    
    test_data
    1. 用训练好的模型进行预测
    result = predict_mode(mtm, test_data)
    
    predict_mode
    1. 查看结果比例
    result |> countmap
    
    countmap
    1. 将id与预测结果合并至DataFrame
    result_data = DataFrame(id = id, Response = result)
    
    result_data

    相关文章

      网友评论

          本文标题:Julia MLJ 逻辑回归 机器学习 梯度下降 调参 kagg

          本文链接:https://www.haomeiwen.com/subject/adrjuktx.html