环境安装

yum -y install epel-release
yum -y install python-pip
pip install numpy pandas xgboost scikit-learn

环境安装速度

阿里云、腾讯云 > 百度云

阿里云和腾讯云全部装完不到一分钟
百度云装了10分钟，基本是50k/s的速度

华为由于是ARM服务器，基本上都要编译安装，所以速度最最最最最慢，整整装了一个多小时。

测试代码

本代码通过XGBOOST对一个数据集进行分类预测

数据集参见：https://www.kaggle.com/c/forest-cover-type-prediction/data

import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
from xgboost import XGBClassifier
from sklearn import metrics
import random
import time
from sklearn.externals import joblib
import os

def xgboost_train(train, test):
    features = train[[col for col in train.columns if col not in ['Cover_Type']]]
    label = train['Cover_Type']
    xgbc = XGBClassifier(
    booster='gbtree', 
    silent=True, 
    nthread=12,  # 根据CPU数量调整 
    scale_pos_weight=8, 
    n_estimators=120,
    max_depth=7,
    min_child_weight=1, 
    subsample=1,
    colsample_bytree=1,
    gamma=0.2,
    learning_rate=0.2,
    max_delta_step=0,

    base_score=1,
    colsample_bylevel=1,
    objective='multi:softmax',
    num_class=2,
    reg_alpha=2,
    reg_lambda=2
   )
    model = xgbc.fit(features, label)
    X_test = test[[col for col in test.columns if col not in ['Cover_Type']]]
    y_test = test['Cover_Type']
    score = xgbc.score(X_test, y_test)
    predictions = model.predict(X_test)
    macro = metrics.precision_score(y_test, predictions, average='macro', labels = [1,2,3,4,5,6,7])
    return score, macro, model, xgbc

def full_predict():
    batchNo = random.randint(0,9999)
    f = open('xgboost_train_'+ str(batchNo) +'.txt','w')
    dfa = pd.read_csv('train1.csv')
    dfa = dfa.dropna()
    dfa = dfa.drop(['Ground_position'], axis = 1)
    train, test = train_test_split(dfa, test_size=0.2)
    start = time.clock()
    scores = xgboost_train(train, test)
    end = time.clock()
    payForRun = end-start
    ret = "F1-score: " + str(scores[0]) + "; Macro : " + str(scores[1]) + ", padding : " + str(payForRun)
    print(ret)
    f.write(ret)
    f.write('\n')
    f.close()
    model = scores[2]
    xgbc = scores[3]
    t_test = pd.read_csv('test1.csv')
    t_test = t_test.drop(['Ground_position'], axis = 1)
    X_test2 = t_test[[col for col in t_test.columns if col not in ['Cover_Type']]]
    predictions = model.predict(X_test2)
    submission = pd.DataFrame({ 'Cover_Type': predictions })
    submission.to_csv("full_predict_"+str(batchNo)+".csv", index=False)
    joblib.dump(xgbc, "./train_model_"+str(batchNo)+".m")
full_predict()

百度云

实例类型：普通II型
ＣＰＵ：E5-2680 v4 16核心
内存：32GB
操作系统：CENTOS 7.3
测试使用核心：14

测试结果：28s

腾讯云

实例类型：S2.3XLARGE24
ＣＰＵ：Intel Xeon E5-2680 v4(2.4 GHz) 12核
内存：24GB
操作系统：CENTOS 7.3
测试使用核心：12

测试结果：20s

阿里云

实例类型：ecs.ic5.3xlarge
ＣＰＵ：Platinum 8163 12核心
内存：12GB
操作系统：CENTOS 7
测试使用核心：12

测试结果：29s

DELL R630 物理服务器

ＣＰＵ：Intel Xeon E5-2620 v3(2.4 GHz) *2 共12 核心
内存：64GB
操作系统：CentOS 7 x64
测试使用核心：24

测试结果：21s

测试使用核心：12

测试结果：24s

自用主机

实例类型：无
ＣＰＵ：I5-7400 4 核心
内存：16GB
操作系统：Windows 10
测试使用核心：4

测试结果：63s

洋垃圾

实例类型：未知
ＣＰＵ：L5148 4核心
内存：12GB
操作系统：CENTOS 7
测试使用核心：4

测试结果：110s

洋垃圾2

实例类型：未知
ＣＰＵ：Intel(R) Xeon(R) CPU E7- 4807 @ 1.87GHz 12核
内存：12GB
操作系统：CENTOS 7
测试使用核心：12

测试结果：51s

华为云鲲鹏920 ARM服务器

华为云

实例类型：kc1.3xlarge.2
ＣＰＵ：鲲鹏通用计算增强型 | 12vCPUs |
内存：24GB
操作系统：CentOS 7
测试使用核心：12

# 换ARM源
curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.huaweicloud.com/repository/conf/CentOS-AltArch-7.repo
yum clean all
yum makecache
yum install gcc -y
yum install python3 -y
yum install python3-devel -y
yum install lapack -y
yum install blas-devel lapack-devel -y
# 以下是编译安装，比较慢。。我的钱钱 ):
pip3 install Cython -i https://pypi.tuna.tsinghua.edu.cn/simple
CFLAGS=-std=c99 pip3 install numpy==1.17.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install xgboost -i https://pypi.tuna.tsinghua.edu.cn/simple

测试结果：27s