最近在处理KDDcup99的数据,将自己遇到的问题和方法记录下来,以分享给大家。
资源整合
KDD CUP1999的数据集下载地址 http://kdd.ics.uci.edu/databases/kddcup99/下载地址
KDD CUP1999的数据集的介绍 KDD CUP 99数据集
KDD CUP1999 的数据集参考项目下载 可供参考项目地址,下载代码可运行
WEKA学习PPT https://pan.baidu.com/s/1slTz5Bf学习文件
数据下载
下载的KDDCUP99的数据文件是这样的
kddcup.namesAlist of features.
kddcup.data.gzThefull data set (18M; 743M Uncompressed)
kddcup.data_10_percent.gzA10% subset. (2.1M; 75M Uncompressed)
kddcup.newtestdata_10_percent_unlabeled.gz(1.4M; 45M Uncompressed)
kddcup.testdata.unlabeled.gz(11.2M;430M Uncompressed)
kddcup.testdata.unlabeled_10_percent.gz(1.4M;45M Uncompressed)
corrected.gzTestdata with corrected labels.
training_attack_typesAlist of intrusion types.
数据集的介绍请看链接1,把corrected.data文件作为训练集,kddcup.data_10_percent作为测试集即可。
数据读取
下载的文本是纯文本文件,用NotePad++打开另存为.txt文件,方便python读取。下面我做的工作就是添加标签,然后把txt文件另存为csv文件
纯文本文件添加标识,标识为连接1的文章所示,python代码
import pandas as pd
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"] #42个标识
data = pd.read_table("corrected.txt",header=None, sep=',',names = col_names)
print(data.head(10)) #查看前10行
data.to_csv("corrected.csv") #另存为csv文件
提前用excel创建空的corrected.csv文件,要不然报错文件不存在(路径都是绝对路径)。
添加标识后的csv文件
网友评论