决策树算法
数据集是某学校男性女性的身高体重信息。
数据说明: height 身高 、weight 体重 、category 0男1女 、rand 随机数 、features特征值
相关源码:ClassifyDecisionTree.scala
+------+------+--------+--------------------+------------+
|height|weight|category| rand| features|
+------+------+--------+--------------------+------------+
| 177.0| 68.2| 1|0.001924715176714...|[177.0,68.2]|
| 163.2| 55.9| 2|0.002309495097651...|[163.2,55.9]|
| 160.0| 50.2| 2|0.007363393934636031|[160.0,50.2]|
| 175.3| 86.4| 1|0.010757645346996858|[175.3,86.4]|
| 172.7| 66.2| 1|0.010887433925890977|[172.7,66.2]|
| 155.0| 49.2| 2|0.013443416084848447|[155.0,49.2]|
| 162.6| 54.5| 2|0.013660993453225356|[162.6,54.5]|
| 158.0| 55.5| 2|0.017720709335255047|[158.0,55.5]|
| 177.8| 80.9| 1|0.017746365296897437|[177.8,80.9]|
| 179.1| 89.1| 1|0.017900567625543484|[179.1,89.1]|
| 182.9| 85.0| 1|0.020723550349353026|[182.9,85.0]|
| 180.3| 82.6| 1| 0.02109944226439564|[180.3,82.6]|
| 161.3| 70.5| 2|0.021308838567239197|[161.3,70.5]|
| 175.3| 70.9| 1|0.021448847738743337|[175.3,70.9]|
| 180.6| 72.7| 1|0.021585780869558646|[180.6,72.7]|
| 164.5| 70.0| 1|0.025281037559654607|[164.5,70.0]|
| 157.5| 76.8| 2|0.026563453573523965|[157.5,76.8]|
| 188.0| 85.9| 1| 0.03103171572557828|[188.0,85.9]|
| 157.0| 63.0| 2| 0.03160072136660996|[157.0,63.0]|
| 182.4| 74.5| 1|0.032118109945869944|[182.4,74.5]|
+------+------+--------+--------------------+------------+
训练测试结果
+------+------+--------+--------------------+------------+---------------+--------------------+----------+
|height|weight|category| rand| features| rawPrediction| probability|prediction|
+------+------+--------+--------------------+------------+---------------+--------------------+----------+
| 177.0| 68.2| 1|0.001924715176714...|[177.0,68.2]| [0.0,5.0,1.0]|[0.0,0.8333333333...| 1.0|
| 175.3| 70.9| 1|0.021448847738743337|[175.3,70.9]| [0.0,13.0,1.0]|[0.0,0.9285714285...| 1.0|
| 180.6| 72.7| 1|0.021585780869558646|[180.6,72.7]| [0.0,98.0,1.0]|[0.0,0.9898989898...| 1.0|
| 182.4| 74.5| 1|0.032118109945869944|[182.4,74.5]| [0.0,98.0,1.0]|[0.0,0.9898989898...| 1.0|
| 197.1| 90.9| 1| 0.03559399587317391|[197.1,90.9]| [0.0,98.0,1.0]|[0.0,0.9898989898...| 1.0|
| 165.7| 73.1| 2| 0.04257206175621786|[165.7,73.1]|[0.0,41.0,28.0]|[0.0,0.5942028985...| 1.0|
| 176.0| 86.4| 1| 0.04607047165343936|[176.0,86.4]| [0.0,9.0,3.0]| [0.0,0.75,0.25]| 1.0|
| 165.1| 64.1| 2| 0.0608822940489292|[165.1,64.1]|[0.0,41.0,28.0]|[0.0,0.5942028985...| 1.0|
| 159.2| 51.8| 2| 0.0781592434755799|[159.2,51.8]| [0.0,0.0,66.0]| [0.0,0.0,1.0]| 2.0|
| 177.3| 73.2| 1| 0.10412537855836879|[177.3,73.2]| [0.0,98.0,1.0]|[0.0,0.9898989898...| 1.0|
| 175.5| 63.2| 1| 0.11522293393520988|[175.5,63.2]| [0.0,0.0,5.0]| [0.0,0.0,1.0]| 2.0|
| 167.0| 59.8| 2| 0.13131909453868662|[167.0,59.8]| [0.0,0.0,12.0]| [0.0,0.0,1.0]| 2.0|
| 177.8| 86.4| 1| 0.14061099947455546|[177.8,86.4]| [0.0,98.0,1.0]|[0.0,0.9898989898...| 1.0|
| 167.4| 53.9| 1| 0.15083577087167999|[167.4,53.9]| [0.0,2.0,24.0]|[0.0,0.0769230769...| 2.0|
| 180.3| 83.2| 1| 0.15463786724535922|[180.3,83.2]| [0.0,98.0,1.0]|[0.0,0.9898989898...| 1.0|
| 160.0| 55.4| 2| 0.17400393305299244|[160.0,55.4]| [0.0,0.0,66.0]| [0.0,0.0,1.0]| 2.0|
| 152.4| 46.5| 2| 0.1906697626456868|[152.4,46.5]| [0.0,0.0,13.0]| [0.0,0.0,1.0]| 2.0|
| 183.5| 74.8| 1| 0.1920772640392049|[183.5,74.8]| [0.0,98.0,1.0]|[0.0,0.9898989898...| 1.0|
| 173.5| 81.8| 1| 0.20261133506541862|[173.5,81.8]|[0.0,41.0,28.0]|[0.0,0.5942028985...| 1.0|
| 158.8| 49.1| 2| 0.21372194219989293|[158.8,49.1]| [0.0,0.0,66.0]| [0.0,0.0,1.0]| 2.0|
+------+------+--------+--------------------+------------+---------------+--------------------+----------+
only showing top 20 rows
accuracy is 0.8297872340425532
朴素贝叶斯算法
数据集采用的是Iris鸢尾花数据集
_c0、_c1、_c2、_c3是花的某种特征的特征值大小
label 表示花的细分种类
相关源码:ClassifyNativeBayes.scala
+---+---+---+---+-----+--------------------+
|_c0|_c1|_c2|_c3|label| rand|
+---+---+---+---+-----+--------------------+
|4.3|3.0|1.1|0.1| 0|0.003326979325281032|
|5.4|3.4|1.7|0.2| 0|0.009592673729602486|
|6.1|3.0|4.6|1.4| 1| 0.0152037806503027|
|7.9|3.8|6.4|2.0| 2|0.015503439675020214|
|6.7|3.0|5.0|1.7| 1|0.020042734198535972|
|6.4|3.1|5.5|1.8| 2| 0.05476692766370894|
|5.5|2.5|4.0|1.3| 1| 0.05686437116523335|
|4.7|3.2|1.3|0.2| 0| 0.0595954341070446|
|6.9|3.1|5.4|2.1| 2| 0.06726753463099477|
|7.2|3.0|5.8|1.6| 2| 0.07696980523890262|
|6.7|3.3|5.7|2.5| 2| 0.08444880519447917|
|4.6|3.1|1.5|0.2| 0| 0.08524222662857528|
|5.6|2.9|3.6|1.3| 1| 0.10158676661407073|
|4.8|3.0|1.4|0.1| 0| 0.10675364426248701|
|6.3|2.5|4.9|1.5| 1| 0.11310239503362629|
|5.6|2.7|4.2|1.3| 1| 0.11453388616504145|
|5.5|3.5|1.3|0.2| 0| 0.11468327229190811|
|5.8|2.7|5.1|1.9| 2| 0.12196158211354247|
|5.1|3.5|1.4|0.3| 0| 0.12551737888690984|
|4.8|3.4|1.6|0.2| 0| 0.15533175180704428|
+---+---+---+---+-----+--------------------+
only showing top 20 rows
模型训练测试结果
+---+---+---+---+-----+--------------------+-----------------+--------------------+--------------------+----------+
|_c0|_c1|_c2|_c3|label| rand| features| rawPrediction| probability|prediction|
+---+---+---+---+-----+--------------------+-----------------+--------------------+--------------------+----------+
|4.3|3.0|1.1|0.1| 0|0.003326979325281032|[4.3,3.0,1.1,0.1]|[-9.8758096263421...|[0.74188275591212...| 0.0|
|6.1|3.0|4.6|1.4| 1| 0.0152037806503027|[6.1,3.0,4.6,1.4]|[-22.662166461937...|[0.04640246775348...| 1.0|
|6.4|3.1|5.5|1.8| 2| 0.05476692766370894|[6.4,3.1,5.5,1.8]|[-26.182568778203...|[0.01557191175966...| 1.0|
|5.5|2.5|4.0|1.3| 1| 0.05686437116523335|[5.5,2.5,4.0,1.3]|[-20.166766552586...|[0.05458454513502...| 1.0|
|6.9|3.1|5.4|2.1| 2| 0.06726753463099477|[6.9,3.1,5.4,2.1]|[-27.433208055916...|[0.01236259344067...| 1.0|
|7.2|3.0|5.8|1.6| 2| 0.07696980523890262|[7.2,3.0,5.8,1.6]|[-26.494335876097...|[0.01825264119093...| 1.0|
|5.6|2.9|3.6|1.3| 1| 0.10158676661407073|[5.6,2.9,3.6,1.3]|[-19.897110451150...|[0.09237937741037...| 1.0|
|4.8|3.4|1.6|0.2| 0| 0.15533175180704428|[4.8,3.4,1.6,0.2]|[-11.997775483590...|[0.70646055942144...| 0.0|
|7.4|2.8|6.1|1.9| 2| 0.18358013977287357|[7.4,2.8,6.1,1.9]|[-28.090693830596...|[0.00891391025433...| 1.0|
|4.5|2.3|1.3|0.3| 0| 0.24053166847543628|[4.5,2.3,1.3,0.3]|[-10.370756938066...|[0.56456409063282...| 0.0|
|5.7|3.8|1.7|0.3| 0| 0.24371079476801594|[5.7,3.8,1.7,0.3]|[-13.627203949814...|[0.74744750899684...| 0.0|
|6.1|2.9|4.7|1.4| 1| 0.25897191452004664|[6.1,2.9,4.7,1.4]|[-22.747269522018...|[0.04072359198491...| 1.0|
|6.1|2.6|5.6|1.4| 2| 0.32632952248541935|[6.1,2.6,5.6,1.4]|[-24.165921622143...|[0.01746092235489...| 1.0|
|7.7|2.6|6.9|2.3| 2| 0.34150870108653764|[7.7,2.6,6.9,2.3]|[-31.090843380520...|[0.00261340157005...| 2.0|
|5.2|4.1|1.5|0.1| 0| 0.34961811399305576|[5.2,4.1,1.5,0.1]|[-12.484838515158...|[0.82939929861274...| 0.0|
|4.8|3.1|1.6|0.2| 0| 0.35223492445532156|[4.8,3.1,1.6,0.2]|[-11.671413203895...|[0.66862768689173...| 0.0|
|5.9|3.0|5.1|1.8| 2| 0.35296188357024383|[5.9,3.0,5.1,1.8]|[-24.944438710599...|[0.01785017728176...| 1.0|
|5.1|3.7|1.5|0.4| 0| 0.5390894438157275|[5.1,3.7,1.5,0.4]|[-13.069681739915...|[0.71488057398769...| 0.0|
|5.1|3.8|1.9|0.4| 0| 0.5457874776234811|[5.1,3.8,1.9,0.4]|[-13.954031113067...|[0.66260455445175...| 0.0|
|7.7|2.8|6.7|2.0| 2| 0.5473906288859796|[7.7,2.8,6.7,2.0]|[-29.829888190450...|[0.00523190601445...| 2.0|
+---+---+---+---+-----+--------------------+-----------------+--------------------+--------------------+----------+
only showing top 20 rows
accuracy is 0.7777777777777778
支持向量机SVM
数据集也是使用的iris数据集
SVM算法只支持2分类,所以对数据集进行了筛选,只留下label为0、1的数据
相关源码:ClassifySVM.scala
+---+---+---+---+-----+--------------------+
|_c0|_c1|_c2|_c3|label| rand|
+---+---+---+---+-----+--------------------+
|5.1|3.5|1.4|0.2| 0|0.005383118037440182|
|5.7|4.4|1.5|0.4| 0|0.007194431761283537|
|7.0|3.2|4.7|1.4| 1|0.033787938439531984|
|4.6|3.2|1.4|0.2| 0| 0.03515755168692547|
|6.7|3.1|4.4|1.4| 1|0.047194768581304225|
|5.5|2.4|3.8|1.1| 1|0.053851496474066396|
|4.9|3.1|1.5|0.1| 0| 0.05504111221690233|
|5.7|2.8|4.1|1.3| 1| 0.05782788372655445|
|4.9|3.0|1.4|0.2| 0|0.060189662689951184|
|4.8|3.4|1.6|0.2| 0| 0.06897490026440856|
|5.4|3.4|1.7|0.2| 0| 0.09155599582098428|
|5.8|2.7|3.9|1.2| 1| 0.09326583469757688|
|6.1|2.8|4.0|1.3| 1| 0.0982254496580297|
|4.9|2.4|3.3|1.0| 1| 0.12326679062811396|
|6.2|2.9|4.3|1.3| 1| 0.12413265352469693|
|6.0|2.9|4.5|1.5| 1| 0.13204735458660521|
|4.4|3.2|1.3|0.2| 0| 0.1403506514781847|
|5.6|3.0|4.1|1.3| 1| 0.14172346739032382|
|6.5|2.8|4.6|1.5| 1| 0.14371681994803165|
|5.1|2.5|3.0|1.1| 1| 0.18510325676932826|
+---+---+---+---+-----+--------------------+
+---+---+---+---+-----+--------------------+-----------------+--------------------+----------+
|_c0|_c1|_c2|_c3|label| rand| features| rawPrediction|prediction|
+---+---+---+---+-----+--------------------+-----------------+--------------------+----------+
|6.1|2.8|4.7|1.2| 1|0.012481961930157603|[6.1,2.8,4.7,1.2]|[-1.6509634270222...| 1.0|
|5.0|3.3|1.4|0.2| 0|0.016082628759263806|[5.0,3.3,1.4,0.2]|[1.29354104026248...| 0.0|
|4.4|2.9|1.4|0.2| 0| 0.22290326246094538|[4.4,2.9,1.4,0.2]|[1.05281732989969...| 0.0|
|5.4|3.0|4.5|1.5| 1| 0.2668875621875405|[5.4,3.0,4.5,1.5]|[-1.4835570568200...| 1.0|
|5.4|3.9|1.3|0.4| 0| 0.3533812726039295|[5.4,3.9,1.3,0.4]|[1.63057317590279...| 0.0|
|6.6|3.0|4.4|1.4| 1| 0.3553239162288241|[6.6,3.0,4.4,1.4]|[-1.6624321643118...| 1.0|
|5.1|3.3|1.7|0.5| 0| 0.5343838606275636|[5.1,3.3,1.7,0.5]|[0.87191902881306...| 0.0|
|5.1|3.4|1.5|0.2| 0| 0.5482515144522366|[5.1,3.4,1.5,0.2]|[1.32991481935905...| 0.0|
|6.7|3.1|4.4|1.4| 1| 0.8046227561337921|[6.7,3.1,4.4,1.4]|[-1.5893019484248...| 1.0|
|5.6|3.0|4.5|1.5| 1| 0.8385862859176035|[5.6,3.0,4.5,1.5]|[-1.5353542100051...| 1.0|
|6.0|2.2|4.0|1.0| 1| 0.9669924229306907|[6.0,2.2,4.0,1.0]|[-1.7716397981170...| 1.0|
+---+---+---+---+-----+--------------------+-----------------+--------------------+----------+
accuracy is 1.0
总结
总的来说,分类算法的结果还是令人满意的,准确度都还比较高,而且可以通过调参进一步提高准确度。相比回归算法,分类算法更容易得出令人满意的效果,用小数据集也能达到较优的预测效果。由于使用的都是小数据集,结果仅能用于测试和调参。
可以看出,决策树算法最终的准确度为0.829,朴素贝叶斯算法得出的准确度达到了0.77,SVM算法最终的准确度为1.0。1.0的准确度是有偶然性的,在大数量集的情况下不可能达到这个准确度,经过多次测试,svm的准确度维持在0.8以上,不过需要注意的是svm仅支持2分类。
网友评论