python数据分析与机器学习(Numpy,Pandas,Mat

作者: yuhuan121 | 来源:发表于2017-10-29 23:12 被阅读0次

python数据分析与机器学习(Numpy,Pandas,Mat
Pythonner还在为了练习Numpy而没有真实数据而烦恼吗？
Python数据分析与展示Numpy、Matplotlib
如何学习和评价《利用python进行数据分析》这本书？
Python 数据分析学习笔记： numpy 篇
P3-调查数据集-项目概况
Python关于数据分析各种学习文章链接
101道Numpy、Pandas练习题
Python（金融）数据分析（二）Pandas
高效使用ndarray Series DataFrame

机器学习怎么学？

机器学习包含数学原理推导和实际应用技巧，所以需要清楚算法的推导过程和如何应用。
深度学习是机器学习中神经网络算法的延伸，在计算机视觉和自然语言处理中应用更厉害一些。
自己从头开始做笔记。

机器学习怎么动手，哪里去找案例？

最好的资源：github ，kaggle
案例积累的作用很大，很少从头去写一个项目。先学会模仿，再去创作。

科学计算库Numpy

numpy(Numerical Python extensions)是一个第三方的Python包，用于科学计算。这个库的前身是1995年就开始开发的一个用于数组运算的库。经过了长时间的发展，基本上成了绝大部分Python科学计算的基础包，当然也包括所有提供Python接口的深度学习框架。
numpy.genfromtxt方法
从文本文件加载数据，并按指定的方式处理缺少的值

delimiter : 分隔符：用于分隔值的字符串。可以是str, int, or sequence。默认情况下，任何连续的空格作为分隔符。
dtype：结果数组的数据类型。如果没有，则dtypes将由每列的内容单独确定。

import numpy
world_alcohol = numpy.genfromtxt("world_alcohol.txt",delimiter=",",dtype=str)
print(type(world_alcohol))
print(world_alcohol)
print(help(numpy.genfromtxt)) #当想知道numpy.genfromtxt用法时，使用help查询帮助文档

输出结果：
<class 'numpy.ndarray'>  #所有的numpy都是ndarray结构
[['Year' 'WHO region' 'Country' 'Beverage Types' 'Display Value']
 ['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ..., 
 ['1987' 'Africa' 'Malawi' 'Other' '0.75']
 ['1989' 'Americas' 'Bahamas' 'Wine' '1.5']
 ['1985' 'Africa' 'Malawi' 'Spirits' '0.31']]

numpy.array
创建一个向量或矩阵（多维数组）

import numpy as np
a = [1, 2, 4, 3]        #vector
b = np.array(a)             # array([1, 2, 4, 3])
type(b)                     # <type 'numpy.ndarray'>

对数组元素的操作1

b.shape                     # (4,) 返回矩阵的（行数，列数）或向量中的元素个数
b.argmax()                  # 2 返回最大值所在的索引
b.max()                     # 4最大值
b.min()                            # 1最小值
b.mean()                    # 2.5平均值

numpy限制了nump.array中的元素必须是相同数据结构。使用dtype属性返回数组中的数据类型

>>> a = [1,2,3,5]
>>> b = np.array(a)
>>> b.dtype
dtype('int64')

对数组元素的操作2

c = [[1, 2], [3, 4]]    # 二维列表
d = np.array(c)             # 二维numpy数组
d.shape                     # (2, 2)
d[1,1]                        #4,矩阵方式按照行、列获取元素
d.size                      # 4 数组中的元素个数
d.max(axis=0)               # 找维度0，也就是最后一个维度上的最大值，array([3, 4])
d.max(axis=1)               # 找维度1，也就是倒数第二个维度上的最大值，array([2, 4])
d.mean(axis=0)              # 找维度0，也就是第一个维度上的均值，array([ 2.,  3.])
d.flatten()                 # 展开一个numpy数组为1维数组，array([1, 2, 3, 4])
np.ravel(c)               # 展开一个可以解析的结构为1维数组，array([1, 2, 3, 4])

对数组元素的操作3

import numpy as np
matrix = np.array([
                [5,10,15],
                [20,25,30],
                [35,40,45]
                ])
print(matrix.sum(axis=1))  #指定维度axis=1，即按行计算
输出结果：
[ 30  75 120]

import numpy as np
matrix = np.array([
                [5,10,15],
                [20,25,30],
                [35,40,45]
                ])
print(matrix.sum(axis=0))  #指定维度axis=0，即按列计算
输出结果：
[60 75 90]

矩阵中也可以使用切片

import numpy as np
vector = [1, 2, 4, 3]       
print(vector[0:3])  #[1, 2, 4] 对于索引大于等于0，小于3的所有元素

matrix = np.array([[5,10,15],[20,25,30],[35,40,45]])
print(matrix[:,1])  #[10 25 40]取出所有行的第一列
print(matrix[:,0:2])   #取出所有行的第一、第二列
#[[ 5 10]
 [20 25]
 [35 40]]

对数组的判断操作，等价于对数组中所有元素的操作

import numpy as np
matrix = np.array([[5,10,15],[20,25,30],[35,40,45]])
print(matrix == 25)
输出结果：
[[False False False]
 [False  True False]
 [False False False]]

second_colum_25 = matrix[:,1]== 25
print(second_colum_25)
print(matrix[second_colum_25,:])  #bool类型的值也可以拿出来当成索引
输出结果：
[False  True False]
[[20 25 30]]

对数组元素的与操作,或操作

import numpy as np
vector = np.array([5,10,15,20])
equal_to_ten_and_five = (vector == 10) & (vector == 5)
print (equal_to_ten_and_five)
输出结果：
[False False False False]

import numpy as np
vector = np.array([5,10,15,20])
equal_to_ten_and_five = (vector == 10) | (vector == 5)
print (equal_to_ten_and_five)
vector[equal_to_ten_and_five] = 50  #bool类型值作为索引时，True有效
print(vector)  
输出结果：
[ True  True False False]
[50 50 15 20]

对数组元素类型的转换

import numpy as np
vector = np.array(['lucy','ch','dd'])
vector = vector.astype(float) #astype对整个vector进行值类型的转换
print(vector.dtype)
print(vector)
输出结果：
float64
[  5.  10.  15.  20.]

Numpy常用函数

reshape方法，变换矩阵维度

import numpy as np
print(np.arange(15))
a = np.arange(15).reshape(3,5) #将向量变为3行5列矩阵
print(a)
print(a.shape)  #shape方法获得（行数，烈数）

输出结果：
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
(3, 5)

初始化矩阵为0或1

>>> import numpy as np
>>> np.zeros((3,4))   #将一个三行四列矩阵初始化为0
输出结果：
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

>>> import numpy as np
>>> np.ones((3,4),dtype=np.int32)  #指定类型为int型
输出结果：
array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int32)

构造序列

np.arange( 10, 30, 5 )   #起始值10，终止值小于30，间隔为5
输出结果：
array([10, 15, 20, 25])

np.arange( 0, 2, 0.3 )
输出结果：
array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8])

random模块

np.random.random((2,3)) #random模块中的random函数，产生一个两行三列的随机矩阵。（-1，+1）之间的值
输出结果：
array([[ 0.40130659,  0.45452825,  0.79776512],
       [ 0.63220592,  0.74591134,  0.64130737]])

linspace模块，将起始值与终止值之间等分成x份

from numpy import pi
np.linspace( 0, 2*pi, 100 )
输出结果：
array([ 0.        ,  0.06346652,  0.12693304,  0.19039955,  0.25386607,
        0.31733259,  0.38079911,  0.44426563,  0.50773215,  0.57119866,
        0.63466518,  0.6981317 ,  0.76159822,  0.82506474,  0.88853126,
        0.95199777,  1.01546429,  1.07893081,  1.14239733,  1.20586385,
        1.26933037,  1.33279688,  1.3962634 ,  1.45972992,  1.52319644,
        1.58666296,  1.65012947,  1.71359599,  1.77706251,  1.84052903,
        1.90399555,  1.96746207,  2.03092858,  2.0943951 ,  2.15786162,
        2.22132814,  2.28479466,  2.34826118,  2.41172769,  2.47519421,
        2.53866073,  2.60212725,  2.66559377,  2.72906028,  2.7925268 ,
        2.85599332,  2.91945984,  2.98292636,  3.04639288,  3.10985939,
        3.17332591,  3.23679243,  3.30025895,  3.36372547,  3.42719199,
        3.4906585 ,  3.55412502,  3.61759154,  3.68105806,  3.74452458,
        3.8079911 ,  3.87145761,  3.93492413,  3.99839065,  4.06185717,
        4.12532369,  4.1887902 ,  4.25225672,  4.31572324,  4.37918976,
        4.44265628,  4.5061228 ,  4.56958931,  4.63305583,  4.69652235,
        4.75998887,  4.82345539,  4.88692191,  4.95038842,  5.01385494,
        5.07732146,  5.14078798,  5.2042545 ,  5.26772102,  5.33118753,
        5.39465405,  5.45812057,  5.52158709,  5.58505361,  5.64852012,
        5.71198664,  5.77545316,  5.83891968,  5.9023862 ,  5.96585272,
        6.02931923,  6.09278575,  6.15625227,  6.21971879,  6.28318531])

对矩阵的运算以矩阵为单位进行操作

import numpy as np
a = np.array( [20,30,40,50] )
b = np.arange( 4 )  #[0 1 2 3]
c = a-b 
print(c)  #[20 29 38 47]
print(b**2)  #[0 1 4 9]
print(a<35)  #[ True  True False False]

矩阵乘法

A = np.array( [[1,1],
               [0,1]] )
B = np.array( [[2,0],
               [3,4]] )
print A.dot(B)  #求矩阵乘法的方法一
print np.dot(A, B)  ##求矩阵乘法的方法二
输出结果：
[[5 4]
 [3 4]]
[[5 4]
 [3 4]]

e为底数的运算&开根运算

import numpy as np
B = np.arange(3)
print (np.exp(B))  #[ 1.          2.71828183  7.3890561 ] e的B次方
print (np.sqrt(B))  #[ 0.          1.          1.41421356]

floor向下取整

import numpy as np
a = np.floor(10*np.random.random((3,4)))  #floor向下取整
print(a)
print (a.ravel())  #将矩阵中元素展开成一行
a.shape = (6, 2)     #当采用a.reshape(6,-1) 第二个参数-1表示默认根据行数确定列数
print (a)
print (a.T)  #a的转置（矩阵行列互换）

[[ 8.  7.  2.  1.]
 [ 5.  2.  5.  1.]
 [ 8.  7.  7.  2.]]
[ 8.  7.  2.  1.  5.  2.  5.  1.  8.  7.  7.  2.]
[[ 8.  7.]
 [ 2.  1.]
 [ 5.  2.]
 [ 5.  1.]
 [ 8.  7.]
 [ 7.  2.]]
[[ 8.  2.  5.  5.  8.  7.]
 [ 7.  1.  2.  1.  7.  2.]]

hstack与vstack实现矩阵的拼接（拼接数据常用）

a = np.floor(10*np.random.random((2,2)))
b = np.floor(10*np.random.random((2,2)))
print(a)
print(b)
print(np.hstack((a,b)))  #横着拼接
print(np.vstack((a,b)))   #竖着拼接
输出结果：
[[ 8.  6.]
 [ 7.  6.]]
[[ 3.  4.]
 [ 8.  1.]]
[[ 8.  6.  3.  4.]
 [ 7.  6.  8.  1.]]
[[ 8.  6.]
 [ 7.  6.]
 [ 3.  4.]
 [ 8.  1.]]

hsplit与vsplit实现矩阵的切分

a = np.floor(10*np.random.random((2,12)))
print(a)
print(np.hsplit(a,3))  #横着将矩阵切分为3份
print(np.hsplit(a,(3,4)))  # 指定横着切分的位置，第三列和第四列
输出结果：
[[ 7.  1.  4.  9.  8.  8.  5.  9.  6.  6.  9.  4.]
 [ 1.  9.  1.  2.  9.  9.  5.  0.  5.  4.  9.  6.]]
[array([[ 7.,  1.,  4.,  9.],
       [ 1.,  9.,  1.,  2.]]), array([[ 8.,  8.,  5.,  9.],
       [ 9.,  9.,  5.,  0.]]), array([[ 6.,  6.,  9.,  4.],
       [ 5.,  4.,  9.,  6.]])]
[array([[ 7.,  1.,  4.],
       [ 1.,  9.,  1.]]), array([[ 9.],
       [ 2.]]), array([[ 8.,  8.,  5.,  9.,  6.,  6.,  9.,  4.],
       [ 9.,  9.,  5.,  0.,  5.,  4.,  9.,  6.]])]

a = np.floor(10*np.random.random((12,2)))
print(a)
np.vsplit(a,3)  #竖着将矩阵切分为3份
输出结果：
[[ 6.  4.]
 [ 0.  1.]
 [ 9.  0.]
 [ 0.  0.]
 [ 0.  4.]
 [ 1.  1.]
 [ 0.  4.]
 [ 1.  6.]
 [ 9.  7.]
 [ 0.  9.]
 [ 6.  1.]
 [ 3.  0.]]
[array([[ 6.,  4.],
        [ 0.,  1.],
        [ 9.,  0.],
        [ 0.,  0.]]), array([[ 0.,  4.],
        [ 1.,  1.],
        [ 0.,  4.],
        [ 1.,  6.]]), array([[ 9.,  7.],
        [ 0.,  9.],
        [ 6.,  1.],
        [ 3.,  0.]])]

直接把一个数组赋值给另一个数组，两个数组指向同一片内存区域，对其中一个的操作就会影响另一个结果

a = np.arange(12)
b = a    #a和b是同一个数组对象的两个名字
print (b is a)
b.shape = 3,4
print (a.shape)
print (id(a))   #id表示指向内存区域，具有相同id，表示a、b指向相同内存区域中的值
print (id(b))
输出结果：
True
(3, 4)
4382560048
4382560048

view方法创建一个新数组，指向的内存区域不同，但元素值共用

import numpy as np
a = np.arange(12)
c = a.view()
print(id(a))  #id值不同
print(id(c))
print(c is a) 
c.shape = 2,6
print (a.shape) #改变c的shape，a的shape不变
c[0,4] = 1234  #改变c中元素的值
print(a)   #a中元素的值也会发生改变
输出结果：
4382897216
4382897136
False
(12,)
[   0    1    2    3 1234    5    6    7    8    9   10   11]

copy方法(深复制)创建一个对数组和元素值的完整的copy

d = a.copy()

按照矩阵的行列找出最大值，最大值的索引

import numpy as np
data = np.sin(np.arange(20)).reshape(5,4) 
print (data)
ind = data.argmax(axis=0)   #找出每列最大值的索引
print (ind)
data_max = data[ind, range(data.shape[1])]  #通过行列索引取值
print (data_max)
输出结果：
[[ 0.          0.84147098  0.90929743  0.14112001]
 [-0.7568025  -0.95892427 -0.2794155   0.6569866 ]
 [ 0.98935825  0.41211849 -0.54402111 -0.99999021]
 [-0.53657292  0.42016704  0.99060736  0.65028784]
 [-0.28790332 -0.96139749 -0.75098725  0.14987721]]
[2 0 3 1]
[ 0.98935825  0.84147098  0.99060736  0.6569866 ]

tile方法，对原矩阵的行列进行扩展

import numpy as np
a = np.arange(0, 40, 10)
b = np.tile(a, (2, 3))  #行变成2倍，列变成3倍
print(b)
输出结果：
[[ 0 10 20 30  0 10 20 30  0 10 20 30]
 [ 0 10 20 30  0 10 20 30  0 10 20 30]]

两种排序方法
sort方法对矩阵中的值进行排序，argsort方法得到元素从小到大的索引值，根据索引值的到排序结果

a = np.array([[4, 3, 5], [1, 2, 1]])
b = np.sort(a, axis=1)  #对a按行由小到大排序，值赋给b
print(b)
a.sort(axis=1)   #直接对a按行由小到大排序
print(a)
a = np.array([4, 3, 1, 2])
j = np.argsort(a)  #argsort方法得到元素从小到大的索引值
print (j)
print (a[j])  #根据索引值输出a
输出结果：
[[3 4 5]
 [1 1 2]]
-------
[[3 4 5]
 [1 1 2]]
-------
[2 3 1 0]
-------
[1 2 3 4]

数据分析处理库Pandas，基于Numpy

read_csv方法读取csv文件

import pandas as pd
food_info = pd.read_csv("food_info.csv")
print(type(food_info))  #pandas代表的DataFrame可以当成矩阵结构
print(food_info.dtypes)  #dtypes在当前数据中包含的数据类型
输出结果：
<class 'pandas.core.frame.DataFrame'>
NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
......
Cholestrl_(mg)     float64
dtype: object

获取读取到的文件的信息

print(food_info.head(3))  #head()方法如果没有参数，默认获取前5行
print(food_info.tail())  #tail()方法获取最后5行
print(food_info.columns)  #columns获取所有的列名
print(food_info.shape) #获取当前数据维度(8618, 36)

取出指定某行的数据

print(food_info.loc[0])  #取出第零行的数据
food_info.loc[8620]  # 当index值超过最大值，throw an error: "KeyError: 'the label [8620] is not in the [index]'"
food_info.loc[3:6]  #取出第三到第六行数据，3、4、5、6
two_five_ten = [2,5,10] 
food_info.loc[two_five_ten] #取出第2、5、10行数据

取出指定某列的数据

ndb_col = food_info["NDB_No"] #取出第一列NDB_No中的数据
print (ndb_col)

columns = ["Zinc_(mg)", "Copper_(mg)"] #要取出多列，就写入所要取出列的列名
zinc_copper = food_info[columns]
print(zinc_copper)

取出以(g)为结尾的列名

col_names = food_info.columns.tolist()  #tolist()方法将列名放在一个list里
gram_columns = []
for c in col_names:
    if c.endswith("(g)"):  
        gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head(3))
输出结果：
  Water_(g)  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  \
0      15.87         0.85          81.11     2.11            0.06   
1      15.87         0.85          81.11     2.11            0.06   
2       0.24         0.28          99.48     0.00            0.00   
3      42.41        21.40          28.74     5.11            2.34   
4      41.11        23.24          29.68     3.18            2.79   

   Fiber_TD_(g)  Sugar_Tot_(g)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  
0           0.0           0.06      51.368       21.021        3.043  
1           0.0           0.06      50.489       23.426        3.012  
2           0.0           0.00      61.924       28.732        3.694  
3           0.0           0.50      18.669        7.778        0.800  
4           0.0           0.51      18.764        8.598        0.784

对某列中的数据进行四则运算

import pandas
food_info = pandas.read_csv("food_info.csv")
iron_grams = food_info["Iron_(mg)"] / 1000  #对列中的数据除以1000
food_info["Iron_(g)"] = iron_grams  #新增一列Iron_(g) 保存结果

water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"] #将两列数字相乘

求某列中的最大值、最小值、均值

max_calories = food_info["Energ_Kcal"].max()
print(max_calories)
min_calories = food_info["Energ_Kcal"].min()
print(min_calories)
mean_calories = food_info["Energ_Kcal"].mean()
print(mean_calories)
输出结果：
902
0
226.438616848

使用sort_values()方法对某列数据进行排序

food_info.sort_values("Sodium_(mg)", inplace=True)
 #默认从小到大排序，inplace=True表示返回一个新的数据结构，而不在原来基础上做改变
print(food_info["Sodium_(mg)"])

food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)  
#ascending=False表示从大到小排序，
print(food_info["Sodium_(mg)"])

针对titanic_train.csv 的练习（含pivot_table()透视表方法）

import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("titanic_train.csv")
titanic_survival.head()

age = titanic_survival["Age"]
print(age.loc[0:20])  #打印某一列的0到20行
age_is_null = pd.isnull(age)  #isnull()方法用于检测是否为缺失值，缺失为True 不缺失为False
print(age_is_null)
age_null_true = age[age_is_null] #得到该列所有缺失的行
print(age_null_true)  
age_null_count = len(age_null_true)
print(age_null_count)    #缺失的行数

#存在缺失值的情况下无法计算均值
mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"]) #sum()方法对列中元素求和
print(mean_age)  #nan

#在计算均值前要把缺失值剔除
good_ages = titanic_survival["Age"][age_is_null == False] #不缺失的取出来
correct_mean_age = sum(good_ages) / len(good_ages) 
print(correct_mean_age)  #29.6991176471

#当然也可以不这么麻烦，缺失值很普遍，pandas提供了mean()方法用于自动剔除缺失值并求均值
correct_mean_age = titanic_survival["Age"].mean()
print(correct_mean_age)  #29.6991176471

#求每个仓位等级，船票的平均价格
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
    pclass_fares = pclass_rows["Fare"]  #定为到同一等级舱，船票价格的那一列
    fare_for_class = pclass_fares.mean()  
    fares_by_class[this_class] = fare_for_class
print(fares_by_class)
运算结果：
{1: 84.154687499999994, 2: 20.662183152173913, 3: 13.675550101832993}

#pandas为我们提供了更方便的统计工具，pivot_table()透视表方法
#index 告诉pivot_table方法是根据哪一列分组
#values 指定对哪一列进行计算
#aggfunc 指定使用什么计算方法
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print(passenger_survival)
运算结果：
Pclass  Survived
1       0.629630
2       0.472826
3       0.242363

#计算不同等级舱乘客的平均年龄
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")  #默认采用aggfunc=np.mean计算方法
print(passenger_age)
运算结果：          
Pclass    Age
1       38.233441
2       29.877630
3       25.140620

#index 根据一列分组
##values 指定对多列进行计算
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)
运算结果：
Embarked       Fare  Survived                  
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217

#丢弃有缺失值的数据行
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Cabin"])  #subset指定了Age和Cabin中任何一个有缺失的，这行数据就丢弃
print(new_titanic_survival)

#按照行列定位元素，取出值
row_index_83_age = titanic_survival.loc[103,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)

#sort_values()排序，reset_index()重新设置行号
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False) #ascending=False从大到小
print(new_titanic_survival[0:10])  #但序号是原来的序号
itanic_reindexed = new_titanic_survival.reset_index(drop=True)  #reset_index(drop=True)更新行号
print(itanic_reindexed.iloc[0:10])  #iloc通过行号获取行数据

#通过定义一个函数，把操作封装起来，然后apply函数
def hundredth_row(column):   #这个函数返回第100行的每一列数据
    # Extract the hundredth item
    hundredth_item = column.iloc[99]
    return hundredth_item
hundredth_row = titanic_survival.apply(hundredth_row) #apply()应用函数
print(hundredth_row)
返回结果：
PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object

##统计所有的缺失值
def not_null_count(column):
    column_null = pd.isnull(column) 
    null = column[column_null]  
    return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
输出结果：
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#对船舱等级进行转换
def which_class(row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    elif pclass == 3:
        return "Third Class"
classes = titanic_survival.apply(which_class, axis=1)  #通过axis = 1参数，使用DataFrame.apply（）方法来迭代行而不是列。
print(classes)

#使用两个自定义函数，统计不同年龄标签对应的存活率
def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)

titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived" ,aggfunc=np.mean)
print(age_group_survival)
运算结果：
 
age_labels  Survived         
adult       0.381032
minor       0.539823
unknown     0.293785

Series结构

Series (collection of values) DataFrame中的一行或者一列就是Series结构
DataFrame (collection of Series objects)是读取文件read_csv()方法获得的矩阵
Panel (collection of DataFrame objects)

import pandas as pd
fandango = pd.read_csv('fandango_score_comparison.csv')  #读取电影信息，DataFrame结构 
series_film = fandango['FILM']  #定位到“FILM”这一列
print(type(series_film))   #<class 'pandas.core.series.Series'>结构
print(series_film[0:5])    #通过索引切片
series_rt = fandango['RottenTomatoes']
print (series_rt[0:5])

from pandas import Series # Import the Series object from pandas
film_names = series_film.values  #把Series结构中的每一个值拿出来
print(type(film_names))  #<class 'numpy.ndarray'>说明series结构中每一个值的结构是ndarray
rt_scores = series_rt.values
series_custom = Series(rt_scores , index=film_names) #设置以film_names为索引的film结构,创建一个Series
series_custom[['Minions (2015)', 'Leviathan (2014)']]  #确实可以使用名字索引
fiveten = series_custom[5:10] #也可以使用数字索引
print(fiveten)

Series中的排序

original_index = series_custom.index.tolist() #将index值放入一个list结构中
sorted_index = sorted(original_index) 
sorted_by_index = series_custom.reindex(sorted_index) #reset index操作
print(sorted_by_index)

sc2 = series_custom.sort_index()  #根据index值进行排序
sc3 = series_custom.sort_values()   #根据value值进行排序
print(sc3)

在Series中的每一个值的类型是ndarray，即NumPy中核心数据类型

import numpy as np
print(np.add(series_custom, series_custom))  #将两列值相加
np.sin(series_custom) #对每个值使用sin函数
np.max(series_custom) #获取某一列的最大值

取出series_custom列中数值在50到70之间的数值
对某一列中的所有值进行比较运算，返回boolean值

criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two] #返回boolean值的Series对象
print(both_criteria)

对index相同的两列运算

#data alignment same index
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users)/2
print(rt_mean)

对DataFrame结构进行操作
设置‘FILM’为索引

fandango = pd.read_csv('fandango_score_comparison.csv')
print(type(fandango))  #<class 'pandas.core.frame.DataFrame'>
fandango_films = fandango.set_index('FILM', drop=False) 
#以‘FILM’为索引返回一个新的DataFrame ，drop=False不丢弃原来的FILM列

对DataFrame切片

#可以使用[]或者loc[]来切片
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]  #用string值做的索引也可以切片
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films[0:3] #数值索引依然存在，可以用来切片
#选择特定的列
#movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']

可视化库matplotlib

Matplotlib是Python中最常用的可视化工具之一，可以非常方便地创建海量类型地2D图表和一些基本的3D图表。

2D图表之折线图

Matplotlib中最基础的模块是pyplot，先从最简单的点图和线图开始。
更多属性可以参考官网：http://matplotlib.org/api/pyplot_api.html

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

unrate = pd.read_csv('unrate.csv')
unrate['DATE'] = pd.to_datetime(unrate['DATE']) #pd.to_datetime方法标准化日期格式

first_twelve = unrate[0:12]  #取0到12行数据
plt.plot(first_twelve['DATE'], first_twelve['VALUE']) #plot(x轴,y轴)方法画图
plt.xticks(rotation=45)  #设置x轴上横坐标旋转角度
plt.xlabel('Month') #x轴含义
plt.ylabel('Unemployment Rate')  #y轴含义
plt.title('Monthly Unemployment Trends, 1948') #图标题
plt.show()  #show方法显示图

子图操作

添加子图：add_subplot(first,second,index)
first 表示行数,second 列数.

import matplotlib.pyplot as plt
fig = plt.figure() #Creates a new figure.
ax1 = fig.add_subplot(3,2,1) #一个3*2子图中的第一个模块
ax2 = fig.add_subplot(3,2,2) #一个3*2子图中的第二个模块
ax2 = fig.add_subplot(3,2,6) #一个3*2子图中的第六个模块
plt.show()

import numpy as np
#fig = plt.figure()
fig = plt.figure(figsize=(3, 6))  #指定画图区大小（长，宽）
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)

ax1.plot(np.random.randint(1,5,5), np.arange(5)) #第一个子图画图
ax2.plot(np.arange(10)*3, np.arange(10)) #第二个子图画图
plt.show()

在同一个图中画两条折线（plot两次）

fig = plt.figure(figsize=(6,3))
plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'], c='red')
plt.plot(unrate[12:24]['MONTH'], unrate[12:24]['VALUE'], c='blue')
plt.show()

为所画曲线作标记

fig = plt.figure(figsize=(10,6))
colors = ['red', 'blue', 'green', 'orange', 'black']
for i in range(5):
    start_index = I*12
    end_index = (i+1)*12
    subset = unrate[start_index:end_index]
    label = str(1948 + i) #label值
    plt.plot(subset['MONTH'], subset['VALUE'], c=colors[i], label=label) #x轴指标，y轴指标，颜色，label值
plt.legend(loc='upper left') #loc指定legend方框的位置,loc = 'best'/'upper right'/'lower left'等，print(help(plt.legend))查看用法
plt.xlabel('Month, Integer')
plt.ylabel('Unemployment Rate, Percent')
plt.title('Monthly Unemployment Trends, 1948-1952')

plt.show()

2D图标之条形图与散点图

bar条形图

import pandas as pd
reviews = pd.read_csv('fandango_scores.csv') #读取电影评分表
cols = ['FILM', 'RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
norm_reviews = reviews[cols]
num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
bar_heights = norm_reviews.ix[0, num_cols].values  #柱高度
bar_positions = arange(5) + 0.75 #设定每一个柱到左边的距离
tick_positions = range(1,6) #设置x轴刻度标签为[1,2,3,4,5]
fig, ax = plt.subplots()

ax.bar(bar_positions, bar_heights, 0.5) #bar型图。柱到左边距离，柱高度，柱宽度
ax.set_xticks(tick_positions) #x轴刻度标签
ax.set_xticklabels(num_cols, rotation=45) 

ax.set_xlabel('Rating Source')
ax.set_ylabel('Average Rating')
ax.set_title('Average User Rating For Avengers: Age of Ultron (2015)')
plt.show()

散点图

fig, ax = plt.subplots() #fig控制图的整体情况，如大小，用ax实际来画图
ax.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm']) #scatter方法，画散点图的x轴，y轴
ax.set_xlabel('Fandango')
ax.set_ylabel('Rotten Tomatoes')
plt.show()

散点图子图

fig = plt.figure(figsize=(8,3))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm'])
ax1.set_xlabel('Fandango')
ax1.set_ylabel('Rotten Tomatoes')
ax2.scatter(norm_reviews['RT_user_norm'], norm_reviews['Fandango_Ratingvalue'])
ax2.set_xlabel('Rotten Tomatoes')
ax2.set_ylabel('Fandango')
plt.show()

屏幕快照 2017-11-05 上午11.42.10.png

python数据分析与机器学习(Numpy,Pandas,Mat
机器学习怎么学？机器学习包含数学原理推导和实际应用技巧，所以需要清楚算法的推导过程和如何应用。深度学习是机器学...
Pythonner还在为了练习Numpy而没有真实数据而烦恼吗？
前言 Python里面在数据处理、数据分析、数据可视化、数据挖掘等领域，用到的库有Numpy、Pandas、Mat...
Python数据分析与展示Numpy、Matplotlib
Python 数据分析与展示学习内容 Numpy数据表示、Matplotlib绘图、Pandas数据分析这三个库...
如何学习和评价《利用python进行数据分析》这本书？
这本书几乎是数据分析入门必读书了。主要介绍了python3个库numpy（数组），pandas（数据分析）和mat...
Python 数据分析学习笔记： numpy 篇
Python 数据分析学习笔记：numpy篇前言数据分析的主要库是 pandas，而 numpy 是 pand...
P3-调查数据集-项目概况
项目概况在此项目中，您将分析数据集，然后传达有关它的发现。您将使用Python库NumPy，pandas和Mat...
Python关于数据分析各种学习文章链接
关于numpy，pandas，seaborn，matplotlib的基础学习：python数据分析关于上述这些库...
101道Numpy、Pandas练习题
无论是数据分析还是机器学习，数据的预处理必不可少。其中最常用、最基础的Python库非numpy和pandas莫属...
Python（金融）数据分析（二）Pandas
Python（金融）数据分析（二）Pandas 1.数据分析库：Pandas、Numpy、Scipy等等； 2.随...
高效使用ndarray Series DataFrame
python 机器学习的基础库 numpy 和 pandas 定义了3种非常适合矩阵运算的数据结构：numpy.n...