pandas0.24.1文档3.3 基础功能（三）

作者: Lykit01 | 来源:发表于2019-04-09 23:33 被阅读6次

pandas0.24.1文档3.3 基础功能（三）
pandas0.24.1文档3.3 基础功能（二）
pandas0.24.1文档3.3 基础功能（一）
WordPress博客资讯类主题：Zibll V3.3 bate
即时通讯报告
自然语言处理 | NLTK英文分词尝试
IM社交App02
eyoucms uiarclist 文档列表可视化标签
eyoucms uichannel 栏目列表可视化标签
活在当下的孩子——365读书会第31天

目录：
1 0.24.1版本新特性
 2 安装
3马上开始
3.1 pandas概况
 3.2 十分钟上手pandas
3.3 基础功能（一）
3.3 基础功能（二）
3.3基础功能（三）

3.3.11 排序

pandas支持三种排序方式：通过index标签、通过每列的值和结合两者来排序。

3.3.11.1 通过index

通过index来给pandas对象排序，需要用到Series.sort_index()和DataFrame.sort_index()方法。

In [295]: df = pd.DataFrame({
   .....:     'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   .....:     'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   .....:     'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   .....: 

In [296]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
   .....:                          columns=['three', 'two', 'one'])
   .....: 

In [297]: unsorted_df
Out[297]: 
      three       two       one
a       NaN -0.867293  0.050162
d  1.215473 -0.051744       NaN
c -0.421091 -0.712097  0.953102
b  1.205223  0.632624 -1.534113

# DataFrame
In [298]: unsorted_df.sort_index()
Out[298]: 
      three       two       one
a       NaN -0.867293  0.050162
b  1.205223  0.632624 -1.534113
c -0.421091 -0.712097  0.953102
d  1.215473 -0.051744       NaN

In [299]: unsorted_df.sort_index(ascending=False)
Out[299]: 
      three       two       one
d  1.215473 -0.051744       NaN
c -0.421091 -0.712097  0.953102
b  1.205223  0.632624 -1.534113
a       NaN -0.867293  0.050162

In [300]: unsorted_df.sort_index(axis=1)
Out[300]: 
        one     three       two
a  0.050162       NaN -0.867293
d       NaN  1.215473 -0.051744
c  0.953102 -0.421091 -0.712097
b -1.534113  1.205223  0.632624

# Series
In [301]: unsorted_df['three'].sort_index()
Out[301]: 
a         NaN
b    1.205223
c   -0.421091
d    1.215473
Name: three, dtype: float64

3.3.11.2 通过值

对于Series可以通过Series.sort_values()方法来排序。对于DataFrame可以通过DataFrame.sort_values()方法用列或行的值来进行排序。DataFrame.sort_values()可选的by参数可以用来指定使用一或多列来用作排序的标准。

In [302]: df1 = pd.DataFrame({'one': [2, 1, 1, 1],
   .....:                     'two': [1, 3, 2, 4],
   .....:                     'three': [5, 4, 3, 2]})
   .....: 

In [303]: df1.sort_values(by='two')
Out[303]: 
   one  two  three
0    2    1      5
2    1    2      3
1    1    3      4
3    1    4      2

by参数可以赋予多个列名组成的列表，比如：

In [304]: df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])
Out[304]: 
   one  two  three
2    1    2      3
1    1    3      4
3    1    4      2
0    2    1      5

这些方法还通过na_position参数来对空值做特殊处理：

In [305]: s[2] = np.nan

In [306]: s.sort_values()
Out[306]: 
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2     NaN
5     NaN
dtype: object

In [307]: s.sort_values(na_position='first')
Out[307]: 
2     NaN
5     NaN
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: object

3.3.11.3 通过index和值

这是0.23.0版本的新特性
DataFrame.sort_values()传给by参数的字符串可能是列名，也可能是index名。

# Build MultiIndex
In [308]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
   .....:                                 ('b', 2), ('b', 1), ('b', 1)])
   .....: 

In [309]: idx.names = ['first', 'second']

# Build DataFrame
In [310]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
   .....:                         index=idx)
   .....: 

In [311]: df_multi
Out[311]: 
              A
first second   
a     1       6
      2       5
      2       4
b     2       3
      1       2
      1       1

下面通过'second'(index)和'A'(column)来排序：

In [312]: df_multi.sort_values(by=['second', 'A'])
Out[312]: 
              A
first second   
b     1       1
      1       2
a     1       6
b     2       3
a     2       4
      2       5

注意： 如果一个字符串同时和列名和index名相同，pandas会发出警告，但还是会继续程序并优先匹配列名。但是在未来版本中这种同时匹配的情况会直接出现歧义错误。

3.3.11.4 searchsorted方法

Series有searchsorted()方法，用法和numpy.ndarray.searchsorted()类似。

In [313]: ser = pd.Series([1, 2, 3])

In [314]: ser.searchsorted([0, 3])
Out[314]: array([0, 2])

In [315]: ser.searchsorted([0, 4])
Out[315]: array([0, 3])

In [316]: ser.searchsorted([1, 3], side='right')
Out[316]: array([1, 3])

In [317]: ser.searchsorted([1, 3], side='left')
Out[317]: array([0, 2])

In [318]: ser = pd.Series([3, 1, 2])

In [319]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
Out[319]: array([0, 2])

searchsorted()方法的参数是：searchsorted(a,v)或a.searchsorted(v)，原理是在a中检索v所处的位置，默认从左开始检索。

3.3.11.5 最小/最大的值

Series的nsmallest()和nlargest()两个方法能返回最小和最大的n个值。对于非常大型的Series，相比先排序再使用head(n)取前n个数，使用这两个方法会快很多。

In [320]: s = pd.Series(np.random.permutation(10))

In [321]: s
Out[321]: 
0    5
1    3
2    2
3    0
4    7
5    6
6    9
7    1
8    4
9    8
dtype: int64

In [322]: s.sort_values()
Out[322]: 
3    0
7    1
2    2
1    3
8    4
0    5
5    6
4    7
9    8
6    9
dtype: int64

In [323]: s.nsmallest(3)
Out[323]: 
3    0
7    1
2    2
dtype: int64

In [324]: s.nlargest(3)
Out[324]: 
6    9
9    8
4    7
dtype: int64

DataFrame也有nlargest和nsmallest这两个方法。

In [325]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
   .....:                    'b': list('abdceff'),
   .....:                    'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
   .....: 

In [326]: df.nlargest(3, 'a')
Out[326]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN

In [327]: df.nlargest(5, ['a', 'c'])
Out[327]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN
2   1  d  4.0
6  -1  f  4.0

In [328]: df.nsmallest(3, 'a')
Out[328]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0

In [329]: df.nsmallest(5, ['a', 'c'])
Out[329]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0
2  1  d  4.0
4  8  e  NaN

3.3.11.6 对有多重索引的列排序

当列有多重索引时，对列的值进行排序时一定要把赋给by参数的列名写完整。（要细到最低级的列名。）

In [330]: df1.columns = pd.MultiIndex.from_tuples([('a', 'one'),
   .....:                                          ('a', 'two'),
   .....:                                          ('b', 'three')])
   .....: 

In [331]: df1.sort_values(by=('a', 'two'))
Out[331]: 
    a         b
  one two three
0   2   1     5
2   1   2     3
1   1   3     4
3   1   4     2

3.3.12 复制

pandas对象的copy()方法复制底层数据（尽管不是轴索引，因为它们是不可变的）并返回一个新对象。注意很少有情况需要复制对象。 比如，只有几个方法可以原地修改DataFrame的值：

插入、删除或修改一列数据
对index或column的属性赋值
对于同质的数据，通过属性或高级索引直接修可以明确地说，没有pandas方法会产生修改数据的副作用。几乎所有方法都返回一个新对象，而原数据保持原封不动。如果数据被修改了，那一定是因为用户“明显地”修改了数据。

3.3.13 数据类型(dtypes)

在大多数情况下，pandas在Series或DataFrame单个的列中使用NumPy的数组和dtypes。NumPy支持float、int、bool、timedelta64[ns]和datetime64[ns]。（注意NumPy并不支持标有时区的时间格式）
pandas和第三方库对NumPy的数据类型系统在一些地方做了扩展。这部分描述了pandas在内部做了哪些扩展。如果你想做自己的扩展并且能在pandas中能使用，请看扩展类型。第三方加入的扩展请看扩展数据类型。
下表列出了所有的pandas扩展类型。请看每种类型的具体文档描述。

数据种类	数据类型	标量	数组	文档
标有时区的时间	DatetimeTZDtype	时间戳	array.DatatimeArray	时区处理
Categorical	CategoricalDtype	(none)	Categorical	Categorical Data
时间段	PeriodDtype	Period	arrays.PeriodArray	时间段表示
稀疏数据	SparseDtype	(none)	[arrays.SparseArray]	稀疏数据结构
间隔	IntervalDtype	Interval	arrays.IntervalArray	IntervalIndex
可为空的整数	Int64Dtype等等	(none)	arrays.IntegerArray	Nullable Integer Data Type

pandas使用object这种数据类型来储存字符串。
最后，可以使用object数据类型储存任意对象，但是应该尽可能避免这么做（为了性能和与其他库和方法的互操作性等原因。请看对象转换）。
DataFrame的dtypes属性可以很方便地返回一个包含每列数据类型的Series。

In [332]: dft = pd.DataFrame({'A': np.random.rand(3),
   .....:                     'B': 1,
   .....:                     'C': 'foo',
   .....:                     'D': pd.Timestamp('20010102'),
   .....:                     'E': pd.Series([1.0] * 3).astype('float32'),
   .....:                     'F': False,
   .....:                     'G': pd.Series([1] * 3, dtype='int8')})
   .....: 

In [333]: dft
Out[333]: 
          A  B    C          D    E      F  G
0  0.278831  1  foo 2001-01-02  1.0  False  1
1  0.242124  1  foo 2001-01-02  1.0  False  1
2  0.078031  1  foo 2001-01-02  1.0  False  1

In [334]: dft.dtypes
Out[334]: 
A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

对于Series对象，用dtype属性。

In [335]: dft['A'].dtype
Out[335]: dtype('float64')

如果一个pandas对象的一个列包含多种dtypes，那么这列的dtype将会适应这列的所有数据类型。（通常是用object类型。）

# these ints are coerced to floats
In [336]: pd.Series([1, 2, 3, 4, 5, 6.])
Out[336]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

# string data forces an ``object`` dtype
In [337]: pd.Series([1, 2, 3, 6., 'foo'])
Out[337]: 
0      1
1      2
2      3
3      6
4    foo
dtype: object

一个DataFrame中使用每个数据类型的列数可以调用get_dtype_counts()来统计。

In [338]: dft.get_dtype_counts()
Out[338]: 
float64           1
float32           1
int64             1
int8              1
datetime64[ns]    1
bool              1
object            1
dtype: int64

数字dtypes将传播，并且可以在DataFrame共存。如果传递了一个dtype（直接通过dtype关键字、传递的ndarray或传递的Series），那么它将保留在DataFrame操作中。此外，不同的数字dtypes不会组合在一起。下面的例子将让您体验一下。

In [339]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [340]: df1
Out[340]: 
          A
0 -1.641339
1 -0.314062
2 -0.679206
3  1.178243
4  0.181790
5 -2.044248
6  1.151282
7 -1.641398

In [341]: df1.dtypes
Out[341]: 
A    float32
dtype: object

In [342]: df2 = pd.DataFrame({'A': pd.Series(np.random.randn(8), dtype='float16'),
   .....:                     'B': pd.Series(np.random.randn(8)),
   .....:                     'C': pd.Series(np.array(np.random.randn(8),
   .....:                                             dtype='uint8'))})
   .....: 

In [343]: df2
Out[343]: 
          A         B    C
0  0.130737 -1.143729    1
1  0.289551  2.787500    0
2  0.590820 -0.708143  254
3 -0.020142 -1.512388    0
4 -1.048828 -0.243145    1
5 -0.808105 -0.650992    0
6  1.373047  2.090108    0
7 -0.254395  0.433098    0

In [344]: df2.dtypes
Out[344]: 
A    float16
B    float64
C      uint8
dtype: object

3.3.13.1 默认设置

默认情况下，整数都是int64类型，浮点数都是float64型，与平台无关（不论32位或64位系统）。下面的操作将会导致int64 dtypes。

In [345]: pd.DataFrame([1, 2], columns=['a']).dtypes
Out[345]: 
a    int64
dtype: object

In [346]: pd.DataFrame({'a': [1, 2]}).dtypes
Out[346]: 
a    int64
dtype: object

In [347]: pd.DataFrame({'a': 1}, index=list(range(2))).dtypes
Out[347]: 
a    int64
dtype: object

注意在创建数组时，NumPy会依照平台选择合适的数据类型。下面在32位平台的操作将会产生int32数据类型。

In [348]: frame = pd.DataFrame(np.array([1, 2]))

3.3.13.2 数据类型升级

当与其他类型的数据结合时，数据类型可能会升级，这意味着它们会从当前类型向上升级（例如从int到float）。

In [349]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [350]: df3
Out[350]: 
          A         B      C
0 -1.510602 -1.143729    1.0
1 -0.024511  2.787500    0.0
2 -0.088385 -0.708143  254.0
3  1.158101 -1.512388    0.0
4 -0.867039 -0.243145    1.0
5 -2.852354 -0.650992    0.0
6  2.524329  2.090108    0.0
7 -1.895793  0.433098    0.0

In [351]: df3.dtypes
Out[351]: 
A    float32
B    float64
C    float64
dtype: object

dataframe.to_numpy()将返回数据类型的较低的公分母，这种类型可以容纳生成的同质的numpy数组中所有的数据类型。这会迫使一些类型升级。

In [352]: df3.to_numpy().dtype
Out[352]: dtype('float64')

3.3.13.3 astype方法

你可以用astype()方法对数据类型进行转换。这些操作默认返回一个副本，即使原数据类型并没有改变。（传递参数copy=False可以改变这种操作。）此外，如果astype操作不合法，程序会报错。
类型升级操作经常是根据numpy规则的。如果操作包括两个不同的数据类型，其中更通常的类型将被用作操作的结果。

In [353]: df3
Out[353]: 
          A         B      C
0 -1.510602 -1.143729    1.0
1 -0.024511  2.787500    0.0
2 -0.088385 -0.708143  254.0
3  1.158101 -1.512388    0.0
4 -0.867039 -0.243145    1.0
5 -2.852354 -0.650992    0.0
6  2.524329  2.090108    0.0
7 -1.895793  0.433098    0.0

In [354]: df3.dtypes
Out[354]: 
A    float32
B    float64
C    float64
dtype: object

# conversion of dtypes
In [355]: df3.astype('float32').dtypes
Out[355]: 
A    float32
B    float32
C    float32
dtype: object

使用astype()可以将部分列转换到指定的类型。

In [356]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [357]: dft[['a', 'b']] = dft[['a', 'b']].astype(np.uint8)

In [358]: dft
Out[358]: 
   a  b  c
0  1  4  7
1  2  5  8
2  3  6  9

In [359]: dft.dtypes
Out[359]: 
a    uint8
b    uint8
c    int64
dtype: object

0.19.0版本新特性
给astype()传递一个字典参数，能将指定的列转换到指定的数据类型。

In [360]: dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [361]: dft1 = dft1.astype({'a': np.bool, 'c': np.float64})

In [362]: dft1
Out[362]: 
       a  b    c
0   True  4  7.0
1  False  5  8.0
2   True  6  9.0

In [363]: dft1.dtypes
Out[363]: 
a       bool
b      int64
c    float64
dtype: object

注意： 当试图使用astype()和loc()将一部分列转换到指定的类型时，可能会发生类型升级。loc()会试着去适应我们赋给当前数据类型鄂内容，而[]将用从右侧获取数据类型的内容覆盖这些内容。因此，下面的我代码会产生意外的结果。

In [364]: dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [365]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes
Out[365]: 
a    uint8
b    uint8
dtype: object

In [366]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [367]: dft.dtypes
Out[367]: 
a    int64
b    int64
c    int64
dtype: object

3.3.13.4 对象转换

pandas提供了很多对象转换的函数。在一些情况下，数据本身是正确的类型，但是被储存在了object类型的数组中，DataFrame.infer_objects()和Series.infer_objects()方法能自动转换到正确的类型。

In [368]: import datetime

In [369]: df = pd.DataFrame([[1, 2],
   .....:                    ['a', 'b'],
   .....:                    [datetime.datetime(2016, 3, 2),
   .....:                     datetime.datetime(2016, 3, 2)]])
   .....: 

In [370]: df = df.T

In [371]: df
Out[371]: 
   0  1                    2
0  1  a  2016-03-02 00:00:00
1  2  b  2016-03-02 00:00:00

In [372]: df.dtypes
Out[372]: 
0    object
1    object
2    object
dtype: object

.T会将行、列进行转换。我们来看看原数据：

	0	1
0	1	2
1	a	b
2	2016-03-02 00:00:00	2016-03-02 00:00:00

数据类型：

0    object
1    object
dtype: object

原数据将所有列储存为object类型，现在行、列转换了，infer_objects将能推断出正确的数据结构。

In [373]: df.infer_objects().dtypes
Out[373]: 
0             int64
1            object
2    datetime64[ns]
dtype: object

可以用下面的函数将一维的object类型数组转换为指定的类型：

to_numeric()（转换为数值型的dtypes）

In [374]: m = ['1.1', 2, 3]

In [375]: pd.to_numeric(m)
Out[375]: array([ 1.1,  2. ,  3. ])

to_datetime()（转换为时间对象）

In [376]: import datetime

In [377]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]

In [378]: pd.to_datetime(m)
Out[378]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

to_timedelta()（转换为时间段对象）

In [379]: m = ['5us', pd.Timedelta('1day')]

In [380]: pd.to_timedelta(m)
Out[380]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

如果要进行强制转换，我们可以传递一个errors参数，如果数据中的某些元素不能被强制转换，可以用这个参数指导pandas具体怎么做。默认情况下，指定errors='raise'，意味着在转换过程中只有遇到错误就会报错。但是，如果设定errors='coerce'，这些错误将被忽略，并且pandas将把有问题的元素转换成pd.NaT（如果转换目标是datetime或timedelta）或者np.nan（如果转换对象是数值型的）。如果你想呈现的数据中大部分都是你想要的数据类型（比如数值、时间等），只有少数几个不符合的元素混杂其中，用这个参数将这几个处理成缺失值很有用。

In [381]: import datetime

In [382]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [383]: pd.to_datetime(m, errors='coerce')
Out[383]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)

In [384]: m = ['apple', 2, 3]

In [385]: pd.to_numeric(m, errors='coerce')
Out[385]: array([ nan,   2.,   3.])

In [386]: m = ['apple', pd.Timedelta('1day')]

In [387]: pd.to_timedelta(m, errors='coerce')
Out[387]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

errors参数还有第三个可选的值errors='ignore'，如果在转换为所需数据类型时遇到任何错误，它只返回传入的数据：

[388]: import datetime

In [389]: m = ['apple', datetime.datetime(2016, 3, 2)]

In [390]: pd.to_datetime(m, errors='ignore')
Out[390]: Index(['apple', 2016-03-02 00:00:00], dtype='object')

In [391]: m = ['apple', 2, 3]

In [392]: pd.to_numeric(m, errors='ignore')
Out[392]: array(['apple', 2, 3], dtype=object)

In [393]: m = ['apple', pd.Timedelta('1day')]

In [394]: pd.to_timedelta(m, errors='ignore')
Out[394]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

除了类型转换之外，to_numeric()还提供了一个downcast参数，这个参数可以选择是否将新产生的或已经存在的数值型数据“降级”到一个更小的数据类型，这样可以省很多内存。

In [395]: m = ['1', 2, 3]

In [396]: pd.to_numeric(m, downcast='integer')   # smallest signed int dtype
Out[396]: array([1, 2, 3], dtype=int8)

In [397]: pd.to_numeric(m, downcast='signed')    # same as 'integer'
Out[397]: array([1, 2, 3], dtype=int8)

In [398]: pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype
Out[398]: array([1, 2, 3], dtype=uint8)

In [399]: pd.to_numeric(m, downcast='float')     # smallest float dtype
Out[399]: array([ 1.,  2.,  3.], dtype=float32)

上面这些方法只能对一维的数组、列表或标量运用，他们不能直接作用于多维的对象，比如DataFrame。但是，用apply()方法，我们能将这些方法对每个列进行应用。

In [400]: import datetime

In [401]: df = pd.DataFrame([
   .....:     ['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
   .....: 

In [402]: df
Out[402]: 
            0                    1
0  2016-07-09  2016-03-02 00:00:00
1  2016-07-09  2016-03-02 00:00:00

In [403]: df.apply(pd.to_datetime)
Out[403]: 
           0          1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02

In [404]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')

In [405]: df
Out[405]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [406]: df.apply(pd.to_numeric)
Out[406]: 
     0  1  2
0  1.1  2  3
1  1.1  2  3

In [407]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [408]: df
Out[408]: 
     0                1
0  5us  1 days 00:00:00
1  5us  1 days 00:00:00

In [409]: df.apply(pd.to_timedelta)
Out[409]: 
                0      1
0 00:00:00.000005 1 days
1 00:00:00.000005 1 days

3.3.13.5 gotchas

对整数类型的数据进行选择操作很容易把它们变成浮点数类型。只有在一些没有空值的例子中，输入的数据的dtype会被保留。请看整数空值支持。

In [410]: dfi = df3.astype('int32')

In [411]: dfi['E'] = 1

In [412]: dfi
Out[412]: 
   A  B    C  E
0 -1 -1    1  1
1  0  2    0  1
2  0  0  254  1
3  1 -1    0  1
4  0  0    1  1
5 -2  0    0  1
6  2  2    0  1
7 -1  0    0  1

In [413]: dfi.dtypes
Out[413]: 
A    int32
B    int32
C    int32
E    int64
dtype: object

In [414]: casted = dfi[dfi > 0]

In [415]: casted
Out[415]: 
     A    B      C  E
0  NaN  NaN    1.0  1
1  NaN  2.0    NaN  1
2  NaN  NaN  254.0  1
3  1.0  NaN    NaN  1
4  NaN  NaN    1.0  1
5  NaN  NaN    NaN  1
6  2.0  2.0    NaN  1
7  NaN  NaN    NaN  1

In [416]: casted.dtypes
Out[416]: 
A    float64
B    float64
C    float64
E      int64
dtype: object

不过，float类型是不会变的。

In [417]: dfa = df3.copy()

In [418]: dfa['A'] = dfa['A'].astype('float32')

In [419]: dfa.dtypes
Out[419]: 
A    float32
B    float64
C    float64
dtype: object

In [420]: casted = dfa[df2 > 0]

In [421]: casted
Out[421]: 
          A         B      C
0 -1.510602       NaN    1.0
1 -0.024511  2.787500    NaN
2 -0.088385       NaN  254.0
3       NaN       NaN    NaN
4       NaN       NaN    1.0
5       NaN       NaN    NaN
6  2.524329  2.090108    NaN
7       NaN  0.433098    NaN

In [422]: casted.dtypes
Out[422]: 
A    float32
B    float64
C    float64
dtype: object

3.3.14 基于数据类型(dtype)选择列

select_dtypes()方法基于列的类型来选择列。
首先，我们创建一个有各种不同类型的列的DataFrame：

In [423]: df = pd.DataFrame({'string': list('abc'),
   .....:                    'int64': list(range(1, 4)),
   .....:                    'uint8': np.arange(3, 6).astype('u1'),
   .....:                    'float64': np.arange(4.0, 7.0),
   .....:                    'bool1': [True, False, True],
   .....:                    'bool2': [False, True, False],
   .....:                    'dates': pd.date_range('now', periods=3),
   .....:                    'category': pd.Series(list("ABC")).astype('category')})
   .....: 

In [424]: df['tdeltas'] = df.dates.diff()

In [425]: df['uint64'] = np.arange(3, 6).astype('u8')

In [426]: df['other_dates'] = pd.date_range('20130101', periods=3)

In [427]: df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [428]: df
Out[428]: 
  string  int64  uint8  float64  bool1  bool2                      dates category tdeltas  uint64 other_dates            tz_aware_dates
0      a      1      3      4.0   True  False 2019-03-12 22:38:38.692567        A     NaT       3  2013-01-01 2013-01-01 00:00:00-05:00
1      b      2      4      5.0  False   True 2019-03-13 22:38:38.692567        B  1 days       4  2013-01-02 2013-01-02 00:00:00-05:00
2      c      3      5      6.0   True  False 2019-03-14 22:38:38.692567        C  1 days       5  2013-01-03 2013-01-03 00:00:00-05:00

我们来看看数据类型：

In [429]: df.dtypes
Out[429]: 
string                                object
int64                                  int64
uint8                                  uint8
float64                              float64
bool1                                   bool
bool2                                   bool
dates                         datetime64[ns]
category                            category
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

select_dtypes()有两个参数include和exclude，include能够选择指定的类型的列，exculde能够选择排除指定的类型之外的列。
举个例子，我们要选择bool型的列：

In [430]: df.select_dtypes(include=[bool])
Out[430]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

你也可以传递NumPy数据类型层次结构中的dtype的字符串名：

In [431]: df.select_dtypes(include=['bool'])
Out[431]: 
   bool1  bool2
0   True  False
1  False   True
2   True  False

select_dtypes()也可以使用泛型类型。
比如，在排除无符号整数的同时选择所有数值列和布尔列：

In [432]: df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])
Out[432]: 
   int64  float64  bool1  bool2 tdeltas
0      1      4.0   True  False     NaT
1      2      5.0  False   True  1 days
2      3      6.0   True  False  1 days

如果要选择字符串类型的列，一定要使用object类型：

In [433]: df.select_dtypes(include=['object'])
Out[433]: 
  string
0      a
1      b
2      c

如果要得到像numpy.number这样的泛型类型的子类型，你可以自定义一个函数来返回子类型树。

In [434]: def subdtypes(dtype):
   .....:     subs = dtype.__subclasses__()
   .....:     if not subs:
   .....:         return dtype
   .....:     return [dtype, [subdtypes(dt) for dt in subs]]
   .....:

所有的NumPy数据类型都是numpy.generic的子类：

In [435]: subdtypes(np.generic)
Out[435]: 
[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

注意： pandas也自定义了category和datetime64[ns,tz]类型，这些类型并没有集成在NumPy数据类型层级内，因此也不会再上面的函数返回的结果中出现。

pandas0.24.1文档3.3 基础功能（三）
目录：1 0.24.1版本新特性2 安装3马上开始3.1 pandas概况3.2 十分钟上手pandas3.3 基...
pandas0.24.1文档3.3 基础功能（二）
目录：1 0.24.1版本新特性2 安装3马上开始3.1 pandas概况3.2 十分钟上手pandas3.3 基...
pandas0.24.1文档3.3 基础功能（一）
目录：1 0.24.1版本新特性2 安装3马上开始3.1 pandas概况3.2 十分钟上手pandas3.3 基...
WordPress博客资讯类主题：Zibll V3.3 bate
V3.3 2020-07-15 新功能新增文档模式的付费功能支持新增链接列表页面模板（例如友情链接）：...
即时通讯报告
融云即时通讯目录：一、基础功能二、高级功能三、特点优势四、案例展示五、产品价格六、文档资料一、基础功...
自然语言处理 | NLTK英文分词尝试
官方文档：Natural Language Toolkit — NLTK 3.3 documentation NL...
IM社交App02
阅读原文注册登录 3.3 功能三:主页面页面布局主界面代码 3.4 功能四:设置页面页面布局退出登录 ...
eyoucms uiarclist 文档列表可视化标签
【基础用法】名称：uiarclist 功能：文档列表编辑，比uitext、uihtml、uiupload标签多了...
eyoucms uichannel 栏目列表可视化标签
【基础用法】名称：uichannel 功能：文档列表编辑，比uitext、uihtml、uiupload标签多了...
活在当下的孩子——365读书会第31天
3.3号，读书会第31天。今天讲到的是大脑的第三种执行功能：心理意象。大脑有三种基本执行功能：行为抑...

pandas0.24.1文档3.3 基础功能（三）

3.3.11 排序

3.3.11.1 通过index

3.3.11.2 通过值

3.3.11.3 通过index和值

3.3.11.4 searchsorted方法

3.3.11.5 最小/最大的值

3.3.11.6 对有多重索引的列排序

3.3.12 复制

3.3.13 数据类型(dtypes)

3.3.13.1 默认设置

3.3.13.2 数据类型升级

3.3.13.3 astype方法

3.3.13.4 对象转换

3.3.13.5 gotchas

3.3.14 基于数据类型(dtype)选择列

相关文章

pandas0.24.1文档3.3 基础功能（三）

pandas0.24.1文档3.3 基础功能（二）

pandas0.24.1文档3.3 基础功能（一）

WordPress博客资讯类主题：Zibll V3.3 bate

即时通讯报告

自然语言处理 | NLTK英文分词尝试

IM社交App02

eyoucms uiarclist 文档列表可视化标签

eyoucms uichannel 栏目列表可视化标签

活在当下的孩子——365读书会第31天

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

小秩学数据分析

数据蛙数据分析每周作业