使用Pandas和Python探索数据集2

作者: python测试开发 | 来源:发表于2020-02-25 13:32 被阅读0次

使用Pandas和Python探索数据集2
【数据分析】-003-数据探索-Python主要数据探索函数
使用Pandas和Python探索数据集3
使用Pandas和Python探索数据集1
python数据探索（2）—Pandas
Pandas简介
「数据分析」02数据源的导入与matplotlib模块的使用
【转载】python merge、concat合并数据集
P3-调查数据集-项目概况
2018-03-03

访问DataFrame元素

由于DataFrame由Series对象组成，因此可以使用完全相同的工具来访问其元素。关键的区别是DataFrame的维度更大。可对列使用索引运算符，对行使用.loc和.iloc访问方法。

使用索引运算符

将DataFrame视为字典，其值为Series，那么可以使用索引运算符访问其列：

>>> city_data["revenue"]
Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64
>>> type(city_data["revenue"])
pandas.core.series.Series

如果列名是字符串，还可以使用带点符号的属性样式访问：

>>> city_data.revenue
Amsterdam    4200
Tokyo        6500
Toronto      8000
Name: revenue, dtype: int64
>>> toys = pd.DataFrame([
...     {"name": "ball", "shape": "sphere"},
...     {"name": "Rubik's cube", "shape": "cube"}
... ])
>>> toys["shape"]
0    sphere
1      cube
Name: shape, dtype: object
>>> toys.shape
(2, 2)

列名与DataFrame属性或方法名相同不能使用这种方法。生产代码或操作数据（例如定义新列）尽量不要使用属性的方法。

使用.loc和.iloc

与Series相似，DataFrame还提供.loc和.iloc数据访问方法。

>>> city_data.loc["Amsterdam"]
revenue           4200.0
employee_count       5.0
Name: Amsterdam, dtype: float64
>>> city_data.loc["Tokyo": "Toronto"]
        revenue employee_count
Tokyo   6500    8.0
Toronto 8000    NaN
>>> city_data.iloc[1]
revenue           6500.0
employee_count       8.0
Name: Tokyo, dtype: float64

对于DataFrame，数据访问方法.loc和.iloc也接受第二个参数。当第一个参数根据索引选择行时，第二个参数选择列。您可以将这些参数一起使用，以从DataFrame中选择行和列的子集：

用逗号分隔参数，逗号之前表示行，之后表示列。

现在是时候在更大的nba数据集中看到相同的构造了。选择标签5555和5559之间的所有比赛。您只对球队的名称和得分感兴趣，因此也请选择这些元素。展开下面的代码块以查看解决方案：

>>> nba.loc[5555:5559, ["fran_id", "opp_fran", "pts", "opp_pts"]]

scores_5555_5559.ac34be4fb1c1.png

查询数据集

>>> current_decade = nba[nba["year_id"] > 2010]
>>> current_decade.shape
(12658, 23)

>>> games_with_notes = nba[nba["notes"].notnull()] # 等效 games_with_notes = nba[nba["notes"].notna()]
>>> games_with_notes.shape
(5424, 23)

>>> ers = nba[nba["fran_id"].str.endswith("ers")]
>>> ers.shape
(27797, 23)

>>> nba[
...     (nba["_iscopy"] == 0) &
...     (nba["pts"] > 100) &
...     (nba["opp_pts"] > 100) &
...     (nba["team_id"] == "BLB")
... ]

>>> nba[
    (nba["_iscopy"] == 0) &
    (nba["team_id"].str.startswith("LA")) &
    (nba["year_id"]==1992) &
    (nba["notes"].notnull())
]

图片.png

分组和汇总数据

Series具有二十多种不同的计算描述统计的方法。这里有些例子：

>>> city_revenues.sum()
18700
>>> city_revenues.max()
8000

>>> points = nba["pts"]
>>> type(points)
<class 'pandas.core.series.Series'>
>>> points.sum()
12976235

>>> nba.groupby("fran_id", sort=False)["pts"].sum() 
fran_id
Huskies           3995
Knicks          582497
Stags            20398
Falcons           3797
Capitols         22387

>>> nba[
...     (nba["fran_id"] == "Spurs") &
...     (nba["year_id"] > 2010)
... ].groupby(["year_id", "game_result"])["game_id"].count()
year_id  game_result
2011     L              25
         W              63
2012     L              20
         W              60
2013     L              30
         W              73
2014     L              27
         W              78
2015     L              31
         W              58
Name: game_id, dtype: int64

>>> nba[
...     (nba["fran_id"] == "Warriors") &
...     (nba["year_id"] == 2015)
... ].groupby(["is_playoffs", "game_result"])["game_id"].count()
is_playoffs  game_result
0            L              15
             W              67
1            L               5
             W              16

默认情况下，Pandas在对.groupby()的调用过程中对组进行排序。如果您不想排序，请传递sort=False 。此参数可以提高性能。

操作列

>>> df = nba.copy()
>>> df.shape
(126314, 23)

>>> df["difference"] = df.pts - df.opp_pts
>>> df.shape
(126314, 24)

>>> df["difference"].max()
68

>>> renamed_df = df.rename(
...     columns={"game_result": "result", "game_location": "location"}
... )
>>> renamed_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 24 columns):
gameorder      126314 non-null int64
...
location       126314 non-null object
result         126314 non-null object
forecast       126314 non-null float64
notes          5424 non-null object
difference     126314 non-null int64
dtypes: float64(6), int64(8), object(10)
memory usage: 23.1+ MB

>>> df.shape
(126314, 24)
>>> elo_columns = ["elo_i", "elo_n", "opp_elo_i", "opp_elo_n"]
>>> df.drop(elo_columns, inplace=True, axis=1)
>>> df.shape
(126314, 20)

修改数据类型

>>> df.info()
>>> df["date_game"] = pd.to_datetime(df["date_game"])
>>> df["game_location"].nunique()
>>> df["game_location"].value_counts()
A    63138
H    63138
N       38
>>> df["game_location"] = pd.Categorical(df["game_location"])
>>> df["game_location"].dtype
CategoricalDtype(categories=['A', 'H', 'N'], ordered=False)

categorical与非结构化文本相比，数据具有一些优势。当您指定categorical数据类型时，由于Pandas仅在内部使用唯一值，因此使验证更加容易并节省了大量内存。总值与唯一值的比率越高，您将获得更多的空间节省。