数据清洗
您可能会惊讶地发现本节太晚了!通常,在继续进行更复杂的分析之前,您会仔细查看数据集以解决所有问题。但是,在本教程中,您将依靠在上一节中学到的技术来清理数据集。
缺失值
图片.png当您使用检查nba
数据集时nba.info()
,您会发现它非常整洁。只有列notes
的大多数行都包含空值。
>>> rows_without_missing_data = nba.dropna()
>>> rows_without_missing_data.shape
(5424, 23)
>>> data_without_missing_columns = nba.dropna(axis=1)
>>> data_without_missing_columns.shape
(126314, 22)
>>> data_with_default_notes = nba.copy()
>>> data_with_default_notes["notes"].fillna(
... value="no notes at all",
... inplace=True
... )
>>> data_with_default_notes["notes"].describe()
count 126314
unique 232
top no notes at all
freq 120890
Name: notes, dtype: object
>>> nba.describe()
>>> nba[nba["pts"] == 0]
>>> nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].empty
True
>>> nba[(nba["pts"] < nba["opp_pts"]) & (nba["game_result"] != 'L')].empty
True
合并多个数据集
>>> further_city_data = pd.DataFrame(
... {"revenue": [7000, 3400], "employee_count":[2, 2]},
... index=["New York", "Barcelona"]
... )
>>> all_city_data = pd.concat([city_data, further_city_data], sort=False)
>>> all_city_data
Amsterdam 4200 5.0
Tokyo 6500 8.0
Toronto 8000 NaN
New York 7000 2.0
Barcelona 3400 2.0
>>> city_countries = pd.DataFrame({
... "country": ["Holland", "Japan", "Holland", "Canada", "Spain"],
... "capital": [1, 1, 0, 0, 0]},
... index=["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]
... )
>>> cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)
>>> cities
revenue employee_count country capital
Amsterdam 4200.0 5.0 Holland 1.0
Tokyo 6500.0 8.0 Japan 1.0
Toronto 8000.0 NaN Canada 0.0
New York 7000.0 2.0 NaN NaN
Barcelona 3400.0 2.0 Spain 0.0
Rotterdam NaN NaN Holland 0.0
>>> pd.concat([all_city_data, city_countries], axis=1, join="inner")
revenue employee_count country capital
Amsterdam 4200 5.0 Holland 1
Tokyo 6500 8.0 Japan 1
Toronto 8000 NaN Canada 0
Barcelona 3400 2.0 Spain 0
>>> countries = pd.DataFrame({
... "population_millions": [17, 127, 37],
... "continent": ["Europe", "Asia", "North America"]
... }, index= ["Holland", "Japan", "Canada"])
>>> pd.merge(cities, countries, left_on="country", right_index=True)
>>> pd.merge(
... cities,
... countries,
... left_on="country",
... right_index=True,
... how="left"
... )
数据可视化
>>> nba[nba["fran_id"] == "Knicks"].groupby("year_id")["pts"].sum().plot()
>>> nba["fran_id"].value_counts().head(10).plot(kind="bar")
>>> nba[
... (nba["fran_id"] == "Heat") &
... (nba["year_id"] == 2013)
... ]["game_result"].value_counts().plot(kind="pie")
https://www.somebits.com/~nelson/pandas-multiindex-slice-demo.html
网友评论