美文网首页
使用Pandas和Python探索数据集3

使用Pandas和Python探索数据集3

作者: python测试开发 | 来源:发表于2020-02-25 15:29 被阅读0次

    数据清洗

    您可能会惊讶地发现本节太晚了!通常,在继续进行更复杂的分析之前,您会仔细查看数据集以解决所有问题。但是,在本教程中,您将依靠在上一节中学到的技术来清理数据集。

    缺失值

    图片.png

    当您使用检查nba数据集时nba.info(),您会发现它非常整洁。只有列notes的大多数行都包含空值。

    >>> rows_without_missing_data = nba.dropna()
    >>> rows_without_missing_data.shape
    (5424, 23)
    
    >>> data_without_missing_columns = nba.dropna(axis=1)
    >>> data_without_missing_columns.shape
    (126314, 22)
    
    >>> data_with_default_notes = nba.copy()
    >>> data_with_default_notes["notes"].fillna(
    ...     value="no notes at all",
    ...     inplace=True
    ... )
    >>> data_with_default_notes["notes"].describe()
    count              126314
    unique                232
    top       no notes at all
    freq               120890
    Name: notes, dtype: object
    
    >>> nba.describe()
    
    >>> nba[nba["pts"] == 0]
    
    >>> nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].empty
    True
    >>> nba[(nba["pts"] < nba["opp_pts"]) & (nba["game_result"] != 'L')].empty
    True
    
    

    合并多个数据集

    >>> further_city_data = pd.DataFrame(
    ...     {"revenue": [7000, 3400], "employee_count":[2, 2]},
    ...     index=["New York", "Barcelona"]
    ... )
    
    >>> all_city_data = pd.concat([city_data, further_city_data], sort=False)
    >>> all_city_data
    Amsterdam   4200    5.0
    Tokyo       6500    8.0
    Toronto     8000    NaN
    New York    7000    2.0
    Barcelona   3400    2.0
    
    >>> city_countries = pd.DataFrame({
    ...     "country": ["Holland", "Japan", "Holland", "Canada", "Spain"],
    ...     "capital": [1, 1, 0, 0, 0]},
    ...     index=["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]
    ... )
    >>> cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)
    >>> cities
               revenue  employee_count  country  capital
    Amsterdam   4200.0             5.0  Holland      1.0
    Tokyo       6500.0             8.0    Japan      1.0
    Toronto     8000.0             NaN   Canada      0.0
    New York    7000.0             2.0      NaN      NaN
    Barcelona   3400.0             2.0    Spain      0.0
    Rotterdam      NaN             NaN  Holland      0.0
    
    >>> pd.concat([all_city_data, city_countries], axis=1, join="inner")
               revenue  employee_count  country  capital
    Amsterdam     4200             5.0  Holland        1
    Tokyo         6500             8.0    Japan        1
    Toronto       8000             NaN   Canada        0
    Barcelona     3400             2.0    Spain        0
    
    >>> countries = pd.DataFrame({
    ...     "population_millions": [17, 127, 37],
    ...     "continent": ["Europe", "Asia", "North America"]
    ... }, index= ["Holland", "Japan", "Canada"])
    >>> pd.merge(cities, countries, left_on="country", right_index=True)
    
    >>> pd.merge(
    ...     cities,
    ...     countries,
    ...     left_on="country",
    ...     right_index=True,
    ...     how="left"
    ... )
    

    数据可视化

    >>> nba[nba["fran_id"] == "Knicks"].groupby("year_id")["pts"].sum().plot()
    >>> nba["fran_id"].value_counts().head(10).plot(kind="bar")
    >>> nba[
    ...     (nba["fran_id"] == "Heat") &
    ...     (nba["year_id"] == 2013)
    ... ]["game_result"].value_counts().plot(kind="pie")
    

    https://www.somebits.com/~nelson/pandas-multiindex-slice-demo.html

    相关文章

      网友评论

          本文标题:使用Pandas和Python探索数据集3

          本文链接:https://www.haomeiwen.com/subject/ymltchtx.html