Pandas备忘录

作者: 弃用中 | 来源:发表于2018-12-08 22:31 被阅读250次

    众所周知,Pandas是最受欢迎的Python数据科学与分析库。

    Numpy用于较低级别的科学计算。Pandas构建于Numpy之上,专为实际数据分析而设计。

    本文总结了这个库中一些常见、方便的功能。话不多说,让我们开始吧!

    数据集

    数据是数据科学与分析的“命根子”,我们需要从数据中寻找答案,从数据中发现模式……如果没有数据的话,就无从谈起。

    数据获取的方式有很多种,这里分享一些提供数据的平台。

    pandas-videos:https://github.com/justmarkham/pandas-videos

    pandas-videos上的数据集

    UCI Machine Learning Reposity:https://archive.ics.uci.edu/ml/datasets.html

    UCI Machine Learning Reposity

    The 50 Best Free Datasets for Machine Learning:https://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/

    导入数据

    任何的数据分析工作都是从导入数据开始,Pandas提供了很多导入数据的方法。

    pd.read_csv(filename) # From a CSV file
    pd.read_table(filename) # From a delimited text file (like TSV)
    pd.read_excel(filename) # From an Excel file
    pd.read_sql(query, connection_object) # Reads from a SQL table/database
    pd.read_json(json_string) # Reads from a JSON formatted string, URL or file
    pd.read_html(url) # Parses an html URL, string or file and extract tables to a list of dataframes
    pd.read_clipboard() # Takes the contents of your clipboard and passes it to read_table()
    pd.DataFrame(dict) # From a dict, keys for columns names, values for data as lists
    

    不妨举个例子,我们想分析一下IMDB高分电影,经过一番搜索,发现 https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv 上的数据就不错,
    若想把它导入,使用pd.read_csv方法就行啦!

    具体来说,就是:

    dataURL = 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv'
    df = pd.read_csv(dataURL)
    df.head() # Prints first 5 rows of the DataFrame
    
    前5行数据

    关于上述导入数据方法的更多详细内容,请查看IO Tools

    探索数据

    一旦将数据导入到DataFrame中之后,就可以用以下方法来了解数据的情况。

    df.index # Index Description
    df.columns # Columns in the DataFrame
    df.shape # Prints number of row and columns in DataFrame
    df.head(n) # Prints first n rows of the DataFrame
    df.tail(n) # Prints last n rows of the DataFrame
    df.info() # Index, DataType and Memory information
    df.describe() # Summary statistics for numerical columns
    s.value_counts(dropna=False) # Views unique values and counts
    df.apply(pd.Series.value_counts) # Unique value and counts for all columns
    df.mean() # Returns the mean of all columns
    df.corr() # Returns the correlation between columns in a DataFrame
    df.count() # Returns the number of non-null values in each DataFrame column
    df.max() # Returns the highest value in each column
    df.min() # Returns the lowest value in each column
    df.median() # Returns the median of each column
    df.std() # Returns the standard deviation of each column
    df.idxmax() # Index of the lowest value
    df.idxmin() # Index of the highest value
    

    举个例子,在导入IMDB高分电影数据后统计一下每种电影类型的频数,我们就可以用:

    df['genre'].value_counts()
    
    每种电影类型的频数

    选择

    通常,我们可能需要选择单个元素或者数据的某个子集来进行深入分析。那么,这些方法就会大显身手:

    df[col] # Returns column with label col as Series
    df[[col1, col2]] # Returns columns as a new DataFrame
    s.iloc[0] # Selection by position (selects first element)
    s.loc[0] # Selection by index (selects element as index 0)
    df.iloc[0, :] # First row
    df.iloc[0, 0] # First element of first column
    df.iat[0, 0] # First element of first column. Access a single value for row/column pair by integer position
    df.at[row_label, col_label] # Access a single value for row/column label pair
    

    更多内容,请查看Indexing and Selecting Data

    数据清理

    实际上,我们拿到数据后,往往需要清理它。以下就是一些非常有用的方法:

    df.columns = ['a', 'b', 'c'] # Renames columns
    pd.isnull() # Checks for null values, return Boolean Array
    pd.notnull() # Opposite of pd.isnull()
    df.dropna() # Drops all rows that contain null values
    df.dropna(axis=1) # Drops all columns that contain null values
    df.dropna(axis=1, thresh=n) # Drops all rows hava less than non null values
    df.fillna(x) # Replaces all null values with x
    s.fillna(s.mean()) # Replaces all null values with the mean 
    s.astype(float) # Converts the datatype of the series to float
    s.replace(1, 'one') # Replaces all values equal to 1 with 'one'
    s.replace([1, 3], ['one', 'three']) # Replace all 1 with 'one' and 3 with 'three'
    df.rename(columns=lambda x: x + 1) # Mass renaming of columns
    df.rename(columns={'old_name': 'new_name'}) # Selective renaming
    df.set_index('column_one') # Changes the index
    df.rename(index=lambda x: x + 1) # Mass renaming of index
    df.drop(labels) # Drop specified labels from rows or columns
    df.drop_duplicates(subset) # Return DataFrame with duplicate rows removed, optionally only considering certain columns
    

    更多内容,请查看Working with missing data

    过滤,排序和分组

    一些对数据过滤、排序和分组的方法:

    df[df[col] > 0.5] # Rows where the col column is greater than 0.5
    df[(df[col] > 0.5) & (df[col] < 0.7)] # Rows where 0.5 < col < 0.7
    df.sort_values(col1) # Sorts values by col1 in ascending order
    df.sort_values(col2, ascending=False) # Sorts values by col2 in descending order
    df.sort_values([col1, col2], ascending=[True, False]) # Sorts values by col1 in ascending order then col2 in descending order
    df.groupby(col) # Returns a groupby object for values from one column
    df.groupby([col1,col2]) # Returns a groupby object values from multiple columns
    df.groupby(col1)[col2].mean() # Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
    df.pivot_table(index=col1, values= col2,col3], aggfunc=mean) # Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
    df.groupby(col1).agg(np.mean) # Finds the average across all columns for every unique column 1 group
    df.apply(np.mean) # Applies a function across each column
    df.apply(np.max, axis=1) # Applies a function across each row
    

    更多内容,请查看Group By: split-apply-combine

    导出数据

    最后,当需要将分析结果导出时,也有几种方便的发方:

    df.to_csv(filename) # Writes to a CSV file
    df.to_excel(filename) # Writes to an Excel file
    df.to_sql(table_name, connectiion_object) # Writes to a SQL table
    df.to_json(filename) # Writes to a file in JSON format
    df.to_html(filename) # Saves as an HTML table
    df.to_clipboard() # Writes to the clipboard
    

    详情请查看IO Tools

    连接和合并

    合并两个DataFrame的方法:

    pd.concat([df1, df2], axis=1) # Adds the columns in df1 to the end of df2
    pd.merge(df11, df2) # SQL-style merges
    df1.append(df2) # Adds the rows in df1 to the end of df2 (columns should be identical)
    df1.join(df2,on=col1,how='inner') # SQL-style joins
    

    更多内容,请查看Merge, join, and concatenate

    创建测试对象

    通用用于测试代码段。

    pd.DataFrame(np.random.rand(20,5)) # 5 columns and 20 rows of random floats
    pd.Series(my_list) # Create a series from an iterable my_list
    df.index = pd.data_range('1900/1/30', periods=df.shape[0]) # Add a date index
    

    实例

    或许我们可以从一个实例(来源于书籍《Python for Data Analysis》)出发,当作对上述内容的一个小练习。

    我们用的是bitly_usagov,数据可以在:https://github.com/wesm/pydata-book 处找到。

    导入数据

    假如将数据下载到了本地,我们可以尝试将其导入并得到DataFrame对象,便于之后的分析工作。

    import pandas as pd
    
    file_path = '../../Datasets/bitly_usagov/example.txt' # Local file path
    df = pd.read_json(file_path, lines=True)
    df.info()
    
    用info方法查看数据

    对时区计数

    df对象有一列名为tz,表示的是时区。我们可以对所有数据的时区进行统计,

    tz_counts = df['tz'].value_counts()
    tz_counts.head(10)
    

    我们可以将结果可视化,但在此之前,可以将未知或者缺失的时区补上一个替代值。

    fillna函数可以替换缺失值(NA),而未知值(空字符串)则可以通过布尔数组索引进行替换:

    clean_tz = df['tz'].fillna('Missing')
    clean_tz[clean_tz == ''] = 'Unknown'
    tz_counts = clean_tz.value_counts()
    

    之后,我们可以用seaborn包创建水平柱状图:

    df对象名为a的列中含有浏览器、设备、应用程序的相关信息,我们可以简单地将浏览器信息提取出来:

    df.a.str.split(' ').str.get(0).head(10)
    

    类似地,我们也可以将浏览器的统计信息可视化:

    参考

    [1] https://paulovasconcellos.com.br/28-useful-pandas-functions-you-might-not-know-de42c59db085

    [2] https://elitedatascience.com/python-cheat-sheet

    [3] https://www.dataquest.io/blog/pandas-cheat-sheet/

    [4] http://pandas.pydata.org/pandas-docs/stable/10min.html

    相关文章

      网友评论

        本文标题:Pandas备忘录

        本文链接:https://www.haomeiwen.com/subject/iimchqtx.html