美文网首页
Pandas包函数快速查找手册

Pandas包函数快速查找手册

作者: tuimer | 来源:发表于2019-11-01 20:22 被阅读0次

    使用python做数据分析最关键的库之一pandas,在数据处理最为常用,pandas中的函数分为如下几大类

    1. 输入输出类

    读入数据类

    1. pd.read_csv(filename) -from a CSV file
    2. pd.read_excel(filename) -from a excel file
    3. pd.read_sql(query, connection_object) -Reads from a SQL table/database
    4. pd.read_json(json_string) - Reads from a JSON formatted string, URL or file.
    5. pd.read_html(url) - Parses an html URL, string or file and extracts tables to a list of dataframes
    6. pd.read_clipboard()- Takes the contents of your clipboard and passes it to read_table()
    7. pd.DataFrame(dict) -from a dict,keys for columns name, values for data as lists

    输出数据类函数

    1. df.to _excel(filename) - Writes to an Excel file
    2. df.to_csv(filename) Writes to a CSV file
    3. df.to_sql(table_name, connection_object) -writes to a SQL table
    4. df.to_json(filename) - Writes to a file in JSON format
    5. df.to_html(filename)- Saves as an HTML table
    6. df.to_clipboard() Writes to the clipboard

    2. 生成测试数据

    1. pd.DataFrame(np.random.rand(20,5)) -生成一个20行5列的随机浮点数数据框
    2. pd.Series(my_list) -由一个可迭代的my_list生成一个Series
    3. df.index = pd.date_range('1900/1/30',periods = df.shape[0]) -增加一个时间序列的index

    3. 查看数据总体情况

    1. df.head(n)
    2. df.tail(n)
    3. df.shape() number of rows and columns
    4. df.info()- Index, Datatype and Memory information
    5. df.describe()- Summary statistics for numerical columns
    6. s.value_counts(dropna = False) -查看唯一的值并计数
    7. df.apply(pd.Series.value_couonts) - 对所有列唯一值计数

    4. 数据选取

    1. df[col] 作为Series返回col列
    2. df[[col1, col2]] 返回多列数据,作为新数据框返回
    3. s.iloc[0]- Selection by position
    4. s.loc[0]- Selection by index
    5. df.iloc[0,:] - First row
    6. df.iloc[0,0]- First element of first column

    5. 数据清洗

    1. df.columns = ['a','b','c']- 重命名列名
    2. pd.isnull() - 检查空值,返回布尔值数组
    3. pd.notnull() - Opposite of s.isnull()
    4. df.dropna()-删除所有包含NA值的行 Drops all rows that contain null values
    5. df.dropna(axis=1) - 删除所有包含NA的列Drops all columns that contain null values
    6. df.dropna(axis=1,thresh=n) - 删除所有行中NA个数大于你的行 /Drops all rows have less than n non null values
    7. df.fillna(x) - 用X填充NA /Replaces all null values with x
    8. s.fillna(s.mean()) - 用均值填充NA /Replaces all null values with the mean (mean can be replaced with almost any function from the statistics section)
    9. s.astype(float) -将Series的数据类型转换为float / Converts the datatype of the series to float
    10. s.replace(1,'one') - 用'one'代替1 /Replaces all values equal to 1 with 'one'
    11. s.replace([1,3],['one','three']) - Replaces all 1 with 'one' and 3 with 'three'
    12. df.rename(columns=lambda x: x + 1) - 对列进行大规模重命名 /Mass renaming of columns
    13. df.rename(columns={'old_name': 'new_ name'}) - 选择性重命名列名 /Selective renaming
    14. df.set_index('column_one') - 更改index /Changes the index
    15. df.rename(index=lambda x: x + 1) - 大规模更改index
      /Mass renaming of index

    6. 过滤、排序和分组

    1. df[df[col] > 0.5] - Rows where the col column is greater than 0.5
    2. df[(df[col] > 0.5) & (df[col] < 0.7)] - Rows where 0.7 > col > 0.5
    3. df.sort_values(col1) -按col1升序排序 Sorts values by col1 in ascending order
    4. df.sort_values(col2,ascending=False) -按col2降序排序 Sorts values by col2 in descending order
    5. df.sort_values([col1,col2], ascending=[True,False]) - Sorts values by col1 in ascending order then col2 in descending order
    6. df.groupby(col) - Returns a groupby object for values from one column
    7. df.groupby([col1,col2]) - Returns a groupby object values from multiple columns
    8. df.groupby(col1)[col2].mean()/df.groupby(col1).mean()[col2] - Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
    9. df.pivot_table(index=col1,values= [col2,col3],aggfunc=mean) - 创建一个透视表,根据col1分组,计算col2,col3的均值 /Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
    10. df.groupby(col1).agg(np.mean) - Finds the average across all columns for every unique column 1 group
    11. df.apply(np.mean) - Applies a function across each column
    12. df.apply(np.max, axis=1) - Applies a function across each row

    7. 统计函数

    These can all be applied to a series as well.

    1. df.describe() - Summary statistics for numerical columns
    2. df.mean() - Returns the mean of all columns
    3. df.corr() - Returns the correlation between columns in a DataFrame
    4. df.count() - Returns the number of non-null values in each DataFrame column
    5. df.max() - Returns the highest value in each column
    6. df.min() - Returns the lowest value in each column
    7. df.median() - Returns the median of each column
    8. df.std() - Returns the standard deviation of each column

    8. 连接数据

    1. df1.append(df2) - Adds the rows in df1 to the end of df2 (columns should be identical)
    2. pd.concat([df1, df2],axis=1) - Adds the columns in df1 to the end of df2 (rows should be identical)
    3. df1.join(df2,on=col1,how='inner') - SQL-style joins the columns in df1 with the columns on df2 where the rows for col have identical values. how can be one of 'left', 'right', 'outer', 'inner'

    相关文章

      网友评论

          本文标题:Pandas包函数快速查找手册

          本文链接:https://www.haomeiwen.com/subject/ocqqbctx.html