美文网首页
记录21个Pandas技巧

记录21个Pandas技巧

作者: 开心人开发世界 | 来源:发表于2019-09-22 07:17 被阅读0次
    image.png

    介绍

    Pandas是一个易于使用且功能强大的数据库分析库。与NumPy一样,它可以矢量化大多数基本操作,即使在CPU上也可以并行计算,从而加快计算速度。这里指定的操作非常基础,但如果您刚开始使用Pandas则非常重要。您将被要求将pandas导入为'pd',然后使用'pd'对象执行其他基本的pandas操作。

    1. 如何从CSV文件或文本文件中读取数据?

    CSV文件以逗号分隔,因此为了读取CSV文件,请执行以下操作:

    df = pd.read_csv(file_path, sep=’,’, header = 0, index_col=False,names=None)
    Explanation:
    ‘read_csv’ function has a plethora of parameters and I have specified only a few, ones that you may use most often. A few key points:
    a) header=0 means you have the names of columns in the first row in the file and if you don’t you will have to specify header=None
    b) index_col = False means to not use the first column of the data as an index in the data frame, you might want to set it to true if the first column is really an index.
    c) names = None implies you are not specifying the column names and want it to be inferred from csv file, which means that your header = some_number contains column names. Otherwise, you can specify the names in here in the same order as you have the data in the csv file. 
    If you are reading a text file separated by space or tab, you could simply change the sep to be:
    sep = " " or sep='\t'
    

    2.如何使用预先存在的列或NumPy 2D阵列的字典创建数据框?

    使用字典

    # c1, c2, c3, c4 are column names. 
    d_dic ={'first_col_name':c1,'second_col_names':c2,'3rd_col_name':c3} df = pd.DataFrame(data = d_dic)
    

    使用NumPy数组

    np_data = np.zeros((no_of_samples,no_of_features)) #any_numpy_array
    df = pd.DataFrame(data=np_data, columns = list_of_Col_names)
    

    3.如何可视化数据框中的顶部和底部x值?

    df.head(num_of_rows_to_view) #top_values
    df.tail(num_of_rows_to_view) #bottom_values
    col = list_of_columns_to_view 
    df[col].head(num_of_rows_to_view)
    df[col].tail(num_of_rows_to_view)
    

    4.如何重命名一列或多列?

    df = pd.DataFrame(data={'a':[1,2,3,4,5],'b':[0,1,5,10,15]})
    new_df = df.rename({'a':'new_a','b':'new_b'})
    

    因为不在原地重命名,将返回数据帧存储到新数据帧非常重要。

    5.如何在列表中获取列名?

    df.columns.tolist()
    

    如果您只想迭代名称但不返回作为索引对象的所有东西,那么不使用tolist()函数也可以完成这项工作。

    6.如何获得一系列值的频率?

    df[col].value_counts() #returns a mapper of key,frequency pair
    df[col].value_counts()[key] to get frequency of a key value
    

    7.如何将索引重置为现有列或其他列表或数组?

    new_df = df.reset_index(drop=True,inplace=False)
    

    如果你做inplace = True,则不需要将它存储到new_df。此外,当您将索引重置为pandas RangeIndex()时,您可以选择保留旧索引或使用'drop'参数将其删除。您可能希望保留它,特别是当它是最初的一个列时,您暂时将其设置为newindex。

    8.如何删除列?

    df.drop(columns = list_of_cols_to_drop)
    

    9.如何更改数据框中的索引?

    df.set_index(col_name,inplace=True)
    

    这会将col_name col设置为索引。您可以传递多个列以将它们设置为索引。inplace关键字与之前的用途相同。

    10.如果删除具有nan值的行或列具

    df.dropna(axis=0,inplace=True)
    

    axis = 0将删除任何具有nan值的列,这可能是您最不想要的。axis = 1将仅删除任何列中具有nan值的行。inplace和上面一样。

    11.如何根据条件对数据帧进行切片?

    您始终需要以逻辑条件的形式指定掩码。
    例如,如果您有列年龄,并且您希望选择年龄列具有特定值或位于列表中的数据框。然后你可以实现如下切片:

    mask = df['age'] == age_value 
    or
    mask = df['age].isin(list_of_age_values)
    result = df[mask]
    

    有多种条件:例如。选择高度和年龄都对应于特定值的行。

    mask = (df['age']==age_value) & (df['height'] == height_value)
    result = df[mask]
    

    12.如何根据行的列名或行的索引值对数据帧进行切片?

    这里有4个选项:at,iat,loc和iloc。其中'iat'和'iloc'在某种意义上是相似的,它们提供基于整数的索引,而'loc'和'at'提供基于名称的索引。

    另外需要注意的是'iat',在'提供'单个元素的索引时使用'loc'和'iloc'可以切片多个元素。

    Examples:
    a) 
    df.iat[1,2] provides the element at 1th row and 2nd column. Here it's important to note that number 1 doesn't correspond to 1 in index column of dataframe. It's totally possible that index in df does not have 1 at all. It's like python array indexing.
    b)
    df.at[first,col_name] provides the value in the row where index value is first and column name is col_name
    c)
    df.loc[list_of_indices,list_of_cols] 
    eg df.loc[[4,5],['age','height']]
    Slices dataframe for matching indices and column names
    d)
    df.iloc[[0,1],[5,6]] used for interger based indexing will return 0 and 1st row for 5th and 6th column.
    

    13.如何迭代行?

    iterrows() and itertuples()
    for i,row in df.iterrows():
        sum+=row['hieght']
    iterrows() passess an iterators over rows which are returned as series. If a change is made to any of the data element of a row, it may reflect upon the dataframe as it does not return a copy of rows.
    itertuples() returns named tuples
    for row in df.itertuples():
        print(row.age)
    

    14.如何按列排序?

    df.sort_values(by = list_of_cols,ascending=True) 
    

    15.如何将函数应用于系列的每个元素?

    df['series_name'].apply(f) 
    where f is the function you want to apply to each element of the series. If you also want to pass arguments to the custom function, you could modify it like this.
    def f(x,**kwargs):
        #do_somthing
        return value_to_store
    df['series_name'].apply(f, a= 1, b=2,c =3)
    If you want to apply a function to more than a series, then:
    def f(row):
        age = row['age']
        height = row['height']
    df[['age','height']].apply(f,axis=1)
    If you don't use axis=1, f will be applied to each element of both the series. axis=1 helps to pass age and height of each row for any manipulation you want.
    

    16.如何将函数应用于数据框中的所有元素?

    new_df = df.applymap(f)
    

    17.如果系列的值位于列表中,如何切片数据框?

    使用masking和isin。要选择年龄位于列表中的数据样本:

    df[df['age'].isin(age_list)]
    

    要选择相反的情况,年龄不在列表中的数据样本使用:

    df[~df['age'].isin(age_list)]
    

    18.如何按列分组并聚合另一列或对其应用函数?

    df.groupby(['age']).agg({'height':'mean'})
    

    这将按系列'age'对数据框进行分组,对于height列,它将应用分组值的平均值。有时可能会发生这种情况,您希望按特定列进行分组,并将其他列的所有相应分组元素转换为列表。你可以通过以下方式实现

    df.groupby(['age']).agg(list)
    

    19.如何为特定列的列表中的每个元素创建其他列的重复项?

    这个问题可能有点令人困惑。我的意思是,假设您有以下数据框df:

    Age Height(in cm)
    20  180
    20  175
    18  165
    18  163
    16  170
    

    使用列表聚合器应用分组后,您可能会得到以下内容:

    Age Height(in cm)
    20  [180,175]
    18  [165,163]
    16  [170]
    

    现在,如果您想通过撤消上一次操作返回原始数据框怎么办?你可以使用pandas版本0.25中新引入的名为explode的操作来实现。

    df['height'].explode() will give the desired outcome.
    

    20.如何连接两个数据帧?

    假设您有两个数据框df1和df2,其中包含给定的列名称,年龄和高度,您可能希望实现两列的串联。axis = 0是垂直轴。这里,结果数据框将从数据框中添加列:

    df1 --> name,age,height
    df2---> name,age,height
    result = pd.concat([df1,df2],axis=0)
    

    对于水平连接,

    df1--> name,age
    df2--->height,salary
    result = pd.concat([df1,df2], axis=1) 
    

    21.如何合并两个数据框?

    For the previous example, assume you have an employee database forming two dataframes like
    df1--> name, age, height
    df2---> name, salary, pincode, sick_leaves_taken
    You may want to combine these two dataframe such that each row has all details of an employee. In order to acheive this, you would have to perform a merge operation.
    df1.merge(df2, on=['name'],how='inner')
    This operation will provide a dataframe where each row will comprise of name, age, height, salary, pincode, sick_leaves_taken. 
    how = 'inner' means include the row in result if there is a matching name in both the data frames. For more read: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge
    

    总结

    对于作为初学者的任何数据分析项目,您可能需要非常了解这些操作。我一直认为Pandas是一个非常有用的库,现在您可以与各种其他数据分析工具和语言集成。在学习支持分布式算法的语言时,了解pandas操作甚至可能有所帮助。

    联系

    如果您喜欢这篇文章,请拍拍并与可能认为有用的其他人分享。我真的很喜欢数据科学,如果你对它感兴趣,那就让我们联系LinkedIn或者跟我来这里走向数据科学平台。

    翻译自:https://towardsdatascience.com/21-pandas-operations-for-absolute-beginners-5653e54f4cda

    相关文章

      网友评论

          本文标题:记录21个Pandas技巧

          本文链接:https://www.haomeiwen.com/subject/srxructx.html