Introduction to Pandas

作者: IntoTheVoid | 来源:发表于2018-09-05 17:48 被阅读12次

    Introduction to Pandas

    Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.

    In the following lessons you will learn:

    • How to import Pandas
    • How to create Pandas Series and DataFrames using various methods
    • How to access and change elements in Series and DataFrames
    • How to perform arithmetic operations on Series
    • How to load data into a DataFrame
    • How to deal with Not a Number (NaN) values

    The following lessons assume that you are already familiar with NumPy and have gone over the previous NumPy lessons. Therefore, to avoid being repetitive we will omit a lot of details already given in the NumPy lessons. Consequently, if you haven't seen the NumPy lessons we suggest you go over them first.

    Downloading Pandas

    Pandas is included with Anaconda. If you don't already have Anaconda installed on your computer, please refer to the Anaconda section to get clear instructions on how to install Anaconda on your PC or Mac.

    Pandas Versions

    As with many Python packages, Pandas is updated from time to time. The following lessons were created using Pandas version 0.22. You can check which version of Pandas you have by typing !conda list pandas in your Jupyter notebook or by typing conda list pandas in the Anaconda prompt. If you have another version of Pandas installed in your computer, you can update your version by typing conda install pandas=0.22 in the Anaconda prompt. As newer versions of Pandas are released, some functions may become obsolete or replaced, so make sure you have the correct Pandas version before running the code. This will guarantee your code will run smoothly.

    Pandas Documentation

    Pandas is remarkable data analysis library and it has many functions and features. In these introductory lessons we will only scratch the surface of what Pandas can do. If you want to learn more about Pandas, make sure you check out the Pandas Documentation:

    Pandas Documentation

    Why Use Pandas?

    The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important. It often happens that large datasets don’t come ready to be fed into your learning algorithms. More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:

    • Allows the use of labels for rows and columns
    • Can calculate rolling statistics on time series data
    • Easy handling of NaN values
    • Is able to load data of different formats into DataFrames
    • Can join and merge different datasets together
    • It integrates with NumPy and Matplotlib

    For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python.

    Creating Pandas Series

    v1
    A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want. Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.

    Let's start by importing Pandas into Python. It has become a convention to import Pandas as pd, therefore, you can import Pandas by typing the following command in your Jupyter notebook:

    import pandas as pd
    

    Let's begin by creating a Pandas Series. You can create Pandas Series by using the command pd.Series(data, index), where index is a list of index labels. Let's use a Pandas Series to store a grocery list. We will use the food items as index labels and the quantity we need to buy of each item as our data.

    # We import Pandas as pd into Python
    import pandas as pd
    
    # We create a Pandas Series that stores a grocery list
    groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])
    
    # We display the Groceries Pandas Series
    groceries
    

    eggs 30
    apples 6
    milk Yes
    bread No
    dtype: object

    We see that Pandas Series are displayed with the indices in the first column and the data in the second column. Notice that the data is not indexed 0 to 3 but rather it is indexed with the names of the food we put in, namely eggs, apples, etc... Also notice that the data in our Pandas Series has both integers and strings.

    Just like NumPy ndarrays, Pandas Series have attributes that allows us to get information from the series in an easy way. Let's see some of them:

    # We print some information about Groceries
    print('Groceries has shape:', groceries.shape)
    print('Groceries has dimension:', groceries.ndim)
    print('Groceries has a total of', groceries.size, 'elements')
    

    Groceries has shape: (4,)
    Groceries has dimension: 1
    Groceries has a total of 4 elements

    We can also print the index labels and the data of the Pandas Series separately. This is useful if you don't happen to know what the index labels of the Pandas Series are.

    # We print the index and data of Groceries
    print('The data in Groceries is:', groceries.values)
    print('The index of Groceries is:', groceries.index)
    

    The data in Groceries is: [30 6 'Yes' 'No']
    The index of Groceries is: Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')

    If you are dealing with a very large Pandas Series and if you are not sure whether an index label exists, you can check by using the in command

    # We check whether bananas is a food item (an index) in Groceries
    x = 'bananas' in groceries
    
    # We check whether bread is a food item (an index) in Groceries
    y = 'bread' in groceries
    
    # We print the results
    print('Is bananas an index label in Groceries:', x)
    print('Is bread an index label in Groceries:', y)
    

    Is bananas an index label in Groceries: False
    Is bread an index label in Groceries: True

    Accessing and Deleting Elements in Pandas Series

    v2
    Now let's look at how we can access or modify elements in a Pandas Series. One great advantage of Pandas Series is that it allows us to access data in many different ways. Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. Since we can use numerical indices, we can use both positive and negative integers to access data from the beginning or from the end of the Series, respectively. Since we can access elements in various ways, in order to remove any ambiguity to whether we are referring to an index label or numerical index, Pandas Series have two attributes, .loc and .iloc to explicitly state what we mean. The attribute .loc stands for location and it is used to explicitly state that we are using a labeled index. Similarly, the attribute .iloc stands for integer location and it is used to explicitly state that we are using a numerical index. Let's see some examples:

    # We access elements in Groceries using index labels:
    
    # We use a single index label
    print('How many eggs do we need to buy:', groceries['eggs'])
    print()
    
    # we can access multiple index labels
    print('Do we need milk and bread:\n', groceries[['milk', 'bread']]) 
    print()
    
    # we use loc to access multiple index labels
    print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']]) 
    print()
    
    # We access elements in Groceries using numerical indices:
    
    # we use multiple numerical indices
    print('How many eggs and apples do we need to buy:\n',  groceries[[0, 1]]) 
    print()
    
    # We use a negative numerical index
    print('Do we need bread:\n', groceries[[-1]]) 
    print()
    
    # We use a single numerical index
    print('How many eggs do we need to buy:', groceries[0]) 
    print()
    # we use iloc to access multiple numerical indices
    print('Do we need milk and bread:\n', groceries.iloc[[2, 3]]) 
    

    How many eggs do we need to buy: 30

    Do we need milk and bread:
    milk Yes
    bread No
    dtype: object

    How many eggs and apples do we need to buy:
    eggs 30
    apples 6
    dtype: object

    How many eggs and apples do we need to buy:
    eggs 30
    apples 6
    dtype: object

    Do we need bread:
    bread No
    dtype: object

    How many eggs do we need to buy: 30

    Do we need milk and bread:
    milk Yes
    bread No
    dtype: object

    Pandas Series are also mutable like NumPy ndarrays, which means we can change the elements of a Pandas Series after it has been created. For example, let's change the number of eggs we need to buy from our grocery list

    # We display the original grocery list
    print('Original Grocery List:\n', groceries)
    
    # We change the number of eggs to 2
    groceries['eggs'] = 2
    
    # We display the changed grocery list
    print()
    print('Modified Grocery List:\n', groceries)
    

    Original Grocery List:
    eggs 30
    apples 6
    milk Yes
    bread No
    dtype: object

    Modified Grocery List:
    eggs 2
    apples 6
    milk Yes
    bread No
    dtype: object

    We can also delete items from a Pandas Series by using the .drop()method. The Series.drop(label)method removes the given label from the given Series. We should note that the Series.drop(label)method drops elements from the Series out of place, meaning that it doesn't change the original Series being modified. Let's see how this works:

    # We display the original grocery list
    print('Original Grocery List:\n', groceries)
    
    # We remove apples from our grocery list. The drop function removes elements out of place
    print()
    print('We remove apples (out of place):\n', groceries.drop('apples'))
    
    # When we remove elements out of place the original Series remains intact. To see this
    # we display our grocery list again
    print()
    print('Grocery List after removing apples out of place:\n', groceries)
    

    Original Grocery List:
    eggs 30
    apples 6
    milk Yes
    bread No
    dtype: object

    We remove apples (out of place):
    eggs 30
    milk Yes
    bread No
    dtype: object

    Grocery List after removing apples out of place:
    eggs 30
    apples 6
    milk Yes
    bread No
    dtype: object

    We can delete items from a Pandas Series in place by setting the keyword inplace to True in the .drop()method. Let's see an example:

    # We display the original grocery list
    print('Original Grocery List:\n', groceries)
    
    # We remove apples from our grocery list in place by setting the inplace keyword to True
    groceries.drop('apples', inplace = True)
    
    # When we remove elements in place the original Series its modified. To see this
    # we display our grocery list again
    print()
    print('Grocery List after removing apples in place:\n', groceries)
    

    Original Grocery List:
    eggs 30
    apples 6
    milk Yes
    bread No
    dtype: object

    Grocery List after removing apples in place:
    eggs 30
    milk Yes
    bread No
    dtype: object

    Arithmetic Operations on Pandas Series

    v3
    Just like with NumPy ndarrays, we can perform element-wise arithmetic operations on Pandas Series. In this lesson we will look at arithmetic operations between Pandas Series and single numbers. Let's create a new Pandas Series that will hold a grocery list of just fruits.

    # We create a Pandas Series that stores a grocery list of just fruits
    fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])
    
    # We display the fruits Pandas Series
    fruits
    

    apples 10
    oranges 6
    bananas 3
    dtype: int64

    We can now modify the data in fruits by performing basic arithmetic operations. Let's see some examples

    # We print fruits for reference
    print('Original grocery list of fruits:\n ', fruits)
    
    # We perform basic element-wise operations using arithmetic symbols
    print()
    print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
    print()
    print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
    print()
    print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
    print()
    print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
    print()
    

    Original grocery list of fruits:
    apples 10
    oranges 6
    bananas 3
    dtype: int64

    fruits + 2:
    apples 12
    oranges 8
    bananas 5
    dtype: int64

    fruits - 2:
    apples 8
    oranges 4
    bananas 1
    dtype: int64

    fruits * 2:
    apples 20
    oranges 12
    bananas 6
    dtype: int64

    fruits / 2:
    apples 5.0
    oranges 3.0
    bananas 1.5
    dtype: float64

    You can also apply mathematical functions from NumPy, such assqrt(x), to all elements of a Pandas Series.

    # We import NumPy as np to be able to use the mathematical functions
    import numpy as np
    
    # We print fruits for reference
    print('Original grocery list of fruits:\n', fruits)
    
    # We apply different mathematical functions to all elements of fruits
    print()
    print('EXP(X) = \n', np.exp(fruits))
    print() 
    print('SQRT(X) =\n', np.sqrt(fruits))
    print()
    print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2
    

    Original grocery list of fruits:
    apples 10
    oranges 6
    bananas 3
    dtype: int64

    EXP(X) =
    apples 22026.465795
    oranges 403.428793
    bananas 20.085537
    dtype: float64

    SQRT(X) =
    apples 3.162278
    oranges 2.449490
    bananas 1.732051
    dtype: float64

    POW(X,2) =
    apples 100
    oranges 36
    bananas 9
    dtype: int64

    Pandas also allows us to only apply arithmetic operations on selected items in our fruits grocery list. Let's see some examples

    # We print fruits for reference
    print('Original grocery list of fruits:\n ', fruits)
    print()
    
    # We add 2 only to the bananas
    print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
    print()
    
    # We subtract 2 from apples
    print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
    print()
    
    # We multiply apples and oranges by 2
    print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
    print()
    
    # We divide apples and oranges by 2
    print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)
    

    Original grocery list of fruits:
    apples 10
    oranges 6
    bananas 3
    dtype: int64

    Amount of bananas + 2 = 5

    Amount of apples - 2 = 8

    We double the amount of apples and oranges:
    apples 20
    oranges 12
    dtype: int64

    We half the amount of apples and oranges:
    apples 5.0
    oranges 3.0
    dtype: float64

    You can also apply arithmetic operations on Pandas Series of mixed data type provided that the arithmetic operation is defined for all data types in the Series, otherwise you will get an error. Let's see what happens when we multiply our grocery list by 2

    # We multiply our grocery list by 2
    groceries * 2
    

    eggs 60
    apples 12
    milk YesYes
    bread NoNo
    dtype: object

    As we can see, in this case, since we multiplied by 2, Pandas doubles the data of each item including the strings. Pandas can do this because the multiplication operation * is defined both for numbers and strings. If you were to apply an operation that was valid for numbers but not strings, say for instance,/ you will get an error. So when you have mixed data types in your Pandas Series make sure the arithmetic operations are valid on all the data types of your elements.

    练习

    import pandas as pd
    
    # Create a Pandas Series that contains the distance of some planets from the Sun.
    # Use the name of the planets as the index to your Pandas Series, and the distance
    # from the Sun as your data. The distance from the Sun is in units of 10^6 km
    
    distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]
    
    planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']
    
    # Create a Pandas Series using the above data, with the name of the planets as
    # the index and the distance from the Sun as your data.
    dist_planets = 
    
    # Calculate the number of minutes it takes sunlight to reach each planet. You can
    # do this by dividing the distance from the Sun for each planet by the speed of light.
    # Since in the data above the distance from the Sun is in units of 10^6 km, you can
    # use a value for the speed of light of c = 18, since light travels 18 x 10^6 km/minute.
    time_light = 
    
    # Use Boolean indexing to select only those planets for which sunlight takes less
    # than 40 minutes to reach them.
    close_planets = 
    

     
     
     
     
     
     
    答案

    import pandas as pd
    
    distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]
    
    planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']
    
    dist_planets = pd.Series(data = distance_from_sun, index = planets)
    
    time_light = dist_planets / 18
    
    close_planets = time_light[time_light < 40]
    

    Creating Pandas DataFrames

    v4
    Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a spreadsheet. We can create Pandas DataFrames manually or by loading data from a file. In these lessons we will start by learning how to create Pandas DataFrames manually from dictionaries and later we will see how we can load data into a DataFrame from a data file.

    We will start by creating a DataFrame manually from a dictionary of Pandas Series. In this case the first step is to create the dictionary of Pandas Series. After the dictionary is created we can then pass the dictionary to the pd.DataFrame() function.

    We will create a dictionary that contains items purchased by two people, Alice and Bob, on an online store. The Pandas Series will use the price of the items purchased as data, and the purchased items will be used as the index labels to the Pandas Series. Let's see how this done in code:

    # We import Pandas as pd into Python
    import pandas as pd
    
    # We create a dictionary of Pandas Series 
    items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
             'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}
    
    # We print the type of items to see that it is a dictionary
    print(type(items))
    

    class 'dict'

    Now that we have a dictionary, we are ready to create a DataFrame by passing it to thepd.DataFrame()function. We will create a DataFrame that could represent the shopping carts of various users, in this case we have only two users, Alice and Bob.

    # We create a Pandas DataFrame by passing it a dictionary of Pandas Series
    shopping_carts = pd.DataFrame(items)
    
    # We display the DataFrame
    shopping_carts
    
    Alice Bob
    bike 500.0 245.0
    book 40.0 NaN
    glasses 110.0 NaN
    pants 45.0 25.0
    watch NaN 55.0

    There are several things to notice here that are worth pointing out. We see that DataFrames are displayed in tabular form, much like an Excel spreadsheet, with the labels of rows and columns in bold. Also notice that the row labels of the DataFrame are built from the union of the index labels of the two Pandas Series we used to construct the dictionary. And the column labels of the DataFrame are taken from the keys of the dictionary. Another thing to notice is that the columns are arranged alphabetically and not in the order given in the dictionary. We will see later that this won't happen when we load data into a DataFrame from a data file. The last thing we want to point out is that we see someNaN values appear in the DataFrame. NaN stands for Not a Number, and is Pandas way of indicating that it doesn't have a value for that particular row and column index. For example, if we look at the column of Alice, we see that it has NaN in the watch index. You can see why this is the case by looking at the dictionary we created at the beginning. We clearly see that the dictionary has no item for Alice labeled watches. So whenever a DataFrame is created, if a particular column doesn't have values for a particular row index, Pandas will put a NaN value there. If we were to feed this data into a machine learning algorithm we will have to remove these NaN values first. In a later lesson we will learn how to deal with NaN values and clean our data. For now, we will leave these values in our DataFrame.

    In the above example we created a Pandas DataFrame from a dictionary of Pandas Series that had clearly defined indexes. If we don't provide index labels to the Pandas Series, Pandas will use numerical row indexes when it creates the DataFrame. Let's see an example:

    # We create a dictionary of Pandas Series without indexes
    data = {'Bob' : pd.Series([245, 25, 55]),
            'Alice' : pd.Series([40, 110, 500, 45])}
    
    # We create a DataFrame
    df = pd.DataFrame(data)
    
    # We display the DataFrame
    df
    
    image.png

    We can see that Pandas indexes the rows of the DataFrame starting from 0, just like NumPy indexes ndarrays.

    Now, just like with Pandas Series we can also extract information from DataFrames using attributes. Let's print some information from our shopping_cartsDataFrame

    # We print some information about shopping_carts
    print('shopping_carts has shape:', shopping_carts.shape)
    print('shopping_carts has dimension:', shopping_carts.ndim)
    print('shopping_carts has a total of:', shopping_carts.size, 'elements')
    print()
    print('The data in shopping_carts is:\n', shopping_carts.values)
    print()
    print('The row index in shopping_carts is:', shopping_carts.index)
    print()
    print('The column index in shopping_carts is:', shopping_carts.columns)
    
    image.png image.png

    When creating the shopping_carts DataFrame we passed the entire dictionary to the pd.DataFrame() function. However, there might be cases when you are only interested in a subset of the data. Pandas allows us to select which data we want to put into our DataFrame by means of the keywords columns and index. Let's see some examples:

    # We Create a DataFrame that only has Bob's data
    bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])
    
    # We display bob_shopping_cart
    bob_shopping_cart
    
    image.png
    # We Create a DataFrame that only has selected items for both Alice and Bob
    sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])
    
    # We display sel_shopping_cart
    sel_shopping_cart
    
    image.png
    # We Create a DataFrame that only has selected items for Alice
    alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])
    
    # We display alice_sel_shopping_cart
    alice_sel_shopping_cart
    
    image.png

    You can also manually create DataFrames from a dictionary of lists (arrays). The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame()function. In this case, however, all the lists (arrays) in the dictionary must be of the same length. Let' see an example:

    # We create a dictionary of lists (arrays)
    data = {'Integers' : [1,2,3],
            'Floats' : [4.5, 8.2, 9.6]}
    
    # We create a DataFrame 
    df = pd.DataFrame(data)
    
    # We display the DataFrame
    df
    
    image.png

    Notice that since the data dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. We can however, put labels to the row index by using the index keyword in the pd.DataFrame()function. Let's see an example

    # We create a dictionary of lists (arrays)
    data = {'Integers' : [1,2,3],
            'Floats' : [4.5, 8.2, 9.6]}
    
    # We create a DataFrame and provide the row index
    df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])
    
    # We display the DataFrame
    df
    
    image.png

    The last method for manually creating Pandas DataFrames that we want to look at, is by using a list of Python dictionaries. The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame() function.

    # We create a list of Python dictionaries
    items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
              {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]
    
    # We create a DataFrame 
    store_items = pd.DataFrame(items2)
    
    # We display the DataFrame
    store_items
    
    image.png

    Again, notice that since the items2 dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. As before, we can put labels to the row index by using the indexkeyword in the pd.DataFrame()function. Let's assume we are going to use this DataFrame to hold the number of items a particular store has in stock. So, we will label the row indices as store 1 and store 2.

    # We create a list of Python dictionaries
    items2 = [{'bikes': 20, 'pants': 30, 'watches': 35}, 
              {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]
    
    # We create a DataFrame  and provide the row index
    store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])
    
    # We display the DataFrame
    store_items
    
    image.png

    Accessing Elements in Pandas DataFrames

    v5
    We can access elements in Pandas DataFrames in many different ways. In general, we can access rows, columns, or individual elements of the DataFrame by using the row and column labels. We will use the same store_items DataFrame created in the previous lesson. Let's see some examples:

    # We print the store_items DataFrame
    print(store_items)
    
    # We access rows, columns and elements using labels
    print()
    print('How many bikes are in each store:\n', store_items[['bikes']])
    print()
    print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
    print()
    print('What items are in Store 1:\n', store_items.loc[['store 1']])
    print()
    print('How many bikes are in Store 2:', store_items['bikes']['store 2'])
    
    image.png image.png image.png

    It is important to know that when accessing individual elements in a DataFrame, as we did in the last example above, the labels should always be provided with the column label first, i.e. in the form dataframe[column][row]. For example, when retrieving the number bikes in store 2, we first used the column label bikes and then the row label store 2. If you provide the row label first you will get an error.

    We can also modify our DataFrames by adding rows or columns. Let's start by learning how to add new columns to our DataFrames. Let's suppose we decided to add shirts to the items we have in stock at each store. To do this, we will need to add a new column to our store_items DataFrame indicating how many shirts are in each store. Let's do that:

    # We add a new column named shirts to our store_items DataFrame indicating the number of
    # shirts in stock at each store. We will put 15 shirts in store 1 and 2 shirts in store 2
    store_items['shirts'] = [15,2]
    
    # We display the modified DataFrame
    store_items
    
    image.png

    We can see that when we add a new column, the new column is added at the end of our DataFrame.

    We can also add new columns to our DataFrame by using arithmetic operations between other columns in our DataFrame. Let's see an example:

    # We make a new column called suits by adding the number of shirts and pants
    store_items['suits'] = store_items['pants'] + store_items['shirts']
    
    # We display the modified DataFrame
    store_items
    
    image.png

    Suppose now, that you opened a new store and you need to add the number of items in stock of that new store into your DataFrame. We can do this by adding a new row to the store_items Dataframe. To add rows to our DataFrame we first have to create a new Dataframe and then append it to the original DataFrame. Let's see how this works

    # We create a dictionary from a list of Python dictionaries that will number of items at the new store
    new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]
    
    # We create new DataFrame with the new_items and provide and index labeled store 3
    new_store = pd.DataFrame(new_items, index = ['store 3'])
    
    # We display the items at the new store
    new_store
    
    image.png

    We now add this row to ourstore_itemsDataFrame by using the .append()method.

    # We append store 3 to our store_items DataFrame
    store_items = store_items.append(new_store)
    
    # We display the modified DataFrame
    store_items
    
    image.png

    Notice that by appending a new row to the DataFrame, the columns have been put in alphabetical order.

    We can also add new columns of our DataFrame by using only data from particular rows in particular columns. For example, suppose that you want to stock stores 2 and 3 with new watches and you want the quantity of the new watches to be the same as the watches already in stock for those stores. Let's see how we can do this

    # We add a new column using data from particular rows in the watches column
    store_items['new watches'] = store_items['watches'][1:]
    
    # We display the modified DataFrame
    store_items
    
    image.png

    It is also possible, to insert new columns into the DataFrames anywhere we want. The dataframe.insert(loc,label,data) method allows us to insert a new column in the dataframe at locationloc, with the given columnlabel, and given data. Let's add new column named shoes right before the **suits **column. Since suits has numerical index value 4 then we will use this value asloc. Let's see how this works:

    # We insert a new column with label shoes right before the column with numerical index 4
    store_items.insert(4, 'shoes', [8,5,0])
    
    # we display the modified DataFrame
    store_items
    
    image.png

    Just as we can add rows and columns we can also delete them. To delete rows and columns from our DataFrame we will use the .pop() and .drop()methods. The.pop()method only allows us to delete columns, while the .drop()method can be used to delete both rows and columns by use of the axis keyword. Let's see some examples

    # We remove the new watches column
    store_items.pop('new watches')
    
    # we display the modified DataFrame
    store_items
    
    image.png
    # We remove the watches and shoes columns
    store_items = store_items.drop(['watches', 'shoes'], axis = 1)
    
    # we display the modified DataFrame
    store_items
    
    image.png
    # We remove the store 2 and store 1 rows
    store_items = store_items.drop(['store 2', 'store 1'], axis = 0)
    
    # we display the modified DataFrame
    store_items
    
    image.png

    Sometimes we might need to change the row and column labels. Let's change the bikes column label to hats using the.rename()method

    # We change the column label bikes to hats
    store_items = store_items.rename(columns = {'bikes': 'hats'})
    
    # we display the modified DataFrame
    store_items
    
    image.png

    Now let's change the row label using the .rename() method again.

    # We change the row label from store 3 to last store
    store_items = store_items.rename(index = {'store 3': 'last store'})
    
    # we display the modified DataFrame
    store_items
    
    image.png

    You can also change the index to be one of the columns in the DataFrame.

    # We change the row index to be the data in the pants column
    store_items = store_items.set_index('pants')
    
    # we display the modified DataFrame
    store_items
    
    image.png

    Dealing with NaN

    v6
    As mentioned earlier, before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns NaN values to missing data. In this lesson we will learn how to detect and deal with NaN values.

    We will begin by creating a DataFrame with some NaN values in it.

    # We create a list of Python dictionaries
    items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
    {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
    {'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]
    
    # We create a DataFrame  and provide the row index
    store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])
    
    # We display the DataFrame
    store_items
    
    image.png

    We can clearly see that the DataFrame we created has 3NaN values: one in store 1 and two in store 3. However, in cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of NaNvalues is not easily visualized. For these cases, we can use a combination of methods to count the number of NaNvalues in our data. The following example combines the.isnull()and thesum()methods to count the number of NaNvalues in our DataFrame

    # We count the number of NaN values in store_items
    x =  store_items.isnull().sum().sum()
    
    # We print x
    print('Number of NaN values in our DataFrame:', x)
    

    Number of NaN values in our DataFrame: 3

    In the above example, the .isnull() method returns a Boolean DataFrame of the same size as store_items and indicates withTrue the elements that have NaNvalues and withFalse the elements that are not. Let's see an example:

    store_items.isnull()
    
    image.png

    In Pandas, logical Truevalues have numerical value 1 and logicalFalsevalues have numerical value 0. Therefore, we can count the number of NaNvalues by counting the number of logical Truevalues. In order to count the total number of logical True values we use the .sum()method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical True values along columns, as we see below:

    store_items.isnull().sum()
    
    image.png

    The second sum will then add up the 1s in the above Pandas Series.

    Instead of counting the number of NaNvalues we can also do the opposite, we can count the number of non-NaN values. We can do this by using the .count() method as shown below:

    # We print the number of non-NaN values in our DataFrame
    print()
    print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())
    
    image.png

    Now that we learned how to know if our dataset has any NaN values in it, the next step is to decide what to do with them. In general we have two options, we can either delete or replace the NaNvalues. In the following examples we will show you how to do both.

    We will start by learning how to eliminate rows or columns from our DataFrame that contain any NaN values. The .dropna(axis) method eliminates any rows with NaN values when axis = 0 is used and will eliminate any columns with NaN values when axis = 1 is used. Let's see some examples

    # We drop any rows with NaN values
    store_items.dropna(axis = 0)
    
    image.png
    # We drop any columns with NaN values
    store_items.dropna(axis = 1)
    
    image.png

    Notice that the.dropna() method eliminates (drops) the rows or columns with NaN values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keywordinplace = Trueinside thedropna() function.

    Now, instead of eliminating NaNvalues, we can replace them with suitable values. We could choose for example to replace all NaN values with the value 0. We can do this by using the .fillna()method as shown below.

    # We replace all NaN values with 0
    store_items.fillna(0)
    
    image.png

    We can also use the .fillna() method to replace NaN values with previous values in the DataFrame, this is known as forward filling. When replacing NaN values with forward filling, we can use previous values taken from columns or rows. The.fillna(method = 'ffill', axis) will use the forward filling (ffill) method to replaceNaNvalues using the previous known value along the given axis. Let's see some examples:

    # We replace NaN values with the previous value in the column
    store_items.fillna(method = 'ffill', axis = 0)
    
    image.png

    Notice that the two NaN values in store 3have been replaced with previous values in their columns. However, notice that the NaN value in store 1didn't get replaced. That's because there are no previous values in this column, since theNaNvalue is the first value in that column. However, if we do forward fill using the previous row values, this won't happen. Let's take a look:

    # We replace NaN values with the previous value in the row
    store_items.fillna(method = 'ffill', axis = 1)
    
    image.png

    We see that in this case all the NaNvalues have been replaced with the previous row values.

    Similarly, you can choose to replace the NaNvalues with the values that go after them in the DataFrame, this is known as backward filling. The .fillna(method = 'backfill', axis)will use the backward filling (backfill) method to replace NaN values using the next known value along the given axis. Just like with forward filling we can choose to use row or column values. Let's see some examples:

    # We replace NaN values with the next value in the column
    store_items.fillna(method = 'backfill', axis = 0)
    
    image.png

    Notice that the NaNvalue in store 1 has been replaced with the next value in its column. However, notice that the two NaNvalues in store 3didn't get replaced. That's because there are no next values in these columns, since these NaNvalues are the last values in those columns. However, if we do backward fill using the next row values, this won't happen. Let's take a look:

    # We replace NaN values with the next value in the row
    store_items.fillna(method = 'backfill', axis = 1)
    
    image.png

    Notice that the.fillna() method replaces (fills) theNaN values out of place. This means that the original DataFrame is not modified. You can always replace the NaNvalues in place by setting the keywordinplace = True inside the fillna()function.

    We can also choose to replace NaN values by using different interpolation methods. For example, the .interpolate(method = 'linear', axis) method will use linear interpolation to replace NaNvalues using the values along the given axis. Let's see some examples:

    # We replace NaN values by using linear interpolation using column values
    store_items.interpolate(method = 'linear', axis = 0)
    
    image.png

    Notice that the two NaNvalues in store 3 have been replaced with linear interpolated values. However, notice that the NaNvalue in store 1 didn't get replaced. That's because theNaN value is the first value in that column, and since there is no data before it, the interpolation function can't calculate a value. Now, let's interpolate using row values instead:

    # We replace NaN values by using linear interpolation using row values
    store_items.interpolate(method = 'linear', axis = 1)
    
    image.png

    Just as with the other methods we saw, the.interpolate()method replaces NaN values out of place.

    练习

    import pandas as pd
    import numpy as np
    
    # Since we will be working with ratings, we will set the precision of our 
    # dataframes to one decimal place.
    pd.set_option('precision', 1)
    
    # Create a Pandas DataFrame that contains the ratings some users have given to a
    # series of books. The ratings given are in the range from 1 to 5, with 5 being
    # the best score. The names of the books, the authors, and the ratings of each user
    # are given below:
    
    books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
    authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])
    
    user_1 = pd.Series(data = [3.2, np.nan ,2.5])
    user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
    user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
    user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])
    
    # Users that have np.nan values means that the user has not yet rated that book.
    # Use the data above to create a Pandas DataFrame that has the following column
    # labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3', 'User 4'. Let Pandas
    # automatically assign numerical row indices to the DataFrame. 
    
    # Create a dictionary with the data given above
    dat = 
    
    # Use the dictionary to create a Pandas DataFrame
    book_ratings = 
    
    # If you created the dictionary correctly you should have a Pandas DataFrame
    # that has column labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3',
    # 'User 4' and row indices 0 through 4.
    
    # Now replace all the NaN values in your DataFrame with the average rating in
    # each column. Replace the NaN values in place. HINT: you can use the fillna()
    # function with the keyword inplace = True, to do this. Write your code below:
    
    

     
     
     
     
     
     
    答案

    import pandas as pd
    import numpy as np
    
    pd.set_option('precision', 1)
    
    books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
    authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])
    user_1 = pd.Series(data = [3.2, np.nan ,2.5])
    user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
    user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
    user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])
    
    dat = {'Book Title' : books,
           'Author' : authors,
           'User 1' : user_1,
           'User 2' : user_2,
           'User 3' : user_3,
           'User 4' : user_4}
    
    book_ratings = pd.DataFrame(dat)
    
    book_ratings.fillna(book_ratings.mean(), inplace = True)
    

    From the DataFrame above you can now pick all the books that had a rating of 5. You can do this in just one line of code. Try to do it yourself first, you'll find the answer below:

    best_rated = book_ratings[(book_ratings == 5).any(axis = 1)]['Book Title'].values

    The code above returns a NumPy ndarray that only contains the names of the books that had a rating of 5.

    Loading Data into a Pandas DataFrame

    v7 + GOOG.csv
    In machine learning you will most likely use databases from many sources to train your learning algorithms. Pandas allows us to load databases of different formats into DataFrames. One of the most popular data formats used to store databases is csv. CSV stands for Comma Separated Values and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the pd.read_csv()function. Let's load Google stock data into a Pandas DataFrame. The GOOG.csv file contains Google stock data from 8/19/2004 till 10/13/2017 taken from Yahoo Finance.

    # We load Google stock data in a DataFrame
    Google_stock = pd.read_csv('./GOOG.csv')
    
    # We print some information about Google_stock
    print('Google_stock is of type:', type(Google_stock))
    print('Google_stock has shape:', Google_stock.shape)
    
    image.png

    We see that we have loaded the GOOG.csv file into a Pandas DataFrame and it consists of 3,313 rows and 7 columns. Now let's look at the stock data

    Google_stock
    
    image.png

    We can also take a look at the last 5 rows of data by using the.tail()method:

    Google_stock.tail()
    
    image.png

    We can also optionally use .head(N)or .tail(N)to display the first and last N rows of data, respectively.

    Let's do a quick check to see whether we have any NaN values in our dataset. To do this, we will use the.isnull()method followed by the .any()method to check whether any of the columns contain NaN values.

    Google_stock.isnull().any()
    
    image.png

    We see that we have no NaN values.

    When dealing with large datasets, it is often useful to get statistical information from them. Pandas provides the .describe()method to get descriptive statistics on each column of the DataFrame. Let's see how this works:

    # We get descriptive statistics on our stock data
    Google_stock.describe()
    
    image.png

    If desired, we can apply the.describe()method on a single column as shown below:

    # We get descriptive statistics on a single column of our DataFrame
    Google_stock['Adj Close'].describe()
    
    image.png

    Similarly, you can also look at one statistic by using one of the many statistical functions Pandas provides. Let's look at some examples:

    # We print information about our DataFrame  
    print()
    print('Maximum values of each column:\n', Google_stock.max())
    print()
    print('Minimum Close value:', Google_stock['Close'].min())
    print()
    print('Average value of each column:\n', Google_stock.mean())
    
    image.png image.png

    Another important statistical measure is data correlation. Data correlation can tell us, for example, if the data in different columns are correlated. We can use the .corr() method to get the correlation between different columns, as shown below:

    # We display the correlation between columns
    Google_stock.corr()
    
    image.png

    A correlation value of 1 tells us there is a high correlation and a correlation of 0 tells us that the data is not correlated at all.

    We will end this Introduction to Pandas by taking a look at the .groupby()method. The .groupby() method allows us to group data in different ways. Let's see how we can group data to get different types of information. For the next examples we are going to load fake data about a fictitious company.

    # We load fake Company data in a DataFrame
    data = pd.read_csv('./fake_company.csv')
    
    data
    
    image.png

    We see that the data contains information for the year 1990 through 1992. For each year we see name of the employees, the department they work for, their age, and their annual salary. Now, let's use the.groupby()method to get information.

    Let's calculate how much money the company spent in salaries each year. To do this, we will group the data by Year using the .groupby()method and then we will add up the salaries of all the employees by using the .sum()method.

    # We display the total amount of money spent in salaries each year
    data.groupby(['Year'])['Salary'].sum()
    
    image.png

    We see that the company spent a total of 153,000 dollars in 1990, 162,000 in 1991, and 174,000 in 1992.

    Now, let's suppose I want to know what was the average salary for each year. In this case, we will group the data by Year using the .groupby()method, just as we did before, and then we use the.mean()method to get the average salary. Let's see how this works

    # We display the average salary per year
    data.groupby(['Year'])['Salary'].mean()
    
    image.png

    We see that the average salary in 1990 was 51,000 dollars, 54,000 in 1991, and 58,000 in 1992.

    Now let's see how much did each employee get paid in those three years. In this case, we will group the data by Name using the.groupby()method and then we will add up the salaries for each year. Let's see the result

    # We display the total salary each employee received in all the years they worked for the company
    data.groupby(['Name'])['Salary'].sum()
    
    image.png

    We see that Alice received a total of 162,000 dollars in the three years she worked for the company, Bob received 150,000, and Charlie received 177,000.

    Now let's see what was the salary distribution per department per year. In this case we will group the data by Year and by Department using the.groupby()method and then we will add up the salaries for each department. Let's see the result

    # We display the salary distribution per department per year.
    data.groupby(['Year', 'Department'])['Salary'].sum()
    
    image.png

    We see that in 1990 the Admin department paid 55,000 dollars in salaries,the HR department paid 50,000, and the RD department 48,0000. While in 1992 the Admin department paid 122,000 dollars in salaries and the RD department paid 52,000.

    辅助材料

    GOOG.csv

    相关文章

      网友评论

        本文标题:Introduction to Pandas

        本文链接:https://www.haomeiwen.com/subject/ydgawftx.html