numpy & pandas
numpy
creating arrays
The core data structure in NumPy is the ndarray object,which stands for N-dimensional array.
An array is a collection of values, similar to a list.
N-dimensional refers to the number of indices needed to select individual values from the object.
A 1-dimensional array is often referred to as a vector while a 2-dimensional array is often referred to as a matrix.
We can directly construct arrays from lists using the numpy.array() function. To construct a vector, we need to pass in a single list (with no nesting):
vector = np.array([5, 10, 15, 20])
The numpy.array() function also accepts a list of lists, which we use to create a matrix:
import numpy as np
#single list
vector = np.array([1,2,3])
#a list of lists
matrix = np.array([1,2,3],[2,3,4])
array shape
It's often useful to know how many elements an array contains. We can use the ndarray.shape property to figure out how many elements are in the array.
- For vectors, the shape property contains a tuple with 1 element.
- For matrices, the shape property contains a tuple with 2 elements.
import numpy as np
vector = numpy.array([1, 2, 3, 4])
print(vector.shape)
#output(4,) ,,,,
matrix = numpy.array([[5, 10, 15], [20, 25, 30]])
print(matrix.shape)
#output(2,3)
reading numpy & pandas
numpy
Instructions
- When reading in world_alcohol.csv using numpy.genfromtxt():
- Use the "U75" data type
- Skip the first line in the dataset
- Use the comma delimiter.
- Assign the result to world_alcohol.
- Use the print() function to display world_alcohol.
import numpy as np
world_alcohol = np.genfromtxt('world_alcohol.csv',dtype='U75',delimiter = ',',skip_header=1)
pandas
To read a CSV file into a dataframe, we use the pandas.read_csv() function and pass in the file name as a string.
Instructions
- Import the pandas library.
- Use the pandas.read_csv() function to read the file "food_info.csv" into a dataframe named food_info.
- Use the type() and print() functions to display the type of food_info to confirm that it's a dataframe object.
import pandas
food_info = pandas.read_csv('food_info.csv')
print(type(food_info))
#output <class 'pandas.core.frame.DataFrame'>
indexing numpy & pandas
index numpy arrays
Here's how we would index a NumPy vector:
vector = np.array([5, 10, 15, 20])
print(vector[0])
# output 5
Matrax:
The first index specifies which row the data comes from, and the second index specifies which column the data comes from.
import numpy as np
matrix = np.array([
[5, 10, 15],
[20, 25, 30]
])
print(matrix[1,2])
#output 30
Instructions
- Assign the country in the third row to third_country. Country is the third column.
import numpy as np
matrix = world_alcochol
third_country = matrix[2,2]
indexing pandas
The Series object is a core data structure that pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector.
The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing.
When you read in a file into a dataframe, pandas uses the values in the first row (also known as the header) for the column labels and the row number for the row labels. Collectively, the labels are referred to as the index. dataframes contain both a row index and a column index.
\ | column label(c index) | ..... |
---|---|---|
row label (r index ) 0 | ||
row label 1 | ||
...... |
when you select a row from a dataframe, instead of just returning the values in that row as a list, pandas returns a Series object that contains the column labels as well as the corresponding values:
NDB_No | Shrt_Desc | Water_(g) | ... |
---|---|---|---|
1001 | BUTTER WITH SALT | 15.87 | ... |
The Series object representing the first row looks like:
DB_No -----------------1001
Shrt_Desc -------------- BUTTERWITHSALT
Water_(g) ---------------15.87
slicing numpy
numpy slicing arrays
vector
like select subset with list
vector = np.array([5,10,15,20])
vector [0:3]
#output [5,10,15]
#from the first index upto but not including the second index
matrix
matrix = np.array ([
[5,10,15],
[20,25,30],
[35,40,45]
])
matrix [:,1]
#output [10,25,40]
- select all of the rows,but only the column with index 1
- colon : specifies the entirety of sigle dimension shoud be selected.
Instructions
- Assign the whole third column from world_alcohol to the variable countries.
countries = world_alcohol[:,2]
slicing one dimension
matrix = np.array([
[5,10,15],
[20,25,30],
[35,40,45]
])
matrix [:,0:2]
#output ([[5,10],
# [20,25],
# [35,40]])
matrix [1:3,1]
#output [25,40]
Instructions
- Assign all the rows and the first 2 columns of world_alcohol to first_two_columns.
- Assign the first 10 rows and the first column of world_alcohol to first_ten_years.
- Assign the first 10 rows and all of the columns of world_alcohol to first_ten_rows.
first_two_columns = world_alcohol[:,0:2]
first_ten_years = world_alcohol[0:10,0]
first_ten_years = world_alcohol[0:10,:]
slicing arrays (both dimensions)
matrix = np.array([
[5,10,15],
[20,25,30],
[35,40,45]
])
matrix [1:3,0:2]
#poutput
# [[20,25],
# [35,40]]
Instructions
- Assign the first 20 rows of the columns at index 1 and 2 of world_alcohol to first_twenty_regions.
first_twenty_regions = world_alcohol[0:20,1:3]
pandas
selecting a row (pandas)
we use bracket notation to access elements in a NumPy array or a standard list
we need to use the pandas method loc[ ] to select rows in a dataframe.
The loc[] method allows you to select rows by row labels. Recall that when you read a file into a dataframe, pandas uses the row number (or position) as each row's label. Pandas uses zero-indexing, so the first row is at index 0, the second row at index 1, and so on.
# Series object representing the row at index 0.
food_info.loc[0]
# Series object representing the seventh row.
food_info.loc[6]
selecting multiple rows
pass in either a slice of row labels or a list of row labels and pandas will return a dataframe.
Note that unlike slicing lists in Python, a slice of a dataframe using .loc[] will include both the start and the end row.
# DataFrame containing the rows at index 3, 4, 5, and 6 returned.
food_info.loc[3:6]
# DataFrame containing the rows at index 2, 5, and 10 returned. Either of the following work.
# Method 1
two_five_ten = [2,5,10]
food_info.loc[two_five_ten]
# Method 2
food_info.loc[[2,5,10]]
Instructions
- Select the last 5 rows of food_info and assign to the variable last_rows.
num_rows = food_info.shape[0]
last_rows = food_info.loc[num_rows-5:num_rows-1]
print(last_rows)
selecting a column
To access a single column, use bracket notation and pass in the column name as a string:
# Series object representing the "NDB_No" column.
ndb_col = food_info["NDB_No"]
# You can instead access a column by passing in a string variable.
col_name = "NDB_No"
ndb_col = food_info[col_name]
selecting multiple columns by name
To select multiple columns, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns
When selecting multiple columns, the order of the columns in the returned dataframe matches the order of the column names in the list of strings that you passed in. This allows you to easily explore specific columns that may not be positioned next to each other in the dataframe.
columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]
# Skipping the assignment.
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]
Computation With NumPy
Learn how to select elements in arrays and perform computations with NumPy.
array comparisons
One of the most powerful aspects of the NumPy module is the ability to make comparisons across an entire array. These comparisons result inBoolean values.
vector = numpy.array([5, 10, 15, 20])
vector == 10
#output a vector bool
#[False,True,...]
matrix = numpy.array([
[5, 10, 15],
[20, 25, 30],
[35, 40, 45]
])
matrix == 25
#output a bool matrix
#[
# [False, False, False],
# [False, True, False],
# [False, False, False]
# ]
Instructions
The variable world_alcohol already contains the data set we're working with.
- Extract the third column in world_alcohol, and compare it to the string Canada. Assign the result to countries_canada.
- Extract the first column in world_alcohol, and compare it to the string 1984. Assign the result to years_1984.
countries_canada= (world_alcohol[:,2]=='Canada')
years_1984 = (world_alcohol[:,0]=='1984')
selecting elements
Comparisons give us the power to select elements in arrays using Boolean vectors. This allows us to conditionally select certain elements in vectors, or certain rows in matrices.
vector = numpy.array([5, 10, 15, 20])
equal_to_ten = (vector == 10)
# output [10]
matrix = numpy.array([
[5, 10, 15],
[20, 25, 30],
[35, 40, 45]
])
second_column_25 = (matrix[:,1] == 25)
matrix[second_column_25,:]
#output [[20 ,25,30]]
Instructions
- Compare the third column of world_alcohol to the string Algeria.
- Assign the result to country_is_algeria.
- Select only the rows in world_alcohol where country_is_algeria is True.
- Assign the result to country_algeria.
country_is_algeria =(world_alcohol[:,2]=='Algeria')
country_algeria = world_alcohol[country_is_algeria,:]
Comparisons with Multiple Conditions
joining multiple conditions with an ampersand (&). it's critical to put each one in parentheses.
the** pipe symbol (|)** to specify that either one condition or the other should be True .
Instructions
- Perform a comparison with multiple conditions, and join the conditions with &.
- Compare the first column of world_alcohol to the string 1986.
- Compare the third column of world_alcohol to the string Algeria.
- Enclose each condition in parentheses, and join the conditions with &.
- Assign the result to is_algeria_and_1986.
- Use is_algeria_and_1986 to select rows from world_alcohol.
- Assign the rows that is_algeria_and_1986 selects to rows_with_algeria_and_1986.
is_algeria_and_1986 =
(world_alcohol[:,0] == "1986") &
(world_alcohol[:,2] == "Algeria")
rows_with_algeria_and_1986 =
world_alcohol[is_algeria_and_1986,:]
Replacing Values
use comparisons to replace values
vector = numpy.array([5, 10, 15, 20])
equal_to_ten_or_five = (vector == 10) | (vector == 5)
vector[equal_to_ten_or_five] = 50
print(vector)
#output [50,50,15,20]
matrix = numpy.array([
[5, 10, 15],
[20, 25, 30],
[35, 40, 45]
])
second_column_25 = matrix[:,1] == 25
matrix[second_column_25, 1] = 10
#output
#[
# [5, 10, 15],
# [20, 10, 30],
# [35, 40, 45]
# ]
Instructions
- Replace all instances of the string 1986 in the first column of world_alcohol with the string 2014.
- Replace all instances of the string Wine in the fourth column of world_alcohol with the string Grog.
world_alcohol[(world_alcohol[:,0]=='1986'),0]='2014'
#world_alcohol[(world_alcohol[:,3]=='Wine'),3] ='Grog'
world_alcohol[:,3][world_alcohol[:,3] == 'Wine'] = 'Grog'
Replacing Empty Strings
Instructions
- Compare all the items in the fifth column of world_alcohol with an empty string ''. Assign the result to is_value_empty.
- Select all the values in the fifth column of world_alcohol where is_value_empty is True, and replace them with the string 0.
is_value_empty =world_alcohol[:,4] == ''
world_alcohol[is_value_empty,4] = '0'
#world_alcohol[:,4][world_alcohol[:,4]==''] = '0'
Converting Data Types
We can convert the data type of an array with the astype() method.
Instructions
- Extract the fifth column from world_alcohol, and assign it to the variable alcohol_consumption.
- Use the astype() method to convert alcohol_consumption to the float data type.
alcohol_consumption= world_alcohol[:,4]
alcohol_consumption = alcohol_consumption.astype(float)
Computing with NumPy
vector = numpy.array([5, 10, 15, 20])
vector.sum()
# output 50
matrix = numpy.array([
[5, 10, 15],
[20, 25, 30],
[35, 40, 45]
])
matrix.sum(axis=1) # 1 row ,0 column
# output [30,75,120]
Instructions
- Use the sum() method to calculate the sum of the values in alcohol_consumption. Assign the result to total_alcohol.
- Use the mean() method to calculate the average of the values in alcohol_consumption. Assign the result to average_alcohol.
total_alcohol = alcohol_consumption.sum()
average_alcohol =alcohol_consumption.mean()
Total Annual Alcohol Consumption
Instructions
- Create a matrix called canada_1986 that only contains the rows in world_alcohol where the first column is the string 1986 and the third column is the string Canada.
- Extract the fifth column of canada_1986, replace any empty strings ('') with the string 0, and convert the column to the float data type. Assign the result to canada_alcohol.
- Compute the sum of canada_alcohol. Assign the result to total_canadian_drinking.
is_canada_1986 =
(world_alcohol[:,2] == "Canada") &
(world_alcohol[:,0] == '1986')
canada_1986 = world_alcohol[is_canada_1986,:]
canada_alcohol = canada_1986[:,4]
empty_strings = canada_alcohol == ''
canada_alcohol[empty_strings] = "0"
canada_alcohol = canada_alcohol.astype(float)
total_canadian_drinking = canada_alcohol.sum()
Calculating Consumption for Each Country
- Create an empty dictionary called totals.
- Select only the rows in world_alcohol that match a given year. Assign the result to year.
- Loop through a list of countries. For each country:
- Select only the rows from year that match the given country.
- asign the result to country_consumption.
- Extract the fifth column from country_consumption.
- Replace any empty string values in the column with the string 0.
- Convert the column to the float data type.
- Find the sum of the column.
---Add the sum to the totals dictionary, with the country name as the key.- After the code executes, you'll have a dictionary containing all of the country names as keys, with the associated alcohol consumption totals as the values.
Calculating Consumption for Each Country
- We've assigned the list of all countries to the variable countries.
- Find the total consumption for each country in countries for the year 1989.
- Refer to the steps outlined above for help.
- When you're finished, totals should contain all of the country names as keys, with the corresponding alcohol consumption totals for 1989 as values.
totals = {}
is_year = world_alcohol[:,0] == "1989"
year = world_alcohol[is_year,:]
for country in countries:
is_country = year[:,2] == country
country_consumption = year[is_country,:]
alcohol_column = country_consumption[:,4]
is_empty = alcohol_column == ''
alcohol_column[is_empty] = "0"
alcohol_column = alcohol_column.astype(float)
totals[country] = alcohol_column.sum()
Finding the Country that Drinks the Most
Now that we've computed total alcohol consumption for each country in 1989, we can loop through the totals dictionary to find the country with the highest value.
The process we've outlined below will help you find the key with the highest value in a dictionary:
-
Create a variable called highest_value that will keep track of the highest value. Set its value to 0.
-
Create a variable called highest_key that will keep track of the key associated with the highest value. Set its value to None.
-
Loop through each key in the dictionary.
- If the value associated with the key is greater than highest_value, assign the value to highest_value, and assign the key to highest_key.
After the code runs, highest_key will be the key associated with the highest value in the dictionary.
- If the value associated with the key is greater than highest_value, assign the value to highest_value, and assign the key to highest_key.
Instructions
- Find the country with the highest total alcohol consumption.
- To do this, you'll need to find the key associated with the highest value in the totals dictionary.
- Follow the process outlined above to find the highest value in totals.
- When you're finished, highest_value will contain the highest average alcohol consumption, and highest_key will contain the country that had the highest per capital alcohol consumption in 1989.
highest_value = 0
highest_key = None
for country in totals:
if totals[country]> highest_value:
highest_value = totals[country]
highest_key = country
NumPy Strengths and Weaknesses
You should now have a good foundation in NumPy, and in handling issues with your data. NumPy is much easier to work with than lists of lists, because:
- It's easy to perform computations on data.
- Data indexing and slicing is faster and easier.
- We can convert data types quickly.
Overall, NumPy makes working with data in Python much more efficient. It's widely used for this reason, especially for machine learning.
You may have noticed some limitations with NumPy as you worked through the past two missions, though. For example:
- All of the items in an array must have the same data type. For many datasets, this can make arrays cumbersome to work with.
- Columns and rows must be referred to by number, which gets confusing when you go back and forth from column name to column number.
In the next few missions, we'll learn about the Pandas library, one of the most popular data analysis libraries. Pandas builds on NumPy, but does a better job addressing the limitations of NumPy.
网友评论