numpy & pandas

作者: 钊钖 | 来源:发表于2017-12-29 20:08 被阅读0次

Numpy复习day_1
1. pandas apply
Numpy与Panda简介
pandas基本介绍
机器学习入门笔记二 pandas基本操作
numpy & pandas
MultiIndex对象
python计算分位数
Python 科学应用库 numpy(1)
数据分析相关的各种说明文档

numpy & pandas

numpy

creating arrays

The core data structure in NumPy is the ndarray object,which stands for N-dimensional array.
An array is a collection of values, similar to a list.

N-dimensional refers to the number of indices needed to select individual values from the object.

A 1-dimensional array is often referred to as a vector while a 2-dimensional array is often referred to as a matrix.

We can directly construct arrays from lists using the numpy.array() function. To construct a vector, we need to pass in a single list (with no nesting):

vector = np.array([5, 10, 15, 20])
The numpy.array() function also accepts a list of lists, which we use to create a matrix:

import numpy as np
#single list
vector = np.array([1,2,3])

#a list of lists
matrix = np.array([1,2,3],[2,3,4])

array shape

It's often useful to know how many elements an array contains. We can use the ndarray.shape property to figure out how many elements are in the array.

For vectors, the shape property contains a tuple with 1 element.
For matrices, the shape property contains a tuple with 2 elements.

import numpy as np
vector = numpy.array([1, 2, 3, 4])
print(vector.shape)
#output(4,) ,,,,
matrix = numpy.array([[5, 10, 15], [20, 25, 30]])
print(matrix.shape)
#output(2,3)

reading numpy & pandas

numpy

Instructions

When reading in world_alcohol.csv using numpy.genfromtxt():
- Use the "U75" data type
- Skip the first line in the dataset
- Use the comma delimiter.
Assign the result to world_alcohol.
Use the print() function to display world_alcohol.

import numpy as np
world_alcohol = np.genfromtxt('world_alcohol.csv',dtype='U75',delimiter = ',',skip_header=1)

pandas

To read a CSV file into a dataframe, we use the pandas.read_csv() function and pass in the file name as a string.

Instructions

Import the pandas library.
Use the pandas.read_csv() function to read the file "food_info.csv" into a dataframe named food_info.
Use the type() and print() functions to display the type of food_info to confirm that it's a dataframe object.

import pandas
food_info = pandas.read_csv('food_info.csv')
print(type(food_info))
#output <class 'pandas.core.frame.DataFrame'>

indexing numpy & pandas

index numpy arrays

Here's how we would index a NumPy vector:

vector = np.array([5, 10, 15, 20])
print(vector[0])
# output 5

Matrax:
The first index specifies which row the data comes from, and the second index specifies which column the data comes from.

import numpy as np

matrix = np.array([
                    [5, 10, 15], 
                    [20, 25, 30]
                ])

print(matrix[1,2])
#output 30

Instructions

Assign the country in the third row to third_country. Country is the third column.

import numpy as np
matrix = world_alcochol
third_country = matrix[2,2]

indexing pandas

The Series object is a core data structure that pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector.

The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing.

When you read in a file into a dataframe, pandas uses the values in the first row (also known as the header) for the column labels and the row number for the row labels. Collectively, the labels are referred to as the index. dataframes contain both a row index and a column index.

\	column label(c index)	.....
row label (r index ) 0
row label 1
......

when you select a row from a dataframe, instead of just returning the values in that row as a list, pandas returns a Series object that contains the column labels as well as the corresponding values:

NDB_No	Shrt_Desc	Water_(g)	...
1001	BUTTER WITH SALT	15.87	...

The Series object representing the first row looks like:

DB_No -----------------1001

Shrt_Desc -------------- BUTTERWITHSALT

Water_(g) ---------------15.87

slicing numpy

numpy slicing arrays

vector

like select subset with list

vector = np.array([5,10,15,20])
vector [0:3]
#output [5,10,15]
#from the first index upto but not including the second index

matrix

matrix = np.array ([
                    [5,10,15],
                    [20,25,30],
                    [35,40,45]
])

matrix [:,1]
#output [10,25,40]

select all of the rows,but only the column with index 1
colon : specifies the entirety of sigle dimension shoud be selected.

Instructions

Assign the whole third column from world_alcohol to the variable countries.

countries = world_alcohol[:,2]

slicing one dimension

matrix = np.array([
                [5,10,15],
                [20,25,30],
                [35,40,45]
])
matrix [:,0:2]
#output ([[5,10],
#        [20,25],
#        [35,40]])

matrix [1:3,1]
#output [25,40]

Instructions

Assign all the rows and the first 2 columns of world_alcohol to first_two_columns.
Assign the first 10 rows and the first column of world_alcohol to first_ten_years.
Assign the first 10 rows and all of the columns of world_alcohol to first_ten_rows.

first_two_columns = world_alcohol[:,0:2]
first_ten_years = world_alcohol[0:10,0]
first_ten_years = world_alcohol[0:10,:]

slicing arrays (both dimensions)

matrix = np.array([
        [5,10,15],
        [20,25,30],
        [35,40,45]
])

matrix [1:3,0:2]
#poutput
# [[20,25],
#  [35,40]]

Instructions

Assign the first 20 rows of the columns at index 1 and 2 of world_alcohol to first_twenty_regions.

first_twenty_regions = world_alcohol[0:20,1:3]

pandas

selecting a row (pandas)

we use bracket notation to access elements in a NumPy array or a standard list

we need to use the pandas method loc[ ] to select rows in a dataframe.

The loc[] method allows you to select rows by row labels. Recall that when you read a file into a dataframe, pandas uses the row number (or position) as each row's label. Pandas uses zero-indexing, so the first row is at index 0, the second row at index 1, and so on.

# Series object representing the row at index 0.
food_info.loc[0]

# Series object representing the seventh row.
food_info.loc[6]

selecting multiple rows

pass in either a slice of row labels or a list of row labels and pandas will return a dataframe.

Note that unlike slicing lists in Python, a slice of a dataframe using .loc[] will include both the start and the end row.

# DataFrame containing the rows at index 3, 4, 5, and 6 returned.
food_info.loc[3:6]

# DataFrame containing the rows at index 2, 5, and 10 returned. Either of the following work.
# Method 1
two_five_ten = [2,5,10] 
food_info.loc[two_five_ten]

# Method 2
food_info.loc[[2,5,10]]

Instructions

Select the last 5 rows of food_info and assign to the variable last_rows.

num_rows = food_info.shape[0]
last_rows = food_info.loc[num_rows-5:num_rows-1]
print(last_rows)

selecting a column

To access a single column, use bracket notation and pass in the column name as a string:

# Series object representing the "NDB_No" column.
ndb_col = food_info["NDB_No"]

# You can instead access a column by passing in a string variable.
col_name = "NDB_No"
ndb_col = food_info[col_name]

selecting multiple columns by name

To select multiple columns, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns

When selecting multiple columns, the order of the columns in the returned dataframe matches the order of the column names in the list of strings that you passed in. This allows you to easily explore specific columns that may not be positioned next to each other in the dataframe.

columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]

# Skipping the assignment.
zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]

Computation With NumPy

Learn how to select elements in arrays and perform computations with NumPy.

array comparisons

One of the most powerful aspects of the NumPy module is the ability to make comparisons across an entire array. These comparisons result inBoolean values.

vector = numpy.array([5, 10, 15, 20])
vector == 10

#output a vector bool
#[False,True,...]
matrix = numpy.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
    matrix == 25
#output a bool matrix
#[
#    [False, False, False], 
#    [False, True,  False],
#    [False, False, False]
#                            ]

Instructions

The variable world_alcohol already contains the data set we're working with.

Extract the third column in world_alcohol, and compare it to the string Canada. Assign the result to countries_canada.
Extract the first column in world_alcohol, and compare it to the string 1984. Assign the result to years_1984.

countries_canada= (world_alcohol[:,2]=='Canada')
years_1984 = (world_alcohol[:,0]=='1984')

selecting elements

Comparisons give us the power to select elements in arrays using Boolean vectors. This allows us to conditionally select certain elements in vectors, or certain rows in matrices.


vector = numpy.array([5, 10, 15, 20])
equal_to_ten = (vector == 10)
# output [10]


matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
second_column_25 = (matrix[:,1] == 25)
matrix[second_column_25,:]
#output [[20 ,25,30]]

Instructions

Compare the third column of world_alcohol to the string Algeria.
Assign the result to country_is_algeria.
Select only the rows in world_alcohol where country_is_algeria is True.
Assign the result to country_algeria.

country_is_algeria =(world_alcohol[:,2]=='Algeria')
country_algeria = world_alcohol[country_is_algeria,:]

Comparisons with Multiple Conditions

joining multiple conditions with an ampersand (&). it's critical to put each one in parentheses.

the** pipe symbol (|)** to specify that either one condition or the other should be True .

Instructions

Perform a comparison with multiple conditions, and join the conditions with &.
- Compare the first column of world_alcohol to the string 1986.
- Compare the third column of world_alcohol to the string Algeria.
- Enclose each condition in parentheses, and join the conditions with &.
- Assign the result to is_algeria_and_1986.
Use is_algeria_and_1986 to select rows from world_alcohol.
Assign the rows that is_algeria_and_1986 selects to rows_with_algeria_and_1986.

is_algeria_and_1986 =
(world_alcohol[:,0] == "1986") &
(world_alcohol[:,2] == "Algeria")

rows_with_algeria_and_1986 =
world_alcohol[is_algeria_and_1986,:]

Replacing Values

use comparisons to replace values

vector = numpy.array([5, 10, 15, 20])
equal_to_ten_or_five = (vector == 10) | (vector == 5)
vector[equal_to_ten_or_five] = 50
print(vector)
#output [50,50,15,20]


matrix = numpy.array([
            [5, 10, 15], 
            [20, 25, 30],
            [35, 40, 45]
         ])
second_column_25 = matrix[:,1] == 25
matrix[second_column_25, 1] = 10
#output
#[
#    [5, 10, 15], 
#    [20, 10, 30],
#    [35, 40, 45]
# ]

Instructions

Replace all instances of the string 1986 in the first column of world_alcohol with the string 2014.
Replace all instances of the string Wine in the fourth column of world_alcohol with the string Grog.

world_alcohol[(world_alcohol[:,0]=='1986'),0]='2014'
#world_alcohol[(world_alcohol[:,3]=='Wine'),3] ='Grog'
world_alcohol[:,3][world_alcohol[:,3] == 'Wine'] = 'Grog'

Replacing Empty Strings

Instructions

Compare all the items in the fifth column of world_alcohol with an empty string ''. Assign the result to is_value_empty.
Select all the values in the fifth column of world_alcohol where is_value_empty is True, and replace them with the string 0.


is_value_empty =world_alcohol[:,4] == ''
world_alcohol[is_value_empty,4] = '0'

#world_alcohol[:,4][world_alcohol[:,4]==''] = '0'

Converting Data Types

We can convert the data type of an array with the astype() method.

Instructions

Extract the fifth column from world_alcohol, and assign it to the variable alcohol_consumption.
Use the astype() method to convert alcohol_consumption to the float data type.

alcohol_consumption= world_alcohol[:,4]
alcohol_consumption = alcohol_consumption.astype(float)

Computing with NumPy

vector = numpy.array([5, 10, 15, 20])
vector.sum()
# output 50

matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
matrix.sum(axis=1) # 1 row ,0 column
# output [30,75,120]

Instructions

Use the sum() method to calculate the sum of the values in alcohol_consumption. Assign the result to total_alcohol.
Use the mean() method to calculate the average of the values in alcohol_consumption. Assign the result to average_alcohol.

total_alcohol = alcohol_consumption.sum()
average_alcohol =alcohol_consumption.mean()

Total Annual Alcohol Consumption

Instructions

Create a matrix called canada_1986 that only contains the rows in world_alcohol where the first column is the string 1986 and the third column is the string Canada.
Extract the fifth column of canada_1986, replace any empty strings ('') with the string 0, and convert the column to the float data type. Assign the result to canada_alcohol.
Compute the sum of canada_alcohol. Assign the result to total_canadian_drinking.

is_canada_1986 = 
(world_alcohol[:,2] == "Canada") &
(world_alcohol[:,0] == '1986')

canada_1986 = world_alcohol[is_canada_1986,:]

canada_alcohol = canada_1986[:,4]

empty_strings = canada_alcohol == ''

canada_alcohol[empty_strings] = "0"

canada_alcohol = canada_alcohol.astype(float)

total_canadian_drinking = canada_alcohol.sum()

Calculating Consumption for Each Country

Create an empty dictionary called totals.

Select only the rows in world_alcohol that match a given year. Assign the result to year.

Loop through a list of countries. For each country:

Select only the rows from year that match the given country.

asign the result to country_consumption.

Extract the fifth column from country_consumption.

Replace any empty string values in the column with the string 0.

Convert the column to the float data type.

Find the sum of the column.
---Add the sum to the totals dictionary, with the country name as the key.

After the code executes, you'll have a dictionary containing all of the country names as keys, with the associated alcohol consumption totals as the values.

Calculating Consumption for Each Country

We've assigned the list of all countries to the variable countries.
Find the total consumption for each country in countries for the year 1989.
- Refer to the steps outlined above for help.
When you're finished, totals should contain all of the country names as keys, with the corresponding alcohol consumption totals for 1989 as values.

totals = {}
is_year = world_alcohol[:,0] == "1989"
year = world_alcohol[is_year,:]

for country in countries:
    is_country = year[:,2] == country
    
    country_consumption = year[is_country,:]
    
    alcohol_column = country_consumption[:,4]
    
    is_empty = alcohol_column == ''
    
    alcohol_column[is_empty] = "0"
    
    alcohol_column = alcohol_column.astype(float)
    
    totals[country] = alcohol_column.sum()

Finding the Country that Drinks the Most

Now that we've computed total alcohol consumption for each country in 1989, we can loop through the totals dictionary to find the country with the highest value.

The process we've outlined below will help you find the key with the highest value in a dictionary:

Create a variable called highest_value that will keep track of the highest value. Set its value to 0.
Create a variable called highest_key that will keep track of the key associated with the highest value. Set its value to None.
Loop through each key in the dictionary.
- If the value associated with the key is greater than highest_value, assign the value to highest_value, and assign the key to highest_key.
  After the code runs, highest_key will be the key associated with the highest value in the dictionary.

Instructions

Find the country with the highest total alcohol consumption.
To do this, you'll need to find the key associated with the highest value in the totals dictionary.
Follow the process outlined above to find the highest value in totals.
When you're finished, highest_value will contain the highest average alcohol consumption, and highest_key will contain the country that had the highest per capital alcohol consumption in 1989.

highest_value = 0
highest_key = None
for country in totals:
    if totals[country]> highest_value:
        highest_value = totals[country]
        highest_key = country

NumPy Strengths and Weaknesses

You should now have a good foundation in NumPy, and in handling issues with your data. NumPy is much easier to work with than lists of lists, because:

It's easy to perform computations on data.
Data indexing and slicing is faster and easier.
We can convert data types quickly.

Overall, NumPy makes working with data in Python much more efficient. It's widely used for this reason, especially for machine learning.

You may have noticed some limitations with NumPy as you worked through the past two missions, though. For example:

All of the items in an array must have the same data type. For many datasets, this can make arrays cumbersome to work with.
Columns and rows must be referred to by number, which gets confusing when you go back and forth from column name to column number.

In the next few missions, we'll learn about the Pandas library, one of the most popular data analysis libraries. Pandas builds on NumPy, but does a better job addressing the limitations of NumPy.

Numpy复习day_1
Numpy、Pandas、Matplotlib Numpy (Pandas的内核是Numpy的) 。有一个高维数...
1. pandas apply
1. numpy pandas基础 numpy底层C语言实现，速度快，pandas是numpy的包装版
Numpy与Panda简介
numpy与pandas较python运行速度较快，pandas是基于numpy,是numpy的升级版本其消耗资源...
pandas基本介绍
pandas与numpy的不同是，pandas使用的是字典结构，而numpy使用的是列表结构，但pandas是建立...
机器学习入门笔记二 pandas基本操作
pandas 是基于NumPy 的一种工具，pandas就是字典型的numpy，就是numpy像是一个列表，pan...
numpy & pandas
numpy & pandas numpy creating arrays The core data struct...
MultiIndex对象
import pandas as pdimport numpy as npfrom pandas import D...
python计算分位数
pandas 和 numpy中都有计算分位数的方法，pandas中是quantile，numpy中是percent...
Python 科学应用库 numpy(1)
numpy 和 pandas 在科学运算比较重要的库 numpy 和 pandas ，如果要用 python 进行...
数据分析相关的各种说明文档
numpy:https://docs.scipy.org/doc/numpy/reference/ pandas:...

numpy & pandas

numpy & pandas

numpy

creating arrays

array shape

reading numpy & pandas

numpy

Instructions

pandas

Instructions

indexing numpy & pandas

index numpy arrays

Instructions

indexing pandas

slicing numpy

numpy slicing arrays

vector

matrix

Instructions

slicing one dimension

Instructions

slicing arrays (both dimensions)

Instructions

pandas

selecting a row (pandas)

selecting multiple rows

Instructions

selecting a column

selecting multiple columns by name

Computation With NumPy

array comparisons

Instructions

selecting elements

Instructions

Comparisons with Multiple Conditions

Instructions

Replacing Values

Instructions

Replacing Empty Strings

Instructions

Converting Data Types

Instructions

Computing with NumPy

Instructions

Total Annual Alcohol Consumption

Instructions

Calculating Consumption for Each Country

Calculating Consumption for Each Country

Finding the Country that Drinks the Most

Instructions

NumPy Strengths and Weaknesses

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读