Intermediate Python -- DataCamp
进阶数据操作(Dictionary&DataFrame)
List
数据单独列出来就是List
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]
ind_alb = countries.index("albania")
ind_alb
1
pop[ind_alb]
2.77
list使用起来不是很方便。
Dictionary
两组list的数据对应起来的话就变成Dictionary了
world = {"afghanistan":30.55, "albania":2.77, "algeria":39.21}
world["albania"]
2.77
- 新增或者删除数值
world["sealand"] = 0.000028
world
{'afghanistan': 30.55, 'albania': 2.81, 'algeria': 39.21, 'sealand': 2.8e-05}
del(world["sealand"]) world
{'afghanistan': 30.55, 'albania': 2.81, 'algeria': 39.21}
- 多层字典
说白了就是字典里面套字典
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
'france': { 'capital':'paris', 'population':66.03 },
'germany': { 'capital':'berlin', 'population':80.62 },
'norway': { 'capital':'oslo', 'population':5.084 } }
# Print out the capital of France
europe["france"]["capital"]
# Create sub-dictionary data
data={"capital":"rome","population":59.83}
# Add data to europe under key 'italy'
europe["italy"]=data
# Print europe
print(europe)
{'france': {'population': 66.03, 'capital': 'paris'}, 'italy': {'population': 59.83, 'capital': 'rome'}, 'germany': {'population': 80.62, 'capital': 'berlin'}, 'norway': {'population': 5.084, 'capital': 'oslo'}, 'spain': {'population': 46.77, 'capital': 'madrid'}}
DataFrame
数据分析里用的最多的其实还是DataFrame(数据框),操作数据框的话会用到pandas
工具包 。
- 把字典转换成数据框
关键语法
import xx as x
pd.DataFrame()
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
# Import pandas as pd
import pandas as pd
# Create dictionary my_dict with three key:value pairs: my_dict
my_dict={"country":names,"drives_right":dr,"cars_per_cap":cpc}
# Build a DataFrame cars from my_dict: cars
cars=pd.DataFrame(my_dict)
# Print cars
print(cars)
cars_per_cap country drives_right
0 809 United States True
1 731 Australia False
2 588 Japan False
3 18 India False
4 200 Russia True
5 70 Morocco True
6 45 Egypt True
- 设置行名
cars.index=xxxxx
import pandas as pd
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict,)
print(cars)
# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']
# Specify row labels of cars
cars.index=row_labels
# Print cars again
print(cars)
cars_per_cap country drives_right
0 809 United States True
1 731 Australia False
2 588 Japan False
3 18 India False
4 200 Russia True
5 70 Morocco True
6 45 Egypt True
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
EG 45 Egypt True
- 读取csv文件
cars= pd.read_csv("cars.csv",index_col = 0)
Pandas的简单操作
之后会花篇幅详细学习Pandas
- 选择列
[]
和[[ ]]
的区别,带列名和不带列名
print(cars["country"])
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
Name: country, dtype: object
print(cars[["country"]])
country
US United States
AUS Australia
JPN Japan
IN India
RU Russia
MOR Morocco
EG Egypt
print(cars[["country","drives_right"]])
country drives_right
US United States True
AUS Australia False
JPN Japan False
IN India False
RU Russia True
MOR Morocco True
EG Egypt True
- 选择行
可以选取指定行,这个和R很相似
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out first 3 observations
print(cars[0:3])
# Print out fourth, fifth and sixth observation
print(cars[3:6])
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
cars_per_cap country drives_right
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
print(cars[0:3])
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
<script.py> output:
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
cars_per_cap country drives_right
IN 18 India False
RU 200 Russia True
MOR 70 Morocco True
loc and iloc
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out drives_right value of Morocco
print(cars.loc[["MOR","drives_right"]])
# Print sub-DataFrame
print(cars.loc[["RU","MOR"],["country","drives_right"]])
<script.py> output:
cars_per_cap country drives_right
MOR 70.0 Morocco True
drives_right NaN NaN NaN
country drives_right
RU Russia True
MOR Morocco True
and, or, not
看下面这个简单的例子就好,直接用and
or
# Define variables
my_kitchen = 18.0
your_kitchen = 14.0
# my_kitchen bigger than 10 and smaller than 18?
print(my_kitchen > 10 and my_kitchen <10)
# my_kitchen smaller than 14 or bigger than 17?
print(my_kitchen<14 or my_kitchen>17)
# Double my_kitchen smaller than triple your_kitchen?
print(2*my_kitchen < 3*your_kitchen)
<script.py> output:
False
True
True
not也是直接用not
In [1]:
x = 8
y = 9
not(not(x < 3) and not(y > 14 or y > 10))
Out[1]:
False
用numpy来进行类似的操作
np.logical_and
np.logical_or
np.logical_not
# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])
# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house >18.5,my_house<10))
# Both my_house and your_house smaller than 11
print(np.logical_and(my_house<11,your_house<11))
得到结果
<script.py> output:
[False True False True]
[False False False True]
if, elif, else
条件后面用的:
分隔
area = 10.0
if(area < 9) :
print("small")
elif(area < 12) :
print("medium")
else :
print("large")
比较一下,R语言里用的是{}
if (test_expression) {
statement1
} else {
statement2
}
Python的简洁易懂的优势就体现出来了。
再举个例子, Python可以没有规矩到条件不需要加()
# Define variables
room = "kit"
area = 14.0
# if-else construct for room
if room == "kit" :
print("looking around in the kitchen.")
else :
print("looking around elsewhere.")
# if-else construct for area
if area > 15 :
print("big place!")
else :
print("pretty small.")
# if statement for room
if room == "kit" :
print("looking around in the kitchen.")
# if statement for area
if area>15 :
print("big place!")
练习题
Remember about np.logical_and()
, np.logical_or()
and np.logical_not()
, the Numpy variants of the and
, or
and not
operators? You can also use them on Pandas Series to do more advanced filtering operations.
Take this example that selects the observations that have a cars_per_cap
between 10 and 80. Try out these lines of code step by step to see what's happening.
cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 10, cpc < 80)
medium = cars[between]
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Import numpy, you'll need this
import numpy as np
# Create medium: observations with cars_per_cap between 100 and 500
cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]
# Print medium
print(medium)
output
<script.py> output:
cars_per_cap country drives_right
RU 200 Russia True
while
语法结构,用:
隔开
while condition :
expression
x = 1
while x < 4 :
print(x)
x = x + 1
# Initialize offset
offset = 8
# Code the while loop
while offset != 0:
print("correcting...")
offset = offset-1
print(offset)
output
<script.py> output:
correcting...
7
correcting...
6
correcting...
5
correcting...
4
correcting...
3
correcting...
2
correcting...
1
correcting...
0
-
while
和if
的混搭
# Initialize offset
offset = -6
# Code the while loop
while offset != 0 :
print("correcting...")
if offset > 0 :
offset = offset -1
else :
offset = offset + 1
print(offset)
for循环
List里的for循环
# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Code the for loop
for i in areas :
print(i)
-
enumerate
, 可以同时输出变量的序列和数值
l = ['Alice', 'Bob', 'Charlie']
for name in l:
print(name)
# Alice
# Bob
# Charlie
for i, name in enumerate(l):
print(i, name)
# 0 Alice
# 1 Bob
# 2 Charlie
默认是从0开始,可以指定从1或者从任意数字开始。
for i, name in enumerate(l, 1):
print(i, name)
# 1 Alice
# 2 Bob
# 3 Charlie
自定义输出格式
# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Code the for loop
for index, area in enumerate(areas) :
print("room " + str(index+1) + ": " + str(area))
output
room 1: 11.25
room 2: 18.0
room 3: 20.0
room 4: 10.75
room 5: 9.5
双层list的for循环也是毫不费力
# house list of lists
house = [["hallway", 11.25],
["kitchen", 18.0],
["living room", 20.0],
["bedroom", 10.75],
["bathroom", 9.50]]
# Build a for loop from scratch
for x, y in house:
print("the " + str(x) + " is " + str(y) + " sqm" )
the hallway is 11.25 sqm
the kitchen is 18.0 sqm
the living room is 20.0 sqm
the bedroom is 10.75 sqm
the bathroom is 9.5 sqm
Dictionary里的for循环
和list不一样的是需要用到items()
这个指令
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
# Iterate over europe
for k, v in europe.items() :
print("the capital of " + k + " is " + v)
Numpy array里的for循环
numpy array可以是1D也可以是2D,1D的话和list一样,2D的话要用到np.nditer
,把两列按照一列来显示。
# Import numpy as np
import numpy as np
# For loop over np_height
for x in np_height :
print(str(x) + " inches")
# For loop over np_baseball
for val in np.nditer(np_baseball) :
print(val)
Dataframe里的for循环
iterrows()
按照行来分析。
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
cars.head()
# Iterate over rows of cars
for lab, row in cars.iterrows() :
print(lab)
print(row)
output
cars.head()
cars_per_cap country drives_right
US 809 United States True
AUS 731 Australia False
JPN 588 Japan False
IN 18 India False
RU 200 Russia True
<script.py> output:
US
cars_per_cap 809
country United States
drives_right True
Name: US, dtype: object
AUS
cars_per_cap 731
country Australia
drives_right False
Name: AUS, dtype: object
JPN
cars_per_cap 588
country Japan
drives_right False
Name: JPN, dtype: object
IN
cars_per_cap 18
country India
drives_right False
Name: IN, dtype: object
RU
cars_per_cap 200
country Russia
drives_right True
Name: RU, dtype: object
MOR
cars_per_cap 70
country Morocco
drives_right True
Name: MOR, dtype: object
EG
cars_per_cap 45
country Egypt
drives_right True
Name: EG, dtype: object
两种方法添加列
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
cars.head()
# Code for loop that adds COUNTRY column
for lab, row in cars.iterrows():
cars.loc[lab,"COUNTRY"] = str.upper(row["country"])
# Print cars
print(cars)
2.apply
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
cars["country"]
# Use .apply(str.upper)
for lab, row in cars.iterrows() :
cars["COUNTRY"] = cars["country"].apply(str.upper)
print(cars)
网友评论