How do i select a pandas Series from a DataFrame?
怎样选择DataFrame里的列或是行? 为什么说是列或是行呢,因为series是pandas库中的一个类型,就像python中的列表一样 一个series里装一行信息,或者是一列信息,总之用维度来解释series就是一维的,dataframe是二维的,二维呢就装的是表
import pandas as pd
ufo = pd.read_csv('uforeports.csv')
# 让我们看看他的前五行是什么样
ufo.head()
# 我们看ufo 这个DataFrame 里City这一列
ufo['City']
# 我们也可以这样选列
ufo.City
# 但是这样不是万能的,首先有的方法可能会和列名重复,其次有的列名中含有空格,比如Colors Reported这一列
ufo['Colors Reported']
# 那我们把series和series相加会得到什么
ufo.City + ufo.State
# 上面我们可以看到这2个series被加在一起了, 加的规则是索引相同的放在一起运算
# 同时我们还可以在其中加一点字符串
ufo.City + ',' + ufo.State
# 创建新的列放location保存我们合并的值
# 注意:我们在创建列的时候不能以df.col 比如:ufo.Location
ufo['Location'] = ufo.City + ',' + ufo.State
# 我们在后面就看到了Location这一列
ufo.head()
why do some pandas commands end with parentheses , and other commands dont't?
为什么pandas有的命令带括号有的不带呢?
有的是调用属性,有的是函数方法
import pandas as pd
movies = pd.read_csv("https://bit.ly/imdbratings")
movies.head()
# 返回描述性统计的结果,count, mean, std, min, 25%, 50%, 75%, max
movies.describe()
# 可以指定百分位点
movies.describe([0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99])
# dataframe的形状
movies.shape
# dataframe的每一列的数据类型
movies.dtypes
# 查看int和float数据的描述
movies.describe()
# 查看object的描述, count, unique, top, freq
movies.describe(include=["object"])
how do i rename columns in a pandas DataFrame?
如何重新命名列?
import pandas as pd
ufo = pd.read_csv('https://bit.ly/uforeports')
ufo.head()
ufo.columns
ufo.rename(columns={"Colors Reported": "Color_Reported"}, inplace=True)
ufo.columns
How do i remove columns from a pandas DataFrame?
如何删掉dataframe中的列?
ufo = pd.read_csv("https://bit.ly/uforeports")
ufo.columns
# 删除列,axis=1
ufo.drop(["Colors Reported"], axis=1, inplace=True)
ufo.columns
how to delete a row from DataFrame?
删除dataframe中的行
# 删除指定索引的行
ufo.drop([0,1], axis=0, inplace=True).head()
# 删除State == CO的行
ufo.drop(ufo[ufo["State"]=="CO"].index, axis=0).head()
How do i sort pandas DataFrame or Series?
如何进行排序呢?
import pandas as pd
movies = pd.read_csv("https://bit.ly/imdbratings")
movies['title'].sort_values(ascending=False).head()
movies.sort_values(['title', "duration"], ascending=[1,1]).head(2)
How do i filter of rows of a pandas DataFrame by column value?
我们怎么通过列里的值来筛选指定的行呢?
movies = pd.read_csv("https://bit.ly/imdbratings")
# demo选择大于200分钟的电影
# 方法1
booleans = []
for length in movies.duration:
if length >= 200:
booleans.append(True)
else:
booleans.append(False)
is_long = pd.Series(booleans)
movies[is_long]
# 方法2
is_long = movies.duration >= 200
is_long.head()
movies[movies.duration >= 200]
How do i apply multiple filter criteria to a pandas DataFrame?
我们怎么用多重选择来筛选数据?
import pandas as pd
# 选择数字的列
import numpy as np
drinks.select_dtypes(include=[np.number]).dtypes
drinks.select_dtypes(include=['float']).dtypes
How do i use the "axis" parameter in pandas?
我们如何使用axis这个参数?
import pandas as pd
drinks = pd.read_csv("https://bit.ly/drinksbycountry")
# axis=1时表示列
drinks.drop('continent', axis=1).head()
# axis=0为行
drinks.drop([1,2], axis=0).head()
drinks.mean(axis=0)
# 在mean中axis=0等同于axis="index"
#drinks.mean(axis="index")
drinks.mean(axis=1).head()
# drinks.mean(axis="columns")
How do i use string methods in pandas?
我该怎么使用字符串的方法呢?
orders = pd.read_csv("https://bit.ly/chiporders", sep="\t")
orders.item_name.str.contains("Chicken").head()
orders.choice_description.str.replace("\[", "").str.replace("\]", "").head()
# 更多可以查询官方文档
How do i change the data type of a pandas Series?
我们怎么改变series的数据类型呢?
import pandas as pd
drinks = pd.read_csv("https://bit.ly/drinksbycountry")
drinks.dtypes
drinks["beer_servings"] = drinks.beer_servings.astype(float)
# 在导入的时候就改变dtype
drinks = pd.read_csv("https://bit.ly/drinksbycountry", dtype={"beer_servings": float})
drinks.dtypes
orders = pd.read_csv("https://bit.ly/chiporders", sep="\t")
orders.dtypes
orders.item_price.str.replace("$", "").astype(float).mean()
When should i use a "groupby" in pandas?
什么时候我们使用groupby
import pandas as pd
drinks = pd.read_csv("https://bit.ly/drinksbycountry")
# 分组计算
drinks.groupby("continent").beer_servings.mean()
drinks.groupby("continent").agg(["count", "min", "max", "mean"])
drinks.groupby("continent").mean()
# jupyter notebook中画图显示
%matplotlib inline
drinks.groupby("continent").mean().plot(kind="bar")
How do i explore a pandas Series?
如何探索一列?
import pandas as pd
movies = pd.read_csv("https://bit.ly/imdbratings")
movies.dtypes
# count是指有979个非空值
# unique
# top 最多的
# freq最多的出现的频率
movies.genre.describe()
# 值统计
movies.genre.value_counts()
# 占总量的比例
movies.genre.value_counts(normalize=True)
# 与众不同的是那些值
movies.genre.unique()
# 与众不同的个数
movies.genre.nunique()
# 交叉表
pd.crosstab(movies.genre, movies.content_rating)
How do i handle missing values in pandas?
我如何处理缺失数据
import pandas as pd
ufo = pd.read_csv('uforeports.csv')
ufo.tail()
ufo.isnull().tail()
ufo.notnull().tail()
ufo.dropna(how='any').shape
ufo.shape
ufo.dropna(subset=['city', 'Shape Reported'], how='any').shape
ufo['Shape Reported'].value_counts(dropna=None)
# 选择全为空值和任意为空drop掉
ufo.dropna(how='any').shape
ufo.dropna(how='all').shape
# 忽略空值统计值个数
ufo['Shape Reported'].value_counts(dropna=None)
ufo['Shape Reported'].value_counts()
# 使用值填充空值
uf0.fillna(value='VARIOUS', inplace=True)
What do i need to know about the pandas index?
我们对于索引应该知道点什么?
import pandas as pd
drinks = pd.read_csv('drinksbycountry.csv')
drinks.columns
drinks.set_index("country", inplace=True)
drinks.index.name = None
drinks.head()
# drinks.reset_index(inplace=True)
drinks.columns
How do i make my pandas DataFrame smaller and faster?
我们怎么让pandas dataframe 更小更快呢?
import pandas as pd
drinks = pd.read_csv('drinksbycountry.csv')
drinks.info()
# 查看dataframe占有的内存空间
drinks.info(memory_usage='deep')
drinks.meory_usage(deep=True)
# 我们转换一列的数据类型
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)
# 如果value种类非常的少,那么设置为category会更加省内存,也更快,反之会适得其反
drinks["continent"] = drinks["continent"].astype("category")
drinks.memory_usage(deep=True)
drinks.continent.cat.codes.head()
# 当然我们也可以自己编category的code
df = pd.DataFrame({"id": [80,90,93,100], "quanlity": ["good", "very good", "very good", "excellent"]})
df["quanlity"] = df.quanlity.astype("category", categories=["good", "very good", "excellent"],ordered=True)
df.quanlity
df.sort_values("quanlity")
df.loc[df.quanlity > "good", :]
How do i create dummy variables in pandas?
如何创建哑变量呢?
# 方法一
import pandas as pd
train = pd.read_csv('https://bit.ly/kaggletrain')
train.head()
train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
train.head()
# 方法二:调用get_dummies来创造哑变量
pd.get_dummies(train.Sex)
pd.get_dummies(train.Sex, prefix='Sex').iloc[:5, 1:]
train['Embarked'].value_counts()
pd.get_dummies(train['Embarked'], prefix='Embarked').head()
# 拼接数据
embarked_dummies = pd.concat([train['Ticket'], train[['Sex', 'Age']], axis=1])
embarked_dummies.head()
# 指定的列会用dummy column代替
# parameter drop_first=True,删掉产生的第一个dummy
pd.get_dummies(train, columns=["Sex", "Embarked"]).head()
How do i work with dates and times in pandas?
在pandas中我们怎么处理时间?
import pandas as pd
ufo = pd.read_csv('uforeports.csv')
ufo.head()
# 查看各列的数据类型
ufo.dtypes
# 切割字符串
ufo.Time.str.slice(-5,-3).astype(int).head()
# 转换为datetime数据类型
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo.head()
ufo.dtypes
# 查看周信息
ufo.Time.dt.weekday_name.head()
ts = pd.to_datetime('1/1/1999')
# 筛选时间
ufo.loc[ufo.Time>=ts, :].head()
# 获取年
ufo['Year'] = ufo.Time.dt.year
ufo.head()
How do i find and remove duplicate rows in pandas?
我们怎么去除重复数据呢?
import pandas as pd
How do i avoid a SettingWithCopyWarning in pandas?
如何避免出现CopyWarning呢?
# 当使用xx.loc[XX] = value时,要copy源数据
# df.copy()
# 当对一列值赋值的时候,不能使用df[XX] = value
# df.loc[XX] = value才可以
How do i change display option in pandas?
如何改变pandas显示设置?
import pandas as pd
drinks = pd.read_csv('drinlsbycountry.csv')
# 展示最大行数
pd.get_option("display.max_rows")
# 设置最大显示行数
pd.set_option("display.max_rows", None)
# 重置设置
pd.reset_option("display.max_rows")
pd.get_option("display.max_columns")
pd.get_option("display.max_colwidth")
# 显示精度小数点后两位
pd.set_option("display.precision", 2)
How do i create a pandas DataFrame from another object?
我们怎么手动创建一个dataframe呢?
import pandas as pd
# 使用字典创建
pd.DataFrame({'id':[101, 102, 103], 'color':['red', 'blue', 'white']})
# 使用列表创建
pd.DataFrame([[100, "red"], [101, "blue"], [109, "black"]], columns=['id', 'color'])
How do i apply a function to a pandas Series or DataFrame?
我们怎么自定义函数使用在series或dataframes上?
import pandas as pd
train = pd.read_csv('kaggletrain.csv')
train.head()
train['name_len'] = train['Name'].apply(len)
train['Name', 'name_len'].head()
# 自定义函数
def get_element(my_list, position):
return my_list[position]
train.Name.str.split(',').head()
train.Name.str.split(",").apply(get_element, position=0).head()
# 和以下一异曲同工
train.Name.str.split(",").apply(lambda x: x[0]).head()
drinks = pd.read_csv("drinksbycountry.csv")
drinks.head()
# 得出每列中最大值
drinks.loc[:, "beer_servings": "wine_servings"].apply(max, axis=0)
# 得出一行中最大值
drinks.loc[:, "beer_servings": "wine_servings"].apply(max, axis=1)
# 查出每一行那个最大
drinks.loc[:, "beer_servings": "wine_servings"].apply(np.argmax, axis=1)
# applymap apply every element for dataframe
# 对每个单元格采用同样的方法
drinks.loc[:, "beer_servings": "wine_servings"].applymap(float)
how to resort columns?
如何给列进行重新排序,同时也增加字段呢
dict = {'b': [1,2], 'a': [0,3]}
df = pd.DataFrame(dict)
sort_cols = ['a', 'b', 'c', 'd']
df.reindex(columns=sort_cols, fill_value=np.nan)
df
网友评论