1.8字符串处理
- 使用方法:先获取Series的str属性,然后在属性上调用函数;
- 只能在字符串列上使用,不能数字列上使用;
- Dataframe上没有str属性和处理方法
- Series.str并不是Python原生字符串,而是自己的一套方法,不过大部分和原生str很相似;
1.8.1 startwith开头
查询字符串以2018-03
开头的,等同于查询月份数据
condition = df["ymd"].str.startswith("2018-03")
1.8.2 链式调用清洗字符串
将字符串2018-01-02
转换成为201801
df["ymd"].str.replace("-", "").str[0:6]
1.8.3 正则表达式
contains
- 找出有晴天的所有数据
df[df['tianqi'].str.contains("晴")]
re表达式
- 提取所有数字
df['ymd'] = df['ymd'].str.replace(re.compile("-"), "")
image.png
1.10 Merge Syntax
Merge相当于sql的Join语法,将不同的Key关联到一个表
- Table 1:
df_ratings = pd.read_csv(
"./datas/movielens-1m/ratings.dat",
sep="::",
engine='python',
names="UserID::MovieID::Rating::Timestamp".split("::")
)
image.png
- Table 2:
df_users = pd.read_csv(
"./datas/movielens-1m/users.dat",
sep="::",
engine='python',
names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)
image.png
- Table 3:
df_movies = pd.read_csv(
"./datas/movielens-1m/movies.dat",
sep="::",
engine='python',
names="MovieID::Title::Genres".split("::")
)
image.png
1.10.1 inner join
- All entries from the left side will appear in the result, and if there's no match from the right side, it will be shown as Null.
df_ratings_users = pd.merge(
df_ratings, df_users, left_on="UserID", right_on="UserID", how="inner"
)
- equal to sql
select * from df_ratings a inner join df_users b on a.UserID= b.UserID
image.png
1.10.2 right join
- All entries from the left side will appear in the result, and if there's no match from the right side, it will be shown as Null
df_ratings_users = pd.merge(
df_ratings, df_users, left_on="UserID", right_on="UserID", how="left "
)
- equal to sql
select * from df_ratings a left join df_users b on a.UserID= b.UserID
1.11 Concat
concat
is used to combine excel with the same format
-
table 1
image.png -
table 2
image.png
1.11.1 use pandas.concat to combine data
- use default param
pd.concat([table1,table2])
image.png
image.png
- user
ignore_index =True
to neglect the primary index
pd.concat([df1,df2], ignore_index=True)
image.png
1.12 Batch merge and split excel
1.12.1 split to multiple equal excel
work_dir="./course_datas/c15_excel_split_merge"
splits_dir=f"{work_dir}/splits"
import os
if not os.path.exists(splits_dir):
os.mkdir(splits_dir)
网友评论