本文会不定期更新一些在做数据分析时经常用到的技巧,以代码片段的形式提供,主要针对numpy,pandas和matplotlib等常用Python数据分析库。
1. 将原始数据中的日期时间字段转换为pandas.Timestamp类型
def convert_to_datetime(s):
return pd.to_datetime(s.apply(lambda x: datetime.strptime(x, '%m/%d/%y %H:%M:%S')))
常见的日期格式说明符:
- %Y Four-digit year
- %y Two-digit year
- %m Two-digit month [01, 12]
- %d Two-digit day [01, 31]
- %H Hour (24-hour clock) [00, 23]
- %I Hour (12-hour clock) [01, 12]
- %M Two-digit minute [00, 59]
- %S Second [00, 61] (seconds 60, 61 account for leap seconds)
- %w Weekday as integer [0 (Sunday), 6]
- %U Week number of the year [00, 53]; Sunday is considered the frst day of the week, and days before the frst Sunday of the year are “week 0”
- %W Week number of the year [00, 53]; Monday is considered the frst day of the week, and days before the frst Monday of the year are “week 0”
- %z UTC time zone offset as +HHMM or -HHMM; empty if time zone naive
- %F Shortcut for %Y-%m-%d (e.g., 2012-4-18)
- %D Shortcut for %m/%d/%y (e.g., 04/18/12)
2. 将类别数据类型转换为category
raw_data['card_type'] = raw_data['card_type'].astype('category')
网友评论