python学习：零碎的内容（二）

作者: GPZ_Lab | 来源:发表于2020-02-21 00:09 被阅读0次

python学习：零碎的内容（二）
python学习：零碎的内容
2019-01-21学习日记
09-28A002B组站会日志
python学习-零碎
数据分析就业班10天学习总结
2018-03-10
Python 中一些容易忽略的知识点（2）
Python 中一些容易忽略的知识点（1）
数据挖掘小组学习记录——20190728-2

...本来是在python学习：零碎的内容里不断盖楼，盖到62条的时候不知道写了啥，文章被封了一次。目前不敢再动了，新盖一楼。

pip install XXX的时候总有ReadTimeoutError: HTTPSConnectionPool(host='....', port=443): Read timed out.
参考https://github.com/pypa/warehouse/issues/3826
试试 pip install --default-timeout=1000 package_name
日期空值：NaT
df.apply(lambda x:pd.to_datetime(x, errors='coerce'))
将df中string格式的日期(比方说2020/1/26)转化为datetime格式，如果有空值则为NaT
13位数的unix时间格式，转化为human readable
Unix time also known as Epoch time, POSIX time
即19870年1月1日后多少秒。13位数为19700101后多少毫秒(milliseconds)

from datetime import datetime
dt_object = datetime.fromtimestamp(1581162409463/1000)
print(dt_object.strftime("%Y-%m-%d %H:%M:%S")) 
print(dt_object)

# 得到：
2020-02-08 19:46:49
2020-02-08 19:46:49.463000

plotly: 用make_subplots()作图，调整两个子图之间的距离，比例，共用Y轴：
subplots

make_subplots(rows=1,cols=2, # 两个图并排放
              column_widths=[0.2,0.8],  # 一个占比20%，一个占比80%
              shared_yaxes=True,  # 共用Y轴
              horizontal_spacing=0.01)  # 两个图之间距离缩短

67.plotly 颜色使用集锦：
discrete颜色：
https://plot.ly/python/discrete-color/
内置颜色：
https://plot.ly/python/builtin-colorscales/#discrete-color-sequences
一个很棒的调色板网站(可能是搞设计的人用的)，如下所示：

os
文件重命名os.rename('old','new')
删除文件 os.system('rm XXX') 即可
python里检查md5码

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()
test = ['XXXXXXXXXXXXXXX.fastq',
        'XXXXXXXXXXXXXX.fastq']
[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in test]

python2和python3中二进制和unicode character的问题
decode: 二进制--->unicode character
encode: unicode character--->二进制
python2:

str: 8-bits value(二进制), unicode: unicode characters
默认使用ASCII
with open(XXX.bin.'r') as ...默认设置为binary encoding
python3:
bytes: 8-bits value(二进制), str: unicode characters
bytes和str是完全不一样的type, 连两者的空值都不能等同。
with open(XXX.bin,'r') as ...默认设置为utf-8 encoding，所以用python3打开binary格式文件，需要指定mode为'rb'

71.get()
参考https://stackoverflow.com/questions/2068349/understanding-get-method-in-python

t = {'a':1,'b':2,'c':3}
t['e'] # get Keyerror
t.get('e',None) # 如果key里没有'e'，则默认返回None

eumerate()的第二个参数

a = ['a','b','c']
for ind, i in eumerate(a,2):
  print(ind, i)

# 2 a
# 3 b
# 4 c

zip() loop
for ai, bi in zip(a,b):
在Python3中，zip() return的是个generator, python2中return的是 a list of all the tuples it creates，如果对很大的list pair迭代，会耗损很大内存。如果要在python2中使用zip,最好看看izip(itertools)
try
参考https://www.thegeekstuff.com/2019/05/python-try-except-examples/

a = 12
b = 'test'
try:
  print(a+b)  # raise typeError
except TypeError: # 如果try里的运行结果是TypeError，那么就：
  print(str(a)+b)

# 12test

list.sort(key=)
pd.read_excel()
读入excel中所有的sheet
pd.read_excel('XXX.xlsx', sheet_name=None)
得到一个dictionary, key为sheet name, value为各sheet读入的dataframe
给一个dataframe全员log10
df.applymap(math.log10) (先import math)
function最好不要return None
因为如果你return的东西要放到if/else中去，None和0或者空List等的效果是一样的，容易造成bug
raise

def divide(a,b):
  try:
    return a/b
  except ZeroDivisionError as e:
    raise ValueError('What?') from e

divide(5,0)

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-9-b8e948d46537> in divide(a, b)
      2     try:
----> 3         return a/b
      4     except ZeroDivisionError as e:

ZeroDivisionError: division by zero

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-10-0a52e2eb64c8> in <module>
----> 1 divide(5,0)

<ipython-input-9-b8e948d46537> in divide(a, b)
      3         return a/b
      4     except ZeroDivisionError as e:
----> 5         raise ValueError('what?') from e
      6       
      7

ValueError: what?

list.sort(key=)
key传递一个函数，在sort之前对list中每个element调用
list of tuple的排序
先根据tuple的第一位element排序，再根据第二位...

test = [(1,2),(1,19),(1,3),(1,4),(0,3),(0,9),(0,10)]
test.sort()
test
[(0, 3), (0, 9), (0, 10), (1, 2), (1, 3), (1, 4), (1, 19)]

所以如果你有个list要排序，但是有一群特殊分子需要安排到前面去，可以先把特殊分子抽出来做成(0,x)，其他的为(1,x)，根据80条来设置。然后排序就可以把特殊分子排列在前面了。

subprocess输入input
subprocess见零碎的内容（一）43条

p = subprocess.Popen('XXXX',shell=True,stdin=subprocess.PIPE,
                     stderr=subprocess.PIPE)
stdout,stderr = p.communicate(input='XXX\nXXXX\nXXXXX')
# 多个Input用\n分开

index name

tqdm
在jupyter notebook/lab 中使用tqdm, import这个比较合适

from tqdm import tqdm_notebook as tqdm
for i in tqdm([1,2,3,4]):
  ....

把某一个index提取出来成string，而不是Index object
df.loc[df['col']==i,:].index.tolist()[0] #这里只有一个element
dataframe筛选出某一种dtype的columns
先看一下有几种dtypes:
df.dtypes.value_counts()
然后select
df.select_dtypes(include=['XX','XXX'])
缺失值填充 missing value imputation

from sklearn.impute import SimpleImputer,KNNImputer
# 用KNN对numeric values填充
imputer_n = KNNImputer(n_neighbors=2,weight='uniform')
imputer_n.fit_transform(df)

# 用most frequent对categorical填充
imputer_c = SimpleImputer(strategy='most_frequent')
imputer_c.fit_transform(df)

有else的list comprehension
["Even" if i%2==0 else "Odd" for i in range(10)]
multi-index的melt (long --> wide)

df
   ID gp     value gp2
0   1  a  0.708910  a1
1   2  a  0.273727  a1
2   3  a  0.161171  a2
3   4  a  0.920273  a2
4   5  b  0.147851  b1
5   6  b  0.957274  b1
6   7  b  0.421100  b2
7   8  b  0.807547  b2

df_mean = df.loc[:,['gp','value','gp2']].groupby(['gp','gp2']).mean()
df_mean
           value
gp gp2
a  a1   0.491318
   a2   0.540722
b  b1   0.552562
   b2   0.614323

pd.melt(df_mean.reset_index(),id_vars = ['gp','gp2'])
  gp gp2 variable     value
0  a  a1    value  0.491318
1  a  a2    value  0.540722
2  b  b1    value  0.552562
3  b  b2    value  0.614323

90.用pandas打开excel
在python3环境里，即使安装了openpyxl也无法打开
需要 pd.read_excel(XXXX, sheet_name= 'XXX', engine='openpyxl')