一文解决基因表达数据的聚类转换

作者: 柳叶刀与小鼠标 | 来源:发表于2019-12-02 00:52 被阅读0次

问题1:我有一个基因表达矩阵,行为样本,列为基因。



问题是我想把它们转化为每一个基因的表达量为分类数据,例如说A基因在所有样本的表达范围是1—100,通过聚类分析,我们发现大多数样本在A基因的表达上为30左右,因为我们可以根据Kmeans方法将基因表达矩阵转化为30和非30两类标签。

今天使用PYTHON的方法:

# -*- coding: utf-8 -*-
"""
Created on Mon Dec  2 00:32:59 2019

@author: czh
"""


# In[*]
%reset -f
%clear
# In[*]
import pandas as pd
from sklearn.cluster import KMeans #导入K均值聚类算法
import os
os.chdir("D:\\train\\diff")

# In[*]
data = pd.read_csv("5year.csv",header=0,index_col=0)

d = data.iloc[:,1:78]
# In[*]
data.head()

# In[*]
d.head()


# In[*]
d.columns = d.columns.map(lambda x :str(x))

d.columns = d.columns+ "gene_exp"

d.columns
# In[*]
def f(x):
    from sklearn.cluster import KMeans
    model = KMeans(n_clusters=2)
    model.fit(d[[x]].as_matrix())

    centers_d = pd.DataFrame(model.cluster_centers_).sort_values(by = 0)
    group = [0] + list(centers_d.rolling(2).mean().iloc[1:][0]) + [d[x].max()]
    s = pd.cut(d[x], group, labels = [ x + str(i) for i in range(2)])

    return s
    

# In[*]
aprioriData = pd.DataFrame()

for i in range(77):
    col_name = d.columns
    col = col_name[i]
    Data = f(col)
    aprioriData =  pd.concat([aprioriData,Data],axis=1)

# In[*]
discretization_d =  pd.concat([aprioriData,data['Class']],axis =1)


# In[*]
discretization_d.head()
data.head()
Out[50]: 
      Class    ADGRA2    ANGPTL2  ...     TPST1    TSC22D3     VSTM4
id                                ...                               
AA80      0  0.776205   3.942062  ...  7.347908  10.512511  0.209625
A9TC      0  2.857827   3.229691  ...  2.324581   7.113074  0.485731
A5W6      0  1.161271   5.802349  ...  7.360124  21.058854  0.629902
A6DX      0  1.465745   7.821838  ...  6.256095  29.304477  1.290819
A8HH      0  9.702574  18.361627  ...  6.382861  29.900405  1.442875

[5 rows x 78 columns]
d.head()
Out[51]: 
      ADGRA2gene_exp  ANGPTL2gene_exp  ...  TSC22D3gene_exp  VSTM4gene_exp
id                                     ...                                
AA80        0.776205         3.942062  ...        10.512511       0.209625
A9TC        2.857827         3.229691  ...         7.113074       0.485731
A5W6        1.161271         5.802349  ...        21.058854       0.629902
A6DX        1.465745         7.821838  ...        29.304477       1.290819
A8HH        9.702574        18.361627  ...        29.900405       1.442875

[5 rows x 77 columns]

相关文章

网友评论

    本文标题:一文解决基因表达数据的聚类转换

    本文链接:https://www.haomeiwen.com/subject/iztxgctx.html