免疫组库数据分析||immunarch教程：GeneUsage分

作者: 周运来就是我 | 来源:发表于2020-08-26 08:30 被阅读0次

scRepertoire||单细胞免疫组库分析：R语言应用
免疫组库数据分析||immunarch教程：GeneUsage分
immunarch使用方法链接汇总
使用immunarch包进行单细胞免疫组库数据分析（四）：Bas
免疫组库数据分析||immunarch教程：Diversity
免疫组库数据分析||immunarch教程：Clonotype
免疫组库数据分析||immunarch教程：Clonotypes
免疫组库数据分析||immunarch教程：快速开始
2022-06-25 （4）过滤数据，找高频克隆型
免疫组库数据分析||immunarch教程：克隆型分析

immunarch — Fast and Seamless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires in R

前情提要：
10× Genomics单细胞免疫组库VDJ分析必知必会
 免疫组库数据分析||immunarch教程：克隆型分析
 免疫组库数据分析||immunarch教程：探索性数据分析
 免疫组库数据分析||immunarch教程：载入10X数据
 免疫组库数据分析||immunarch教程：快速开始

今天，我们继续我们的免疫组库数据分析的Demos，这一次我们来谈谈GeneUsage分析。像我这样刚入门免疫组库的人首先会问什么是GeneUsage？

我不由自主地拿出了周光炎老师的《免疫学原理》:

免疫组库的多样性是由VDJ基因重排带来的，这里的VDJ是区域名称，每个区域又有基因，虽然这些基因的名称带着明显的区域的印记。geneUsage指的就是这些基因的出现频率（次数）。在文章中它出现的形式是这样的：

Analysis of TRA variable (TRAV) and TRB variable (TRBV) gene segment usage among analyzed samples. a−d Analysis of (a, b) TRAV and (c, d) TRBV gene segment usage, including (a, c) heatmap representation of TRAV and TRBV gene usage across samples and (~) frequency of observed TRAV and TRBV gene usages among the samples $^{[1]}$

下面开始immunarch的表演。

ibrary(immunarch); data(immdata)       # Load the package and the test dataset
?geneUsage

immunarch提供了一个基因片段数据表，包含几个物种的已知基因片段数量，按照IMGT的命名法。调用gene_stats()函数可以得到基因的当前统计信息:

gene_stats()
    alias                 species ighd ighj ighv igij igkj igkv iglj iglv traj trav trbd trbj trbv trdd
1      bt               BosTaurus   21    4   25    0    1    6    5   26   46    0    0    0    0    5
2      cd      CamelusDromedarius    0    0    0    0    0    0    0    0    0    0    0    0    0    0
3     clf    CanisLupusFamiliaris    0    0    0    0    0    0    0    0    0    0    2    8   19    0
4      dr              DanioRerio    7    7    0    3    0    0    0    0    0    0    0    0    0    0
5      hs             HomoSapiens   30   13  248    0    5   64    7   69   57   60    3   14   64    3
6  macmul           MacacaMulatta   24    7   19    0    4   83    5    0    0    0    2   15   58    0
7     mmc    MusMusculusCastaneus    0    0    0    0    0    4    0    0    0    0    0    0    0    0
8     mmd   MusMusculusDomesticus    0    0    0    0    0    2    0    0    0    0    0    0    0    0
9  musmus             MusMusculus   32    8  225    0    8  109    3    5   42  145    2   14   23    2
10     oa OrnithorhynchusAnatinus    3   10    0    0    0    0    0    0    0    0    0    0    0    0
11     oc    OryctolagusCuniculus   10   11   39    0    8   26    2   20    0    0    0    0    0    0
12     om      OncorhynchusMykiss    9    7    6    0    0    0    0    0    0    0    1    9    0    0
13     rn        RattusNorvegicus   30    4  113    0    6  132    2    8    0    0    0    0    0    0
14   smth   MusMusculusMolossinus    0    0    0    0    0    1    0    0    0    0    0    0    0    0
15   smth     MusMusculusMusculus    0    0    0    0    0    1    0    0    0    0    0    0    0    0
16   smth              MusSpretus    0    0    0    0    0    2    0    2    0    0    0    0    0    0
17     ss               SusScrofa    5    5   15    0    8   19    4   14    0    0    0    0    0    0
   trdj trdv trgj trgv
1     3    0    6   15
2     0    7    2    2
3     0    0    7    8
4     0    0    0    0
5     4    6    4   10
6     0    0    0    0
7     0    0    0    0
8     0    0    0    0
9     3    7    0   11
10    0    0    0    0
11    0    0    0    0
12    0    0    0    0
13    0    0    0    0
14    0    0    0    0
15    0    0    0    0
16    0    0    0    0
17    0    0    0    0

我们来验证一下，打开IMGT 官网： http://www.imgt.org/找到http://www.imgt.org/IMGTrepertoire/LocusGenes/#H

点进去我们就看到：

HomoSapiens 的 trdd 确实是有3个。

为了计算基因的分布，immunarch包含了geneUsage函数。它接收到一个或多个免疫库（我们读进来的数据对象）作为输入，以及你想要得到统计数据的基因和物种。例如，如果你计划使用智人的TRBV基因，你需要使用hs.trbv，函数中的trbv字符串，其中hs来自别名列，trbv是基因名称。如果你计划使用Mus musmus_ighj基因，你需要使用musmus_ighj。

# Next four function calls are equal. "hs" is from the "alias" column.
imm_gu <- geneUsage(immdata$data, "hs.trbv")
imm_gu

# A tibble: 48 x 13
   Names    `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191` `A4-i192`   MS1   MS2   MS3   MS4   MS5   MS6
   <chr>        <int>     <int>     <int>     <int>     <int>     <int> <int> <int> <int> <int> <int> <int>
 1 TRBV10-1        24        28        NA        16        29         6    19     4    43    19    13    21
 2 TRBV10-2        42        60         8        29        28        16    25    35    63    39    22    43
 3 TRBV10-3       230       282       135       108       376       215   195   142   304   262   179   442
 4 TRBV11-1        21        14        26        17        17        16    14    10    16    32    14    20
 5 TRBV11-2       183       172       125       161        95       113    94   105   174   174   122    94
 6 TRBV11-3         8        11         5        24         2         7     4    13    13    13     3    32
 7 TRBV12-4       603       459       313       433       333       557   406   606   452   646   537   951
 8 TRBV12-5        37        54         8        38        18        17     7    17    25    32    16    24
 9 TRBV13          44        53        45        78        29        43    39    28    31    33    28    47
10 TRBV14          65        91        48        73        40        30    23    94    21    84    26     7
# ... with 38 more rows

imm_gu 不就是一个丰度表吗？行是基因，列是样本，然后我们可以走我们的那一套了，pca啦，聚类啦，巴拉巴拉。

immunarch 提供了方便的可视化函数，与ggplot2无缝衔接。

p1 <- vis(imm_gu[c(1, 2)])
 p2 <- vis(imm_gu[c(1, 2)]) + coord_polar()
p3<- vis(imm_gu[c(1, 2)]) + coord_flip() + theme_bw()

library(patchwork)
p1 + p2 +p3

当然这个函数的细节还是有一些需要注意的，就是你到底要计算什么。基因分布可以通过单个克隆型的计数来计算。quant = "count")或不使用它们(quant = NA)。要计算等位基因水平或家族水平分布，请更改.type参数。Parameter .norm控制immunarch是否将数据归一化，以确保所有频率的和是否等于1。

为了明确其计算的细节，我们用debug(geneUsage)来进入函数内部一探究竟。当然debug(geneUsage)是一个动态过程，我们仅演示主要结果。

debug(geneUsage)
imm_gu <- geneUsage(immdata$data, "hs.trbv")


Browse[3]> .gene
 [1] "TRBV10-1" "TRBV10-2" "TRBV10-3" "TRBV11-1" "TRBV11-2" "TRBV11-3" "TRBV12-3" "TRBV12-4" "TRBV12-5"
[10] "TRBV13"   "TRBV14"   "TRBV15"   "TRBV16"   "TRBV18"   "TRBV19"   "TRBV2"    "TRBV20-1" "TRBV24-1"
[19] "TRBV25-1" "TRBV27"   "TRBV28"   "TRBV29-1" "TRBV3-1"  "TRBV30"   "TRBV4-1"  "TRBV4-2"  "TRBV4-3" 
[28] "TRBV5-1"  "TRBV5-4"  "TRBV5-5"  "TRBV5-6"  "TRBV5-8"  "TRBV6-1"  "TRBV6-2"  "TRBV6-3"  "TRBV6-4" 
[37] "TRBV6-5"  "TRBV6-6"  "TRBV6-8"  "TRBV6-9"  "TRBV7-2"  "TRBV7-3"  "TRBV7-4"  "TRBV7-6"  "TRBV7-7" 
[46] "TRBV7-8"  "TRBV7-9"  "TRBV9"   
Browse[3]> 
debug: .data[[gene_col]] <- return_segments(.data[[gene_col]])
Browse[3]> 
debug: if (has_class(.data, "data.table")) {
    .data <- .data %>% lazy_dt()
}
Browse[3]> 
debug: dataset <- .data %>% select(Gene = gene_col, IMMCOL$count) %>% 
    group_by(Gene) %>% rename(Quant = IMMCOL$count)
Browse[3]> 
debug: if (is.na(.quant)) {
    dataset <- dataset %>% summarise(count = n())
} else {
    dataset <- dataset %>% summarise(count = sum(Quant))
}
Browse[3]> 
debug: dataset <- dataset %>% summarise(count = n())
Browse[3]> 
debug: dataset <- dataset %>% collect(n = Inf)
Browse[3]> dataset
# A tibble: 48 x 2
   Gene     count
   <chr>    <int>
 1 TRBV10-1    28
 2 TRBV10-2    60
 3 TRBV10-3   282
 4 TRBV11-1    14
 5 TRBV11-2   172
 6 TRBV11-3    11
 7 TRBV12-4   459
 8 TRBV12-5    54
 9 TRBV13      53
10 TRBV14      91
# ... with 38 more rows
Browse[3]> .data
# A tibble: 6,553 x 15
   Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
    <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int> <int>   <int>  <dbl>  <dbl>  <dbl>
 1    111    0.0131  TGCAGT~ CSASRG~ TRBV2~ TRBD1  TRBJ2~    11      12    17      25     -1      0      7
 2     93    0.0109  TGTGCC~ CASSVA~ TRBV9  TRBD1  TRBJ2~    15      21    23      24     -1      5      0
 3     66    0.00776 TGTGCC~ CASSRM~ TRBV13 TRBD1  TRBJ2~    11      18    24      32     -1      6      7
 4     59    0.00694 TGTGCC~ CASSPT~ TRBV6~ TRBD2  TRBJ2~    10      14    19      33     -1      3     13
 5     57    0.00671 TGCGCC~ CASSLD~ TRBV5~ TRBD2  TRBJ1~    15      17    20      26     -1      1      5
 6     47    0.00553 TGTGCC~ CASRGL~ TRBV6~ TRBD2  TRBJ2~    10      11    16      20     -1      0      3
 7     46    0.00541 TGCAGC~ CSVTGV~ TRBV2~ TRBD1  TRBJ2~     8       9    13      20     -1      0      6
 8     30    0.00353 TGTGCC~ CASSYL~ TRBV6~ TRBD2  TRBJ1~    15      17    19      20     -1      1      0
 9     29    0.00341 TGTGCC~ CASSLA~ TRBV5~ TRBD1  TRBJ1~    15      21    26      32     -1      5      5
10     29    0.00341 TGTGCC~ CASSYI~ TRBV6~ TRBD1  TRBJ1~    14      17    20      25     -1      2      4
# ... with 6,543 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> gene_col
[1] "V.name"
Browse[3]> IMMCOL$count
[1] "Clones"
Browse[3]> dataset
# A tibble: 48 x 2
   Gene     count
   <chr>    <int>
 1 TRBV10-1    28
 2 TRBV10-2    60
 3 TRBV10-3   282
 4 TRBV11-1    14
 5 TRBV11-2   172
 6 TRBV11-3    11
 7 TRBV12-4   459
 8 TRBV12-5    54
 9 TRBV13      53
10 TRBV14      91
# ... with 38 more rows
Browse[3]> .data$V.name[1]
[1] "TRBV20-1"
Browse[3]> dataset %>% filter(Gene == 'TRBV20-1')
# A tibble: 1 x 2
  Gene     count
  <chr>    <int>
1 TRBV20-1   570
Browse[3]> .data  %>% filter(V.name == 'TRBV20-1')
# A tibble: 570 x 15
   Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
    <dbl>      <dbl> <chr>   <chr>   <chr>  <chr>  <chr>  <int>   <int> <int>   <int>  <dbl>  <dbl>  <dbl>
 1    111   0.0131   TGCAGT~ CSASRG~ TRBV2~ TRBD1  TRBJ2~    11      12    17      25     -1      0      7
 2     11   0.00129  TGCAGT~ CSATGL~ TRBV2~ TRBD2  TRBJ2~     9      12    18      23     -1      2      4
 3      7   0.000824 TGCAGT~ CSARPG~ TRBV2~ TRBD1  TRBJ1~    11      14    19      24     -1      2      4
 4      6   0.000706 TGCAGT~ CSARGG~ TRBV2~ TRBD2  TRBJ2~    12      15    23      24     -1      2      0
 5      5   0.000588 TGCAGT~ CSARDG~ TRBV2~ TRBD2  TRBJ1~    13      16    20      21     -1      2      0
 6      5   0.000588 TGCAGT~ CSGASG~ TRBV2~ TRBD1  TRBJ2~     6       7    10      23     -1      0     12
 7      4   0.000471 TGCAGT~ CSVGGG~ TRBV2~ TRBD2  TRBJ2~     6      14    20      22     -1      7      1
 8      4   0.000471 TGCAGT~ CSARDR~ TRBV2~ TRBD1  TRBJ1~    13      14    18      19     -1      0      0
 9      4   0.000471 TGCAGT~ CSASQF~ TRBV2~ TRBD1  TRBJ2~    10      12    14      17     -1      1      2
10      4   0.000471 TGCAGT~ CSAGDR~ TRBV2~ TRBD1  TRBJ1~     7      12    17      21     -1      4      3
# ... with 560 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> .data  %>% filter(V.name == 'TRBV20-1') %>% dim()
[1] 570  15
Browse[3]> .data  %>% filter(V.name == 'TRBV10-1') %>% dim()
[1] 28 15
Browse[3]> .data  %>% filter(V.name == 'TRBV10-1') %>% DT::datatable()



undebug(geneUsage)

大家看到这个28 的来源了吗？你可以从不同的角度来观察基因使用的直方图:

imm_gu <- geneUsage(immdata$data[c(1, 2)], "hs.trbv", .norm = T)
vis(imm_gu)

imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T）
vis(imm_gu, .grid = T)

如果样本有分组信息，还可以做组间的差异统计：

imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T
vis(imm_gu, .by = "Status", .meta = immdata$meta)

vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")

由于一些克隆型的基因序列不明确，geneUsage有以下选项来处理不明确的数据，也许这些不明确的基因才是潜藏在数据中的瑰宝啊。

.ambig = "inc" - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to "exc" in case of other data formats. It is ON by default, we recommend it to leave it that way.
.ambig = "exc" - filters out all clonotypes with ambiguous gene alignments.
.ambig = "wei" - introduces weighted approach (divides by n (1/n) the frequency for each entry of the corresponding gene if there are n genes for a clonotype).
.ambig = "maj" - chooses only the first gene segment.

作为福利，我们在文章中经常看到这样的V-J基因的圈图，其实也是一种桑基图啦：

咋做？

做一个不成熟的桑基图：

library(ggforce)
immdata$data$`A2-i129` %>%
    gather_set_data(c(6,7,5)) %>%
    ggplot(aes(x, id = id, split = y, value = 1))  +
    geom_parallel_sets(aes(fill = J.name), show.legend = FALSE, alpha = 0.3) +
    geom_parallel_sets_axes(axis.width = 0.1, color = "lightgrey", fill = "white") +
    geom_parallel_sets_labels(angle = 0) +
    theme_no_axes()

VDJ之间的流向：

然后是一个不成熟的圈图：

library(sankeywheel)
sankeywheel(
    from = immdata$data$`A2-i129`$V.name, to = immdata$data$`A2-i129`$J.name,
    weight = immdata$data$`A2-i129`$Clones,#type = "sankey",
    title = " circus plots",
    subtitle = "j_gene &  v_gene",
    seriesName = "", width = "100%", height = "600px"
)

关键还是要理解这里面的数据结构和算法及其背后的生物学意义啊。咦，我们为什么要做geneUsage？

因为geneUsage也是一种异质性啊，主要是可以找到哪些geneUsage较高或者受到抑制，这样我们可以对症下药啊。一如文章所言：

CDR3 sequence length distribution, amino acid conservation and gene usage variability for ATL patients resembled those of peripheral blood cells from ACs and healthy donors. Thus, determining monoclonal architecture and clonal diversity by RNA sequencing might be useful for prognostic purposes and for personalizing ATL diagnosis and assessment of treatments.

[1] Farmanbar, A., Kneller, R. & Firouzi, S. RNA sequencing identifies clonal structure of T-cell repertoires in patients with adult T-cell leukemia/lymphoma. npj Genom. Med. 4, 10 (2019). https://doi.org/10.1038/s41525-019-0084-9

https://immunarch.com/articles/web_only/v5_gene_usage.html

scRepertoire||单细胞免疫组库分析：R语言应用
前情回顾 10× Genomics单细胞免疫组库VDJ分析必知必会免疫组库数据分析||immunarch教程：快速...
免疫组库数据分析||immunarch教程：GeneUsage分
immunarch — Fast and Seamless Exploration of Single-cell ...
immunarch使用方法链接汇总
01免疫组库数据分析||immunarch教程：快速开始[https://www.jianshu.com/p/9d...
使用immunarch包进行单细胞免疫组库数据分析（四）：Bas
基本分析在本节中，我们将使用immunarch包中的基本函数对免疫组库数据进行常规分析。常用函数如下： rep...
免疫组库数据分析||immunarch教程：Diversity
immunarch — Fast and Seamless Exploration of Single-cell ...
免疫组库数据分析||immunarch教程：Clonotype
immunarch — Fast and Seamless Exploration of Single-cell ...
免疫组库数据分析||immunarch教程：Clonotypes
immunarch — Fast and Seamless Exploration of Single-cell ...
免疫组库数据分析||immunarch教程：快速开始
immunarch是一个R包，旨在成为分析T细胞受体(TCR)和B细胞受体(BCR)的工具箱，主要针对医学科学家和...
2022-06-25 （4）过滤数据，找高频克隆型
下周工作安排：1、拿到健康人的免疫组库测序数据，要能与新冠病人的进行对比，并可用于immunarch包的分析。2、...
免疫组库数据分析||immunarch教程：克隆型分析
immunarch — Fast and Seamless Exploration of Single-cell ...