有人问miRNA靶基因如何预测的问题,鉴于自己也做过相关的分析,稍有点经验,总结记录一下。
一、miranda预测
- 软件安装
$ wget http://cbio.mskcc.org/microrna_data/miRanda-aug2010.tar.gz
$ tar -xvf miRanda-aug2010.tar.gz
$ cd miRanda-3.3a/
$ ./configure --prefix=/Lustre01/user/software/miRanda-3.3a/
$ make && make install
$ cd ***/miRanda-3.3a/bin #你自己的软件路径
$ ./miranda --help #产看帮助信息和参数等等,具体如下
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
miranda v3.3a microRNA Target Scanning Algorithm
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
(c) 2003 Memorial Sloan-Kettering Cancer Center, New York
Authors: Anton Enright, Bino John, Chris Sander and Debora Marks
(mirnatargets (at) cbio.mskcc.org - reaches all authors)
Software written by: Anton Enright
Distributed for anyone to use under the GNU Public License (GPL),
See the files 'COPYING' and 'LICENSE' for details
If you use this software please cite:
Enright AJ, John B, Gaul U, Tuschl T, Sander C and Marks DS;
(2003) Genome Biology; 5(1):R1.
miranda comes with ABSOLUTELY NO WARRANTY;
This is free software, and you are welcome to redistribute it
under certain conditions; type `miranda --license' for details.
miRanda is an miRNA target scanner which aims to predict mRNA
targets for microRNAs using dynamic-programming alignment and
thermodynamics.
Usage: miranda file1 file2 [options..]
Where:
'file1' is a FASTA file with a microRNA query
'file2' is a FASTA file containing the sequence(s)
to be scanned.
OPTIONS
--help -h Display this message
--version -v Display version information
--license Display license information
Core algorithm parameters:
-sc S Set score threshold to S [DEFAULT: 140.0]
-en -E Set energy threshold to -E kcal/mol [DEFAULT: 1.0]
-scale Z Set scaling parameter to Z [DEFAULT: 4.0]
-strict Demand strict 5' seed pairing [DEFAULT: off]
Alignment parameters:
-go -X Set gap-open penalty to -X [DEFAULT: -4.0]
-ge -Y Set gap-extend penalty to -Y [DEFAULT: -9.0]
General Options:
-out file Output results to file [DEFAULT: off]
-quiet Output fewer event notifications [DEFAULT: off]
-trim T Trim reference sequences to T nt [DEFAULT: off]
-noenergy Do not perform thermodynamics [DEFAULT: off]
-restrict file Restricts scans to those between
specific miRNAs and UTRs
provided in a pairwise
tab-separated file [DEFAULT: off]
Other Options:
-keyval Key value pairs ?? [DEFAULT:]
This software will be further developed under the open source model,
coordinated by Anton Enright and Chris Sander (miranda (at) cbio.mskcc.org).
Please send bug reports to: miranda (at) cbio.mskcc.org.
- 测试example
$ cd ***/miRanda-3.3a/examples #你自己的软件路径
$ ls #查看路径下有两个fastq序列文件
bantam_stRNA.fasta hid_UTR.fasta
# 我们挨个看一下:[如下,前者是miRNA成熟序列文件(可以是下载的miRNA序列,自己鉴定的miRNA序列等等),后者是待匹配/查找的基因文件(circRNA、lncRNA、PCG等序列文件)。]
$ less -SN bantam_stRNA.fasta
1 >gi|29565487|emb|AJ550546.1| Drosophila melanogaster microRNA miR-bantam
2 GTGAGATCATTTTGAAAGCTG
$ less -SN hid_UTR.fasta
1 >gi|945100|gb|U31226.1|DMU31226 Drosophila melanogaster head involution defective protein (hid) mRNA, complete cds (3'UTR only)
2 TGACAAAAAATAAAAAACGAAATCCATCGTGAACAGTTTTGTGTTTTTAAATCAGTTCTAAACACGAAAA
3 GGGTTGATGAAAAACGCAGAAGAATCCGAAAAACTAACTAACCGAGCAAAAACTTGACTTGAGTGTTGTT
4 TGACAAATCAGGAAAGATAAAAAACAAATCATAAGAAAAAACTGCACGAAAAATGAAAAAGTTTCTAATA
5 TTCAAAATCTTGCACAAGAAATACAAAATCAATTAAAGTGAACTCTAACCAAAAGTTGTACACAAAATAA
6 AAAGCAAAACAAAGCAGCGAAGAACAATCACAAGAAGAGCAAAGTGCCAACAAAGTGCAGGAAGGAAGGA
7 AGCGGATAAGGACAAAAAGGAAGCCAGCACACACACACACACCCACACAATGGCCGTGCCCTTTTATTTG
8 CCCGAGGGCGGCGCCGATGACGTAGCGTCGAGTTCATCGGGAGCCTCGGGCAACTCCTCCCCCCACAACC
9 ACCCACTTCCCTCGAGCGCATCCTCGTCCGTCTCCTCCTCGGGCGTGTCCTCGGCCTCCGCCTCCTCGGC
10 CTCATCTTCGTCATCCGCATCGTCGGACGGCGCCAGCAGCGCCGCCTCGCAATCGCCGAACACCACCACC
11 TCGTCGGCCACGCAGACGCCGATGCAGTCTCCACTGCCCACCGACCAAGTGCTATACGCCCTCTACGAGT
12 GGGTCAGGATGTACCAGAGCCAGCAGAGTGCCCCGCAAATCTTCCAGTATCCGCCGCCAAGCCCCTCTTG
13 CAATTTCACTGGCGGCGATGTGTTCTTTCCGCACGGCCATCCGAATCCGAACTCGAATCCCCATCCGCGC
14 ACCCCCCGAACCAGCGTGAGCTTCTCCTCCGGCGAGGAGTACAACTTCTTCCGGCAGCAGCAGCCGCAAC
15 CACATCCGTCATATCCGGCGCCATCAACACCGCAGCCAATGCCACCGCAGTCAGCGCCGCCGATGCACTG
16 CAGCCACAGCTACCCGCAGCAGTCGGCGCACATGATGCCACACCATTCCGCTCCCTTCGGAATGGGCGGT
17 ACCTACTACGCCGGCTACACGCCACCACCCACTCCGAACACGGCCAGTGCGGGCACCTCCAGCTCATCGG
18 CGGCCTTCGGCTGGCACGGCCACCCCCACAGCCCCTTCACGTCGACCTCCACGCCGTTATCGGCGCCAGT
19 GGCGCCCAAGATGCGCCTGCAGCGCAGCCAGTCGGATGCGGCCAGACGCAAGCGATTGACCTCGACGGGC
20 GAGGATGAGCGCGAGTACCAGAGCGATCATGAGGCCACTTGGGACGAGTTTGGCGATCGCTACGACAACT
运行测试预测靶基因代码(提前添加bin到.bashrc):
$ miranda bantam_stRNA.fasta hid_UTR.fasta > test.out.txt # 这里只设置输入与输出,其他参数默认,或者如下添加其他参数
$ miranda bantam_stRNA.fasta hid_UTR.fasta -sc 140 -en -1 > test.out.txt
$ cat test.out.txt | grep ">" | less -SN
1 >gi|29565487|emb|AJ550546.1| gi|945100|gb|U31226.1|DMU31226 167.00 -24.54 2 20 3340 3360 18 83.33% 94.44%
2 >gi|29565487|emb|AJ550546.1| gi|945100|gb|U31226.1|DMU31226 156.00 -20.03 2 17 2505 2525 15 86.67% 93.33%
3 >gi|29565487|emb|AJ550546.1| gi|945100|gb|U31226.1|DMU31226 155.00 -14.57 2 16 2852 2872 14 78.57% 85.71%
4 >gi|29565487|emb|AJ550546.1| gi|945100|gb|U31226.1|DMU31226 152.00 -14.18 2 18 3820 3841 17 76.47% 76.47%
5 >>gi|29565487|emb|AJ550546.1| gi|945100|gb|U31226.1|DMU31226 630.00 -73.32 167.00 -24.54 1 21 3902 3340 2505 2852 3820
# 结果解读:
1)">"开头的几行,是miRNA(gi|29565487|emb|AJ550546.1|)靶标到基因(gi|945100|gb|U31226.1|DMU31226)上的不同位置
2)">"开头的行对应的信息依次是:miRNA id,基因id,打分,自由能,miRNA起始位置,miRNA终止位置,基因起始位置,基因终止位置,靶标结合miRNA长度,miRNA结合长度与miRNA总长占比,基因上结合长度(大于等于前者,可能是一个跨度)对miRNA总长占比
3)">>"的行是综合信息,依次代表:miRNA id,基因id,总打分,总自由能,最大打分,最大自由能,链信息,miRNA长度,基因长度,靶标到基因上的位置(1个或多个,这里是4个)
4)综上,我们可以抓取“>>”的行,来获取靶向的基因,如果需要其他信息也可以自己按需提取,so easy
- 再举一例(个性化分析)
miRNA鉴定等分析这里就不说了(赘述),(1)我们测miRNA数据,鉴定并得到差异miRNA(known or novel均可),是否希望看一下他们靶向与那些基因呢?(2)文献、试验等等,发现某些miRNA(可获取序列或名字的)有着重要的生物学功能,机理尚不明确,是否也想看看其靶向关系呢?可以运行miranda。按照(2)我们造几个miRNA进行举例:
3.1 下载miRBase成熟序列,提取human的miRNA进行举例
$ wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz
$ gunzip mature.fa.gz
$ less -SN mature.fa # 格式如下
1 >cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
2 UGAGGUAGUAGGUUGUAUAGUU
3 >cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p
4 CUAUGCAAUUUUCUACCUUACC
5 >cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p
6 UCCCUGAGACCUCAAGUGUGA
7 >cel-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p
8 ACACCUGGGCUCUCCGGGUACC
我通过python编程取出10个human的miRNA,存为hsa_mature10.fa
$ cat hsa_mature10.fa
>hsa-let-7a-5p
UGAGGUAGUAGGUUGUAUAGUU
>hsa-let-7a-3p
CUAUACAAUCUACUGUCUUUC
>hsa-let-7a-2-3p
CUGUACAGCCUCCUAGCUUUCC
>hsa-let-7b-5p
UGAGGUAGUAGGUUGUGUGGUU
>hsa-let-7b-3p
CUAUACAACCUACUGCCUUCCC
>hsa-let-7c-5p
UGAGGUAGUAGGUUGUAUGGUU
>hsa-let-7c-3p
CUGUACAACCUUCUAGCUUUCC
>hsa-let-7d-5p
AGAGGUAGUAGGUUGCAUAGUU
>hsa-let-7d-3p
CUAUACGACCUGCUGCCUUUCU
>hsa-let-7e-5p
UGAGGUAGGAGGUUGUAUAGUU
接着准备靶向的基因集合,我是通过gtf位置提取了human基因组所有的转录本,取了其中17个来测试(17.fa),当然你也可以去ensembl下载CDS序列,其他方法得到的基因序列均可。
$ grep ">" 17.fa
>ENST00000000233.9 +
>ENST00000000412.7 -
>ENST00000000442.10 +
>ENST00000001008.5 +
>ENST00000001146.6 -
>ENST00000002125.8 +
>ENST00000002165.10 -
>ENST00000002501.10 -
>ENST00000002596.5 -
>ENST00000002829.7 +
>ENST00000003084.10 +
>ENST00000003100.12 -
>ENST00000003302.8 -
>ENST00000003583.12 -
>ENST00000003912.7 +
>ENST00000004103.7 +
>ENST00000004531.14 +
然后预测二者的靶标关系(通过管道输出):
$ miranda hsa_mature10.fa 17.fa -sc 140 -en -1 | grep ">>" > hsa_mature10_targets.txt
我们大概看看几行结果:
$ less -SN hsa_mature10_targets.txt
1 >>hsa-let-7a-5p ENST00000000412.7 146.00 -11.97 146.00 -11.97 2 22 2756 960
2 >>hsa-let-7a-5p ENST00000001008.5 143.00 -17.75 143.00 -17.75 4 22 3732 2611
3 >>hsa-let-7a-5p ENST00000002125.8 156.00 -21.39 156.00 -21.39 6 22 2176 1248
4 >>hsa-let-7a-5p ENST00000002829.7 146.00 -16.38 146.00 -16.38 10 22 3802 3480
5 >>hsa-let-7a-5p ENST00000003084.10 299.00 -39.65 152.00 -20.76 11 22 6132 3077 5087
6 >>hsa-let-7a-5p ENST00000003100.12 143.00 -18.40 143.00 -18.40 12 22 3210 2586
7 >>hsa-let-7a-5p ENST00000003302.8 438.00 -47.15 152.00 -16.09 13 22 4669 1132 2011 4340
8 >>hsa-let-7a-5p ENST00000004531.14 163.00 -20.40 163.00 -20.40 17 22 7560 1101
9 >>hsa-let-7a-3p ENST00000002125.8 161.00 -12.52 161.00 -12.52 23 21 2176 1476
10 >>hsa-let-7a-3p ENST00000002165.10 158.00 -14.96 158.00 -14.96 24 21 2356 1993
11 >>hsa-let-7a-3p ENST00000003084.10 148.00 -6.47 148.00 -6.47 28 21 6132 5758
12 >>hsa-let-7a-3p ENST00000003302.8 149.00 -16.95 149.00 -16.95 30 21 4669 3959
13 >>hsa-let-7a-3p ENST00000003912.7 146.00 -7.39 146.00 -7.39 32 21 5481 3640
14 >>hsa-let-7a-3p ENST00000004531.14 292.00 -16.76 146.00 -13.42 34 21 7560 3954 6657
tips:我们看到结果的格式如上面例子所描述的一致,此外,也看到同一miRNA靶向与不同的转录本(基因)上面,关于怎么去解释结果和行文、试验等等,就要看你的阅历和水平了。可以提取所有结果进行网络图的构建、可以提取最好(根据打分和自由能)的基因进行验证(相关性计算、定量、湿试验等等)、单纯查找资料讨论。【转录本与基因的对应关系需要自行转换】。下面也附上miRBase下载成熟序列的截图,也很简单。【本教程再提取序列时需要一定的编程基础,如果你只有两者的序列,也是可以进行靶向预测的】
http://www.mirbase.org
download
网友评论