我现在的问题是这样的
vcf文件转plink的格式
方法一
vcftools
[lyc@200server ~]$ vcftools --vcf Rice.recode.vcf --plink --out output
出错这样
VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf Rice.recode.vcf
--out output
--plink
Warning: Expected at least 2 parts in FORMAT entry: ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">
After filtering, kept 141 out of 141 Individuals
Writing PLINK PED and MAP files ...
Unrecognized values used for CHROM: Chr1 - Replacing with 0.
ls
ls
clear
Unrecognized values used for CHROM: Chr2 - Replacing with 0.
Expected at least 2 parts in FORMAT entry: ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">
Unrecognized values used for CHROM: Chr3 - Replacing with 0.
Unrecognized values used for CHROM: Chr4 - Replacing with 0.
Unrecognized values used for CHROM: Chr5 - Replacing with 0.
Unrecognized values used for CHROM: Chr6 - Replacing with 0.
Unrecognized values used for CHROM: Chr7 - Replacing with 0.
Unrecognized values used for CHROM: Chr8 - Replacing with 0.
Unrecognized values used for CHROM: Chr9 - Replacing with 0.
Unrecognized values used for CHROM: Chr10 - Replacing with 0.
Unrecognized values used for CHROM: Chr11 - Replacing with 0.
Unrecognized values used for CHROM: Chr12 - Replacing with 0.
Unrecognized values used for CHROM: ChrUn - Replacing with 0.
Unrecognized values used for CHROM: ChrSy - Replacing with 0.
Unrecognized values used for CHROM: chrC - Replacing with 0.
Done.
After filtering, kept 7186300 out of a possible 7186300 Sites
Run Time = 1647.00 seconds
[lyc@200server ~]$ ls
file.log output.ped prettify
file-temporary.bed output-temporary.bed Rice.recode.vcf
file-temporary.bim output-temporary.bim test
file-temporary.fam output-temporary.fam toy.map
LICENSE plink toy.ped
output.log plink_linux_x86_64_20201019.zip
output.map ppp
[lyc@200server ~]$
[lyc@200server ~]$ ls
file.log output.ped prettify
file-temporary.bed output-temporary.bed Rice.recode.vcf
file-temporary.bim output-temporary.bim test
file-temporary.fam output-temporary.fam toy.map
LICENSE plink toy.ped
output.log plink_linux_x86_64_20201019.zip
output.map ppp
[lyc@200server ~]$ clear
[lyc@200server ~]$ Expected at least 2 parts in FORMAT entry: ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">
-bash: 未预期的符号 `newline' 附近有语法错误
方法二plink
plink --vcf Rice.recode.vcf --recode --out file
plink --vcf Rice.recode.vcf --recode --out output --double-id
plink --vcf Rice.recode.vcf --recode --out output --const-fid family_id
都是一样的结果
773821 MB RAM detected; reserving 386910 MB for main workspace.
Error: Line 38 of .vcf file has a GT half-call.
Use --vcf-half-call to specify how these should be processed.
就无语
之前我的文件有这些
file.log LICENSE plink Rice.recode.vcf
file-temporary.bed output.log plink_linux_x86_64_20201019.zip test
file-temporary.bim output.map ppp toy.map
file-temporary.fam output.ped prettify toy.ped
现在是这些
file.log output.ped prettify
file-temporary.bed output-temporary.bed Rice.recode.vcf
file-temporary.bim output-temporary.bim test
file-temporary.fam output-temporary.fam toy.map
LICENSE plink toy.ped
output.log plink_linux_x86_64_20201019.zip
output.map ppp
然后看到了这些
https://www.cog-genomics.org/plink/1.9/input
然后我就重新安装了plink为plink2
http://www.cog-genomics.org/plink/2.0/
linux上安装Plink
https://blog.csdn.net/qq_40605470/article/details/108882992
但是又遇到了新的问题
[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --recode --out ccc
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021) www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to ccc.log.
Options in effect:
--export ped
--out ccc
--vcf Rice.recode.vcf
Start time: Thu Mar 18 20:20:58 2021
773821 MiB RAM detected; reserving 386910 MiB for main workspace.
Using up to 80 threads (change this with --threads).
--vcf: 7146k variants scanned.
Error: Invalid chromosome code 'ChrUn' on line 7146662 of --vcf file.
(Use --allow-extra-chr to force it to be accepted.)
End time: Thu Mar 18 20:21:32 2021
然后发现文件又是临时文件
[lyc@200server ~]$ ls
ccc.log plink2
ccc-temporary.psam plink2_linux_x86_64_20210302.zip
file.log plink2.log
file-temporary.bed plink_linux_x86_64_20201019.zip
file-temporary.bim ppp
file-temporary.fam prettify
LICENSE Rice.recode.vcf
output.log test
output.map toy.map
output.ped toy.ped
output-temporary.bed transform.log
output-temporary.bim transform-temporary.bed
output-temporary.fam transform-temporary.bim
plink transform-temporary.fam
说明又出错了
所以,现在或许又要解决那一行的无效染色体
然后我就使用了--allow-extra-chr
[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --recode --out ccc --allow-extra-chr
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021) www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to ccc.log.
Options in effect:
--allow-extra-chr
--export ped
--out ccc
--vcf Rice.recode.vcf
Start time: Fri Mar 19 16:31:21 2021
773821 MiB RAM detected; reserving 386910 MiB for main workspace.
Using up to 80 threads (change this with --threads).
--vcf: 7186300 variants scanned.
Error: Line 38 of --vcf file has a GT half-call.
Use --vcf-half-call to specify how these should be processed.
End time: Fri Mar 19 16:32:26 2021
哈,又回到了最初的起点
image.png
可是我看意思,plink2不应该有这个问题啊
[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --recode --out ccc --allow-extra-chr --vcf-half-call
PLINK v2.00a3LM 64-bit Intel (2 Mar 2021) www.cog-genomics.org/plink/2.0/
(C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to ccc.log.
Options in effect:
--allow-extra-chr
--export ped
--out ccc
--vcf Rice.recode.vcf
--vcf-half-call
Start time: Fri Mar 19 16:33:16 2021
Error: Missing --vcf-half-call argument.
For more info, try "plink2 --help <flag name>" or "plink2 --help | more".
应该是 --vcf-half-call的格式不对
然后换回plink1
重看官网
以及https://www.jianshu.com/p/38e603979334?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation
发现格式
[lyc@200server ~]$ plink --vcf Rice.recode.vcf --allow-extra-chr --recode --vcf-half-call'missing' --out eee
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to eee.log.
Options in effect:
--allow-extra-chr
--out eee
--recode
--vcf Rice.recode.vcf
--vcf-half-callmissing
Error: Unrecognized flag ('--vcf-half-callmissing').
For more information, try "plink --help <flag name>" or "plink --help | more".
发现原来是因为少了空格
[lyc@200server ~]$ plink --vcf Rice.recode.vcf --allow-extra-chr --recode --vcf-half-call 'haploid' --out eee
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to eee.log.
Options in effect:
--allow-extra-chr
--out eee
--recode
--vcf Rice.recode.vcf
--vcf-half-call haploid
773821 MB RAM detected; reserving 386910 MB for main workspace.
--vcf: eee-temporary.bed + eee-temporary.bim + eee-temporary.fam written.
7186300 variants loaded from .bim file.
141 people (0 males, 0 females, 141 ambiguous) loaded from .fam.
Ambiguous sex IDs written to eee.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 141 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.878065.
7186300 variants and 141 people pass filters and QC.
Note: No phenotypes present.
--recode ped to eee.ped + eee.map ... done.
终于成功,哭了
https://www.jianshu.com/p/dc82fcbe3cda然后意识到自己搜索关键词有问题
[lyc@200server ~]$ plink --file eee --allow-extra-chr --make-bed --out rice
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to rice.log.
Options in effect:
--allow-extra-chr
--file eee
--make-bed
--out rice
773821 MB RAM detected; reserving 386910 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (7186300 variants, 141 people).
--file: rice-temporary.bed + rice-temporary.bim + rice-temporary.fam written.
7186300 variants loaded from .bim file.
141 people (0 males, 0 females, 141 ambiguous) loaded from .fam.
Ambiguous sex IDs written to rice.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 141 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.878065.
7186300 variants and 141 people pass filters and QC.
Note: No phenotypes present.
--make-bed to rice.bed + rice.bim + rice.fam ... done.
这回终于对了
image.png
但问题是第一种转的,不知道应该是哪一种
image.png
我直接用了
我发现用plink2依然也还是这个问题,我看到一个答案貌似说是因为版本还不够新,今年初的那个改进版本或许可以修正这个问题
[lyc@200server ~]$ plink2 --vcf Rice.recode.vcf --allow-extra-chr --make-bed --out test
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to test.log.
Options in effect:
--allow-extra-chr
--make-bed
--out test
--vcf Rice.recode.vcf
773821 MB RAM detected; reserving 386910 MB for main workspace.
Error: Line 38 of .vcf file has a GT half-call.
Use --vcf-half-call to specify how these should be processed.
网友评论