由于xx原因,需要用Alphafold2的conda版本的本地版本。所以花了两三天终于把alphafold2的conda本地版本给安装上了,主要是下载数据比较麻烦和费时间,总是有数据下载不全又要重新下载,docker版本的话有些配置实在难搞,折腾了一两个小时之后果断放弃使用conda版本。不过无论是哪个版本,都要先把数据下好,所以可以先把数据启动下载,然后在下载过程之中再去折腾alphfold的软件本体。
虽然alphafold官方提供了一个能够全部下载的脚本,但是那个脚本太坑,下载完两个(bfd和params)就会断掉。所以还是要读一下下载脚本的内容,然后学着一个个下载。最麻烦的就是两个数据:pdb70和pdb_mmcif。pdb70不知道为何下载非常的慢,就几十k左右的速度,pdb_mmcif下载到最后可能会莫名其妙断掉,但是一旦缺少后面运行就会报错。pdb70使用axel开多几个线程下载会快一点, pdb_mmcif文件使用下面的脚本会下载的比较快。
pdb_mmcif文件下载
#!/bin/sh
src='rsync.rcsb.org::ftp_data/structures/divided/mmCIF' #源路径,结尾不带斜线
dst='./pdb_mmcif/raw' #目标路径,结尾不带斜线
opt="--recursive --links --perms --times --compress --info=progress2 --delete --port=33444" #同步选项
num=10 #并发进程数
depth='5 4 3 2 1' #归递目录深度
task=/tmp/`echo $src$ | md5sum | head -c 16`
[ -f $task-next ] && cp $task-next $task-skip
[ -f $task-skip ] || touch $task-skip
# 创建目标目录结构
rsync $opt --include "*/" --exclude "*" $src/ $dst
# 从深到浅同步目录
for l in $depth ;do
# 启动rsync进程
for i in `find $dst -maxdepth $l -mindepth $l -type d`; do
i=`echo $i | sed "s#$dst/##"`
if `grep -q "$i$" $task-skip`; then
echo "skip $i"
continue
fi
while true; do
now_num=`ps axw | grep rsync | grep $dst | grep -v '\-\-daemon' | wc -l`
if [ $now_num -lt $num ]; then
echo "rsync $opt $src/$i/ $dst/$i" >>$task-log
rsync $opt $src/$i/ $dst/$i &
echo $i >>$task-next
sleep 1
break
else
sleep 5
fi
done
done
done
下载完
下载完解压的数据会比GitHub上显示的要大一点,看来是后面加了不少数据,大概2.4T左右大小,大头就是bfd这个文件夹,1.8T左右。
Alpdata/
├── bfd
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│ ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│ └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── mgnify
│ └── mgy_clusters_2018_12.fa
├── params
│ ├── LICENSE
│ ├── params_model_1_multimer_v2.npz
│ ├── params_model_1.npz
│ ├── params_model_1_ptm.npz
│ ├── params_model_2_multimer_v2.npz
│ ├── params_model_2.npz
│ ├── params_model_2_ptm.npz
│ ├── params_model_3_multimer_v2.npz
│ ├── params_model_3.npz
│ ├── params_model_3_ptm.npz
│ ├── params_model_4_multimer_v2.npz
│ ├── params_model_4.npz
│ ├── params_model_4_ptm.npz
│ ├── params_model_5_multimer_v2.npz
│ ├── params_model_5.npz
│ └── params_model_5_ptm.npz
├── pdb70
│ ├── md5sum
│ ├── pdb70_a3m.ffdata
│ ├── pdb70_a3m.ffindex
│ ├── pdb70_clu.tsv
│ ├── pdb70_cs219.ffdata
│ ├── pdb70_cs219.ffindex
│ ├── pdb70_hhm.ffdata
│ ├── pdb70_hhm.ffindex
│ └── pdb_filter.dat
├── pdb_mmcif
│ ├── mmcif_files
│ └── obsolete.dat
├── pdb_seqres
│ └── pdb_seqres.txt
├── small_bfd
│ └── bfd-first_non_consensus_sequences.fasta
├── uniclust30
│ └── uniclust30_2018_08
├── uniprot
│ ├── uniprot_sprot.fasta
│ └── uniprot_trembl.fasta
└── uniref90
└── uniref90.fasta
12 directories, 38 files
配置好alphafold的conda环境
会由于系统和cuda的各种原因报错,需要一个个解决掉。这个就是考验个人功底的时候了。
使用
bash run_alphafold.sh -d ./alphafold_data/ -o ./dummy_test/ -f ./example/query.fasta -t 2020-05-14
主要参数
(alphafold2) [lp@localhost alphafold]$ bash run_alphafold.sh -h
Please make sure all required parameters are given
Usage: run_alphafold.sh <OPTIONS>
Required Parameters:
-d <data_dir> Path to directory of supporting data
-o <output_dir> Path to a directory that will store the results.
-f <fasta_path> Path to a FASTA file containing sequence. If a FASTA file contains multiple sequences, then it will be folded as a multimer
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-g <use_gpu> Enable NVIDIA runtime to run with GPUs (default: true)
-r <run_relax> Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage (default: true)
-e <enable_gpu_relax> Run relax on GPU if GPU is enabled (default: true)
-n <openmm_threads> OpenMM threads (default: all available cores)
-a <gpu_devices> Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m <model_preset> Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer')
-c <db_preset> Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs')
-p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false')
-l <num_multimer_predictions_per_model> How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer (default: 5)
-b <benchmark> Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'false')
试着训练一个小的蛋白序列,没有报错就大功告成了。
AlphaFold2的相关资料汇总:https://github.com/chenxingqiang/ref-Alphafold-Code
AlphaFold2的github:https://github.com/deepmind/alphafold
AlphaFold2的conda安装教程:https://github.com/kalininalab/alphafold_non_docker
deepmind对alphafold2的论述:
https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology
alphafold2发表论文:
https://www.nature.com/articles/s41586-021-03828-1
https://www.nature.com/articles/s41586-021-03819-2
alphafold2预测结果的数据库:https://alphafold.ebi.ac.uk/
alphafold2简易版ColabFold:https://github.com/sokrypton/ColabFold
ColabFold及AlphaFold2的框架介绍PPT:https://docs.google.com/presentation/d/1mnffk23ev2QMDzGZ5w1skXEadTe54l8-Uei6ACce8eI/edit
Alphafold2: 如何应用AI预测蛋白质三维结构PPT:https://s3.jcloud.sjtu.edu.cn/c4b9d2676bcc4cd88e25abaf9d8cf068-ins_common/seminars/1959_Alphafold2-INS.pdf
网友评论