Alphafold2的conda版本的本地使用

作者: 可能性之兽 | 来源:发表于2022-12-07 21:03 被阅读0次

由于xx原因，需要用Alphafold2的conda版本的本地版本。所以花了两三天终于把alphafold2的conda本地版本给安装上了，主要是下载数据比较麻烦和费时间，总是有数据下载不全又要重新下载，docker版本的话有些配置实在难搞，折腾了一两个小时之后果断放弃使用conda版本。不过无论是哪个版本，都要先把数据下好，所以可以先把数据启动下载，然后在下载过程之中再去折腾alphfold的软件本体。

虽然alphafold官方提供了一个能够全部下载的脚本，但是那个脚本太坑，下载完两个（bfd和params）就会断掉。所以还是要读一下下载脚本的内容，然后学着一个个下载。最麻烦的就是两个数据：pdb70和pdb_mmcif。pdb70不知道为何下载非常的慢，就几十k左右的速度，pdb_mmcif下载到最后可能会莫名其妙断掉，但是一旦缺少后面运行就会报错。pdb70使用axel开多几个线程下载会快一点， pdb_mmcif文件使用下面的脚本会下载的比较快。

pdb_mmcif文件下载

alphafold-multimer环境部署

#!/bin/sh

src='rsync.rcsb.org::ftp_data/structures/divided/mmCIF' #源路径,结尾不带斜线
dst='./pdb_mmcif/raw' #目标路径,结尾不带斜线
opt="--recursive --links --perms --times --compress --info=progress2 --delete --port=33444" #同步选项
num=10 #并发进程数
depth='5 4 3 2 1' #归递目录深度
task=/tmp/`echo $src$ | md5sum | head -c 16`
[ -f $task-next ] && cp $task-next $task-skip
[ -f $task-skip ] || touch $task-skip

# 创建目标目录结构
rsync $opt --include "*/" --exclude "*" $src/ $dst

# 从深到浅同步目录
for l in $depth ;do
    # 启动rsync进程
    for i in `find $dst -maxdepth $l -mindepth $l -type d`; do
        i=`echo $i | sed "s#$dst/##"`
        if `grep -q "$i$" $task-skip`; then
            echo "skip $i"
            continue
        fi
        while true; do
            now_num=`ps axw | grep rsync | grep $dst | grep -v '\-\-daemon' | wc -l`
            if [ $now_num -lt $num ]; then
                echo "rsync $opt $src/$i/ $dst/$i" >>$task-log
                rsync $opt $src/$i/ $dst/$i &
                echo $i >>$task-next
                sleep 1
                break
            else
                sleep 5
            fi
        done
    done
done

下载完

下载完解压的数据会比GitHub上显示的要大一点，看来是后面加了不少数据，大概2.4T左右大小，大头就是bfd这个文件夹，1.8T左右。

Alpdata/
├── bfd
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
│   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
│   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
├── mgnify
│   └── mgy_clusters_2018_12.fa
├── params
│   ├── LICENSE
│   ├── params_model_1_multimer_v2.npz
│   ├── params_model_1.npz
│   ├── params_model_1_ptm.npz
│   ├── params_model_2_multimer_v2.npz
│   ├── params_model_2.npz
│   ├── params_model_2_ptm.npz
│   ├── params_model_3_multimer_v2.npz
│   ├── params_model_3.npz
│   ├── params_model_3_ptm.npz
│   ├── params_model_4_multimer_v2.npz
│   ├── params_model_4.npz
│   ├── params_model_4_ptm.npz
│   ├── params_model_5_multimer_v2.npz
│   ├── params_model_5.npz
│   └── params_model_5_ptm.npz
├── pdb70
│   ├── md5sum
│   ├── pdb70_a3m.ffdata
│   ├── pdb70_a3m.ffindex
│   ├── pdb70_clu.tsv
│   ├── pdb70_cs219.ffdata
│   ├── pdb70_cs219.ffindex
│   ├── pdb70_hhm.ffdata
│   ├── pdb70_hhm.ffindex
│   └── pdb_filter.dat
├── pdb_mmcif
│   ├── mmcif_files
│   └── obsolete.dat
├── pdb_seqres
│   └── pdb_seqres.txt
├── small_bfd
│   └── bfd-first_non_consensus_sequences.fasta
├── uniclust30
│   └── uniclust30_2018_08
├── uniprot
│   ├── uniprot_sprot.fasta
│   └── uniprot_trembl.fasta
└── uniref90
    └── uniref90.fasta

12 directories, 38 files

配置好alphafold的conda环境

会由于系统和cuda的各种原因报错，需要一个个解决掉。这个就是考验个人功底的时候了。

使用

bash run_alphafold.sh -d ./alphafold_data/ -o ./dummy_test/ -f ./example/query.fasta -t 2020-05-14

主要参数

(alphafold2) [lp@localhost alphafold]$ bash run_alphafold.sh -h

Please make sure all required parameters are given
Usage: run_alphafold.sh <OPTIONS>
Required Parameters:
-d <data_dir>         Path to directory of supporting data
-o <output_dir>       Path to a directory that will store the results.
-f <fasta_path>       Path to a FASTA file containing sequence. If a FASTA file contains multiple sequences, then it will be folded as a multimer
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-g <use_gpu>          Enable NVIDIA runtime to run with GPUs (default: true)
-r <run_relax>        Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage (default: true)
-e <enable_gpu_relax> Run relax on GPU if GPU is enabled (default: true)
-n <openmm_threads>   OpenMM threads (default: all available cores)
-a <gpu_devices>      Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m <model_preset>     Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer')
-c <db_preset>        Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs')
-p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false')
-l <num_multimer_predictions_per_model> How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer (default: 5)
-b <benchmark>        Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'false')