GEO | series matrix批量高速下载

作者: 生命数据科学 | 来源:发表于2022-12-30 00:03 被阅读0次

GEO/TCGA数据是否需要标准化的问题
Affymetrix芯片数据质量控制标准化（2）
limma
GEO数据类型缩写意义
核心基因筛选流程
翻译
我的GEO练习
Linux|wget
【R】getGEO运行报connection buffer错误
GEO文件提取matrix 指针匹配

在大规模分析GEO数据库的过程中，迫切需要批量、高速下载series matrix文件，而在下载数据过程中，因为网络等原因，各种报错层出不穷，如何来解决，一块来看看~

1. 常见报错

Error in checkForRemoteErrors(val) : 
  one node produced an error: Timeout was reached: [ftp.ncbi.nlm.nih.gov] Operation timed out after 10010 milliseconds with 0 out of 0 bytes received

Error in open.connection(x, "rb") : 
Timeout was reached: [ftp.ncbi.nlm.nih.gov] Operation timed out after 10013 milliseconds with 0 out of 0 bytes received

Warning message:
In .Internal(identical(x, y, num.eq, single.NA, attrib.as.set, ignore.bytecode,  :
  closing unused connection 3 (https://ftp.ncbi.nlm.nih.gov/geo/series/GSE31nnn/GSE31733/matrix/)

2. 输入数据

仅需要一个输入文件GSE_list.txt，具体内容就是包含GSE号的一列（不需要表头！不需要表头！不需要表头！）：

input_file <- "GSE_list.txt"# 可自行修改
all_GSE<-read.table(input_file,sep = "\t",header = F) # 有行名header=T,没有行名header=F
> head(all_GSE)
        V1
1 GSE42301
2 GSE43065
3 GSE44961
4 GSE43969
5 GSE43356
6 GSE42247

3. 所需R包

require(doParallel)
library(stringr)
library(GEOquery)
library(xml2)
library(parallel)
library(openxlsx)

4. 代码

13行代码实现下载，内含自动识别下载超时、下载报错问题，同时多线程并发，最大利用电脑性能进行高速下载~

source("function.R")
options(timeout=60) # set the timeout
n.cores <- detectCores()#获得最大核数，或者自行设置
input_file <- "GSE_list.txt"# 可自行修改
all_GSE<-read.table(input_file,sep = "\t",header = F) # 有行名header=T,没有行名header=F
GEO<- unique(unlist(str_match_all(all_GSE[,1],pattern = "GSE[0-9]*")))
merge<-c()
clust <- makeCluster(n.cores)
a <- parLapply(clust, GEO, fun = url,merge,getDirListing)
stopCluster(clust)
registerDoParallel(n.cores)
foreach(i=1:length(a)) %dopar% try(download_fun(a = a,i = i))
stopImplicitCluster()