阅读步骤:
解压之后看到外部有一些文件和文件夹,三个文件夹根据命名可知分别是说明文件docs,主程序文件scvi,测试文件tests。而文件夹外部的文件可以根据后缀和命名分为配置说明文件.yml,.lock,.toml(配置文件的使用由来已久,从.ini、XML、JSON、YAML再到TOML,语言的表达能力越来越强,同时书写便捷性也在不断提升。),总的说明文件.md,安装文件.py,开源证书LICENSE.
![](https://img.haomeiwen.com/i18429961/49d84a973e4a5709.png)
tree看一些结构发现文件很多,那么我会选择先不看主程序,而是先看他的说明
![](https://img.haomeiwen.com/i18429961/0cc96d97201b55b0.png)
说明文件主要包含:专门放说明的文件夹docs,外部的.yml文件。
那么就先看文件夹外部文件,之后再看docs文件夹,最后看scvi文件夹,不看tests文件夹
.lock文件被用来锁住某些资源,Lock文件表明一个应用程序中某个资源在锁释放之前是不能被应用的。这对那些需要并发访问临界资源的应用程序是十分有用的。对于文件锁,应用程序会创建一个新文件,然后将此文件在原有命名的基础上添加一个.lock后缀,比如,example.file文件的锁文件将是example.file.lock。
poetry.lock
![](https://img.haomeiwen.com/i18429961/6c43d4536775010b.png)
codecov.yml
![](https://img.haomeiwen.com/i18429961/ba3fb0006b205380.png)
其余的.yml也都不是很有用
![](https://img.haomeiwen.com/i18429961/5d7d653e37de5183.png)
配置文件的意义不大。
进入主体程序
![](https://img.haomeiwen.com/i18429961/730e146c0fbd0aeb.png)
包是一种通过使用“带点的模块名称”来构建 Python 模块名称空间的方式。例如,模块名称A.B指定一个名为B的包中的子模块A。需要这些__init__.py文件才能使 Python 将包含该文件的目录视为包。我们先从文件最少的nn看,
![](https://img.haomeiwen.com/i18429961/084125124b78c19e.png)
首先,有__init__.py所以整个文件夹是个包,_utils.py,调用包,对序列实现onehot编码
python torch又称PyTorach,是一个以Python优先的深度学习框架,一个开源的Python机器学习库,用于自然语言处理等应用程序,不仅能够实现强大的GPU加速,同时还支持动态神经网络,这是现在很多主流框架比如Tensorflow等都不支持的。
![](https://img.haomeiwen.com/i18429961/a0faa9d5d5f28c7f.png)
另一个文件虽然很长,但是主要还是用torch包做模型,暂时不想看模型部分,而且我觉得都调用了,为什么使用者不自己搭啊。不知道意义在哪,或许我应该看看他的例子再理解下他的意思。
![](https://img.haomeiwen.com/i18429961/63e24457f6d58fa5.png)
```
import logging
import os
import anndata
import h5py
import numpy as np
import scipy.sparse as sp_sparse
from scvi.data._built_in_data._download import _download
logger = logging.getLogger(__name__)
def _load_brainlarge_dataset(
save_path: str = "data/",
sample_size_gene_var: int = 10000,
max_cells_to_keep: int = None,
n_genes_to_keep: int = 720,
loading_batch_size: int = 100000,
) -> anndata.AnnData:
"""Loads brain-large dataset."""
url = "http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5"
save_fn = "brain_large.h5"
_download(url, save_path, save_fn)
adata = _load_brainlarge_file(
os.path.join(save_path, save_fn),
sample_size_gene_var=sample_size_gene_var,
max_cells_to_keep=max_cells_to_keep,
n_genes_to_keep=n_genes_to_keep,
loading_batch_size=loading_batch_size,
)
return adata
def _load_brainlarge_file(
path_to_file: str,
sample_size_gene_var: int,
max_cells_to_keep: int,
n_genes_to_keep: int,
loading_batch_size: int,
) -> anndata.AnnData:
logger.info("Preprocessing Brain Large data")
print(path_to_file)
with h5py.File(path_to_file, "r") as f:
data = f["mm10"]
nb_genes, nb_cells = f["mm10"]["shape"]
n_cells_to_keep = (
max_cells_to_keep if max_cells_to_keep is not None else nb_cells
)
index_partitioner = data["indptr"][...]
# estimate gene variance using a subset of cells.
index_partitioner_gene_var = index_partitioner[: (sample_size_gene_var + 1)]
last_index_gene_var_sample = index_partitioner_gene_var[-1]
gene_var_sample_matrix = sp_sparse.csc_matrix(
(
data["data"][:last_index_gene_var_sample].astype(np.float32),
data["indices"][:last_index_gene_var_sample],
index_partitioner_gene_var,
),
shape=(nb_genes, len(index_partitioner_gene_var) - 1),
)
mean = gene_var_sample_matrix.mean(axis=1)
var = gene_var_sample_matrix.multiply(gene_var_sample_matrix).mean(
axis=1
) - np.multiply(mean, mean)
subset_genes = np.squeeze(np.asarray(var)).argsort()[-n_genes_to_keep:][::-1]
del gene_var_sample_matrix, mean, var
n_iters = int(n_cells_to_keep / loading_batch_size) + (
n_cells_to_keep % loading_batch_size > 0
)
for i in range(n_iters):
index_partitioner_batch = index_partitioner[
(i * loading_batch_size) : ((1 + i) * loading_batch_size + 1)
]
first_index_batch = index_partitioner_batch[0]
last_index_batch = index_partitioner_batch[-1]
index_partitioner_batch = (
index_partitioner_batch - first_index_batch
).astype(np.int32)
n_cells_batch = len(index_partitioner_batch) - 1
data_batch = data["data"][first_index_batch:last_index_batch].astype(
np.float32
)
indices_batch = data["indices"][first_index_batch:last_index_batch].astype(
np.int32
)
matrix_batch = sp_sparse.csr_matrix(
(data_batch, indices_batch, index_partitioner_batch),
shape=(n_cells_batch, nb_genes),
)[:, subset_genes]
# stack on the fly to limit RAM usage
if i == 0:
matrix = matrix_batch
else:
matrix = sp_sparse.vstack([matrix, matrix_batch])
logger.info(
"loaded {} / {} cells".format(
i * loading_batch_size + n_cells_batch, n_cells_to_keep
)
)
logger.info("%d cells subsampled" % matrix.shape[0])
logger.info("%d genes subsampled" % matrix.shape[1])
adata = anndata.AnnData(matrix)
adata.obs["labels"] = np.zeros(matrix.shape[0])
adata.obs["batch"] = np.zeros(matrix.shape[0])
counts = adata.X.sum(1)
if sp_sparse.issparse(counts):
counts = counts.A1
gene_num = (adata.X > 0).sum(1)
if sp_sparse.issparse(gene_num):
gene_num = gene_num.A1
adata = adata[counts > 1]
adata = adata[gene_num > 1]
return adata.copy()
```
网友评论