2022-04-03 源码阅读

作者: byejya | 来源:发表于2022-04-03 15:27 被阅读0次

2022-04-03 源码阅读
亲子阅读篇（六十六）《下雨
iOS 系统源码及第三方源码总结
【源码阅读】Glide源码阅读之with方法（一）
etcd-raft 库源码阅读【WIP】
220403Crosby社区
HotSpot源码构建调试
Java8 源码阅读 - AQS之Condition
Java多线程——ReentrantReadWriteLock源
SpringBoot 源码解析--搭建

阅读步骤：

解压之后看到外部有一些文件和文件夹，三个文件夹根据命名可知分别是说明文件docs，主程序文件scvi，测试文件tests。而文件夹外部的文件可以根据后缀和命名分为配置说明文件.yml，.lock，.toml（配置文件的使用由来已久，从.ini、XML、JSON、YAML再到TOML，语言的表达能力越来越强，同时书写便捷性也在不断提升。），总的说明文件.md，安装文件.py，开源证书LICENSE.

1，

tree看一些结构发现文件很多，那么我会选择先不看主程序，而是先看他的说明

1

说明文件主要包含：专门放说明的文件夹docs，外部的.yml文件。

那么就先看文件夹外部文件，之后再看docs文件夹，最后看scvi文件夹，不看tests文件夹

.lock文件被用来锁住某些资源，Lock文件表明一个应用程序中某个资源在锁释放之前是不能被应用的。这对那些需要并发访问临界资源的应用程序是十分有用的。对于文件锁，应用程序会创建一个新文件，然后将此文件在原有命名的基础上添加一个.lock后缀，比如，example.file文件的锁文件将是example.file.lock。

poetry.lock

3

codecov.yml

4

其余的.yml也都不是很有用

5

配置文件的意义不大。

进入主体程序

6

包是一种通过使用“带点的模块名称”来构建 Python 模块名称空间的方式。例如，模块名称A.B指定一个名为B的包中的子模块A。需要这些__init__.py文件才能使 Python 将包含该文件的目录视为包。我们先从文件最少的nn看，

7

首先，有__init__.py所以整个文件夹是个包，_utils.py，调用包，对序列实现onehot编码

python torch又称PyTorach，是一个以Python优先的深度学习框架，一个开源的Python机器学习库，用于自然语言处理等应用程序，不仅能够实现强大的GPU加速，同时还支持动态神经网络，这是现在很多主流框架比如Tensorflow等都不支持的。

9

另一个文件虽然很长，但是主要还是用torch包做模型，暂时不想看模型部分，而且我觉得都调用了，为什么使用者不自己搭啊。不知道意义在哪，或许我应该看看他的例子再理解下他的意思。

10

```

import logging

import os

import anndata

import h5py

import numpy as np

import scipy.sparse as sp_sparse

from scvi.data._built_in_data._download import _download

logger = logging.getLogger(__name__)

def _load_brainlarge_dataset(

save_path: str = "data/",

sample_size_gene_var: int = 10000,

max_cells_to_keep: int = None,

n_genes_to_keep: int = 720,

loading_batch_size: int = 100000,

) -> anndata.AnnData:

"""Loads brain-large dataset."""

url = "http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5"

save_fn = "brain_large.h5"

_download(url, save_path, save_fn)

adata = _load_brainlarge_file(

os.path.join(save_path, save_fn),

sample_size_gene_var=sample_size_gene_var,

max_cells_to_keep=max_cells_to_keep,

n_genes_to_keep=n_genes_to_keep,

loading_batch_size=loading_batch_size,

)

return adata

def _load_brainlarge_file(

path_to_file: str,

sample_size_gene_var: int,

max_cells_to_keep: int,

n_genes_to_keep: int,

loading_batch_size: int,

) -> anndata.AnnData:

logger.info("Preprocessing Brain Large data")

print(path_to_file)

with h5py.File(path_to_file, "r") as f:

data = f["mm10"]

nb_genes, nb_cells = f["mm10"]["shape"]

n_cells_to_keep = (

max_cells_to_keep if max_cells_to_keep is not None else nb_cells

)

index_partitioner = data["indptr"][...]

# estimate gene variance using a subset of cells.

index_partitioner_gene_var = index_partitioner[: (sample_size_gene_var + 1)]

last_index_gene_var_sample = index_partitioner_gene_var[-1]

gene_var_sample_matrix = sp_sparse.csc_matrix(

(

data["data"][:last_index_gene_var_sample].astype(np.float32),

data["indices"][:last_index_gene_var_sample],

index_partitioner_gene_var,

),

shape=(nb_genes, len(index_partitioner_gene_var) - 1),

)

mean = gene_var_sample_matrix.mean(axis=1)

var = gene_var_sample_matrix.multiply(gene_var_sample_matrix).mean(

axis=1

) - np.multiply(mean, mean)

subset_genes = np.squeeze(np.asarray(var)).argsort()[-n_genes_to_keep:][::-1]

del gene_var_sample_matrix, mean, var

n_iters = int(n_cells_to_keep / loading_batch_size) + (

n_cells_to_keep % loading_batch_size > 0

)

for i in range(n_iters):

index_partitioner_batch = index_partitioner[

(i * loading_batch_size) : ((1 + i) * loading_batch_size + 1)

]

first_index_batch = index_partitioner_batch[0]

last_index_batch = index_partitioner_batch[-1]

index_partitioner_batch = (

index_partitioner_batch - first_index_batch

).astype(np.int32)

n_cells_batch = len(index_partitioner_batch) - 1

data_batch = data["data"][first_index_batch:last_index_batch].astype(

np.float32

)

indices_batch = data["indices"][first_index_batch:last_index_batch].astype(

np.int32

)

matrix_batch = sp_sparse.csr_matrix(

(data_batch, indices_batch, index_partitioner_batch),

shape=(n_cells_batch, nb_genes),

)[:, subset_genes]

# stack on the fly to limit RAM usage

if i == 0:

matrix = matrix_batch

else:

matrix = sp_sparse.vstack([matrix, matrix_batch])

logger.info(

"loaded {} / {} cells".format(

i * loading_batch_size + n_cells_batch, n_cells_to_keep

)

logger.info("%d cells subsampled" % matrix.shape[0])

logger.info("%d genes subsampled" % matrix.shape[1])

adata = anndata.AnnData(matrix)

adata.obs["labels"] = np.zeros(matrix.shape[0])

adata.obs["batch"] = np.zeros(matrix.shape[0])

counts = adata.X.sum(1)

if sp_sparse.issparse(counts):

counts = counts.A1

gene_num = (adata.X > 0).sum(1)

if sp_sparse.issparse(gene_num):

gene_num = gene_num.A1

adata = adata[counts > 1]

adata = adata[gene_num > 1]

return adata.copy()

```

网友评论

本文标题：2022-04-03 源码阅读

本文链接：https://www.haomeiwen.com/subject/mdrxsrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

2022-04-03 源码阅读

相关文章