美文网首页
个人杂记

个人杂记

作者: 深山夕照深秋雨OvO | 来源:发表于2023-11-14 00:52 被阅读0次

    cat tmp2 | tr A-Z a-z | sed 's/^\w|\s\w/\U&/g' | tr " " "," > tmp3
    首先全部变成小写,然后再首字母大写(tmp2是空格分隔符)


    根据指定基因组区域的提取bam,可以使用以下命令

    samtools view -hb chr:start-end  wgs.sort.bam > target.region.bam
    #根据bed文件来提取
    samtools view -hb -L target.bed  wgs.sort.bam > target.region.bam
    
    bedtools intersect -a  wgs.sort.bam  -b target.bed  > target.region.bam
    
    sambamba view -hb chr:start-end  wgs.sort.bam > target.region.bam
    #根据bed文件来提取可以用 `sambamba slice `
    sambamba slice -L target.bed wgs.sort.bam > target.region.bam
    
    #sambamba slice -L 会是速度最快资源消耗最少的
    

    把gff/gtf转为genebank格式, ref: https://www.biostars.org/p/72220/
    The EMBOSS tool seqret would be a possible option.

    seqret   -sequence   reference.fasta   -feature   -fformat gff   -fopenfile 1.gff   -osformat genbank   -auto
    #但是细节上需要自行修改
    

    awk中的if与else

    awk '{if($2<10)print $1"\t"$2-10 ;else print $1"\t"$2+10} input > output
    

    批量生成sed命令行

    awk '{print $1"\t"$2}' rename.txt | tr "\t" "#" | awk '{print "sed -i ""'\''""s""#"$1"#""g""'\''"" input"}' > run.sh
    sh run.sh
    #input就是需要批量sed的文件
    #rename.txt有两列,我希望把第一列的内容全部批量替换为第二列
    

    转载自生物信息文件格式中的坐标系以及互相转换
    https://www.biochen.org/cn/blog/2020/%E7%94%9F%E7%89%A9%E4%BF%A1%E6%81%AF%E6%96%87%E4%BB%B6%E6%A0%BC%E5%BC%8F%E4%B8%AD%E7%9A%84%E5%9D%90%E6%A0%87%E7%B3%BB/

    生物信息文件格式中有很多格式是基于基因组坐标的,比如常见的BED格式或者GTF格式。然而对于对标系的定义,这两者有着截然的区别。BED格式第一个位置的下标是0,区间前开后闭;而GTF格式第一个位置的下标是1,区间都是闭的。不妨我们称前者为0-based,后者为1-based。0-based的优点是长度的计算很简单,直接相减就可以得到序列的长度;而1-based的优点是比较直观


    除了BED格式和GTF格式,下表列举了其他格式的情况。


    长度计算

    Length(0-based) = End(0-based) - Start(0-based)
    Length(1-based) = End(1-based) - Start(1-based) + 1

    坐标转换

    0-based转1-based
    Start(1-based) = Start(0-based) + 1
    End(1-based) = End(0-based)

    1-based转0-based
    Start(0-based) = Start(1-based) - 1
    End(0-based) = End(1-based)


    关于ChIpSeeker的注释


    chip.png

    有时候会出现,左边两列(geneChr/transcriptid) 和 第三列 (distanceToTSS) 不同的情况
    这是因为,左边两列表示的是 输入的bed文件比如peak, 是落在哪个基因上
    右边也就是第三列则是这个peak, 距离那个gene的TSS最近

    如果没有额外的信息,基因的第一个exon的第一个碱基是TSS


    Von Neumann Entropy (VNE) index的含义
    This likely reflects a more disordered (the highentropy status) and relaxed chromatin architecture at early development (E38 and E80) (Fig.2b).
    In agreement with the phenomenon that 3D structure in early mammalian embryos is initially obscure but gradually established throughout development45–47, the relatively loose chromatin folding highlights a highly plastic state for hepatocyte genomes at the early stages of development and may be essential for the rapid functional transitions in the liver before and after birth.


    https://doi.org/10.1038/s41421-022-00416-z; Fig. 2a

    We observed a significantly higher VNE in the POF stage (0.86, P < 0.016, Wilcoxon rank-sum test) than in the SWF (0.80) and F1 stages (0.79) (Fig. 2a). This is likely due to a more disordered and relaxed chromatin architecture in the POF stage (Fig. 2b), while the architecture is more stable and ordered in mature GCs at the F1 stages, which aligns with the relaxed genome architecture observed during senescence


    https://doi.org/10.1038/s41467-021-27800-9 ; Fig. 2a

    这句话出自文章https://doi.org/10.1080/19491034.2021.1910437, 文章中有这么一句话,并引用了两篇文献。
    这个文章也是提供了一个可以计算VNE参数的工具。
    Biologically, genomic regions with high entropy likely correlate with high proportions of euchromatin, as euchromatin is more structurally permissive than heterochromatin [1, 2]
    1.Macarthur BD, Lemischka IR. Statistical mechanics of pluripotency. Cell. 2013;154(3):484–489
    2.Rajapakse I, Groudine M, Mesbahi M. What can systems theory of networks offer to biology? PLoS Comput Biol. 2012;8(6):e1002543.

    以下两句话出自文章: https://doi.org/10.1016/j.neo.2020.12.010
    In the context of genome structure, the higher the entropy, the more conformations available to the system [46] . If the distant ends of a genomic region, e.g., a gene, interact to form a loop, there are fewer conformations available to the gene and thus the entropy of that genomic region is reduced.
    46.Phillips, Rob, et al. "Physical biology of the cell." American Journal of Physics 78.11 (2010): 1230-1230.
    and
    We apply one such approach - a derivative of VNE - to measure local chromatin organization of individual gene regions [59]. Higher VNE values indicate that the number of conformations available to the gene and its immediate neighborhood are higher, indicating that chromatin is more accessible.
    按照这个作者做的来看,VNE和基因的表达量是正相关的

    The more disordered (and permissive) chromatin in the pgEpiSCs was also evident based on its high-entropy status.
    然后引用了下图, 下图中的d图的图注是: The extent of disorder in chromatin structure (quantified by the Von Neumann Entropy (VNE))


    https://doi.org/10.1038/s41422-021-00592-9; Fig. 5d

    We found that Di-SG had higher entropy (Fig. 1C), suggestive of less compact chromatin structural organization in Di-SG.


    https://doi.org/10.1016/j.jbc.2021.101559; Fig. 1c

    使用MATLAB计算VNE的代码如下:

    close all, clear %%% Close figures and reset variables
    restoredefaultpath %%% Ensure no other folders on current path
    addpath(genpath('D:\MATLAB-\toolbox\4DNvestigator')) %%% Add all 4DNvestigator folders and files to path
    
    Folder_Result = 'E:\workspace\fenshuChr.2406\juicer'; %%% Output folder
    Data_Loc2 = {'E:\workspace\fenshuChr.2406\juicer\Ebaileyi.30.hic'};
    bpFrag = 'BP';
    binSize = 1E5;
    entropyExample(Data_Loc2, Folder_Result, 1, bpFrag, binSize)
    #第三个参数 1 ,就是染色体号
    
    #即
    Folder_Result = 'E:\workspace\fenshuChr.2406\juicer'; %%% Output folder
    Data_Loc2 = {'E:\workspace\fenshuChr.2406\juicer\Ebaileyi.30.hic'};
    bpFrag = 'BP';
    binSize = 1E5;
    chrSelect = 1
    entropyExample(Data_Loc2, Folder_Result, chrSelect, bpFrag, binSize)
    

    https://github.com/HuiyangYu/PanDepth 基于sam bam cram算基因组(和基因集)的深度和覆盖度 超级快高效的工具(低内存),超级大(几十G)的bam 也一两分钟的事。另外: 默认内存至少是bamdeal 的1之10。 速度也十分快。


    李恒大牛新作|compleasm:比BUSCO的更快、更准确评估工具
    https://github.com/huangnengCSU/compleasm


    Rather than reporting so much detail in the abstract, it might be better to make a more general statement like: "Deletions affecting introns and/or coding regions of numerous genes may have contributed to phenotypic differences between A. baiyi and other Ablax species"


    Comparative Recombination Rates in the Rat, Mouse, and Human Genomes
    10.1101/gr.1970304

    遗传距离的系数转换,参考上述文献

    awk '{print $1"\t"$4"\t"$4*0.000554779412}' Chr27.map | sort -Vk 1 | awk '{print "Chr"$1"\t"$2"\t"$3}' >  Chr27.genetic.map
    

    SNP的pos * 0.000554779412
    物理位置*0.000554779412


    Phylogenomics-DensiTree绘制详细教程
    所谓DensiTree,其实就是将多颗进化树的拓扑结构进行的叠加,以可视化进化树间的拓扑冲突(或基因树异质性)。绘制DensiTree绘制可以使用DensiTree软件(现在已经整合到BEAST2安装包中),也可以使用R包phangorn进行。下面记录一下DensiTree的绘制过程。
    https://mp.weixin.qq.com/s/PvxX02Pw_NPiV8aTpxL8TQ


    Kingship四个级别的亲缘关系的具体阈值
    0.0442 / 0.0884 / 0.177 / 0.354
    这篇文章把大于0.0442也就是3rd degree relationship以上的个体都删除了


    Fig. S3. https://doi.org/10.1073/pnas.1713288114
    https://doi.org/10.1073/pnas.1713288114

    https://coolpuppy.readthedocs.io/en/latest/walkthrough.html
    Hi-C的pileup图的绘制


    在不同的服务器之间传输文件
    yum install rsynz
    rsync -azv -P -e "ssh -p 20338" kuangzhr17@202.201.1.198:/home/kuangzhr17/test.fa /opt/synData


    https://github.com/veg/hyphy-analyses/tree/master/AncestralSequences
    HyPhy的祖先序列重建


    保守loop的鉴定
    https://github.com/adadiehl/mapLoopLoci
    需要两个基因组的chain文件(query to ref),和相应的loop文件

    loop文件格式是这样的
    前三列是loop的左锚点,第四列到第六列是loop的右锚点,第七列是一个uniq的标识用于标注这个loop,第八列是read counts,第九列是p值。。实际运用下来,第七、八、九列都随意就行,无所谓

    ./mapLoopLoci.py query.loop target.loop query.to.target.chain > query.to.target.out
    最后是以query.loop为基底,也就是判断query.loop中的loop,哪些是保守的,哪些是XX


    提取cds和pep序列
    gffread 35.gff -g 35.genomic.fa -x35.cds -y 35.pep


    大鼠的遗传距离.jpg

    awk '{print "'$i'""\t"$1}'  我知道怎么做了,就是多加一个单引号
    awk "{if($3=="Chr10" && $1=="'$i'")print}" | wc -l
    

    plink --vcf input.vcf --allow-extra-chr --double-id --indep-pairwise 50 10 0.1 --out ld
    #这三个参数代表的意思分别是: 窗口大小,每一步移动窗口的距离,以及判定关联的r2阈值
    
    #输出后缀.prune.in和.prune.out的两个文件
    input_pruned.prune.in    #pruning后保留的互不相关的SNP
    input_pruned.prune.out  #去除掉的SNP
    
    awk 'BEGIN{ FS="_";OFS="\t"}{print $1,$2}' try.prune.in > keep_SNP.list
    
    bgzip input.vcf
    tabix -p vcf input.vcf.gz
    
    bcftools view -R keep_SNP.list input.vcf.gz > ld_pruning.vcf
    
    XX

    相关文章

      网友评论

          本文标题:个人杂记

          本文链接:https://www.haomeiwen.com/subject/ibtjwdtx.html