美文网首页Biostar Handbook学习小组
YXF-本体论和生物学数据

YXF-本体论和生物学数据

作者: 大宝贝喜欢徐先生 | 来源:发表于2017-11-11 21:07 被阅读17次

    一直都感觉模模糊糊,先把弄明白的写下来吧

    1. 本体论就相当于给一个事物或者现象一个确定的命名好让所有人都用这一个词来描述这一事物或现象以免使别人疑惑---也就是制定术语(term)。本体论分为SO 和GO, SO 是给sequence feature命名, GO是给基因功能命名
      基因本体论:
      连接基因与它的一个或多个功能
      分三部分:

    2. cellular component: where does the product exhibit its effect

    3. molecular function: how does it work

    4. biological process:ehat is the propose of the gene product
      基因本体论是个有向环,一个点可以和多个点有关联。
      GO data:
      It contain gene ontology definition file and a gene association file
      GO assocaition file format: GAF format
      Functional analysis:
      ORA(Over-representation analysis0: To find representative functions of a list of genes
      FCS(Functional class scoring):
      Gene set enrichment:
      The process of discovering the common characteristics potentially, present in ln a list of genes.
      Tools: AgriGO, DAVID, Panther, goatools, ermineJ, GOrilla, ToppFun

    5. Data format
      目前生物学数据库有GenBank和NCBI
      DNA sequence数据库为INSDC(International nucleotide sequence database collaboration), 包括NCBI, EMBL, DDBJ.
      Protein sequence 数据库为UniProt(Universal protein resource)
      另外,PDB(Protein data bank) 是生物大分子3D结构信息库
      Automate data access:
      Sequenceing data formate: GenBank, FASTA, FASTQ
      FASTA 数据格式

    6. 以">" 开头

    7. ">"之后是一串字母

    8. 可能包括一些文字
      Some rules:

    9. Sequence lines should not be too long

    10. The sequence lines should wrap at the same width

    11. Use upper-case letters
      Some data of FASTA headers include structured information.
      Lower-case letters might be used to indicate repetitive regions for genome.
      FASTQ format
      分四部分:

    12. 以"@"开头

    13. 已有的顺序

    14. 符号“+”,也可能后面接与第一行一样的ID

    15. 衡量第二部分质量的字符并且与第二行长度相同

    16. How to get data
      Where to get data: NCBI, ENSEMBL, BioMart, UCSC table browser
      FASTQ manipulation
      Overview data:
      seqkit stat *.gz
      There are too many manipulatios in FASTA/Q, I only report what you can do with FASTA/Q file and the answer is in Chapter 7 of Biostar handbook.
      How to get the GC content of every sequence in a FASTA/Q file?
      How to extract a subset of sequences from a FASTA/Q file with name/ID list file?
      How to find FASTA/Q sequences containing degenerate bases and locate them?
      How to remove FASTA/Q records with duplicated sequences?
      How to locate motif/subsequence/enzyme digest sites in FASTA/Q sequence?
      How to sort a huge number of FASTA sequences by length?
      How to split FASTA sequences according to information in the header?
      How to search and replace within a FASTA header using character strings from a text file?
      How to extract paired reads from two paired-end reads files?
      How to concatenate two FASTA sequences in to one?
      You can follow the answer in biostar handbook if you want to do some thing same as above

    相关文章

      网友评论

        本文标题:YXF-本体论和生物学数据

        本文链接:https://www.haomeiwen.com/subject/hxygmxtx.html