Ensembl数据库中参考基因组及注释文件命名含义

作者: 泼皮混混 | 来源:发表于2022-02-13 18:55 被阅读0次

Ensembl数据库中参考基因组及注释文件命名含义
转录组 | 参考基因组
常用参考基因组—下载站点
关于参考基因组和注释
【测序实验】如何从UCSC、RefSeq、Ensembl中下载参
生信流程搭建（13）拟南芥参考基因下载与注释文件配置
python pyensembl:基因组位置注释到基因组特征（基
基因组注释文件格式 --（一）BED文件格式
R-基因名转换
使用JBrowser配置参考基因组和基因注释信息

1、参考基因组命名含义

#### README ####

IMPORTANT: Please note you can download subsets of data via the
BioMart data mining tool.
See https://www.ensembl.org/info/data/biomart/ for more information.

The genome assembly represented here corresponds to GenBank Assembly ID 
GCA_000001635.9

#######################
Fasta DNA dumps
#######################

-----------
FILE NAMES
------------
The files are consistently named following this pattern:
   <species>.<assembly>.<sequence type>.<id type>.<id>.fa.gz

<species>:   The systematic name of the species.
<assembly>:  The assembly build name.
<sequence type>:
 * 'dna' - unmasked genomic DNA sequences.
  * 'dna_rm' - masked genomic DNA.  Interspersed repeats and low
     complexity regions are detected with the RepeatMasker tool and masked
     by replacing repeats with 'N's.
  * 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions
    have been replaced with lowercased versions of their nucleic base
<id type> One of the following:
  * 'chromosome'     - The top-level coordinate system in most species in Ensembl
  * 'nonchromosomal' - Contains DNA that has not been assigned a chromosome
  * 'seqlevel'       - This is usually sequence scaffolds, chunks or clones.
     -- 'scaffold'   - Larger sequence contigs from the assembly of shorter
        sequencing reads (often from whole genome shotgun, WGS) which could
        not yet be assembled into chromosomes. Often more genome sequencing
        is needed to narrow gaps and establish a tiling path.
     -- 'chunk' -  While contig sequences can be assembled into large entities,
        they sometimes have to be artificially broken down into smaller entities
        called 'chunks'. This is due to limitations in the annotation
        pipeline and the finite record size imposed by MySQL which stores the
        sequence and annotation information.
     -- 'clone' - In general this is the smallest sequence entity.  It is often
        identical to the sequence of one BAC clone, or sequence region
        of one BAC clone which forms the tiling path.
<id>:     The actual sequence identifier. Depending on the <id type> the <id>
          could represent the name of a chromosome, a scaffold, a contig, a clone ..
          Field is empty for seqlevel files
fa: All files in these directories represent FASTA database files
gz: All files are compacted with GNU Zip for storage efficiency.


EXAMPLES
   The genomic sequence of human chromosome 1:
     Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz

   The masked version of the genome sequence on human chromosome 1
   (contains '_rm' or '_sm' in the name):
     Homo_sapiens.GRCh37.dna_rm.chromosome.1.fa.gz
     Homo_sapiens.GRCh37.dna_sm.chromosome.1.fa.gz

   Non-chromosomal assembly sequences:
   e.g. mitochondrial genome, sequence contigs not yet mapped on chromosomes
     Homo_sapiens.GRCh37.dna.nonchromosomal.fa.gz
     Homo_sapiens.GRCh37.dna_rm.nonchromosomal.fa.gz
     Homo_sapiens.GRCh37.dna_sm.nonchromosomal.fa.gz

---------
TOPLEVEL
---------
These files contains all sequence regions flagged as toplevel in an Ensembl
schema. This includes chromsomes, regions not assembled into chromosomes and
N padded haplotype/patch regions.

EXAMPLES

  Toplevel sequences unmasked:
    Homo_sapiens.GRCh37.dna.toplevel.fa.gz
  
  Toplevel soft/hard masked sequences:
    Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz
    Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz

-----------------
PRIMARY ASSEMBLY
-----------------
Primary assembly contains all toplevel sequence regions excluding haplotypes
and patches. This file is best used for performing sequence similarity searches
where patch and haplotype sequences would confuse analysis. If the primary
assembly file is not present, that indicates that there are no haplotype/patch
regions, and the 'toplevel' file is equivalent.

EXAMPLES

  Primary assembly sequences unmasked:
    Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
  
  Primary assembly soft/hard masked sequences:
    Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz
    Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz

--------------
SPECIAL CASES
--------------
Some chromosomes have alternate haplotypes which are presented in files with 
the haplotype sequence only:
   Homo_sapiens.GRCh37.dna_rm.chromosome.HSCHR6_MHC_QBL.fa.gz
   Homo_sapiens.GRCh37.dna_rm.chromosome.HSCHR17_1.fa.gz

All alternative assembly and patch regions have their sequence padded 
with N's to ensure alignment programs can report the correct index
regions

e.g. A patch region with a start position of 1,000,001 will have 1e6 N's added
its start so an alignment program will report coordinates with respect to the
whole chromosome.

Human has sequenced Y chromosomes and the pseudoautosomal region (PAR)
on the Y is annotated.  By definition the PAR region is identical on the 
X and Y chromosome.  The Y chromosome file contains the Y chromsome 
minus these repeated PAR regions i.e. the unique portion of Y.

2、注释文件命名含义

2.1、gtf命名含义

#### README ####

--------
GTF DUMP
--------

This directory includes a summary of the gene annotation information 
and GTF format.

Ensembl provides an automatic gene annotation for Mus musculus.
For some species ( human, mouse, zebrafish, pig and rat), the
annotation provided through Ensembl also includes manual annotation
from HAVANA.
In the case of human and mouse, the GTF files found here are equivalent
to the GENCODE gene set.

GTF provides access to all annotated transcripts which make
up an Ensembl gene set. Annotation is based on alignments of
biological evidence (eg. proteins, cDNAs, RNA-seq) to a genome assembly.
The annotation dumped here is transcribed and translated from the 
genome assembly and is not the original input sequence data that 
we used for alignment. Therefore, the sequences provided by Ensembl 
may differ from the original input sequence data where the genome 
assembly is different to the aligned sequence. 

Additionally, we provide a GTF file containing the predicted gene set
as generated by Genscan and other abinitio prediction tools.
This file is identified by the abinitio extension.


-----------
FILE NAMES
------------
The files are consistently named following this pattern:
   <species>.<assembly>.<version>.gtf.gz

<species>:       The systematic name of the species.
<assembly>:      The assembly build name.
<version>:       The version of Ensembl from which the data was exported.
gtf : All files in these directories are in GTF format
gz : All files are compacted with GNU Zip for storage efficiency.

e.g.
Homo_sapiens.GRCh38.81.gtf.gz

For the predicted gene set, an additional abinitio flag is added to the name file.
<species>.<assembly>.<version>.abinitio.gtf.gz

e.g.
Homo_sapiens.GRCh38.81.abinitio.gtf.gz

--------------------------------
Definition and supported options
--------------------------------

The GTF (General Transfer Format) is an extension of GFF version 2 
and used to represent transcription models. GFF (General Feature Format) 
consists of one line per feature, each containing 9 columns of data. 

Fields

Fields are tab-separated. Also, all but the final field in each 
feature line must contain a value; "empty" columns are denoted 
with a '.'

    seqname   - name of the chromosome or scaffold; chromosome names 
                without a 'chr' 
    source    - name of the program that generated this feature, or 
                the data source (database or project name)
    feature   - feature type name. Current allowed features are
                {gene, transcript, exon, CDS, Selenocysteine, start_codon,
                stop_codon and UTR}
    start     - start position of the feature, with sequence numbering 
                starting at 1.
    end       - end position of the feature, with sequence numbering 
                starting at 1.
    score     - a floating point value indiciating the score of a feature
    strand    - defined as + (forward) or - (reverse).
    frame     - one of '0', '1' or '2'. Frame indicates the number of base pairs
                before you encounter a full codon. '0' indicates the feature 
                begins with a whole codon. '1' indicates there is an extra
                base (the 3rd base of the prior codon) at the start of this feature.
                '2' indicates there are two extra bases (2nd and 3rd base of the 
                prior exon) before the first codon. All values are given with
                relation to the 5' end.
    attribute - a semicolon-separated list of tag-value pairs (separated by a space), 
                providing additional information about each feature. A key can be
                repeated multiple times.

Attributes

The following attributes are available. All attributes are semi-colon
separated pairs of keys and values.

- gene_id: The stable identifier for the gene
- gene_version: The stable identifier version for the gene
- gene_name: The official symbol of this gene
- gene_source: The annotation source for this gene
- gene_biotype: The biotype of this gene
- transcript_id: The stable identifier for this transcript
- transcript_version: The stable identifier version for this transcript
- transcript_name: The symbold for this transcript derived from the gene name
- transcript_source: The annotation source for this transcript
- transcript_biotype: The biotype for this transcript
- exon_id: The stable identifier for this exon
- exon_version: The stable identifier version for this exon
- exon_number: Position of this exon in the transcript
- ccds_id: CCDS identifier linked to this transcript
- protein_id: Stable identifier for this transcript's protein
- protein_version: Stable identifier version for this transcript's protein
- tag: A collection of additional key value tags
- transcript_support_level: Ranking to assess how well a transcript is supported (from 1 to 5)

Tags

Tags are additional flags used to indicate attibutes of the transcript.

- CCDS: Flags this transcript as one linked to a CCDS record
- seleno: Flags this transcript has a Selenocysteine edit. Look for the Selenocysteine
feature for the position of this on the genome
- cds_end_NF: the coding region end could not be confirmed
- cds_start_NF: the coding region start could not be confirmed
- mRNA_end_NF: the mRNA end could not be confirmed
- mRNA_start_NF: the mRNA start could not be confirmed.
- basic: the transcript is part of the gencode basic geneset

Comments

Lines may be commented out by the addition of a single # character at the start. These
lines should be ignored by your parser.

Pragmas/Metadata

GTF files can contain meta-data. In the case of experimental meta-data these are 
noted by a #!. Those which are stable are noted by a ##. Meta data is a single key,
a space and then the value. Current meta data keys are:

* genome-build -  Build identifier of the assembly e.g. GRCh37.p11
* genome-version - Version of this assembly e.g. GRCh37
* genome-date - The date of this assembly's release e.g. 2009-02
* genome-build-accession - The accession and source of this accession e.g. NCBI:GCA_000001405.14
* genebuild-last-updated - The date of the last genebuild update e.g. 2013-09

------------------
Example GTF output
------------------

#!genome-build GRCh38
11      ensembl_havana  gene    5422111 5423206 .       +       .       gene_id "ENSG00000167360"; gene_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
11      ensembl_havana  transcript      5422111 5423206 .       +       .       gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11      ensembl_havana  exon    5422111 5423206 .       +       .       gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381"; exon_id "ENSE00001276439"; exon_version "4";
11      ensembl_havana  CDS     5422201 5423151 .       +       0       gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381"; protein_id "ENSP00000300778"; protein_version "4";
11      ensembl_havana  start_codon     5422201 5422203 .       +       0       gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11      ensembl_havana  stop_codon      5423152 5423154 .       +       0       gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; exon_number "1"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11      ensembl_havana  UTR     5422111 5422200 .       +       .       gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";
11      ensembl_havana  UTR     5423155 5423206 .       +       .       gene_id "ENSG00000167360"; gene_version "4"; transcript_id "ENST00000300778"; transcript_version "4"; gene_name "OR51Q1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "OR51Q1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31381";

2.2、gff命名含义

#### README ####

-----------------------
GFF FLATFILE DUMPS
-----------------------
Gene annotation is provided in GFF3 format. Detailed specification of
the format is maintained by the Sequence Ontology:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

GFF3 files are validated using GenomeTools: http://genometools.org

For chromosomal assemblies, in addition to a file containing all
genes, there are per-chromosome files. If a predicted geneset is
available (generated by Genscan and other ab initio tools), these
genes are in a separate 'abinitio' file.


The 'type' of gene features is:
 * "gene" for protein-coding genes
 * "ncRNA_gene" for RNA genes
 * "pseudogene" for pseudogenes
The 'type' of transcript features is:
 * "mRNA" for protein-coding transcripts
 * a specific type or RNA transcript such as "snoRNA" or "lnc_RNA"
 * "pseudogenic_transcript" for pseudogenes
All transcripts are linked to "exon" features.
Protein-coding transcripts are linked to "CDS", "five_prime_UTR", and
"three_prime_UTR" features.

Attributes for feature types:
(square brackets indicate data which is not available for all features)
 * region types:
    * ID: Unique identifier, format "<region_type>:<region_name>"
    * [Alias]: A comma-separated list of aliases, usually including the
      INSDC accession
    * [Is_circular]: Flag to indicate circular regions
 * gene types:
    * ID: Unique identifier, format "gene:<gene_stable_id>"
    * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
    * gene_id: Ensembl gene stable ID
    * version: Ensembl gene version
    * [Name]: Gene name
    * [description]: Gene description
 * transcript types:
    * ID: Unique identifier, format "transcript:<transcript_stable_id>"
    * Parent: Gene identifier, format "gene:<gene_stable_id>"
    * biotype: Ensembl biotype, e.g. "protein_coding", "pseudogene"
    * transcript_id: Ensembl transcript stable ID
    * version: Ensembl transcript version
    * [Note]: If the transcript sequence has been edited (i.e. differs
      from the genomic sequence), the edits are described in a note.
 * exon
    * Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
    * exon_id: Ensembl exon stable ID
    * version: Ensembl exon version
    * constitutive: Flag to indicate if exon is present in all
      transcripts
    * rank: Integer that show the 5'->3' ordering of exons
 * CDS
    * ID: Unique identifier, format "CDS:<protein_stable_id>"
    * Parent: Transcript identifier, format "transcript:<transcript_stable_id>"
    * protein_id: Ensembl protein stable ID
    * version: Ensembl protein version

Metadata:
 * genome-build - Build identifier of the assembly e.g. GRCh37.p11
 * genome-version - Version of this assembly e.g. GRCh37
 * genome-date - The date of the release of this assembly e.g. 2009-02
 * genome-build-accession - Genome accession e.g. GCA_000001405.14
 * genebuild-last-updated - Date of the last genebuild update e.g. 2013-09

-----------
FILE NAMES
------------
The files are consistently named following this pattern:
   <species>.<assembly>.<_version>.gff3.gz

<species>:       The systematic name of the species. 
<assembly>:      The assembly build name.
<version>:       The version of Ensembl from which the data was exported.
gff3 : All files in these directories are in GFF3 format
gz : All files are compacted with GNU Zip for storage efficiency.

e.g. 
Homo_sapiens.GRCh38.81.gff3.gz

For the predicted gene set, an additional abinitio flag is added to the name file.
<species>.<assembly>.<version>.abinitio.gff3.gz

e.g.
Homo_sapiens.GRCh38.81.abinitio.gff3.gz

------------------
Example GFF3 output
------------------

##gff-version 3
#!genome-build  Pmarinus_7.0
#!genome-version Pmarinus_7.0
#!genome-date 2011-01
#!genebuild-last-updated 2013-04

GL476399        Pmarinus_7.0    supercontig     1       4695893 .       .       .       ID=supercontig:GL476399;Alias=scaffold_71
GL476399        ensembl gene    2596494 2601138 .       +       .       ID=gene:ENSPMAG00000009070;Name=TRYPA3;biotype=protein_coding;description=Trypsinogen A1%3B Trypsinogen a3%3B Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:O42608];logic_name=ensembl;version=1
GL476399        ensembl transcript      2596494 2601138 .       +       .       ID=transcript:ENSPMAT00000010026;Name=TRYPA3-201;Parent=gene:ENSPMAG00000009070;biotype=protein_coding;version=1
GL476399        ensembl exon    2596494 2596538 .       +       .       Name=ENSPMAE00000087923;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=-1;rank=1;version=1
GL476399        ensembl exon    2598202 2598361 .       +       .       Name=ENSPMAE00000087929;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=2;ensembl_phase=1;rank=2;version=1
GL476399        ensembl exon    2599023 2599282 .       +       .       Name=ENSPMAE00000087937;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;rank=3;version=1
GL476399        ensembl exon    2599814 2599947 .       +       .       Name=ENSPMAE00000087952;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;rank=4;version=1
GL476399        ensembl exon    2600895 2601138 .       +       .       Name=ENSPMAE00000087966;Parent=transcript:ENSPMAT00000010026;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;rank=5;version=1
GL476399        ensembl CDS     2596499 2596538 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2598202 2598361 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2599023 2599282 .       +       1       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2599814 2599947 .       +       2       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl CDS     2600895 2601044 .       +       0       ID=CDS:ENSPMAP00000009982;Parent=transcript:ENSPMAT00000010026
GL476399        ensembl five_prime_UTR  2596494 2596498 .       +       .       Parent=transcript:ENSPMAT00000010026
GL476399        ensembl three_prime_UTR 2601045 2601138 .       +       .       Parent=transcript:ENSPMAT00000010026

Ensembl数据库中参考基因组及注释文件命名含义
1、参考基因组命名含义 2、注释文件命名含义 2.1、gtf命名含义 2.2、gff命名含义
转录组 | 参考基因组
三个常用参考基因组数据库： Ensembl：www.ensembl.org[http://www.ensembl....
常用参考基因组—下载站点
常用参考基因组数据库 1、Ensembl 是由 European Bioinformatics Institute...
关于参考基因组和注释
常用的参考基因组数据库： 1、Ensembl 是由European Bioinformatics Institut...
【测序实验】如何从UCSC、RefSeq、Ensembl中下载参
以人类参考hg38为例，介绍基因组及注释文件格式一、hg38在三大基因组数据库中的主页 1. UCSC进入UCS...
生信流程搭建（13）拟南芥参考基因下载与注释文件配置
在Ensembl数据库中下载拟南芥参考基因 TAIR是研究拟南芥的首选数据库，其他数据库中拟南芥的基因组数据都是直...
python pyensembl:基因组位置注释到基因组特征（基
PyEnsembl 为用户访问 Ensembl[http://www.ensembl.org/]参考基因组数据提供...
基因组注释文件格式 --（一）BED文件格式
参考 UCSC数据文件格式基因组数据注释常用的文件-Bed文件和GFF文件 1、简介注释文件就是基因组的说明书。...
R-基因名转换
SYMBOL命名转换为ENSEMBL命名可以得到gene文件
使用JBrowser配置参考基因组和基因注释信息
准备参考基因组hg19及其基因的注释文件使用JBrowser自带的脚本导入参考基因组序列导入参考基因组后，会在...