Reference Human Genome DNA

Reference Human Genome DNA

作者: 浩瀚之宇 | 来源:发表于2018-11-22 10:11 被阅读0次

    GRCh38 vS hg38


    1. Introduction
    2. File names and contents
    3. Sequence names
    4. Metadata Tag-value pairs
    5. Definitions

    1. Introduction

    The files in this directory provide the FASTA format sequences for a
    genome assembly in a package convenient for use by various Next
    Generation Sequence read alignment pipelines. The sequence names,
    sequence order, and format of the sequence definition lines, were
    developed in consultation with several developers and major users of
    alignment pipelines.

    2. File names and contents

    GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz is a gzipped
    file that contains FASTA format sequences for the following:
    1. chromosomes from the GRCh38 Primary Assembly unit.
    Note: the two PAR regions on chrY have been hard-masked with Ns. The
    chromosome Y sequence provided therefore has the same coordinates as
    the GenBank sequence but it is not identical to the GenBank sequence.
    Similarly, duplicate copies of centromeric arrays and WGS on
    chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns.
    2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
    3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
    4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
    5. Epstein-Barr virus (EBV) sequence
    Note: The EBV sequence is not part of the genome assembly but is
    included in the analysis set as a sink for alignment of reads that
    are often present in sequencing samples.

    GCA_000001405.15_GRCh38_full_analysis_set.fna.gz is a gzipped file
    that contains all the same FASTA formatted sequences as
    GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
    6. alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units.

    3. Sequence names

    The sequence names in the analysis sets follow UCSC-style naming

    chr{chromosome number or name}
    e.g. chr1 or chrX
    chrM for the mitochondrial genome.

    Unlocalized scaffolds:
    chr{chromosome number or name}_{sequence_accession}v{sequence_version}_random
    e.g. chr17_GL000205v2_random

    Unplaced scaffolds:
    e.g. chrUn_GL000220v1

    Alternate loci scaffolds:
    chr{chromosome number or name}_{sequence_accession}v{sequence_version}_alt
    e.g. chr6_GL000250v2_alt

    4. Metadata tag-value pairs

    The FASTA definition lines contain sequences metadata in a series of
    space-separated tag-value pairs.

    Tag Value

    AC: sequence accession.version
    gi: sequence gi
    LN: sequence length
    rg: region
    - chromosome to which unlocalized scaffolds are assigned,
    e.g. chr1
    - region on chromosome within which alt-scaffolds or patch
    scaffolds are placed, e.g. chr6:28696604-33335493
    - not present for chromosomes, other replicons, or unplaced
    rl: role of the sequence in the assembly
    - possible values are: Chromosome, Mitochondrion, unlocalized,
    unplaced, alt-scaffold fix-patch, novel-patch, decoy
    M5: md5 checksum of the sequence as a single string of uppercase
    letters without line breaks (as produced by Samtools or Picard)
    AS: assembly-name
    hm: hard-masked regions, either a single span, two spans separated by
    a comma, or "multiple" if more than two spans were hard-masked
    tp: topology
    - circular for chrM and chrEBV
    - not present for linear chromosomes and scaffolds

    5. Definitions

    Unlocalized sequence:
    A sequence found in an assembly that is associated with a specific
    chromosome but cannot be ordered or oriented on that chromosome.

    Unplaced sequence:
    A sequence found in an assembly that is not associated with any

    A scaffold that provides an alternate representation of a locus found
    in the primary assembly. These sequences do not represent a complete
    chromosome sequence although there is no hard limit on the size of the
    alternate locus; currently these are less than 1 Mb.

    Major release:
    The formal release of a genome assembly, e.g. GRCh38.

    Minor release:
    A release of a genome assembly including patches that occurs between
    major releases.

    Genome Patch:
    A sequence contig/scaffold that corrects sequence in a major release
    of the genome, or adds sequence to it.

    A patch that corrects sequence or reduces an assembly gap in a given
    major release. FIX patch sequences are meant to be incorporated into
    the primary or existing alt-loci assembly units at the next major

    A patch that adds sequence to a major release. Typically, NOVEL patch
    sequences are meant to be incorporated into the assembly as new
    alternate loci at the next major release.

    A sequence that is not part of the genome assembly but is included in
    an analysis set as a sink for alignment of reads that are often
    present in sequencing samples.

    Name Last modified Size Description

      [Parent Directory](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/)                                        -   
      [hg38.analysisSet.2bit](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.2bit)              27-Jan-2014 10:40  770M  
      [hg38.analysisSet.chroms.tar.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.chroms.tar.gz)     27-Jan-2014 11:02  905M  
      [hg38.fullAnalysisSet.2bit](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.fullAnalysisSet.2bit)          18-Mar-2014 13:23  797M  
      [hg38.fullAnalysisSet.chroms.tar.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.fullAnalysisSet.chroms.tar.gz) 18-Mar-2014 13:41  936M  
      [md5sum.txt](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/md5sum.txt)                         18-Mar-2014 13:41  250   




          本文标题:Reference Human Genome DNA
