cutadapt 使用详解

作者: JeremyL | 来源:发表于2018-05-27 01:48 被阅读20次

    Cutadapt安装

    #使用pip安装
    $ pip install --user --upgrade cutadapt
    #使用conda安装
    $ conda install -c bioconda cutadapt
    

    基本命令

    Usage:
        cutadapt -a ADAPTER options input.fastq
        cutadapt -a AACCGGTT input.fastq > output.fastq
    For paired-end reads:
        cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
    
    输入输出文件格式
    .fasta, .fa, .fna, .fastq, .fq, .gz, .bz2, .xz
    多核运行(Python 3.3 or later)
    -j N
    

    例子:

    1.1 修剪reads 5'端adapter
    cutadapt -g ADAPTER -O 5 -e 0 -o sample.trimmed.fastq sample.fastq --minimum-length 35
    --discard-untrimmed --info-file=reads.adapter.txt
    
    1.2 修剪reads 3'端adapter
    cutadapt -a ADAPTER -O 5 -e 0 -o sample.trimmed.fastq sample.fastq --minimum-length 35
    --discard-untrimmed --info-file=reads.adapter.txt
    
    2 双端测序
    cutadapt -g ADAPTER_FWD -G ADAPTER_REV  -O 5 -e 0 -o sample.trimmed.1.fastq -p sample.trimmed.2.fastq sample.reads_1.fastq sample.reads_2.fastq --minimum-length 35
    
    --discard-untrimmed --info-file=reads.adapter.txt
    
    3 去掉poly-A
    cutadapt -a "A{10}" -o output.fastq input.fastq
    
    4 修剪reads 5’ 或3’ 固定长度
    cutadapt -u 5 -o trimmed.fastq reads.fastq
    cutadapt -u -7 -o trimmed.fastq reads.fastq
    
    5 修剪低质量reads
    cutadapt -q 10 -o output.fastq input.fastq
    cutadapt -q 15,10 -o output.fastq input.fastq
    
    6 修剪reads 3’ 端碱基使得reads到一定长度(10bp)
    cutadapt -l 10 -o output.fastq.gz input.fastq.gz
    
    7 Multiple adapters #多种adapter
    cutadapt -g TGAGACACGCA -a AGGCACACAGGG -o output.fastq input.fastq #最佳匹配adapter移除
    cutadapt -g file:adapters.fasta -o output.fastq input.fastq
    
    8 Demultiplexing #adapter解析;reads根据adapter不同输出到不同的文件
    cutadapt -a one=TATA -a two=GCGC -o trimmed-{name}.fastq.gz input.fastq.gz
    #result files: demulti-one.fastq.gz, demulti-two.fastq.gz and demulti-unknown.fastq.gz
    
    cutadapt -a file:barcodes.fasta --no-trim --untrimmed-o untrimmed.fastq.gz -o trimmed-{name}.fastq.gz input.fastq.gz
    #result files: trimmed-first.1.fastq.gz, trimmed-second.1.fastq.gz, trimmed-unknown.1.fastq.gz and trimmed-first.2.fastq.gz, trimmed-second.2.fastq.gz, trimmed-unknown.2.fastq.gz
    
    9 从一个read修剪多个adapter
    cutadapt -g ^ADAPTER -n 5 -o output.fastq.gz input.fastq.gz
    

    常用参数

    -g: #剪切reads 5'端adapter,加$表示congreads 5'端第一个碱基匹配adapter(无法adapter部分匹配识别)
    -a: #剪切reads 3'端adapter,加$表示congreads 3'端第一个碱基(无法adapter部分匹配识别)
    -O MINLENGTH, --overlap=MINLENGTH #adapter与reads最小overlap,才算成功识别; Default: 3
    -m LENGTH, --minimum-length=LENGTH: 根据最短长度筛选reads;Default: 0
    --discard-untrimmed, --trimmed-only #丢掉不包含adapter的reads
     -e ERROR_RATE, --error-rate=ERROR_RATE  #adapter匹配允许的最大错配率(错配/匹配片段长度);Default: 0.1
    --no-trim: 不修剪adapter,直接输出满足跳进啊的reads
    -u LENGTH, --cut=LENGTH:  #修剪reads 5'/3'端碱基,正数:从开始除移除碱基;负数:从末尾处移除碱基;
    -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF: #修剪低质量碱基
    -l LENGTH, --length=LENGTH: #将reads修剪的最终长度
    --trim-n: #修剪reads末端的'N'
    -o FILE, --output=FILE: #输出文件
    --info-file=FILE:每条reads和与reads匹配的adapter的信息
    --too-short-output=FILE: #为reads长度最大值设定阈值筛选reads后,要丢弃的部分输出到文件;         --untrimmed-output=FILE: #将没有adapter未做修剪的reads输出到一个文件;默认输出到trimmed reads结果文件
    --max-n=COUNT:#reads中N的数量,设定整数或小数(N的占比)
    
    双端测序参数
    -A ADAPTER:  #第二条reads 3'adapter
    -G ADAPTER:#第二条reads 5'adapter
    -U LENGTH: #从第二条reads上修剪的长度
    -p FILE, --paired-output=FILE: #第二条reads的输出结果
    --untrimmed-paired-output=FILE:#第一条reads没有adapter时,将第二条reads输出到文件;默认输出到trimmed reads结果文件   
    
    

    参数详解

    $ cutadapt --help
    Options:
      --version           
      -h, --help            
      --debug               Print debugging information.
      -f FORMAT, --format=FORMAT  #文件格式
                            Input file format:'fasta', 'fastq' or 'sra-fastq'.
                            Ignored when reading csfasta/qual files.
                            Default: auto-detect from file name extension.
    
      Finding adapters::
        Parameters -a, -g, -b specify adapters to be removed from each read
        (or from the first read in a pair if data is paired). If specified
        multiple times, only the best matching adapter is trimmed (but see the
        --times option). When the special notation 'file:FILE' is used,
        adapter sequences are read from the given FASTA file.
    
        -a ADAPTER, --adapter=ADAPTER #剪切reads 3'端adapter,加$表示congreads 3'端第一个碱基(无法adapter部分匹配识别)
                            Sequence of an adapter ligated to the 3' end (paired
                            data: of the first read). The adapter and subsequent
                            bases are trimmed. If a '$' character is appended
                            ('anchoring'), the adapter is only found if it is a
                            suffix of the read.匹配adapter
        -g ADAPTER, --front=ADAPTER #剪切reads 5'端adapter,加$表示congreads 5'端第一个碱基匹配adapter(无法adapter部分匹配识别)
                            Sequence of an adapter ligated to the 5' end (paired
                            data: of the first read). The adapter and any
                            preceding bases are trimmed. Partial matches at the 5'
                            end are allowed. If a '^' character is prepended
                            ('anchoring'), the adapter is only found if it is a
                            prefix of the read.
        -b ADAPTER, --anywhere=ADAPTER #adapter在 5'和3'端都可能出现时使用,慎用
                            Sequence of an adapter that may be ligated to the 5'
                            or 3' end (paired data: of the first read). Both types
                            of matches as described under -a und -g are allowed.
                            If the first base of the read is part of the match,
                            the behavior is as with -g, otherwise as with -a. This
                            option is mostly for rescuing failed library preparations
                            -do not use if you know which end your adapter was ligated to!       
        -e ERROR_RATE, --error-rate=ERROR_RATE  #adapter匹配允许的最大错配率(错配/匹配片段长度)
                            Maximum allowed error rate (no. of errors divided by
                            the length of the matching region). Default: 0.1
        --no-indels         Allow only mismatches in alignments. Default: allow
                            both mismatches and indels 禁止adapter发生Insertions和deletions
        -n COUNT, --times=COUNT #从reads行修剪adapter次数
                            Remove up to COUNT adapters from each read. Default: 1
        -O MINLENGTH, --overlap=MINLENGTH #adapter与reads最小overlap,才算成功识别;Default: 3
                            If the overlap between the read and the adapter is
                            shorter than MINLENGTH, the read is not modified.
                            Reduces the no. of bases trimmed due to random adapter
                            matches. Default: 3
        --match-read-wildcards
                            Interpret IUPAC wildcards in reads. Default: False
        -N, --no-match-adapter-wildcards
                            Do not interpret IUPAC wildcards in adapters.
        --no-trim           Match and redirect reads to output/untrimmed-output as
                            usual, but do not remove adapters. #不修剪adapter输出reads
        --mask-adapter      Mask adapters with 'N' characters instead of trimming
                            them. #识别adapter后,用'N'代替adapter
    
      Additional read modifications:
        -u LENGTH, --cut=LENGTH  #修剪reads 5'/3'端碱基,正数:从开始除移除碱基;负数:从末尾处移除碱基;
                            Remove bases from each read (first read only if
                            paired). If LENGTH is positive, remove bases from the
                            beginning. If LENGTH is negative, remove bases from
                            the end. Can be used twice if LENGTHs have different
                            signs.
        --nextseq-trim=3'CUTOFF
                            NextSeq-specific quality trimming (each read). Trims
                            also dark cycles appearing as high-quality G bases
                            (EXPERIMENTAL).
        -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF #修剪低质量碱基
                            Trim low-quality bases from 5' and/or 3' ends of each
                            read before adapter removal. Applied to both reads if
                            data is paired. If one value is given, only the 3' end
                            is trimmed. If two comma-separated cutoffs are given,
                            the 5' end is trimmed with the first cutoff, the 3'
                            end with the second.
        --quality-base=QUALITY_BASE
                            Assume that quality values in FASTQ are encoded as
                            ascii(quality + QUALITY_BASE). This needs to be set to
                            64 for some old Illumina FASTQ files. Default: 33
        -l LENGTH, --length=LENGTH #将reads修剪的最终长度
                            Shorten reads to LENGTH. This and the following
                            modificationsare applied after adapter trimming. 
        --trim-n            Trim N's on ends of reads. #修剪reads末端的'N'
        --length-tag=TAG    Search for TAG followed by a decimal number in the
                            description field of the read. Replace the decimal
                            number with the correct length of the trimmed read.
                            For example, use --length-tag 'length=' to correct
                            fields like 'length=123'. #修剪reads后,用此参数修改reads的长度
        --strip-suffix=STRIP_SUFFIX
                            Remove this suffix from read names if present. Can be
                            given multiple times.
        -x PREFIX, --prefix=PREFIX #为reads名添加前缀,使用{name}添加adapter名字
                            Add this prefix to read names. Use {name} to insert
                            the name of the matching adapter.
        -y SUFFIX, --suffix=SUFFIX #为reads名添加后缀,使用{name}添加adapter名字
                            Add this suffix to read names; can also include {name}
    
      Filtering of processed reads:
        -m LENGTH, --minimum-length=LENGTH #修建前后,reads短于最短长度时,丢弃这对reads
                            Discard trimmed reads that are shorter than LENGTH.
                            Reads that are too short even before adapter removal
                            are also discarded. In colorspace, an initial primer
                            is not counted. Default: 0
        -M LENGTH, --maximum-length=LENGTH #修建前后,reads长于最大长度时,丢弃这对reads
                            Discard trimmed reads that are longer than LENGTH.
                            Reads that are too long even before adapter removal
                            are also discarded. In colorspace, an initial primer
                            is not counted. Default: no limit
         --max-n=COUNT       Discard reads with too many N bases. If COUNT is an
                            integer, it is treated as the absolute number of N
                            bases. If it is between 0 and 1, it is treated as the
                            proportion of N's allowed in a read. #reads中N的数量,设定整数或小数(N的占比)
        --discard-trimmed, --discard #丢弃只有一个adapter的reads
                            Discard reads that contain an adapter. Also use -O to
                            avoid discarding too many randomly matching reads!
        --discard-untrimmed, --trimmed-only #丢掉不包含adapter的reads
                            Discard reads that do not contain the adapter.  
    
      Output:
        --quiet             Print only error messages.
        -o FILE, --output=FILE #输出文件
                            Write trimmed reads to FILE. FASTQ or FASTA format is
                            chosen depending on input. The summary report is sent
                            to standard output. Use '{name}' in FILE to
                            demultiplex reads into multiple files. Default: write
                            to standard output
        --info-file=FILE    Write information about each read and its adapter
                            matches into FILE. #每条reads与adapter的匹配信息
        -r FILE, --rest-file=FILE
                            When the adapter matches in the middle of a read,
                            write the rest (after the adapter) into FILE.
        --wildcard-file=FILE
                            When the adapter has N bases (wildcards), write
                            adapter bases matching wildcard positions to FILE.
                            When there are indels in the alignment, this will
                            often not be accurate.
        --too-short-output=FILE #为reads长度最大值设定阈值筛选reads后,要丢弃的部分输出到文件;
                            Write reads that are too short (according to length
                            specified by -m) to FILE. Default: discard reads
        --too-long-output=FILE #将依据最长长度筛选reads后,要丢弃的部分输出到文件;
                            Write reads that are too long (according to length
                            specified by -M) to FILE. Default: discard reads
        --untrimmed-output=FILE #将没有adapter未做修剪的reads输出到一个文件,而不是输出到trimmed reads结果文件
                            Write reads that do not contain any adapter to FILE.
                            Default: output to same file as trimmed reads
    
      Colorspace options:
        -c, --colorspace    Enable colorspace mode: Also trim the color that is
                            adjacent to the found adapter.
        -d, --double-encode
                            Double-encode colors (map 0,1,2,3,4 to A,C,G,T,N).
        -t, --trim-primer   Trim primer base and the first color (which is the
                            transition to the first nucleotide)
        --strip-f3          Strip the _F3 suffix of read names
        --maq, --bwa        MAQ- and BWA-compatible colorspace output. This
                            enables -c, -d, -t, --strip-f3 and -y '/1'.
        --no-zero-cap       Do not change negative quality values to zero in
                            colorspace data. By default, they are since many tools
                            have problems with negative qualities.
        -z, --zero-cap      Change negative quality values to zero. This is
                            enabled by default when -c/--colorspace is also
                            enabled. Use the above option to disable it.
    
      Paired-end options:  #双端测序参数
        The -A/-G/-B/-U options work like their -a/-b/-g/-u counterparts, but
        are applied to the second read in each pair.
    
        -A ADAPTER          3' adapter to be removed from second read in a pair. #第二条reads 3'adapter
        -G ADAPTER          5' adapter to be removed from second read in a pair. #第二条reads 5'adapter
        -B ADAPTER          5'/3 adapter to be removed from second read in a pair.#第二条reads 5'/3'adapter
        -U LENGTH           Remove LENGTH bases from second read in a pair #从第二条reads上修剪的长度
        -p FILE, --paired-output=FILE #第二条reads的输出结果
                            Write second read in a pair to FILE. 
        --pair-filter=(any|both)
                            Which of the reads in a paired-end read have to match
                            the filtering criterion in order for it to be
                            filtered. Default: any
        --interleaved       Read and write interleaved paired-end reads.
        --untrimmed-paired-output=FILE #第一条reads没有adapter时,将第二条reads输出到文件;默认输出到trimmed reads结果文件 
                            Write second read in a pair to this FILE when no
                            adapter was found in the first read. Use this option
                            together with --untrimmed-output when trimming paired-
                            end reads. Default: output to same file as trimmed
                            reads
        --too-short-paired-output=FILE #将reads2中太短的reads输出到文件
                            Write second read in a pair to this file if pair is
                            too short. Use together with --too-short-output.
        --too-long-paired-output=FILE #将reads2中太长的reads输出到文件
                            Write second read in a pair to this file if pair is
                            too long. Use together with --too-long-output.
    
    

    Removing adapters

    Adapter 移除

    Adapter type Command-line option
    3’ adapter -a ADAPTER
    5’ adapter -g ADAPTER
    Anchored 3’ adapter -a ADAPTER$
    Anchored 5’ adapter -g ^ADAPTER
    5’ or 3’ (both possible) -b ADAPTER
    Linked adapter -a ADAPTER1...ADAPTER2
    Non-anchored linked adapter -g ADAPTER1...ADAPTER2

    3’ adapters修剪

    Reads Adapter Reault
    MYSEQUEN -a ADAPTER MYSEQUEN
    MYSEQUENCEADAP MYSEQUENCE
    MYSEQUENCEADAPTER MYSEQUENCE
    MYSEQUENCEADAPTERSOMETHINGELSE MYSEQUENCE
    ADAPTERSOMETHING 全部修剪导致无意义,reads保留到output

    5’ adapters修剪

    Reads Adapter Reault
    ADAPTERMYSEQUENCE ADAPTER MYSEQUENCE
    DAPTERMYSEQUENCE ADAPTER MYSEQUENCE
    DAPTERMYSEQUENCE ADAPTER MYSEQUENCE
    DAPTERMYSEQUENCE ADAPTER MYSEQUENCE
    SOMETHINGADAPTER ADAPTER 全部修剪导致无意义,reads保留到output

    Anchored 5’ adapters #锚定5’ adapters修剪

    识别的barcode Adapter
    ADAPTER -g ^ADAPTER
    ADAPT -g ^ADAPTER
    ADA -g ^ADAPTER

    允许错配的话,可能会允许插入的发生:

    BADAPTERSOMETHING 也可被-g ^ADAPTER识别,为防止这种现象发生;可以加参数--no-indels
    

    Anchored 3’ adapters #锚定3’ adapters修剪

    reads adapter result
    MYSEQUENCEADAP -a ADAPTER$ MYSEQUENCEADAP
    MYSEQUENCEADAPTER -a ADAPTER$ MYSEQUENCE
    MYSEQUENCEADAPTERSOMETHINGELSE -a ADAPTER$ MYSEQUENCEADAPTERSOMETHINGELSE

    Linked adapters (combined 5’ and 3’ adapter)

    本情况下,5’ 的ADAPTER1是an anchored 5’ adapter

    -a ADAPTER1...ADAPTER2: #5’ 正常修剪,3’ 只有 5’存在adapter才会发生修剪;

    -a ADAPTER1...ADAPTER2$: #只有 5’和3’ 都存在adapter才会发生修剪;

    cutadapt -a FIRST...SECOND -o output.fastq input.fastq
    
    reads adapter result
    FIRSTMYSEQUENCESECONDEXTRABASES -a FIRST...SECOND MYSEQUENCE
    FIRSTMYSEQUENCESEC MYSEQUENCE
    FIRSTMYSEQUE MYSEQUENCE
    ANOTHERREADSECOND MYSEQUENCE

    Linked adapters without anchoring

    -g ADAPTER1...ADAPTER2

    • The 5’ adapter is not anchored by default. (So neither the 5’ nor 3’ adapter are anchored.) #5’和3’都不是锚定的

    • Both adapters are required. If one of them is not found, the read is not trimmed. #5’和3’同时存在的,adapter才会被剪切

      此模式下,可以丢掉不包含adapter的reads;使用参数:

      --discard-untrimmed, --trimmed-only
      Discard reads that do not contain the adapter.

    Linked adapter statistics

    === Adapter 1 ===
    Sequence: AAAAAAAAA...TTTTTTTTTT; Type: linked; Length: 9+10; Trimmed: 3 times; Half matches: 2
    

    只适用于 (non-anchored) 3’ adapters情况下;

    Half matches:只有5’ adapters的情况;

    5’ or 3’ adapters

    -b ADAPTER (or use the longer spelling --anywhere ADAPTER).

    adapter允许出现在reads的任何位置;允许adapter降解;允许partially match;如果adapter前面还有碱基,adapter归于3’ adapters

    Read before trimming Read after trimming Detected adapter type
    MYSEQUENCEADAPTERSOMETHING MYSEQUENCE 3’ adapter
    MYSEQUENCEADAPTER MYSEQUENCE 3’ adapter
    MYSEQUENCEADAP MYSEQUENCE 3’ adapter
    MADAPTER M 3’ adapter
    ADAPTERMYSEQUENCE MYSEQUENCE 5’ adapter
    PTERMYSEQUENCE MYSEQUENCE 5’ adapter
    TERMYSEQUENCE MYSEQUENCE 5’ adapter

    Error tolerance

    3种情况:mismatches, insertions and deletions

    -e ERROR_RATE, --error-rate=ERROR_RATE #adapter匹配允许的最大错配率=错配/匹配片段长度(adapter全长)
    Maximum allowed error rate (no. of errors divided by
    the length of the matching region). Default: 0.1

    --no-indels:禁止adapter发生Insertions和deletions

    Sequence: 'LONGADAPTER'; Length: 11; Trimmed: 2 times.
    
    No. of allowed errors:
    0-9 bp: 0; 10-11 bp: 1
    

    11bp adapter,前9bp需要完全匹配,10-11 bp允许1bp错配;

    40bp adapter,0.1的最大错误率,允许最多错误4bp,这4bp不是均匀分给reads各个位置的;

    Multiple adapter occurrences within a single read

    5’ and 3’ adapters均识别最左边的adapter

    Reducing random matches

    -O MINLENGTH, --overlap=MINLENGTH #adapter与reads最小overlap,才算成功识别,最少3bp,再少就丢掉reads

    Wildcards

    使用N,Y,T这种通配符

    cutadapt -a ACGTAANNNNTTAGC -o output.fastq input.fastq
    

    Repeated bases in the adapter sequence

    可以去除ploy-A

    AAAAAAAAAA)等价于A{10}

    cutadapt -a "A{100}" -o output.fastq input.fastq
    

    Modifying reads

    对reads的修剪

    Removing a fixed number of bases #移除一定数目碱基

    -u LENGTH, --cut=LENGTH #修剪reads 5'/3' 碱基,正数表示从 5'移除碱基,负数表示从3'移除碱基

    cutadapt -u 5 -o trimmed.fastq reads.fastq
    

    Quality trimming

    修剪低质量碱基,默认phred quality + 33;使用64时,需加参数:--quality-base=64

    -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff=[5'CUTOFF,]3'CUTOFF

    cutadapt -q 10 -o output.fastq input.fastq #默认是3'进行质量修剪 
    
    cutadapt -q 15,10 -o output.fastq input.fastq #5'/3' 低质量碱基都修剪
    
    cutadapt -q 15,0 -o output.fastq input.fastq #5'进行质量修剪 
    

    Quality trimming of reads using two-color chemistry (NextSeq)

    cutadapt --nextseq-trim=20 -o out.fastq input.fastq
    

    Quality trimming algorithm

    一段序列质量编码如下:

    42, 40, 26, 27, 8, 7, 11, 4, 2, 3

    首先减去阈值10

    32, 30, 16, 17, -2, -3, 1, -6, -8, -7

    从末端开始累加,

    (70), (38), 8, -8, -25, -23, -20, -21, -15, -7

    因为-25 最小,所以保留-25 之前的碱基, 即保留前4位碱基

    Shortening reads to a fixed length

    -l LENGTH, --length=LENGTH #修剪reads到固定长度

    cutadapt -l 10 -o output.fastq.gz input.fastq.gz
    

    Modifying read names

    修改reads名字

    -y SUFFIX, --suffix=SUFFIX #为reads名添加后缀,使用{name}添加adapter名字

    cutadapt -a adapter1=ACGT -y ' we found {name}' input.fastq
    

    如果reads名中有reads长度,使用参数--length-tag 'length='更新

    Read modification order

    Filtering reads

    reads的过滤

    --minimum-length LENGTH or -m LENGTH #根据最短长度筛选reads;

    --too-short-output FILE #为reads长度最小值设定阈值筛选reads后,要丢弃的部分输出到文件;

    --maximum-length LENGTH or -M LENGTH #根据最长长度筛选reads;

    --too-long-output FILE #为reads长度最大值设定阈值筛选reads后,要丢弃的部分输出到文件;

    --untrimmed-output FILE #将没有adapter未做修剪的reads输出到一个文件,而不是输出到trimmed reads结果文件

    --discard-trimmed #丢弃只有一个adapter的reads

    --discard-trimmed 等价于 --untrimmed-output /dev/null #丢弃没有adapter的reads

    Trimming paired-end reads

    cutadapt -a ADAPTER_FWD -A ADAPTER_REV -o out.1.fastq -p out.2.fastq reads.1.fastq reads.2.fastq
    

    -U: 修剪第二条reads的碱基;

    对成对的reads都起作用的参数:

    • -q (along with --quality-base)
    • --times applies to all the adapters given
    • --no-trim
    • --trim-n
    • --mask
    • --length
    • --length-tag
    • --prefix, --suffix
    • --strip-f3
    • --colorspace, --bwa, -z, --no-zero-cap, --double-encode, --trim-primer

    双端测序reads的过滤:

    Filtering option With --pair-filter=any, the pair is discarded if … With -pair-filter=both, the pair is discarded if …
    --minimum-length one of the reads is too short both reads are too short
    --maximum-length one of the reads is too long both reads are too long
    --discard-trimmed one of the reads contains an adapter both reads contain an adapter
    --discard-untrimmed one of the reads does not contain an adapter both reads do not contain an adapter
    --discard-untrimmed one of the reads does not contain an adapter both reads do not contain an adapter

    交叉双端测序reads:

    也是双端测序,只是所有reads在一个文件;read1在read2前面;

    cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.fastq reads.fastq
    
    cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq -p trimmed.2.fastq reads.fastq
    

    正常双端测序,输出这种格式文件:

    cutadapt --interleaved -q 20 -a ACGT -A TGCA -o trimmed.1.fastq reads.1.fastq reads.2.fastq
    

    Legacy paired-end read trimming #双端测序修剪

    双端测序文件单独处理:(双端测序没有共同的参数的处理)

    cutadapt -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq
    cutadapt -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq
    
    cutadapt -q 10 -a ADAPTER_FWD -o trimmed.1.fastq reads1.fastq
    cutadapt -q 15 -a ADAPTER_REV -o trimmed.2.fastq reads2.fastq
    

    双端测序文件单独处理:(双端测序经过共同参数的处理)

    #先处理read1文件,使用-p生成临时文件
    cutadapt -q 10 -a ADAPTER_FWD --minimum-length 20 -o tmp.1.fastq -p tmp.2.fastq reads.1.fastq reads.2.fastq
    #先处理read1文件,使用临时文件作为输入
    cutadapt -q 15 -a ADAPTER_REV --minimum-length 20 -o trimmed.2.fastq -p trimmed.1.fastq tmp.2.fastq tmp.1.fastq
    #移除临时文件
    rm tmp.1.fastq tmp.2.fastq
    

    Multiple adapters

    多条adapter

    两条3’ adapters

    cutadapt -a TGAGACACGCA -a AGGCACACAGGG -o output.fastq input.fastq
    

    使用adapters文件,fasta格式

    cutadapt -a file:adapters.fasta -o output.fastq input.fastq
    
    • All given adapter sequences are matched to the read.
    • Adapter matches where the overlap length (see the -O parameter) is too small or where the error rate is too high (-e) are removed from further consideration.
    • Among the remaining matches, the one with the greatest number of matching bases is chosen.
    • If there is a tie, the first adapter wins. The order of adapters is the order in which they are given on the command line or in which they are found in the FASTA file.

    Named adapters

    cutadapt -a My_Adapter=AACCGGTT -o output.fastq input.fastq
    

    Demultiplexing

    cutadapt -a one=TATA -a two=GCGC -o trimmed-{name}.fastq.gz input.fastq.gz
    #ouput files:demulti-one.fastq.gz, demulti-two.fastq.gz, demulti-unknown.fastq.gz
    
    cutadapt -a file:barcodes.fasta --no-trim --untrimmed-o untrimmed.fastq.gz -o trimmed-{name}.fastq.gz input.fastq.gz
    

    --no-trim: #不修剪reads,保留adapter

    --untrimmed-output: #输出没有adapter的reads到新文件,默认输出到正常trimmed reads结果文件

    --untrimmed-paired-output:接受PE测序,没有adapter的reads2输出

    barcodes.fasta

    >barcode01
    TTAAGGCC
    >barcode02
    TAGCTAGC
    >barcode03
    ATGATGAT
    
    cutadapt -a first=AACCGG -a second=TTTTGG -A ACGTACGT -A TGCATGCA -o trimmed-{name}.1.fastq.gz -p trimmed-{name}.2.fastq.gz input.1.fastq.gz input.2.fastq.gz
    #ouput files:
    trimmed-first.1.fastq.gz, trimmed-second.1.fastq.gz, trimmed-unknown.1.fastq.gz
    trimmed-first.2.fastq.gz, trimmed-second.2.fastq.gz, trimmed-unknown.2.fastq.gz
    --untrimmed-paired-output ##输出没有adapter的reads到文件
    PE reads,一般情况下检测reads1的adapter决定这对pair reads;如果想考虑reads2,使用参数-A/-G
    

    Trimming more than one adapter from each read

    cutadapt -g ^ADAPTER -n 5 -o output.fastq.gz input.fastq.gz 
    #直到在reads中找到5次adapter就停止寻找
    #-n COUNT, --times=COUNT #从reads行修剪adapter次数
    cutadapt -g ^FIRST -a SECOND -n 2 ...
    

    Illumina TruSeq

    单端和双端reads1: A + the “TruSeq Indexed Adapter”

    cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -o trimmed.fastq.gz reads.fastq.gz
    

    PE reads:

    reads2: TruSeq Universal Adapter

    cutadapt \
                -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
                -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \
                -o trimmed.1.fastq.gz -p trimmed.2.fastq.gz \
                reads.1.fastq.gz reads.2.fastq.gz
    
    

    Warning about incomplete adapter sequences

    WARNING:
        One or more of your adapter sequences may be incomplete.
        Please see the detailed output above.
    
    Bases preceding removed adapters:
      A: 95.5%
      C: 1.0%
      G: 1.6%
      T: 1.6%
      none/other: 0.3%
    WARNING:
        The adapter is preceded by "A" extremely often.
        The provided adapter sequence may be incomplete.
        To fix the problem, add "A" to the beginning of the adapter sequence.
    

    扩增子测序可以忽略这些警告。

    Dealing with N bases

    --max-n COUNT #丢掉超过一定数目N 的reads;

    --trim-n #修剪reads末端的'N'

    Bisulfite sequencing (RRBS)

    需要移除3’ 端除adapter外的两个碱基;

    cutadapt -a NNADAPTER -o output.fastq input.fastq
    

    Cutadapt’s output

    No. of allowed errors:
    0-7 bp: 0; 8-15 bp: 1; 16-20 bp: 2
    

    0-7 bp,不允许错配;8-15 bp,允许错配1bp; 16-20 bp允许错配 2bp;

    Overview of removed sequences
    length  count   expect  max.err error counts
    3       140     156.2   0       140
    4       57      39.1    0       57
    5       50      9.8     0       50
    6       35      2.4     0       35
    7       13      0.3     0       1 12
    8       31      0.1     1       0 31
    ...
    100     397     0.0     3       358 36 3
    

    Format of the info file

    --info-file=FILE Write information about each read and its adapter
    matches into FILE. #每条reads与adapter的匹配信息

    The alignment algorithm

    unit costs

    mismatches, insertions and deletions are counted as one error,这样就只需要一个参数(最大错误率)就可以判断是否进行下一步操作

    基本理念:在允许的错误率下 ,adapter与reads匹配的区域最大化;

    the procedure is as follows:

    1. Consider all possible overlaps between the two sequences and compute an alignment for each, minimizing the total number of errors in each one.
    2. Keep only those alignments that do not exceed the specified maximum error rate.
    3. Then, keep only those alignments that have a maximal number of matches (that is, there is no alignment with more matches).
    4. If there are multiple alignments with the same number of matches, then keep only those that have the smallest error rate.
    5. If there are still multiple candidates left, choose the alignment that starts at the leftmost position within the read.

    In Step 1, the different adapter types are taken into account: Only those overlaps that are actually allowed by the adapter type are actually considered.

    相关文章

      网友评论

        本文标题:cutadapt 使用详解

        本文链接:https://www.haomeiwen.com/subject/jkzejftx.html