conda install -c bioconda seqkit
sort sequences by id/name/sequence/length.
By default, all records will be readed into memory.
For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
supported.
Firstly, seqkit reads the sequence head and length information.
If the file is not plain FASTA file,
seqkit will write the sequences to temporary files, and create FASTA index.
Secondly, seqkit sorts sequence by head and length information
and extracts sequences by FASTA index.
Usage:
seqkit sort [flags]
Flags:
-b, --by-bases by non-gap bases
-l, --by-length by sequence length
-n, --by-name by full name instead of just id
-s, --by-seq by sequence
-G, --gap-letters string gap letters (default "- \t.")
-h, --help help for sort
-i, --ignore-case ignore case
-k, --keep-temp keep temporary FASTA and .fai file when using 2-pass mode
-N, --natural-order sort in natural order, when sorting by IDs/full name
-r, --reverse reverse the result
-L, --seq-prefix-length int length of sequence prefix on which seqkit sorts by sequences (0 for whole sequence) (default 10000)
-2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Time loading forward index: 00:00:22
Time loading reference: 00:00:03
Multiseed full-index search: 00:11:30
23576060 reads; of these:
23576060 (100.00%) were paired; of these:
1398071 (5.93%) aligned concordantly 0 times
21068790 (89.37%) aligned concordantly exactly 1 time
1109199 (4.70%) aligned concordantly >1 times
----
1398071 pairs aligned concordantly 0 times; of these:
39078 (2.80%) aligned discordantly 1 time
----
1358993 pairs aligned 0 times concordantly or discordantly; of these:
2717986 mates make up the pairs; of these:
1728911 (63.61%) aligned 0 times
892790 (32.85%) aligned exactly 1 time
96285 (3.54%) aligned >1 times
96.33% overall alignment rate
Time searching: 00:11:34
Overall time: 00:11:56
Time loading forward index: 00:00:25
Time loading reference: 00:00:04
Multiseed full-index search: 00:08:33
23576060 reads; of these:
23576060 (100.00%) were paired; of these:
1390031 (5.90%) aligned concordantly 0 times
21078357 (89.41%) aligned concordantly exactly 1 time
1107672 (4.70%) aligned concordantly >1 times
----
1390031 pairs aligned concordantly 0 times; of these:
39187 (2.82%) aligned discordantly 1 time
----
1350844 pairs aligned 0 times concordantly or discordantly; of these:
2701688 mates make up the pairs; of these:
1717885 (63.59%) aligned 0 times
890091 (32.95%) aligned exactly 1 time
93712 (3.47%) aligned >1 times
96.36% overall alignment rate
Time searching: 00:08:37
Overall time: 00:09:02
网友评论