- Bioinformatics Data Skills Ch3 关
- 【shell笔记>生信|专项】生信数据处理技能手札(3):
- 记录《Bioinformatics Data Skills》中关
- Bioinformatics Data Skills
- 《Bioinformatics Data Skills 2015
- 《Bioinformatics Data Skills 2015
- Bioinformatics Data Skills - Pip
- 28.《Bioinformatics-Data-Skills》之
- 18.《Bioinformatics-Data-Skills》之
- 19.《Bioinformatics-Data-Skills》之
1. why use UNIX in bioinformatics?
Unix philosophy:
This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
—Doug McIlory
Advantages:
- with modular workflows, easy to spot errors and where occur.
- modular workflow allows testing with alternative methods. e.g.change an aligner
- allow combining command-line tools for interactively exploring data. e.g. use py for scripting and r for statistical analysis
- reusable and applicable to many types of data
- text stream : allow us to both couple programs together into workflows and process data without storing huge amounts of data in our computers’ memory.
2. UNIX
many UNIX shells
bash : widely availble
zsh: have more advanced features
chainsaw feature
Be careful because a single space could mean differently!!
$ rm -rf tmp-data/aligned-reads* # deletes all old large files
$ # versus
$ rm -rf tmp-data/aligned-reads * # deletes entire current directory
# rm: tmp-data/aligned-reads: No such file or directory
3. work with steads and redirections
生物信息数据通常是大型的text data (A、G、T、C),因此unix的text stream很有用。如果想把一个文件中的内容复制粘贴到另一个文件中,不仅需要在memory中加载两个文件,还要使用额外的memory来复制。UNIX可以将这个问题简单化,stream 可以避免家在不必要的大文件在memory中。
Instead, we can combine large files by printing their contents to the standard output stream and redirect this stream from our terminal to the file we wish to save the combined results to.Use the program cat to print a file’s contents to standard out(which when not redirected is printed to your terminal screen).
use cat to print to standard output
$ cat tb1-protein.fasta
>teosinte-branched-1 protein
LGVPSVKHMFPFCDSSSPMDLPLYQQLQLSPSSPKTDQSSSFYCYPCSPP
cat print multiple files' content to the standard output stream
$ cat tb1-protein.fasta tga1-protein.fasta
>teosinte-branched-1 protein
LGVPSVKHMFPFCDSSSPMDLPLYQQLQLSPSSPKTDQSSSFYCYPCSPP
>teosinte-glume-architecture-1 protein
DSDCALSLLSAPANSSGIDVSRMVRPTEHVPMAQQPVVPGLQFGSASWFP
# 可以看出来分别打印出来
use > or >> to redirect standard output to a file
$ cat tb1-protein.fasta tga1-protein.fasta > zea-proteins.fasta
注意当redirect的时候在terminal screen上是没有output的
ls -lrt
# -l is in list format
# -rt in reverse time order
ls -lt #see newest files at the top
redirect each stream to separate files
将两个stream分别redirect到不同的文件,注意导入第二个文件是2>
$ ls -l tb1.fasta leafy1.fasta > listing.txt 2> listing.stderr
$ cat listing.txt
#-rw-r--r-- 1 vinceb staff 152 Jan 20 21:24 tb1.fasta
$ cat listing.stderr
# ls: leafy1.fasta: No such file or directory

redirection can be a useful way to silence diagnostic information some programs write to standard output: just direct to a file, like stder.rtxt
use standard input redirection
$ program < inputfile > outputfile
$ cat inputfile | program > output file
the artificial file inputfile is provided to program through standard
input, and all of program’s standard output is redirected to the file outputfile.
use tail -f to monitor redirected standard error
running tail stderr.txt will print the last 10 lines of the file stderr.txt.
Tail can also be used to constantly monitor a file with -f (-f for follow).
to stop the monitoring of a file, you can use Control-C to interrupt the tail process.
3. UNIX pipe: speed and beauty in one
rather than redirecting a program’s standard output stream to a file, pipes redirect it to another program’s standard input. Only standard output is piped to the next command; standard error still is printed to your terminal screen.

Passing the output of one program directly into the input of another program with pipes is a computationally efficient and simple way to interface Unix programs.
- grep and pipes
The Golden Rule of Bioinformatics is to not trust your tools or data. This scepticism requires constant sanity checking of intermediate results, which ensures your methods aren’t biasing your data, or problems in your data aren’t being exacerbated by your methods.
- pipes avoid latency issues from writing unnecessary files to disk.
网友评论