美文网首页
Bioinformatics Data Skills Ch3 关

Bioinformatics Data Skills Ch3 关

作者: Pingouin | 来源:发表于2020-09-12 06:25 被阅读0次

1. why use UNIX in bioinformatics?

Unix philosophy:
This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
—Doug McIlory

Advantages:

  • with modular workflows, easy to spot errors and where occur.
  • modular workflow allows testing with alternative methods. e.g.change an aligner
  • allow combining command-line tools for interactively exploring data. e.g. use py for scripting and r for statistical analysis
  • reusable and applicable to many types of data
  • text stream : allow us to both couple programs together into workflows and process data without storing huge amounts of data in our computers’ memory.

2. UNIX

many UNIX shells

bash : widely availble
zsh: have more advanced features

chainsaw feature

Be careful because a single space could mean differently!!

$ rm -rf tmp-data/aligned-reads* # deletes all old large files
$ # versus
$ rm -rf tmp-data/aligned-reads * # deletes entire current directory 
# rm: tmp-data/aligned-reads: No such file or directory

3. work with steads and redirections

生物信息数据通常是大型的text data (A、G、T、C),因此unix的text stream很有用。如果想把一个文件中的内容复制粘贴到另一个文件中,不仅需要在memory中加载两个文件,还要使用额外的memory来复制。UNIX可以将这个问题简单化,stream 可以避免家在不必要的大文件在memory中。
Instead, we can combine large files by printing their contents to the standard output stream and redirect this stream from our terminal to the file we wish to save the combined results to.Use the program cat to print a file’s contents to standard out(which when not redirected is printed to your terminal screen).

use cat to print to standard output

$ cat tb1-protein.fasta 
>teosinte-branched-1 protein 
LGVPSVKHMFPFCDSSSPMDLPLYQQLQLSPSSPKTDQSSSFYCYPCSPP 

cat print multiple files' content to the standard output stream

$ cat tb1-protein.fasta tga1-protein.fasta
>teosinte-branched-1 protein 
LGVPSVKHMFPFCDSSSPMDLPLYQQLQLSPSSPKTDQSSSFYCYPCSPP
>teosinte-glume-architecture-1 protein 
DSDCALSLLSAPANSSGIDVSRMVRPTEHVPMAQQPVVPGLQFGSASWFP
# 可以看出来分别打印出来

use > or >> to redirect standard output to a file

$ cat tb1-protein.fasta tga1-protein.fasta > zea-proteins.fasta

注意当redirect的时候在terminal screen上是没有output的

ls -lrt
# -l is in list format
# -rt in reverse time order 
ls -lt #see newest files at the top

redirect each stream to separate files

将两个stream分别redirect到不同的文件,注意导入第二个文件是2>

$ ls -l tb1.fasta leafy1.fasta > listing.txt 2> listing.stderr 
$ cat listing.txt
#-rw-r--r-- 1 vinceb staff 152 Jan 20 21:24 tb1.fasta
$ cat listing.stderr
# ls: leafy1.fasta: No such file or directory

redirection can be a useful way to silence diagnostic information some programs write to standard output: just direct to a file, like stder.rtxt

use standard input redirection

$ program < inputfile > outputfile
$ cat inputfile | program > output file

the artificial file inputfile is provided to program through standard
input, and all of program’s standard output is redirected to the file outputfile.

use tail -f to monitor redirected standard error

running tail stderr.txt will print the last 10 lines of the file stderr.txt.
Tail can also be used to constantly monitor a file with -f (-f for follow).
to stop the monitoring of a file, you can use Control-C to interrupt the tail process.

3. UNIX pipe: speed and beauty in one

rather than redirecting a program’s standard output stream to a file, pipes redirect it to another program’s standard input. Only standard output is piped to the next command; standard error still is printed to your terminal screen.


Passing the output of one program directly into the input of another program with pipes is a computationally efficient and simple way to interface Unix programs.

  1. grep and pipes
    The Golden Rule of Bioinformatics is to not trust your tools or data. This scepticism requires constant sanity checking of intermediate results, which ensures your methods aren’t biasing your data, or problems in your data aren’t being exacerbated by your methods.
  • pipes avoid latency issues from writing unnecessary files to disk.

相关文章

网友评论

      本文标题:Bioinformatics Data Skills Ch3 关

      本文链接:https://www.haomeiwen.com/subject/cfsuektx.html