美文网首页生信思路python学生信
2020-03-27Unix, R 和python工具和资源

2020-03-27Unix, R 和python工具和资源

作者: iColors | 来源:发表于2020-03-27 17:07 被阅读0次

    转载自:
    https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources

    基因组学和数据科学用的Unix, R 和python工具和资源

    大神杰作,保存备学。。。。。。

    Table of content

    General

    Courses

    Some biology

    If you are from fields outside of biology, places to get you started:

    Some statistics

    linear algebra

    Bayesian Statistics

    Learning Latex

    Linux commands

    Theory and quick reference

    There are 3 file descriptors, stdin, stdout and stderr (std=standard).

    Basically you can:

    redirect stdout to a file
    redirect stderr to a file
    redirect stdout to a stderr
    redirect stderr to a stdout
    redirect stderr and stdout to a file
    redirect stderr and stdout to stdout
    redirect stderr and stdout to stderr
    1 'represents' stdout and 2 stderr.
    A little note for seeing this things: with the less command you can view both stdout (which will remain on the buffer) and the stderr that will be printed on the screen, but erased as you try to 'browse' the buffer.

    • stdout 2 file

    This will cause the ouput of a program to be written to a file.

         ls -l > ls-l.txt
    

    Here, a file called 'ls-l.txt' will be created and it will contain what you would see on the screen if you type the command 'ls -l' and execute it.

    • stderr 2 file

    This will cause the stderr ouput of a program to be written to a file.

         grep da * 2> grep-errors.txt
    

    Here, a file called 'grep-errors.txt' will be created and it will contain what you would see the stderr portion of the output of the 'grep da *' command.

    • stdout 2 stderr

    This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.

         grep da * 1>&2
    

    Here, the stdout portion of the command is sent to stderr, you may notice that in differen ways.

    • stderr 2 stdout

    This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.

         grep * 2>&1
    

    Here, the stderr portion of the command is sent to stdout, if you pipe to less, you'll see that lines that normally 'dissapear' (as they are written to stderr) are being kept now (because they're on stdout).

    • stderr and stdout 2 file

    This will place every output of a program to a file. This is suitable sometimes for cron entries, if you want a command to pass in absolute silence.

         rm -f $(find / -name core) &> /dev/null
    

    This (thinking on the cron entry) will delete every file called 'core' in any directory. Notice that you should be pretty sure of what a command is doing if you are going to wipe it's output.

    • change permissions of files
      each digit is for: user, group and other.

    chmod 754 myfile: this means the user has read, write and execute permssion; member in the same group has read and execute permission but no write permission; other people in the world only has read permission.

    4 stands for "read",
    2 stands for "write",
    1 stands for "execute", and
    0 stands for "no permission."
    So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).

    It is sometimes hard to remember. one can use the letter:The letters u, g, and o stand for "user", "group", and "other"; "r", "w", and "x" stand for "read", "write", and "execute", respectively.

    chmod u+x myfile
    chmod g+r myfile

    Do not give me excel files!

    How to name files

    It is really important to name your files correctly! see a ppt by Jenny Bryan.

    Three principles for (file) names:

    • Machine readable (do not put special characters and space in the name)
    • Human readable (Easy to figure out what the heck something is, based on its name, add slug)
    • Plays well with default ordering:
    1. Put something numeric first

    2. Use the ISO 8601 standard for dates (YYYY-MM-DD)

    3. Left pad other numbers with zeros

    image image

    If you have to rename the files...

    • brename A cross-platform command-line tool for safely batch renaming files/directories via regular expression (supporting Windows, Linux and OS X) from ShenWei is very useful!

    Good naming of your files can help you to extract meta data from the file name

    • dirdf Create tidy data frames of file metadata from directory and file names.
    > dir("examples/dataset_1/")
    [1] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv"
    [2] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv"
    [3] "2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv"
    [4] "2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv"
    [5] "2016-04-01_BRAFWTNEG_FFPEDNA-CRC-1-41_E12.csv"
    
    > library("dirdf")
    > dirdf("examples/dataset_1/", template="date_assay_experiment_well.ext")
            date     assay           experiment well ext                                          pathname
    1 2013-06-26 BRAFWTNEG Plasmid-Cellline-100  A01 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv
    2 2013-06-26 BRAFWTNEG Plasmid-Cellline-100  A02 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv
    3 2014-02-26 BRAFWTNEG     FFPEDNA-CRC-1-41  D08 csv     2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv
    4 2014-03-05 BRAFWTNEG   FFPEDNA-CRC-REPEAT  H03 csv   2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv
    

    parallelization

    Using these tool will greatly improve your working efficiency and get rid of most of your for loops.

    1. xargs
    2. GNU parallel. one of my post here
    3. gxargs by Brent Pedersen. Written in GO.
    4. rush A cross-platform command-line tool for executing jobs in parallel by Shen Wei. I use his other tools such as brename and csvtk.
    5. future: Unified Parallel and Distributed Processing in R for Everyone
    6. furrr Apply Mapping Functions in Parallel using Futures

    Statistics

    Data transfer

    a blog post by Mark Ziemann http://genomespot.blogspot.com/2018/03/share-and-backup-data-sets-with-dat.html

    Website

    updating R

    # Install new version of R (lets say 3.5.0 in this example)
    
    # Create a new directory for the version of R
    fs::dir_create("~/Library/R/3.5/library")
    
    # Re-start R so the .libPaths are updated
    
    # Lookup what packages were in your old package library
    pkgs <- fs::dirname(fs::dir_ls("~/Library/R/3.4/library"))
    
    # Filter these packages as needed
    
    # Install the packages in the new version
    install.packages(pkgs)
    
    

    Better R code

    Shiny App

    profile R code

    • profvis Interactive Visualizations for Profiling R Code.
    • proffer The proffer package profiles R code to find bottlenecks.
    • rco - The R Code Optimizer Make your R code run faster! rco analyzes your code and applies different optimization strategies that return an R code that runs faster.

    R tools for data wrangling, tidying and visualizing.

    If you already know the mapping in advance (like the above example) you should use the .data pronoun from rlang to make it explicit that you are referring to the drv in the layer data and not some other variable named drv (which may or may not exist elsewhere). To avoid a similar note from the CMD check about .data, use #' @importFrom rlang .data in any roxygen code block (typically this should be in the package documentation as generated by usethis::use_package_doc()).

    • If you know the mapping or facet specification is col in advance, use aes(.datacol) or vars(.datacol).
    • If col is a variable that contains the column name as a character vector, use aes(.data[[col]] or vars(.data[[col]]).
    • If you would like the behaviour of col to look and feel like it would within aes() and vars(), use aes({{ col }}) or vars({{ col }}).

    Sankey graph

    Handling big data in R

    Write your own R package

    Documentation

    • This is a must read for writing good documentations: A blog post. I saved it to a pdf and uploaded to this repo.

    handling arguments at the command line

    visualization in general

    Javascript

    python tips and tools

    machine learning

    Amazon cloud computing

    Intro to AWS Cloud Computing

    Genomics-visualization-tools

    There are many online web based tools for visualization of (cancer) genomic data. I put my collections here. I use R for visulization.
    see a nice post by using python by Radhouane Aniba:Genomic Data Visualization in Python

    • UCSC cancer genome browser It has many data including TCGA data buit in, and can be very handy for both bench scientist and bioinformaticians.
    • UCSC Xena. A new tool developed by UCSC team as well. Poteintially very useful, but need more tutorials to follow.
    • UCSC genome browser. One of the most famous genome browser and my favoriate. Every person studying genetics, genomics and molecular biology needs to know how to use it. Tutorials from OpenHelix.
    • Epiviz 3 is an interactive visualization tool for functional genomics data. It supports genome navigation like other genome browsers, but allows multiple visualizations of data within genomic regions using scatterplots, heatmaps and other user-supplied visualizations.
    • Mutation Annotation & Genome Interpretation TCGA: MAGA
    • GeneProteinViz (GPViz) is a versatile Java-based software for dynamic gene-centered visualization of genomic regions and/or variants.
    • ProteinPaint: Web Application for Visualizing Genomic Data The software developed for this project highlights critical attributes about the mutations, including the form of protein variant (e.g. the new amino acid as a result of missense mutation), the name of sample from which the mutation was identified, whether the mutation is somatic or germline,

    Databases

    Large data consortium data mining

    Integrative analysis

    Interactive visualization

    Tutorials

    "MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"

    MOOC(Massive Open Online Courses)

    git and version control

    blogs

    data management

    Automate your workflow, open science and reproducible research

    Automation wins in the long run.

    image

    STEP 6 is usually missing!

    image

    The pic was downloaded from http://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scientific-method

    Workflow languages

    Reviews
    Snakemake

    I am using snakemake and so far is very happy about it!

    Nextflow

    Reproducible research

    As an early adopter of the Figshare repository, I came up with a strategy that serves both our open-science and our reproducibility goals, and also helps with this problem: for the main results in any new paper, we would share the data, plotting script and figure under a CC-BY license, by first uploading them to Figshare.

    Survival curve

    Organize research for a group

    • slack:A messaging app for teams.
    • Ryver.
    • Trello lets you work more collaboratively and get more done.

    Clustering

    CRISPR related

    vector arts for life sciences

    相关文章

      网友评论

        本文标题:2020-03-27Unix, R 和python工具和资源

        本文链接:https://www.haomeiwen.com/subject/zyqruhtx.html