2024. 11. 27. 01:14ㆍComputational biology
Anaconda: Used for utilizing Python virtual environments. It is not limited to bioinformatics but is useful in many cases of data science. It is possible to create a virtual environment and manage the version of each package in that environment. Below are conda channels useful for bioinformatics.
- bioconda
- conda-forge
pandas, numpy: useful for storing and processing data in the form of Excel-like arrays/matrices.
Scipy: Used for scientific and technological calculations. It can be linked to numpy and pandas above. It provides a variety of statistical tools, including differential equation analysis, root finding of equations, standard continuity, and discrete probability distributions.
- It is possible to replace MATLAB through linkage with pandas and numpy.
Biopython: An open source API that provides functions for bioinformatics software development and various task execution. It is used to handle Sequece or Parsing. You can also import data from the Bioinformatics database.
PyVCF: Used for processing VCF files. The VCF file is the result of software like GATK, SAMtool, and it contains information about genetic variants like SNPs, INDELs, and CNVs.
rPY2: Provides an interface to use the R language in Python. You can use the function used in R. But Python and R have different namespaces, so you have to be careful about that.
- In the case of Jupiter laptop, R interface is possible through magic command without using rPY2. In this case, it is possible to write the entire cell of the Jupiter laptop starting with %%R using the R grammar, which is more intuitive than rPY2.
Gzip: .gz You can decompress the published bioinformatics data.
Collection.defaultdict: A subclass of dictionary object that can specify the basic data type of the key value of the dictionary. If you call an unspecified key through this, block the KeyError and print the empty data type of the specified defaultdict.
pysam: Provides a Python implementation of SAMtools C API. This allows access to SAM & BAM file.
- SAM & BAM file is output of alignment software such as BWA, Bowtie, CLCAssemblyCell, etc., and contains alignment and mapping information.
matplotlib, seaborn: matplotlib is a library that performs various statistical visualizations in Python, and seaborn contains a wider variety of statistical tools based on matplotlib. It can be used in connection with numpy and pandas.
tabix: tabix allows for quick indexing and area search of files organized at TAB intervals, such as GFF, BED, PSL, and BAM.
HTSeq: Provides various NGS data processing functions. It can process FASTA, FASTQ, SAM, VCF, GFF, and BED files and provides data statistics summary, align, and read counting functions.
GFFUtils: Provides features for handling GFF, GTF files. It is possible to create a sqlite3 database for more flexible manipulation.
- GFF & GTF file is used for genomic annotation. (= General Feature Format)
Requests: Provides an interface that makes each database web request easier through the REST API. In Bioinformatics, it can be used to load Gene ontology (GO).
DendroPy: A Python library for phylogeny. Functions and classes are provided for phylogenetic data and for manipulation, processing, and simulation of phylogenetic trees.
- Similar functions are provided in Biopython, and the visualization of phylogenetic trees is also neat on the biopython side.
MAFFT, MUSCLE: MAFFT is used for genome alignment, and MUSCLE is used for protein sequence alignment. It is provided through Bio.Align.Applications module in biopython.
Bio.PDB: Provides a function that can handle PDB file formats used in Protein Data Bank, a famous protein structure database.
- It also includes a function for processing mmCIF, the next generation protein structure model file format.
PyMOL: An open-source visualization Python library that allows you to create three-dimensional images and animations of molecules. Write a script of .pml, the format of PyMOL, through the python grammar.
Planemo: Used for Galaxy Analysis Tool Development.
Airflow: An open source workflow management platform that can schedule and monitor various tasks using direct acyclic graphs (DAGs). Each task has dependency and relationship and this relationship is expressed through DAG. It is not suitable for dynamically active pipeline processing, such as streaming workflows.
h5py: A Python library for using Hierarchical Data Format (HDF5), a large data storage file format. It is possible to use the package in conjunction with numpy & pandas.
Dask: Python library for parallel computing. It is easy to link with numpy and pandas. You can have a virtual dataframe that allows you to process all the data without loading it into memory, and you can determine how to process the task through the task scheduler. You can adjust the parallel processing method, the number of work processes, etc.
PySpark: It can be said to be the Python API of Apache Spark, a parallelism framework implemented in the Scala language. The parallelization implementation is more abstract than Dask, and it uses resilience distributed data (RDD) as a basic data structure. For pandas users, grammatically, it might be convenient to use Dask.
Cython: Used for the connection of Python and C languages. Because Python is a scripting language, it has slower execution times than C, a compiler language, due to real-time source code interpretation, variable type verification due to dynamic programming, etc. To address this, Cython compiles Python scripts into C language to speed up execution. By calling functions in C language a Python grammar or compiling Python functions into C language, we can speed up time-consuming operations in loops.
Numba: Python, can be used for JIT (Just-In-Time) compilation optimization of Numpy code. Only functions decorated with decorators are JIT compiled when calling functions separately. Through this, the speed of that part depends on the speed of the machine code version. The advantage is that you don't have to learn a separate language, but the disadvantage is that you're limited to some data types or you don't understand pandas.
Cytoscape: An open source platform for visualizing molecular interaction network. It is easy to express the network between high-throughput expression data such as DNA and protein. It's mostly used with the KEGG database.
- py2cytoscape: A library that helps access Cytoscape and Cytoscape.js via Python code.
NetworkX: Provides classes and functions for graph representation through python. It's possible to visualize graphs created by NetworkX with cytoscape.
Each library has some things that cannot be installed through the conda command. Also, there may be things that are not supported depending on OS such as Windows or Mac OS, so it is recommended to check from the source of each library. The source provides tutorials for each library, so you can check and practice how to use the library you need.
'Computational biology' 카테고리의 다른 글
Python-Basic (1) | 2024.11.27 |
---|---|
biopython (0) | 2024.11.27 |
scRNA-seq Raw Data Preprocessing: scRNA-seq quality control (0) | 2024.11.27 |
R packages (0) | 2024.11.27 |
VCF file analysis method using Pandas (0) | 2024.11.24 |