R Biodconductor

2024. 11. 2. 11:09Computational biology

Bioconductor

- R package providing system for genome analysis (many package collection)
- You can check the information on each package on the bioconductor website.
- Provide analysis tool and example data

▶ INSTALLATION METHOD

# 설치
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# 기본 package 다운
BiocManager::install() 

# 특정 package 다운
BiocManager::install(원하는 package 이름) 

BioBase

- Most basic package

- Supports the ExpressionSet structure.

● ExpresssionSet

- Data structure with gene expression value

- It consists of assayData, phenoData, and featureData.

▶ assayData: matrix representing gene expression (row = gene_id, column = sample_id)

▶ phenoData: dataframe with sample information (row = sample_id, column = sample_info)

- Clinical data, age, species, etc.

▶ featureData: Dataframe with Gene information (row = gene_id, column = gene_info)

- Chrosom location, start, end number, strand, synbol, etc.

phenoData and featureData are AnnotationDataframe

Be aware when creating ExpressionSet
# ExpressionSet 만들기

ExpressionSet(assayData = assayData
              ,phenoData = AnnotationDataFrame(phenoData)
              ,featureData = AnnotationDataFrame(featureData))

 
● ExpressionSet-related functions

# assayData, phenoData, featureData 가져오기

exprs(expressionSet), fData(), pData()

# sample 이름, 유전자 이름 가져오기

sampleNames(), featureNames()

# phenoData column, featureData column 가져오기

varLabels(), fvarLabels()

Create ExpressionSet with NGS result data

- NGS Result data: Data in which the expression level for each gene of each sample is calculated as FPKM through cufflinks (gene name, id, FPKM value, etc.).

- Create assayData with FPKM_tracking data.

- Create PhenoData with data for each sample.

- Create featureData from the referance (information of already known genes) downloaded from Gencode using Gene_id..

▶ Example data information

- RNA-seq 데이터

- 4개의 sample(MUT1, MUT2, WT1, WT2)

- phenoData에 사용할 정보는 MUT, WT 이외에는 없다......

 
● assayData Making

1. Create a vector with all gene ids of tracking_data (gene expression level data), respectively.

- This is because the type and number of genes for each sample may be different.

# 각 샘플의 유전자 데이터

cuff.mut1 <- read.delim('MUT1/genes.fpkm_tracking')
cuff.mut2 <- read.delim('MUT2/genes.fpkm_tracking')
cuff.wt1 <- read.delim('WT1/genes.fpkm_tracking')
cuff.wt2 <- read.delim('WT2/genes.fpkm_tracking')

# 모든 유전자 id를 갖는 벡터 만들기

gene_id <- unique(c(cuff.mut1$tracking_id
		     ,cuff.mut2$tracking_id
                     ,cuff.wt1$tracking_id
                     ,cuff.wt2$tracking_id))
                     

2. Create an empty matrix and specify rownames = gene_id, colnames = sample_id.

# 빈 matrix

assay <- matrix(NA,nrow=length(gene_id),ncol=4) # nrow = 유전자 수, ncol = sample 수

# row, column name 지정

rownames(assay) = gene_id

columns(assay) = c('MUT1','MUT2','WT1','WT2')

3. Enter the value that matches the gene and sample using the match function.

# macth 함수를 이용하여 값이 일치하는 index 추출

matched_index <- match(gene_id,cuff.mut1$tracking_id)

# 일치하는 값 넣기

assay[,1] <- cuff.mut1$FPKM[matched_index]

# 4개 sample에 대해 진행

 
● phenoData Making

1. Create a vector with sample_id and sample_info information.

# sample_id

sample_id <- colnames(assay) # assay의 순서와 통일한다.

# 기타 info

group <- c('MUT','MUT','WT','WT') # 여기에선 하나만......

2. Combine the above data to create a dataframe and set rowname = sample_id..

# dataframe 만들기

phenoData <- data.Frame(sample_id,group,row.names=sample_id)

 
● Creating featureData

1. Import reference data

# rda data의 경우

ref = data(referencefile)

# gtf.gz 파일의 경우

library(rtracklayer)

ref = import(referancefile)

2. gene_idBring the information suitable for and create a dataframe and set rowname = gene_id.

# gene_id로 ref에 일치하는 index 찾기

matched_index <- match(gene_id, ref)

# fetureData 만들기

featureData <- ref[macthed_index,] # assay column과 순서일치

rownames(featureData) <- gene_id
# 일치 확인

sum(!(rownames(assay) == rownames(phenoData))) # 0이 나오면 일치!

# feature 도 확인하자

 
● ExpressionSet Making

1. Check if assay row == phenoData row, assay column == featureData row
2. Completed with ExpresstionSet() function (AnotationDataFrame must! )

eset <- ExpressionSet(assayData = assay
		      ,phenoData = AnnotationDataFrame(phenoData)
                      ,featureData = AnnotationDataFrame(featureData))

GenomicRanges

- GRanges PACKAGE PROVIDING DATA STRUCTURE

● GRanges

- Data with gene regions and gene information on the genome

- Additional information (dataframe) is attached to the basic format (seqnames, ranges, strand).

- seqnames: Chromosome number
- Range: Genetic start and end number (Enter using the IRanges() function).
- strand: Which strand of DNA strand is (+,-) expression

예시 데이터

3개의 유전자 [chr1, 1~10, '-'], [chr2, 1~20, '+'], [chr3, 1~30 , '+']

추가 정보

score = c(1,2,3)

name = c(a, b, c)

 
● GRanges Making

# 기본정보

gr = GRanges(seqnames = c('chr1','chr2','chr3')
             ,ranges = IRanges(start=c(1,1,1),end=c(10,20,30))
             ,strand = c('-','+','+'))
        
# 추가정보

gr$score = c(1,2,3)

gr$name = c('a','b','c')

● GRanges Function

# seqnames, start, end, strand 가져오기

seqnames(gr), start(), end(), strand() # as.character 와 함께 사용

# 유전자 개수

length()

# 각 유전자의 길이

width()

# 추가 정보 조회하기

mcol()

 
● Create GRanges with the featureData of ExpressionSet

- Use the seqname, start, end, and strand information in featureData.

# eset의 featureData

fdata <- fData(eset)

# GRanges 만들기

GRanges(seqnames = fdata$seqnames
        ,ranges = IRange(start=fdata$start,end=fdata$end)
        ,strand = fdata$strand)

# gene_name등 기타 필요한 추가정보도 넣어준다.

●Finding genes present on both sides of the two GRanges data (findOverlaps)

- Even if there is no gene name, find overlapping parts based on information about the region of a specific chromosome.

Finding genes in GRanges with only chromosome and region information using GRanges data created in Example ExpressionSet example

fdata.gr = Data created above

gr = chr17 7582285-7773654와 chr13 47501983-49283632의 GRange 데이터
# gr을 만든다.

gr = GRanges(seqnames = c('chr17','chr13')
             ,ranges = IRange(start=c(7582285,47501983),end=c(7773654,49283632)))
             
# findOverlaps() 함수로 겹치는 부분 찾기

findOverlaps(query = gr, subject = fdata.gr)

 
▶ Results

Return the index of query GRanges data where the matching gene is located and the index of subject GRagnes data.

Only each index can be extracted using queryHits() and subjectHits() functions.

'Computational biology' 카테고리의 다른 글

VCF file analysis method using Pandas  (0) 2024.11.24
certificates  (0) 2024.11.24
R self learning  (0) 2024.10.28
Bioinformatics tools  (0) 2024.10.24
BLAST  (0) 2024.10.24