VCF file analysis method using Pandas
2024. 11. 24. 13:28ㆍComputational biology
1. Data type conversion and filtering
Transform the data types of columns required for data analysis.
For example, POS can be converted to integer and QUAL to float.
# Data type conversion
df['POS'] = df['POS'].astype(int)
df['QUAL'] = df['QUAL'].astype(float)
df['AF'] = df['AF'].apply(lambda x: float(x.split(',')[0]))
# Filtering only high-quality variations
high_quality_variants = df[df['QUAL'] >= 30]
print(high_quality_variants)
2. High-quality variation analysis
It analyzes the frequency, location, genes, and so on of high quality variation.
# Calculation of frequency for each variant location
position_counts = high_quality_variants['POS'].value_counts()
print(position_counts)
# Calculation of ALT frequency
alt_freq = high_quality_variants['ALT'].value_counts()
print(alt_freq)
3. Visualization
Variant data are visualized using Matplotlib or Seaborn.
import matplotlib.pyplot as plt
import seaborn as sns
# Variant quality distribution visualization
plt.figure(figsize=(10, 6))
sns.histplot(high_quality_variants['QUAL'], bins=30, kde=True)
plt.title('Distribution of Variant Quality Scores')
plt.xlabel('Quality Score')
plt.ylabel('Frequency')
plt.show()
# Variant position distribution visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(x='POS', y='QUAL', data=high_quality_variants)
plt.title('Variant Positions and Quality')
plt.xlabel('Position')
plt.ylabel('Quality Score')
plt.show()
4. One more step: Analysis of mutations within specific genes
Analyze and visualize variation for a particular gene.
# For example, filtering variation in a particular gene region.
gene_start = 10000
gene_end = 50000
gene_variants = high_quality_variants[(high_quality_variants['POS'] >= gene_start) & (high_quality_variants['POS'] <= gene_end)]
# Visualization of mutations within the gene area
plt.figure(figsize=(10, 6))
sns.scatterplot(x='POS', y='QUAL', data=gene_variants)
plt.title('Variants in Specific Gene Region')
plt.xlabel('Position')
plt.ylabel('Quality Score')
plt.show()
'Computational biology' 카테고리의 다른 글
scRNA-seq Raw Data Preprocessing: scRNA-seq quality control (0) | 2024.11.27 |
---|---|
R packages (0) | 2024.11.27 |
certificates (0) | 2024.11.24 |
R Biodconductor (0) | 2024.11.02 |
R self learning (0) | 2024.10.28 |