VCF file analysis method using Pandas

2024. 11. 24. 13:28Computational biology

1. Data type conversion and filtering

Transform the data types of columns required for data analysis.

For example, POS can be converted to integer and QUAL to float.

# Data type conversion
df['POS'] = df['POS'].astype(int)
df['QUAL'] = df['QUAL'].astype(float)
df['AF'] = df['AF'].apply(lambda x: float(x.split(',')[0]))

# Filtering only high-quality variations
high_quality_variants = df[df['QUAL'] >= 30]
print(high_quality_variants)



2. High-quality variation analysis

It analyzes the frequency, location, genes, and so on of high quality variation.

# Calculation of frequency for each variant location
position_counts = high_quality_variants['POS'].value_counts()
print(position_counts)

# Calculation of ALT frequency
alt_freq = high_quality_variants['ALT'].value_counts()
print(alt_freq)




3. Visualization

Variant data are visualized using Matplotlib or Seaborn.

import matplotlib.pyplot as plt
import seaborn as sns

# Variant quality distribution visualization
plt.figure(figsize=(10, 6))
sns.histplot(high_quality_variants['QUAL'], bins=30, kde=True)
plt.title('Distribution of Variant Quality Scores')
plt.xlabel('Quality Score')
plt.ylabel('Frequency')
plt.show()

# Variant position distribution visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(x='POS', y='QUAL', data=high_quality_variants)
plt.title('Variant Positions and Quality')
plt.xlabel('Position')
plt.ylabel('Quality Score')
plt.show()



4. One more step: Analysis of mutations within specific genes

Analyze and visualize variation for a particular gene.

# For example, filtering variation in a particular gene region.
gene_start = 10000
gene_end = 50000
gene_variants = high_quality_variants[(high_quality_variants['POS'] >= gene_start) & (high_quality_variants['POS'] <= gene_end)]

# Visualization of mutations within the gene area
plt.figure(figsize=(10, 6))
sns.scatterplot(x='POS', y='QUAL', data=gene_variants)
plt.title('Variants in Specific Gene Region')
plt.xlabel('Position')
plt.ylabel('Quality Score')
plt.show()

'Computational biology' 카테고리의 다른 글

scRNA-seq Raw Data Preprocessing: scRNA-seq quality control  (0) 2024.11.27
R packages  (0) 2024.11.27
certificates  (0) 2024.11.24
R Biodconductor  (0) 2024.11.02
R self learning  (0) 2024.10.28