scRNA-seq Raw Data Preprocessing: scRNA-seq quality control

2024. 11. 27. 01:12Computational biology


The QC report summarizes the passage, failure, warning, etc. for each item.

Basic statistics
Basic statistics contain basic statistical information such as file format, encoding, sequence number, number of poor quality flags in FASTQ, sequence length, and GC ratio.
For good data, poor quality sequences are small, uniform sequence length is shown,
The overall GC content of the analyzed organism should be shown.

Per Base Sequence Quality
It shows the quality distribution by base sequence position in the sequence length.
In the case of good data, the overall length should show the same level of quality.


Per Tile Sequence Quality
It shows the average quality of the read in each flow cell tile.
Flow cell tile is a sequencing section used in Illumina's sequencer.
For good data, it should be the same without any deviation.

Per Sequence Quality Scores
It shows the histogram distribution of the average quality score of reads in the file.
For good data, the distribution should be concentrated on a high mean.

Per Base Sequence Content
It shows the distribution of each base in the sequence length of A, C, G, and T.
The distribution may be biased because the beginning is a priming site,
Good data should show a uniform distribution in other areas.

Per Sequence GC Content
The GC content distribution of the reads is compared with the theoretical distribution.
Good data should have the peak part of the distribution match each other and the distribution should be similar.

Per Base N Content
It shows the distribution of N base in the sequence length.
N base is a base assigned when one of ACGT bases cannot be classifiedly assigned.
Good data should not have an N base.

Sequence Length Distribution
It shows the legnth distribution of reads.
Good data requires all reads to show about the same length.

Sequence Duplication Levels
Show the degree of duplication of read sequences.
For good data, a low duplication level must be shown.
However, since FastQC does not recognize the UMI used in scRNA-seq,
Recognize the large number of reads of the highly expressed gene as duplication
There are cases where warning is displayed, but in this case, it is not a data quality problem.

Overrepresented Sequences
It shows the count and percentage of sequences that are expressed more than 0.1% of the total sequence read.
As expected, highly expressed genes are sometimes recognized as overexpression.
If the Possible Source part is not a No hit, library contamination due to the source may be suspected.

Adapter Content
It shows the cumulative rate of adapter sequece observed within the base sequence.
There should be no adapter that is over-observed of good data.


※ For more information, see FastQC's official description.
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3 Analysis Modules/


Alignment / Mapping
Alignment or mapping is the process of determining the potential genomic loci of sequence read.
In other words, it is a process of matching which part of the genome is derived from.
Many methods have been studied since bulk RNA-seq,

There are several reasons why it is difficult to use single cell sequencing.

The location or length of the cell barcode (CB) and UMI varies by platform.
CB and UMI interfere with sequecing aligment.
Because it deals with hundreds of billions of cells, there are memory or time problems in the calculation.

Therefore, scRNA-seq-specific alignment tools such as Cell Ranger have been developed.

Before alignment, there are various things to consider

What alignment algorithm to use
spliced alignment, contiguous alignment, lightweight-mapping
Global mapping or local alignment
Which reference to use
Whole genome
Transcriptome
Augmented transcriptome including intron region


Each tool has a different strategy, so you have to choose according to your data set and purpose.

Cell barcode(CB) correction


In the Droplet-based single cell separation system, each cell is attached to the droplet with the barcoded bead encapsulated.  What cell did the RNA content in the cell come from?

Includes a unique sequence tag and cell barcode (CB) for display.

Errors that can occur during the barcode process of droptelt-based library

Doublet / Multiplet: When multiple cells are included in one droplet and the same CB is used in duplicate
Empty Droplet: The droplet does not carry cells, and only barcode is misrecognized, so the number of cells is overcounted.
Sequence error: Urecess during PCR or sequencing

Several representative methods have been developed to correct these errors.
Known list of potential barcode
Check the potential barcode list created in the platform with the observed barcode.
Because there are cases where barcode itself is incorrectly sequenced.
Correct errors by comparing hamming distance and edit distance between potential vs observed barcodes.

If there is a problem, CB can be checked in duplicate within the potential list.

UMI resolution
If the cell-dimensional problem was solved by CB correction, the amount of each gene in the cell should be measured next.

Unique molecular identifier (UMI) uniquely tags molecules and RNAs in cells with a short sequence.
Sequencer goes through a PCR process to increase sequencing confidence and correct errors.
Through this, the same molecule is amplified and duplicate sequencing is performed.
Since UMI is attached before amplification, actual count can be observed through De-duplication.

That is, reads with the same UMI are thought to have originated from one molecule,
It can solve the bias that may occur during amplification.

The UMI resolution processes reads associated with the observed UMI, resulting from each gene.
The original number of molucule, the process of estimating the transcript expression level.

However, there are several challenges in the UMI resolution.

UMI sequecing error: It cannot be recognized due to an error during PCR or sequencing.
UMI mapping error: When the UMI tagged RNA sequence is mapped in duplicate
Convergent: same UMI is used to different molecules from the same gene
Divergent: two or more UMI from the same molecule

Considering these errors, UMI-tools are a representative tool for UMI resolution.

※ It is provided as Pypi, conda, and github, and supports UMI resolution and count matrix creation.
https://github.com/CGATOxford/UMI-tools

Count matrix quality control
Quality control is important even after creating a count matrix.
The measurement items used for count matirx QC are as follows.

Total mapped reads
Distrubution of UMI per cell
Distribution of UMI deduplication rate
Distribution of detected genes per cell
Mitochondrial gene expression percentage
Filtering empty droplets or multiplets in consideration of the above.

'Computational biology' 카테고리의 다른 글

biopython  (0) 2024.11.27
python libraries  (1) 2024.11.27
R packages  (0) 2024.11.27
VCF file analysis method using Pandas  (0) 2024.11.24
certificates  (0) 2024.11.24