M͟i͟xed Bayesian n̲etwork for i̲soform quantification (miniQuant) provides a highly-accurate bioinformatics tool for transcript abundance estimation. miniQuant features:
- Novel K-value metric: a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study
- Mixed Bayesian network: a novel mixed Bayesian network model for transcript abundance estimation that can be applied to different data scenarios: long-read-alone and hybrid (i.e., long reads plus short reads) integrating the strengths of both long reads and short reads.
Linux operating system
The software has been tested with following software version
Python==3.9.7
minimap2==2.24
bowtie2==2.4.1
It is recommended to use a Docker or Singularity to run the software.
Click me
Use Docker
# download and load docker image
wget https://miniquant.s3.us-east-2.amazonaws.com/miniQuant.tar.gz && docker load --input miniQuant.tar.gz && rm miniQuant.tar.gz
# run inside container
docker run -it --rm tidesun/miniquant:1.0 bash
cd / && source miniQuant/base/bin/activate
Click me
Use Singularity
Tested on singularity 4.1.3.
wget https://miniquant.s3.us-east-2.amazonaws.com/miniQuant.sif && singularity build --sandbox miniQuant_singularity miniQuant.sif && rm miniQuant.sif
Use singularity to run command
singularity run miniQuant_singularity bash -c "source /miniQuant/base/bin/activate && python /miniQuant/isoform_quantification/main.py --help"
Click me
Linux operating system with conda installed
conda create -n miniQuant python=3.8 openblas
conda activate miniQuant
wget -qO- https://miniquant.s3.us-east-2.amazonaws.com/miniQuant-1.0.0.tar.gz | tar xvz --one-top-level=miniQuant --strip-components 1
cd miniQuant
python -m venv base
source base/bin/activate
wget -qO- https://miniquant.s3.us-east-2.amazonaws.com/pretrained_models.tar.gz | tar xvz
pip install --upgrade pip
pip install setuptools==57.4.0
pip install -r requirements.txt
cd miniQuant
wget -qO- https://miniquant.s3.us-east-2.amazonaws.com/SIRV_pretrained_models.tar.gz | tar xvz
MiniQuant starts from Sequence Alignment/Map (SAM) file. For fasta/fastq file input, refer to Data Preparation section, otherwise, refer to Isoform quantification by miniQuant section.
Click me
Preparation:
- install minimap2(v2.24) and bowtie2(v2.4.1)
Required:
- long reads alignment data mapped to reference genome in SAM format, example data can be found in
miniQuant/example/LR.sam - gene isoform annotation in GTF format, example data can be found in
miniQuant/example/annotation.gtf
Optional:
- short reads alignment data mapped to reference transcriptome in SAM format, example data can be found in
miniQuant/example/SR.sam
Sequence alignment recommendation:
use minimap2 to map long reads data (e.g. ENCFF714YOZ.fastq.gz) to reference genome (e.g. GRCh38.primary_assembly.genome.fa)
minimap2 -a --MD -t 10 -N 0 -u f -x splice \
GRCh38.primary_assembly.genome.fa ENCFF714YOZ.fastq.gz > LR.sam
minimap2 -a --MD -t 10 -N 0 -x splice \
GRCh38.primary_assembly.genome.fa ENCFF714YOZ.fastq.gz > LR.sam
use Bowtie2 to map short reads data (e.g. paired end reads: ENCFF892WVN.fastq.gz and ENCFF481BLH.fastq.gz to reference transcriptome (e.g. gencode.v39.transcripts.fa)
bowtie2-build -f
gencode.v39.transcripts.fa bowtie2_index
bowtie2 -q --phred33 --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1 -I 1 -X 1000 --no-mixed --no-discordant -p 10 -k 200 \
-x bowtie2_index -1 ENCFF892WVN.fastq.gz -2 ENCFF481BLH.fastq.gz > SR.sam
miniQuant provides two options for isoform quantification:
- quantify by long reads data alone.
- quantify using short and long reads data in hybrid mode.
An toy dataset example is provided in miniQuant/example/. Please following example command below for instruction.
Click me
miniQuant requires gene isoform annotation in GTF format (-gtf) and long reads sequence alignment mapped to the reference genome in SAM format (-lrsam) as the input.
Example: quantify using long reads data (miniQuant/example/LR.sam) with annotation (e.g. miniQuant/example/annotation.gtf), results in miniQuant_res folder
source miniQuant/base/bin/activate
python miniQuant/isoform_quantification/main.py quantify \
-gtf miniQuant/example/annotation.gtf \
-lrsam miniQuant/example/LR.sam \
-t 1 \
-o miniQuant_res
arguments:
-gtf GTF_ANNOTATION_PATH, --gtf_annotation_path GTF_ANNOTATION_PATH
The path of isoform annotation file in GTF format
-lrsam LONG_READ_SAM_PATH, --long_read_sam_path LONG_READ_SAM_PATH
The path of long read sam file mapping to reference genome.
-t THREADS, --threads THREADS
Number of threads. Default is 1.
-o OUTPUT_PATH, --output_path OUTPUT_PATH
The path of output directory
Isoform quantification abundance
miniQuant_res/Isoform_abundance.out
Isoform Gene TPM num_expected_LRs
ENST00000373020.9 ENSG00000000003.15 710234.9711212328 150.56981387770136
ENST00000494424.1 ENSG00000000003.15 0.06848555891537092 1.4518938490058638e-05
ENST00000496771.5 ENSG00000000003.15 103773.58490566035 21.999999999999996
ENST00000612152.4 ENSG00000000003.15 3.2608726820945185e-20 6.913050086040379e-24
ENST00000614008.4 ENSG00000000003.15 181274.39435547238 38.430171603360144
Isoform: isoform IDGene: gene IDTPM: isoform TPM
The result is a TSV file showing the abundance of each gene isoform, one isoform per line.
Click me
- Integrates short and long reads sequencing data from the same organism for better quantification performance.
- In hybrid mode, miniQuant requires gene isoform annotation in
GTFformat (-gtf), long reads sequence alignment mapped to the reference genome inSAMformat (-lrsam), and short reads sequence alignment mapped to reference transcriptome inSAMformat (-srsam) as the input. - A pretrained machine learning model will be used for optimal intergration by simply set
-pretrained_model_pathto the long reads sequencing platform (i.e.cDNA-ONTfor Oxford Nanopore cDNA sequencing,cDNA-PacBiofor PacBio cDNA sequencing, anddRNA-ONTfor Oxford Nanopore direct-RNA sequencing.
Example: quantify using short reads (e.g. miniQuant/example/SR.sam) and long reads data (e.g. miniQuant/example/SR.sam) by annotation (e.g. miniQuant/example/annotation.gtf), results in miniQuant_res_hybrid folder
source miniQuant/base/bin/activate
python miniQuant/isoform_quantification/main.py quantify \
-gtf miniQuant/example/annotation.gtf \
-lrsam miniQuant/example/LR.sam \
-srsam miniQuant/example/SR.sam \
--pretrained_model_path dRNA-ONT \
--EM_choice hybrid \
-t 1 \
-o miniQuant_res_hybrid
arguments:
-gtf GTF_ANNOTATION_PATH, --gtf_annotation_path GTF_ANNOTATION_PATH
The path of isoform annotation file in GTF format
-lrsam LONG_READ_SAM_PATH, --long_read_sam_path LONG_READ_SAM_PATH
The path of long read sam file mapping to reference genome.
-srsam SHORT_READ_SAM_PATH, --short_read_sam_path SHORT_READ_SAM_PATH
The path of short read sam file mapping to reference transcriptome.
--pretrained_model_path PRETRAINED_MODEL_PATH
The pretrained model path to identify the alpha. default: cDNA-ONT. \n
Can be one of the options [cDNA-ONT,dRNA-ONT,cDNA-PacBio] or file path of pretrained model.
-t THREADS, --threads THREADS
Number of threads. Default is 1.
-o OUTPUT_PATH, --output_path OUTPUT_PATH
The path of output directory
optional arguments
--eff_len_option EFF_LEN_OPTION
How to calculate the effective length [kallisto,RSEM]. Choose kallisto
or RSEM to calculate the effective length in the same way as the
corresponding method. Default is kallisto.
--EM_SR_num_iters EM_SR_NUM_ITERS
Number of maximum iterations for EM algorithm. Default is 200.
Isoform quantification abundance
miniQuant_res_hybrid/Isoform_abundance.out
Isoform Gene Effective length TPM num_expected_SRs num_expected_LRs
ENST00000373020.9 ENSG00000000003.15 3535.9141630901286 728571.217176296 75.0428353691585 153.72852682419847
ENST00000494424.1 ENSG00000000003.15 587.9141630901288 0.08438441577205032 8.691594824521182e-06 1.7805111727902615e-05
ENST00000496771.5 ENSG00000000003.15 792.9141630901288 97566.37817660523 10.049336952190338 20.5865057952637
ENST00000612152.4 ENSG00000000003.15 3563.9141630901286 0.008307999564622307 8.557239551560977e-07 1.752987908135307e-06
ENST00000614008.4 ENSG00000000003.15 667.9141630901288 173862.31195468327 17.907818131332377 36.68494782243817
Isoform: isoform IDGene: gene IDEffective length: isoform effective lengthTPM: isoform TPM
The result is a TSV file showing the abundance of each gene isoform, one isoform per line.
Click me
K-value is a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study. K-value can be calculated given a gene isoforms annotation in GTF format
source miniQuant/base/bin/activate
python miniQuant/isoform_quantification/main.py cal_K_value \
-gtf miniQuant/example/annotation.gtf \
-t 1 \
-o miniQuant_kvalue
optional arguments:
-t THREADS, --threads THREADS
Number of threads
--sr_region_selection SR_REGION_SELECTION
SR region selection methods
[default:read_length][read_length,num_exons]
K-value for each gene
miniQuant_kvalue/kvalues.out
Gene Chr Num_isoforms Kvalue
ENSG00000000003.15 chrX 5 14.263027941780145
Gene: gene IDChr: chromsosome IDNum_isoforms: number of isoforms in the geneKvalue: K value
*For gene that consists only short isoforms (i.e. all isoforms with length <150 bp), K-value will not be calculated and a NA value will be given.