miniQuant

M͟i͟xed Bayesian n̲etwork for i̲soform quantification (miniQuant) provides a highly-accurate bioinformatics tool for transcript abundance estimation. miniQuant features:

Novel K-value metric: a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study
Mixed Bayesian network: a novel mixed Bayesian network model for transcript abundance estimation that can be applied to different data scenarios: long-read-alone and hybrid (i.e., long reads plus short reads) integrating the strengths of both long reads and short reads.

Dependency

Linux operating system

The software has been tested with following software version

Python==3.9.7
minimap2==2.24
bowtie2==2.4.1

Installation

It is recommended to use a Docker or Singularity to run the software.

[Recommended] Docker

Click me

Use Docker

# download and load docker image
wget https://miniquant.s3.us-east-2.amazonaws.com/miniQuant.tar.gz && docker load --input miniQuant.tar.gz && rm miniQuant.tar.gz
# run inside container
docker run -it --rm tidesun/miniquant:1.0 bash
cd / && source miniQuant/base/bin/activate

[Recommended] Singularity

Click me

Use Singularity

Tested on singularity 4.1.3.

wget https://miniquant.s3.us-east-2.amazonaws.com/miniQuant.sif && singularity build --sandbox miniQuant_singularity miniQuant.sif && rm miniQuant.sif

Use singularity to run command

singularity run miniQuant_singularity bash -c "source /miniQuant/base/bin/activate && python /miniQuant/isoform_quantification/main.py --help"

[Not Recommended] Install from source

Click me

Dependency

Linux operating system with conda installed

conda create -n miniQuant python=3.8 openblas
conda activate miniQuant
wget -qO- https://miniquant.s3.us-east-2.amazonaws.com/miniQuant-1.0.0.tar.gz | tar xvz --one-top-level=miniQuant --strip-components 1
cd miniQuant
python -m venv base
source base/bin/activate
wget -qO- https://miniquant.s3.us-east-2.amazonaws.com/pretrained_models.tar.gz | tar xvz
pip install --upgrade pip
pip install setuptools==57.4.0
pip install -r requirements.txt

Optional: install pretrained models for SIRV set-4 real data

cd miniQuant
wget -qO- https://miniquant.s3.us-east-2.amazonaws.com/SIRV_pretrained_models.tar.gz | tar xvz

Usage

MiniQuant starts from Sequence Alignment/Map (SAM) file. For fasta/fastq file input, refer to Data Preparation section, otherwise, refer to Isoform quantification by miniQuant section.

Data Preparation (if start from fasta/fastq file)

Click me

Preparation:

install minimap2(v2.24) and bowtie2(v2.4.1)

Required:

long reads alignment data mapped to reference genome in SAM format, example data can be found in miniQuant/example/LR.sam
gene isoform annotation in GTF format, example data can be found in miniQuant/example/annotation.gtf

Optional:

short reads alignment data mapped to reference transcriptome in SAM format, example data can be found in miniQuant/example/SR.sam

Sequence alignment recommendation:

use `minimap2` to map long reads data (e.g. ENCFF714YOZ.fastq.gz) to reference genome (e.g. GRCh38.primary_assembly.genome.fa)

For dRNA-ONT data

minimap2 -a --MD -t 10 -N 0 -u f -x splice \
GRCh38.primary_assembly.genome.fa ENCFF714YOZ.fastq.gz > LR.sam

For cDNA-ONT or cDNA-PacBio data

minimap2 -a --MD -t 10 -N 0 -x splice \
GRCh38.primary_assembly.genome.fa ENCFF714YOZ.fastq.gz > LR.sam

use `Bowtie2` to map short reads data (e.g. paired end reads: ENCFF892WVN.fastq.gz and ENCFF481BLH.fastq.gz to reference transcriptome (e.g. gencode.v39.transcripts.fa)

bowtie2-build -f 
gencode.v39.transcripts.fa bowtie2_index

bowtie2 -q --phred33 --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1 -I 1 -X 1000 --no-mixed --no-discordant -p 10 -k 200 \
-x bowtie2_index -1 ENCFF892WVN.fastq.gz -2 ENCFF481BLH.fastq.gz > SR.sam

Isoform quantification by miniQuant

miniQuant provides two options for isoform quantification:

quantify by long reads data alone.
quantify using short and long reads data in hybrid mode.

An toy dataset example is provided in miniQuant/example/. Please following example command below for instruction.

1. If quantify using long reads data alone

Click me

miniQuant requires gene isoform annotation in GTF format (-gtf) and long reads sequence alignment mapped to the reference genome in SAM format (-lrsam) as the input.

Example: quantify using long reads data (`miniQuant/example/LR.sam`) with annotation (e.g. `miniQuant/example/annotation.gtf`), results in `miniQuant_res` folder

source miniQuant/base/bin/activate
python miniQuant/isoform_quantification/main.py quantify \
-gtf miniQuant/example/annotation.gtf \
-lrsam miniQuant/example/LR.sam \
-t 1 \
-o miniQuant_res

arguments:
  -gtf GTF_ANNOTATION_PATH, --gtf_annotation_path GTF_ANNOTATION_PATH
                        The path of isoform annotation file in GTF format
  -lrsam LONG_READ_SAM_PATH, --long_read_sam_path LONG_READ_SAM_PATH
                        The path of long read sam file mapping to reference genome.
  -t THREADS, --threads THREADS
                        Number of threads. Default is 1.
  -o OUTPUT_PATH, --output_path OUTPUT_PATH
                        The path of output directory

Results explanation

Isoform quantification abundance
miniQuant_res/Isoform_abundance.out

Isoform	Gene	TPM	num_expected_LRs
ENST00000373020.9	ENSG00000000003.15	710234.9711212328	150.56981387770136
ENST00000494424.1	ENSG00000000003.15	0.06848555891537092	1.4518938490058638e-05
ENST00000496771.5	ENSG00000000003.15	103773.58490566035	21.999999999999996
ENST00000612152.4	ENSG00000000003.15	3.2608726820945185e-20	6.913050086040379e-24
ENST00000614008.4	ENSG00000000003.15	181274.39435547238	38.430171603360144

Isoform: isoform ID
Gene: gene ID
TPM: isoform TPM
The result is a TSV file showing the abundance of each gene isoform, one isoform per line.

2. If quantify using short and long reads data in hybrid mode

Click me

Integrates short and long reads sequencing data from the same organism for better quantification performance.
In hybrid mode, miniQuant requires gene isoform annotation in GTF format (-gtf), long reads sequence alignment mapped to the reference genome in SAM format (-lrsam), and short reads sequence alignment mapped to reference transcriptome in SAM format (-srsam) as the input.
A pretrained machine learning model will be used for optimal intergration by simply set -pretrained_model_path to the long reads sequencing platform (i.e. cDNA-ONT for Oxford Nanopore cDNA sequencing, cDNA-PacBio for PacBio cDNA sequencing, and dRNA-ONT for Oxford Nanopore direct-RNA sequencing.

Example: quantify using short reads (e.g. `miniQuant/example/SR.sam`) and long reads data (e.g. `miniQuant/example/SR.sam`) by annotation (e.g. `miniQuant/example/annotation.gtf`), results in `miniQuant_res_hybrid` folder

source miniQuant/base/bin/activate
python miniQuant/isoform_quantification/main.py quantify \
-gtf miniQuant/example/annotation.gtf \
-lrsam miniQuant/example/LR.sam \
-srsam miniQuant/example/SR.sam \
--pretrained_model_path dRNA-ONT \
--EM_choice hybrid \
-t 1 \
-o miniQuant_res_hybrid

arguments:
  -gtf GTF_ANNOTATION_PATH, --gtf_annotation_path GTF_ANNOTATION_PATH
                        The path of isoform annotation file in GTF format
  -lrsam LONG_READ_SAM_PATH, --long_read_sam_path LONG_READ_SAM_PATH
                        The path of long read sam file mapping to reference genome.
  -srsam SHORT_READ_SAM_PATH, --short_read_sam_path SHORT_READ_SAM_PATH
                        The path of short read sam file mapping to reference transcriptome.
  --pretrained_model_path PRETRAINED_MODEL_PATH
                        The pretrained model path to identify the alpha. default: cDNA-ONT. \n
                        Can be one of the options [cDNA-ONT,dRNA-ONT,cDNA-PacBio] or file path of pretrained model.
  -t THREADS, --threads THREADS
                        Number of threads. Default is 1.
  -o OUTPUT_PATH, --output_path OUTPUT_PATH
                        The path of output directory

Advanced parameters for hybrid quantification

optional arguments
  --eff_len_option EFF_LEN_OPTION
                        How to calculate the effective length [kallisto,RSEM]. Choose kallisto 
                        or RSEM to calculate the effective length in the same way as the 
                        corresponding method. Default is kallisto.
  --EM_SR_num_iters EM_SR_NUM_ITERS
                        Number of maximum iterations for EM algorithm. Default is 200.

Results explanation

Isoform quantification abundance
miniQuant_res_hybrid/Isoform_abundance.out

Isoform	Gene	Effective length	TPM	num_expected_SRs	num_expected_LRs
ENST00000373020.9	ENSG00000000003.15	3535.9141630901286	728571.217176296	75.0428353691585	153.72852682419847
ENST00000494424.1	ENSG00000000003.15	587.9141630901288	0.08438441577205032	8.691594824521182e-06	1.7805111727902615e-05
ENST00000496771.5	ENSG00000000003.15	792.9141630901288	97566.37817660523	10.049336952190338	20.5865057952637
ENST00000612152.4	ENSG00000000003.15	3563.9141630901286	0.008307999564622307	8.557239551560977e-07	1.752987908135307e-06
ENST00000614008.4	ENSG00000000003.15	667.9141630901288	173862.31195468327	17.907818131332377	36.68494782243817

Isoform: isoform ID
Gene: gene ID
Effective length: isoform effective length
TPM: isoform TPM
The result is a TSV file showing the abundance of each gene isoform, one isoform per line.

Calculate K-value by miniQuant

Click me

K-value is a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study. K-value can be calculated given a gene isoforms annotation in GTF format

Example: calculate K-value given annotation in GTF format (e.g. `miniQuant/example/annotation.gtf`)

source miniQuant/base/bin/activate
python miniQuant/isoform_quantification/main.py cal_K_value \
-gtf miniQuant/example/annotation.gtf \
-t 1 \
-o miniQuant_kvalue

optional arguments:
  -t THREADS, --threads THREADS
                        Number of threads
  --sr_region_selection SR_REGION_SELECTION
                        SR region selection methods
                        [default:read_length][read_length,num_exons]

Results explanation

K-value for each gene
miniQuant_kvalue/kvalues.out

Gene	Chr	Num_isoforms	Kvalue
ENSG00000000003.15	chrX	5	14.263027941780145

Gene: gene ID
Chr: chromsosome ID
Num_isoforms: number of isoforms in the gene
Kvalue: K value

*For gene that consists only short isoforms (i.e. all isoforms with length <150 bp), K-value will not be calculated and a NA value will be given.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
example		example
isoform_quantification		isoform_quantification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
kvalue_intro_Qi1.png		kvalue_intro_Qi1.png
meta.yaml		meta.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

miniQuant

Table of contents

Dependency

Installation

[Recommended] Docker

Use Docker

[Recommended] Singularity

Use Singularity

[Not Recommended] Install from source

Dependency

Optional: install pretrained models for SIRV set-4 real data

Usage

Data Preparation (if start from fasta/fastq file)

use minimap2 to map long reads data (e.g. ENCFF714YOZ.fastq.gz) to reference genome (e.g. GRCh38.primary_assembly.genome.fa)

For dRNA-ONT data

For cDNA-ONT or cDNA-PacBio data

use Bowtie2 to map short reads data (e.g. paired end reads: ENCFF892WVN.fastq.gz and ENCFF481BLH.fastq.gz to reference transcriptome (e.g. gencode.v39.transcripts.fa)

Isoform quantification by miniQuant

1. If quantify using long reads data alone

Example: quantify using long reads data (miniQuant/example/LR.sam) with annotation (e.g. miniQuant/example/annotation.gtf), results in miniQuant_res folder

Results explanation

2. If quantify using short and long reads data in hybrid mode

Example: quantify using short reads (e.g. miniQuant/example/SR.sam) and long reads data (e.g. miniQuant/example/SR.sam) by annotation (e.g. miniQuant/example/annotation.gtf), results in miniQuant_res_hybrid folder

Advanced parameters for hybrid quantification

Results explanation

Calculate K-value by miniQuant

Example: calculate K-value given annotation in GTF format (e.g. miniQuant/example/annotation.gtf)

Results explanation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

use `minimap2` to map long reads data (e.g. ENCFF714YOZ.fastq.gz) to reference genome (e.g. GRCh38.primary_assembly.genome.fa)

use `Bowtie2` to map short reads data (e.g. paired end reads: ENCFF892WVN.fastq.gz and ENCFF481BLH.fastq.gz to reference transcriptome (e.g. gencode.v39.transcripts.fa)

Example: quantify using long reads data (`miniQuant/example/LR.sam`) with annotation (e.g. `miniQuant/example/annotation.gtf`), results in `miniQuant_res` folder

Example: quantify using short reads (e.g. `miniQuant/example/SR.sam`) and long reads data (e.g. `miniQuant/example/SR.sam`) by annotation (e.g. `miniQuant/example/annotation.gtf`), results in `miniQuant_res_hybrid` folder

Example: calculate K-value given annotation in GTF format (e.g. `miniQuant/example/annotation.gtf`)

Packages