miniQuant features:
- Optimal use of long and/or short RNA-seq reads: transcript abundance estimation that can be applied to different data scenarios: long-read-alone and hybrid (long reads + short reads) integrating the strengths of both technologies.
- Fast RNA-seq quantification: less than 15 minutes to analyze unaligned 40 million paired-end short reads + 5 million long reads on a standard laptop computer.
- Calculate novel K-value metric: a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study.
Our newest version is recommended with faster speed and better performance. However, to reproduce the results on our Nature Biotechnology paper, download old version from miniQuant v1.0. Feel free to run miniQuant online without installation!
Linux operating system (tested on Red Hat 8.8)
- Download latest binary executable (
wget https://github.com/Augroup/miniQuant/releases/download/latest/miniQuant_linux_latest.tar.gz) and decompress bytar -zxvf miniQuant_linux_latest.tar.gz. cd miniQuant_linux && chmod +x miniQuant- (Optional. Only if you want to directly call
miniQuantin command line)cp ./miniQuant /usr/local/bin; cp ./miniQuant ~/.local/bin - Run
./miniQuant
- Download latest binary executable (
wget https://github.com/Augroup/miniQuant/releases/download/latest/miniQuant_linux_latest.tar.gz) and decompress bytar -zxvf miniQuant_linux_latest.tar.gz. cd miniQuant_linux && chmod +x miniQuant- Run miniQuant under Docker or Singularity container with following commands:
docker run -i -t tidesun/miniquant:latest ./miniQuant
OR
singularity run docker://tidesun/miniquant:latest ./miniQuant
OR
singularity run https://miniquant.s3.us-east-2.amazonaws.com/miniQuant_latest.sif ./miniQuant
If you want to compile from the source, you need to have a C compiler and GNU make installed. Then type make in the src to compile.
miniQuant provides two options for gene isoform quantification:
- quantify by long reads data alone.
- quantify using short and long reads data in hybrid mode.
A toy dataset example is provided inexample/. Please following example command below for instruction.
Click me
miniQuant requires reference transcripts sequences in FASTA format (-r) and long-read RNA-seq sequences in plain or gzipped FASTA/FASTQ format (-l) as the input.
Example: quantify using long reads data (example/LR.fasta.gz) with reference transcripts sequences (e.g. example/reference.fa), results in miniQuant_LR_alone_res folder
miniQuant quant -r example/reference.fa -l example/LR.fasta.gz -t 1 -o miniQuant_LR_alone_res
Required arguments:
-r, --reference arg Reference sequence file in plain or gzipped
FASTA format
-l, --long_reads arg Input long reads file in plain or gzipped
FASTA/FASTQ format.(default: "")
Optional arguments:
-o arg, --output arg The path of output folder. (default: ./miniQuant_res/)
--long_reads_library_prep arg The library preparation for long reads.
Choices:[cDNA-ONT,dRNA-ONT,cDNA-PacBio]
(default: cDNA-ONT)
-t arg, --threads arg Number of threads. Default is 1.
--mem arg Max RAM usage in GB allowed when aligning
the reads (default: 20.0)
The result will be in TSV format (miniQuant_LR_alone_res/abundance.tsv) showing the abundance of each transcript, one transcript per line, with following columns:
Transcript ID: transcript ID provided in the reference sequences(--reference)TPM: transcript relative abundance in TPM (Transcripts Per Kilobase Million).Expected_num_long_reads: expected counts of long reads, corresponding to the total number of long reads of input.
Click me for example
| Transcript_id | TPM | Expected_num_long_reads |
|---|---|---|
| ENST00000379080.5 | 0 | 0 |
| ENST00000379081.5 | 30326.9 | 9.85623 |
| ENST00000379084.5 | 0 | 0 |
| ENST00000379087.5 | 0.000665636 | 0.000000216332 |
| ENST00000379089.5 | 0 | 0 |
| ENST00000651358.1 | 2.68181 | 0.00087159 |
| ENST00000445726.5 | 2.76325 | 0.000898056 |
| ENST00000297620.8 | 31447 | 10.2203 |
| ENST00000422409.5 | 0.0521039 | 0.0000169338 |
| ENST00000379078.1 | 12294.7 | 3.99577 |
| ENST00000294244.9 | 807593 | 262.468 |
| ENST00000540893.1 | 56604.9 | 18.3966 |
| ENST00000535820.1 | 61728.4 | 20.0617 |
Click me
- Integrates short and long reads RNA-seq reads from the same organism for better quantification performance.
- In hybrid mode, miniQuant requires reference transcripts sequences in
FASTAformat (-r), long-read RNA-seq sequences in plain or gzippedFASTA/FASTQformat (-l), and short-read paired-end RNA-seq sequences in plain or gzippedFASTA/FASTQformat (-1and-2) as the input.
Example: quantify using short reads (e.g. example/SR_R1.fasta.gz and example/SR_R2.fasta.gz) and long reads (e.g. example/LR.fasta.gz) with reference transcripts sequences (e.g. example/reference.fa), results in miniQuant_hybrid_res folder
miniQuant quant -r example/reference.fa -l example/LR.fasta.gz -1 example/SR_R1.fasta.gz -2 example/SR_R2.fasta.gz -t 1 -o miniQuant_hybrid_res
Required arguments:
-r, --reference arg Reference sequence file in plain or gzipped
FASTA format
-l, --long_reads arg Input long reads file in plain or gzipped
FASTA/FASTQ format.(default: "")
-1, --short_reads_pair_1 arg Input short reads pair 1 in plain or
gzipped FASTA/FASTQ format. Leave blank if using
only long reads. (default: "")
-2, --short_reads_pair_2 arg Input short reads pair 2 in plain or
gzipped FASTA/FASTQ format. Leave blank if using
only long reads. (default: "")
Optional arguments:
-o arg, --output arg The path of output folder. (default: ./miniQuant_res/)
--long_reads_library_prep arg The library preparation for long reads. Choices:[cDNA-ONT,dRNA-ONT,cDNA-PacBio] (default: cDNA-ONT)
--short_reads_strandness arg The strandness of short reads. Choices:[unstranded,fr-stranded,rf-stranded] (default: unstranded)
*fr-stranded: Strand specific reads, first
read forward
*rf-stranded: Strand specific reads, first
read reverse
-t arg, --threads arg Number of threads. Default is 1.
--mem arg Max RAM usage in GB allowed when aligning
the reads (default: 20.0)
The result will be in TSV format (miniQuant_res_hybrid/abundance.tsv) showing the abundance of each transcript, one transcript per line, with following columns:
Transcript ID: transcript ID provided in the reference sequences(--reference)TPM: transcript relative abundance in TPM (Transcripts Per Kilobase Million). It is calculated by integrating both short and long reads.Expected_num_long_reads: expected counts of long reads. It is calculated by integrating both short and long reads, corresponding to the total number of long reads of input.Expected_num_short_read_pairs: expected counts of short read pairs, corresponding to the total number of short read pairs of input.Effective_length: effective length of each transcript.
Click me for example
| Transcript_id | TPM | Expected_num_long_reads | Expected_num_short_read_pairs | Effective_length |
|---|---|---|---|---|
| ENST00000379080.5 | 0.000143216 | 0.0000000465451 | 0.00000118102 | 3357 |
| ENST00000379081.5 | 10983.9 | 3.56975 | 89.2826 | 3309 |
| ENST00000379084.5 | 0 | 0 | 0 | 659 |
| ENST00000379087.5 | 11371 | 3.69557 | 93.2673 | 3339 |
| ENST00000379089.5 | 283.145 | 0.0920222 | 2.35789 | 3390 |
| ENST00000651358.1 | 9.77862 | 0.00317805 | 0.081936 | 3411 |
| ENST00000445726.5 | 10.9382 | 0.00355493 | 0.0916257 | 3410 |
| ENST00000297620.8 | 27378.1 | 8.89788 | 225.57 | 3354 |
| ENST00000422409.5 | 4603.86 | 1.49626 | 6.37848 | 564 |
| ENST00000379078.1 | 11825 | 3.84313 | 15.5698 | 536 |
| ENST00000294244.9 | 841246 | 273.405 | 3554.4 | 1720 |
| ENST00000540893.1 | 44896.1 | 14.5912 | 35.2918 | 320 |
| ENST00000535820.1 | 47392.2 | 15.4025 | 57.6272 | 495 |
Click me
K-value is a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study. K-value can be calculated given a gene isoforms annotation in GTF/GFF3/genePred format.
Example: calculate K-value given annotation in GTF/GFF3/genePred format (e.g. example/annotation.gtf)
miniQuant kvalue -a example/annotation.gtf -o miniQuant_kvalue -t 1
Required arguments:
-a, --annotation arg Gene isoform annotation file in GTF, GFF or
genePred format
Optional arguments:
-o, --output arg The path of output folder (default: ./miniQuant_kvalue/)
-t, --threads arg Num of threads (default: 1)
--short_reads_mean_fragment_length arg
Mean value of short reads fragment lengths
(default: 235.0)
--not_normalize_entry Whether NOT normalize region-isoform matrix
(A matrix) before calculating K-value
--kvalue_entry_type arg What kind of entry to use for
region-isoform matrix (A matrix).
Choices:[effective_length,binary] (default:
effective_length)
The result will be in TSV format (miniQuant_kvalue/kvalues.tsv) showing the K-value of each gene, one gene per line, with following columns:
Gene: gene IDK-value: K-value. Larger K-value indicates higher quantification error.
Click me for example
| Gene_id | K-value |
|---|---|
| ENSG00000164970.15 | 331.422233 |
| ENSG00000168005.9 | 1.320074 |
*For gene that consists only short isoforms (i.e. all isoforms with length less than --short_reads_mean_fragment_length), K-value will not be calculated and a NA value will be given.