Improving gene isoform quantification with miniQuant

miniQuant features:

Optimal use of long and/or short RNA-seq reads: transcript abundance estimation that can be applied to different data scenarios: long-read-alone and hybrid (long reads + short reads) integrating the strengths of both technologies.
Fast RNA-seq quantification: less than 15 minutes to analyze unaligned 40 million paired-end short reads + 5 million long reads on a standard laptop computer.
Calculate novel K-value metric: a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study.

Our newest version is recommended with faster speed and better performance. However, to reproduce the results on our Nature Biotechnology paper, download old version from miniQuant v1.0. Feel free to run miniQuant online without installation!

Dependency

Linux operating system (tested on Red Hat 8.8)

Installation

Download latest binary executable (wget https://github.com/Augroup/miniQuant/releases/download/latest/miniQuant_linux_latest.tar.gz) and decompress by tar -zxvf miniQuant_linux_latest.tar.gz.
cd miniQuant_linux && chmod +x miniQuant
(Optional. Only if you want to directly call miniQuant in command line) cp ./miniQuant /usr/local/bin; cp ./miniQuant ~/.local/bin
Run ./miniQuant

If your operating system doesn't have GLIBC 2.28 or later

Download latest binary executable (wget https://github.com/Augroup/miniQuant/releases/download/latest/miniQuant_linux_latest.tar.gz) and decompress by tar -zxvf miniQuant_linux_latest.tar.gz.
cd miniQuant_linux && chmod +x miniQuant
Run miniQuant under Docker or Singularity container with following commands:
docker run -i -t tidesun/miniquant:latest ./miniQuant

OR

singularity run docker://tidesun/miniquant:latest ./miniQuant

OR

singularity run https://miniquant.s3.us-east-2.amazonaws.com/miniQuant_latest.sif ./miniQuant

Build by source

If you want to compile from the source, you need to have a C compiler and GNU make installed. Then type make in the src to compile.

Usage

miniQuant provides two options for gene isoform quantification:

quantify by long reads data alone.
quantify using short and long reads data in hybrid mode.
A toy dataset example is provided in example/. Please following example command below for instruction.

1. If quantify using long reads data alone

Click me

miniQuant requires reference transcripts sequences in FASTA format (-r) and long-read RNA-seq sequences in plain or gzipped FASTA/FASTQ format (-l) as the input.

Example: quantify using long reads data (`example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_LR_alone_res` folder

miniQuant quant -r example/reference.fa -l example/LR.fasta.gz -t 1 -o miniQuant_LR_alone_res

Available parameters

Required arguments:
  -r, --reference arg           Reference sequence file in plain or gzipped
                                FASTA format
  -l, --long_reads arg          Input long reads file in plain or gzipped
                                FASTA/FASTQ format.(default: "")

Optional arguments:
  -o arg, --output arg          The path of output folder. (default: ./miniQuant_res/)
  --long_reads_library_prep arg The library preparation for long reads.
                                Choices:[cDNA-ONT,dRNA-ONT,cDNA-PacBio]
                                (default: cDNA-ONT)
  -t arg, --threads arg         Number of threads. Default is 1.
  --mem arg                     Max RAM usage in GB allowed when aligning 
                                the reads (default: 20.0)

Results explanation

The result will be in TSV format (miniQuant_LR_alone_res/abundance.tsv) showing the abundance of each transcript, one transcript per line, with following columns:

Transcript ID: transcript ID provided in the reference sequences(--reference)
TPM: transcript relative abundance in TPM (Transcripts Per Kilobase Million).
Expected_num_long_reads: expected counts of long reads, corresponding to the total number of long reads of input.

Click me for example

Transcript_id	TPM	Expected_num_long_reads
ENST00000379080.5	0	0
ENST00000379081.5	30326.9	9.85623
ENST00000379084.5	0	0
ENST00000379087.5	0.000665636	0.000000216332
ENST00000379089.5	0	0
ENST00000651358.1	2.68181	0.00087159
ENST00000445726.5	2.76325	0.000898056
ENST00000297620.8	31447	10.2203
ENST00000422409.5	0.0521039	0.0000169338
ENST00000379078.1	12294.7	3.99577
ENST00000294244.9	807593	262.468
ENST00000540893.1	56604.9	18.3966
ENST00000535820.1	61728.4	20.0617

2. If quantify using short and long reads data in hybrid mode

Click me

Integrates short and long reads RNA-seq reads from the same organism for better quantification performance.
In hybrid mode, miniQuant requires reference transcripts sequences in FASTA format (-r), long-read RNA-seq sequences in plain or gzipped FASTA/FASTQ format (-l), and short-read paired-end RNA-seq sequences in plain or gzipped FASTA/FASTQ format (-1 and -2) as the input.

Example: quantify using short reads (e.g. `example/SR_R1.fasta.gz` and `example/SR_R2.fasta.gz`) and long reads (e.g. `example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_hybrid_res` folder

miniQuant quant -r example/reference.fa -l example/LR.fasta.gz -1 example/SR_R1.fasta.gz -2 example/SR_R2.fasta.gz -t 1 -o miniQuant_hybrid_res

Available parameters

Required arguments:
  -r, --reference arg           Reference sequence file in plain or gzipped
                                FASTA format
  -l, --long_reads arg          Input long reads file in plain or gzipped
                                FASTA/FASTQ format.(default: "")
  -1, --short_reads_pair_1 arg  Input short reads pair 1 in plain or
                                gzipped FASTA/FASTQ format. Leave blank if using
                                only long reads. (default: "")
  -2, --short_reads_pair_2 arg  Input short reads pair 2 in plain or
                                gzipped FASTA/FASTQ format. Leave blank if using
                                only long reads. (default: "")

Optional arguments:
  -o arg, --output arg          The path of output folder. (default: ./miniQuant_res/)
  --long_reads_library_prep arg The library preparation for long reads. Choices:[cDNA-ONT,dRNA-ONT,cDNA-PacBio] (default: cDNA-ONT)
  --short_reads_strandness arg  The strandness of short reads.          Choices:[unstranded,fr-stranded,rf-stranded] (default: unstranded)

                                *fr-stranded: Strand specific reads, first
                                read forward
                                *rf-stranded: Strand specific reads, first
                                read reverse
                                 
  -t arg, --threads arg         Number of threads. Default is 1.
  --mem arg                     Max RAM usage in GB allowed when aligning 
                                the reads (default: 20.0)

Results explanation

The result will be in TSV format (miniQuant_res_hybrid/abundance.tsv) showing the abundance of each transcript, one transcript per line, with following columns:

Transcript ID: transcript ID provided in the reference sequences(--reference)
TPM: transcript relative abundance in TPM (Transcripts Per Kilobase Million). It is calculated by integrating both short and long reads.
Expected_num_long_reads: expected counts of long reads. It is calculated by integrating both short and long reads, corresponding to the total number of long reads of input.
Expected_num_short_read_pairs: expected counts of short read pairs, corresponding to the total number of short read pairs of input.
Effective_length: effective length of each transcript.

Click me for example

Transcript_id	TPM	Expected_num_long_reads	Expected_num_short_read_pairs	Effective_length
ENST00000379080.5	0.000143216	0.0000000465451	0.00000118102	3357
ENST00000379081.5	10983.9	3.56975	89.2826	3309
ENST00000379084.5	0	0	0	659
ENST00000379087.5	11371	3.69557	93.2673	3339
ENST00000379089.5	283.145	0.0920222	2.35789	3390
ENST00000651358.1	9.77862	0.00317805	0.081936	3411
ENST00000445726.5	10.9382	0.00355493	0.0916257	3410
ENST00000297620.8	27378.1	8.89788	225.57	3354
ENST00000422409.5	4603.86	1.49626	6.37848	564
ENST00000379078.1	11825	3.84313	15.5698	536
ENST00000294244.9	841246	273.405	3554.4	1720
ENST00000540893.1	44896.1	14.5912	35.2918	320
ENST00000535820.1	47392.2	15.4025	57.6272	495

Calculate K-value by miniQuant

Click me

K-value is a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study. K-value can be calculated given a gene isoforms annotation in GTF/GFF3/genePred format.

Example: calculate K-value given annotation in GTF/GFF3/genePred format (e.g. `example/annotation.gtf`)

miniQuant kvalue -a example/annotation.gtf -o miniQuant_kvalue -t 1

Available parameters

Required arguments:
  -a, --annotation arg  Gene isoform annotation file in GTF, GFF or 
                        genePred format

 Optional arguments:
  -o, --output arg      The path of output folder (default: ./miniQuant_kvalue/)
  -t, --threads arg             Num of threads (default: 1)
      --short_reads_mean_fragment_length arg
                                Mean value of short reads fragment lengths 
                                (default: 235.0)
      --not_normalize_entry     Whether NOT normalize region-isoform matrix
                                (A matrix) before calculating K-value
      --kvalue_entry_type arg   What kind of entry to use for
                                region-isoform matrix (A matrix).
                                Choices:[effective_length,binary] (default:
                                effective_length)

Results explanation

The result will be in TSV format (miniQuant_kvalue/kvalues.tsv) showing the K-value of each gene, one gene per line, with following columns:

Gene: gene ID
K-value: K-value. Larger K-value indicates higher quantification error.

Click me for example

Gene_id	K-value
ENSG00000164970.15	331.422233
ENSG00000168005.9	1.320074

*For gene that consists only short isoforms (i.e. all isoforms with length less than --short_reads_mean_fragment_length), K-value will not be calculated and a NA value will be given.

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
example		example
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
dockerfile		dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving gene isoform quantification with miniQuant

Table of contents

Dependency

Installation

If your operating system doesn't have GLIBC 2.28 or later

Build by source

Usage

1. If quantify using long reads data alone

Example: quantify using long reads data (`example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_LR_alone_res` folder

Available parameters

Results explanation

2. If quantify using short and long reads data in hybrid mode

Example: quantify using short reads (e.g. `example/SR_R1.fasta.gz` and `example/SR_R2.fasta.gz`) and long reads (e.g. `example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_hybrid_res` folder

Available parameters

Results explanation

Calculate K-value by miniQuant

Example: calculate K-value given annotation in GTF/GFF3/genePred format (e.g. `example/annotation.gtf`)

Available parameters

Results explanation

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Improving gene isoform quantification with miniQuant

Table of contents

Dependency

Installation

If your operating system doesn't have GLIBC 2.28 or later

Build by source

Usage

1. If quantify using long reads data alone

Example: quantify using long reads data (example/LR.fasta.gz) with reference transcripts sequences (e.g. example/reference.fa), results in miniQuant_LR_alone_res folder

Available parameters

Results explanation

2. If quantify using short and long reads data in hybrid mode

Example: quantify using short reads (e.g. example/SR_R1.fasta.gz and example/SR_R2.fasta.gz) and long reads (e.g. example/LR.fasta.gz) with reference transcripts sequences (e.g. example/reference.fa), results in miniQuant_hybrid_res folder

Available parameters

Results explanation

Calculate K-value by miniQuant

Example: calculate K-value given annotation in GTF/GFF3/genePred format (e.g. example/annotation.gtf)

Available parameters

Results explanation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Example: quantify using long reads data (`example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_LR_alone_res` folder

Example: quantify using short reads (e.g. `example/SR_R1.fasta.gz` and `example/SR_R2.fasta.gz`) and long reads (e.g. `example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_hybrid_res` folder

Example: calculate K-value given annotation in GTF/GFF3/genePred format (e.g. `example/annotation.gtf`)

Packages