EpiScan: High-Throughput mapping of antibody-specific epitopes on viral proteins using sequence information
- cuda >= 9.0
- cudnn >= 7.0
- python=3.7
- pytorch=1.11
- numpy >= 1.19.1
- scikit-learn >= 0.23.2
- biopython
- h5py
- matplotlib
- tqdm
#***********************
DB1: 162 antibody/antigen complexes
DB2: 30 antibody/antigen complexes
Antigen Samples (.pickle). can be refered in the ‘./dataProcess/’.
Antibody Samples (.h5). Can be downloaded from zenodo here: Published later (or generated by language model embedding, see [Models]).
The Antigen Sample data format is the following:
list of dictionaries with complex_code(str)
Each dictionary has the following keys:
Indices [0-2] represent the 3D coordinates of the amino acids of the antigenic molecules.
Indices [3-22] represent a one-hot encoding of the amino acid types.
Indices [23-43] represent a conservation profile for that position across a set of homologous proteins obtained from PSI-BLAST (The initial column with a bias value of zero).
Indices [44-63] represent a local amino acid profile that indicates the frequency of each amino acid type within 8 Å of the residue.
Indices [64-65] represent the absolute and relative solvent accessible surface area of the residue, calculated by STRIDE.
#***********************
#***********************
## Feature Extraction Details
The antigen features consist of several components that can be obtained as follows:
### 1. Conservation Profile (Indices [23-43])
- Generated using PSI-BLAST
- Command: `psiblast -query input.fasta -db nr -num_iterations 3 -out_ascii_pssm output.pssm`
- The PSSM (Position-Specific Scoring Matrix) values are extracted from the output file
### 2. Local Amino Acid Profile (Indices [44-63])
- Calculated using structural information
- Tools needed: BioPython or similar structural analysis packages
- Process:
1. Load PDB structure
2. Calculate distances between residues
3. Count amino acid frequencies within 8Å radius to get frequency profile
### 3. Solvent Accessibility (Indices [64-65])
- Calculated using STRIDE
- Installation: Download STRIDE from http://webclu.bio.wzw.tum.de/stride/
- Command: `stride input.pdb > output.stride`
- Extract both absolute and relative accessibility values
### 4. 3D Coordinates (Indices [0-2])
- Directly extracted from PDB files
- Can be obtained using BioPython:
```python
from Bio import PDB
parser = PDB.PDBParser()
structure = parser.get_structure('ID', 'protein.pdb')
coordinates = [atom.get_coord() for atom in structure.get_atoms()]
Note: While con_pdb_dict_AgAb.pickle encodes all 66 dimensions of features described above, in actual use we removed the one-hot encoding and only used 46 dimensions of features.
#***********************
The trained models can be refered in the ./trained_model/Seq_final.pth.
.h5 file(language model embedding) generate :
#***********************
python ./EpiScan/commands/embed.py --seqs ./dataProcess/public/DB1.fasta --outfile ./dataProcess/public/DB1.h5 --device 0
#***********************Training script :
#***********************
python ./EpiScan/commands/train_sep-auc.py --train ./dataProcess/public/public_sep_trainAg.tsv --test ./dataProcess/public/public_sep_valAg.tsv --embedding ./dataProcess/public/DB1.h5 --lr 1e-4 --save-prefix ./save_model/2023 --no-augment --device 0 --num-epochs 250 --batch-size 15
#***********************Predict script :
#***********************
# Use the 'epimapping.py' script to execute inference with the following command. This script will output predictions for antibody-specific epitopes on the antigen sequences.
python ./EpiScan/commands/epimapping.py --test INPUT_DATA_FILE --embedding EMBEDDING_DATA_FILE
# For INPUT_DATA_FILE: this file should list the identifier for each antigen-antibody pair along with the sequence of the antigen that you want to test. It should be formatted as a tab-separated values (TSV) file with each line representing one pair.
# For EMBEDDING_DATA_FILE: this file contains the language model embeddings for the antigen sequences in HDF5 (.h5) format. You should ensure that you have already generated the embeddings for the antigens you wish to analyze. The file path provided should point to this HDF5 file that was generated using the 'embed.py' script.
# For instance, if you have 'test_complexes.tsv' as your input file with antigen sequences and 'DB1.h5' as your embeddings file, your command would look like this:
python ./EpiScan/commands/epimapping.py --test ../dataProcess/public/public_sep_testAg.tsv --embedding ../dataProcess/public/DB1.h5
# Before running, please modify the path to your absolute EpiScan directory. If you encounter any issues, you can refer to the demo_test.py script.
#***********************