77
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Pepfun 2.0: Improved Protocols for the Analysis of Natural and Modified Peptides

ORCID Icon
Article: FDD82 | Received 03 Apr 2023, Accepted 09 Aug 2023, Published online: 24 Aug 2023

Abstract

Aim: The role of peptides is nowadays relevant in fields, such as drug discovery and biotechnology. Computational analyses are required to study their properties and gain insights into rational design strategies. Materials & methods: PepFun 2.0 is a new version of the python package for the study of peptides using a set of modules to analyze the sequence and structure of the molecules. Both natural and modified peptides containing non-natural amino acids can be studied based on the provided functionalities. Results: PepFun 2.0 comprises five main modules for tasks such as sequence alignments, prediction of properties, generation of conformers, detection of interactions and extra functions to include peptides containing non-natural amino acids. Conclusion: The code, tutorial and specific examples are open source and available at: https://github.com/rochoa85/PepFun2.

Plain language summary

The role of peptides is nowadays relevant in fields such as drug discovery and biotechnology. Computational analyses are required to study their properties and gain insights into rational design strategies. PepFun 2.0 is a new version for the study of natural and modified peptides using a set of modules to analyze the sequence and structure of the molecules. PepFun 2.0 comprises five main modules for different tasks such as sequence alignments, prediction of properties, generation of conformers, modification of structures, detection of protein–peptide interactions and extra functions to include peptides containing non-natural amino acids.

Graphical abstract

Motivated by the functional design of peptides as therapeutic agents, several computational packages are currently available to optimize different peptide properties during pre-clinical studies [Citation1–3]. For example, using tools to run predictions with amino acid scales is standard in analyzing natural peptides [Citation4,Citation5]. Additional tools use the peptides’ chemical structures to predict conformers and other properties [Citation6,Citation7]. Given the peptides’ hybrid nature, these methods usually combine approaches focused on protein or small molecules but not necessarily peptide-oriented. However, packages like PepFun address this problem through open bioinformatics and cheminformatics protocols to run the analysis with natural peptides in python [Citation8]. The latest can help study different physicochemical properties of peptides and to guide, for example, the generation of analog libraries with potential similar activities [Citation9,Citation10].

The original PepFun package is a single command line interface to execute peptide analysis tasks, emphasizing sequence and structure functionalities and methods to build peptide libraries based on patterns. However, a modular software architecture and additional functions are required for a broad implementation to study natural and modified peptide sequences composed of non-natural amino acids [Citation11]. Nowadays, including subtle modifications on the peptide sequences is crucial to increase the peptide’s structural and metabolic stability, among other applications [Citation12,Citation13]. For example, being able to run automatic pipelines able to predict conformers, analyze interactions and mutate residues by non-natural monomers is a practice commonly used in both industrial and academic environments [Citation14].

Here PepFun 2.0 is described, a new optimized version of PepFun with functionalities required for researchers designing and studying new natural and modified peptides. The methods have been adapted to work with non-cyclic peptides that can form secondary structure motifs. Details of the package’s five modules, technical assumptions, and best practices are shared. A complete tutorial on implementing the code is available in the repository’s README file at: https://github.com/rochoa85/PepFun2.

Methods

PepFun 2.0 is written in Python 3, and the code is split into five modules with various functionalities. The package relies on external dependencies, including main packages such as the RDKit (https://rdkit.org/), BioPython (https://biopython.org/) [Citation15] and Modeller [Citation16]. Installing them using a Conda environment through the official package channels is recommended. Additional complementary packages and PepFun 2.0 can be installed using a provided setup file for a quick installation. A set of tests are provided to verify the correct installation. If the user runs functions based on Modeller, an academic or commercial license should be requested from the authors (https://salilab.org/modeller/).

The package contains the scripts: sequence, conformer, modifications, interactions and extra, which include multiple classes to run analysis such as aligning natural and modified peptide sequences, calculating amino acid and structure-based physicochemical properties, predicting conformers with and without secondary structure restrictions, extract interaction patterns from protein-peptide complexes, as well as modify existing peptide structures by incorporating new residues or adding cap groups. In addition, a method to allow the generation of descriptors for modified peptides is included for training machine learning models. The tasks are summarized in , with specific details in the next section.

Figure 1. PepFun 2.0 main functionalities.

The package is split into five modules: sequence, conformer, interactions, modifications and extra. Information on the required inputs and the main functionalities are described using a color-code schema. Calling each function from any python script is possible after following the installation instructions.

AA: Amino acid; NNAA: Non-natural amino acid.

Figure 1. PepFun 2.0 main functionalities.The package is split into five modules: sequence, conformer, interactions, modifications and extra. Information on the required inputs and the main functionalities are described using a color-code schema. Calling each function from any python script is possible after following the installation instructions.AA: Amino acid; NNAA: Non-natural amino acid.

Results & discussion

PepFun 2.0 can be called from any python script after installing the package. The following are the main scripts and classes.

Sequence module

The input is a natural peptide sequence in FASTA format. This module contains different methods based on amino acid scales, such as the hydrophobicity Eisenberg scale [Citation17], to calculate properties. One method contains a list of pKa values that depending on the pH, allows the calculation of the peptides’ net charge. Other methods are functions from the ProtParam package to calculate metrics such as aromaticity, instability index and statistics of the amino acid composition [Citation18]. One condition is that the peptide should have natural amino acids.

In addition, the module generates the SMILES representation of the molecule to be subsequently used in packages like the RDKit, allowing the calculation of the peptide molecular weight, lipophilicity by the Crippen LogP [Citation19] and the number of hydrogen bond acceptors and donors. The inclusion of RDKit allows running similarity between pairs of peptides by matching their molecular fingerprints with the Tanimoto coefficient [Citation20]. All the RDKit-based calculations rely on the 2D molecular information of the peptides without considering the conformational flexibility. The alignment can also be done by pairs using a classical position-by-position comparison, which can be weighted using scoring matrices available in the Biopython package. A wrap function to run online blast searches with parameters adjusted for peptide sequences is also available [Citation21].

Finally, the module contains a set of empirical rules to account for solubility and synthesis issues associated with peptides. The rules describe violations by specific patterns or amino acids in the peptide sequence [Citation22]. The higher the number of violations, the lower the probability of validating the peptides experimentally. The solubility rules violations are:

  • Warning if the number of charged and/or of hydrophobic amino acids exceeds 45%.

  • Warning if the absolute total peptide charge at pH 7 is more than +1.

  • Warning if the number of glycines or prolines is more than one in the sequence.

  • Warning if the first or the last amino acid is charged.

  • Warning if any amino acid represents more than 25% of the total sequence.

The synthesis rules violations are:

  • Warning if two prolines are consecutive.

  • Warning if the motifs DG (aspartic acid and glycine) and DP (aspartic acid and proline) are present in the sequence. Two rules, one per motif.

  • Warning if the sequences end with asparagine (N) or glutamine (Q) residues.

  • Warning if there are charged residues every five amino acids.

  • Warning if there are oxidation-sensitive amino acids like methionine (M), cysteine (C) or tryptophan (W). Three rules, one per amino acid.

Overall, the rules allow the filtering of candidates during screening stages, which is crucial to reduce the number of false positives for the prediction of hits.

Conformer module

The peptide sequence in FASTA format is also required to predict the conformers. For this purpose, two methods are available. One uses RDKit as the engine to generate the structures. Specifically, the sequence is converted to HELM notation and parsed to annotate the amino acids with correct atom names based on the IUPAC nomenclature [Citation23]. Then, the ETKDGv3 method generates the most probable conformer per peptide [Citation24]. The method is a knowledge-based potential optimized for macrocycles that can be useful in the case of small extended peptides with less than 10–15 amino acids [Citation25].

For larger peptides, a second method relies on predicting the peptide’s secondary structure. First, the PSIPRED package assigns possible α-helices, β-sheets, or coils [Citation26]. Although the tool focuses on proteins, it can be applied to peptide sequences. However, if the user knows the required secondary structure, it can be provided as a direct input for the Modeller pipeline. Second, a loop refinement protocol with Modeller uses one single amino acid of the peptide as a template (i.e., a predicted amino acid structure using RDKit). Then the rest of the residues are modeled following the secondary structure restraints. If the peptide is cyclic, the function can add restrictions on the formation of disulfide bonds.

shows an example of conformers predicted with the two methodologies. In , an extended conformation is predicted for an 11-mer peptide (SDVAFRGNLLD) using the RDKit protocol. In , a second 11-mer peptide (ITFEDLLDYYG) is predicted as a helical conformer using the Modeller pipeline, which is consistent with the experimental folding of the bound peptide (PDB id 2buo) [Citation27].

Figure 2. Examples of the conformer and modifications modules.

(A) Prediction of an extended conformer for the peptide sequence SDVAFRGNLLD using the ETKDGv3 module from RDKit [Citation24]. (B) Prediction of a helical conformer for peptide ITFEDLLDYYG using a combination of PSIPRED [Citation26] and Modeller [Citation16] to predict the secondary structure and generate the 3D model using the secondary structure restraints in a loop refinement protocol. (C) Filling a peptide by adding two amino acids, LK at the N-terminal and KL at the C-terminal part, using the sequence AKAFIAWLVRG. (D) Capping the peptide AKAFIAWLVRG using the acetyl group (ACE) at the N-terminal and methylamine (NME) at the C-terminal part.

Figure 2. Examples of the conformer and modifications modules.(A) Prediction of an extended conformer for the peptide sequence SDVAFRGNLLD using the ETKDGv3 module from RDKit [Citation24]. (B) Prediction of a helical conformer for peptide ITFEDLLDYYG using a combination of PSIPRED [Citation26] and Modeller [Citation16] to predict the secondary structure and generate the 3D model using the secondary structure restraints in a loop refinement protocol. (C) Filling a peptide by adding two amino acids, LK at the N-terminal and KL at the C-terminal part, using the sequence AKAFIAWLVRG. (D) Capping the peptide AKAFIAWLVRG using the acetyl group (ACE) at the N-terminal and methylamine (NME) at the C-terminal part.

Modifications module

This module’s purpose is to modify existing structures of peptides by including missing amino acids, capping groups or by mutation of specific residues without altering the original structure of the peptide that will be modified. The input is a PDB file of the peptide alone or in a complex with a protein target. Something important is assigning the correct atom names and numbering in the input file to facilitate its manipulation using the BioPython and Modeller packages.

Using Modeller, the user can provide a PDB file of the peptide alone or in a complex for the filling functionalities, which receives the amino acid sequence of the input peptide, and the sequence of a new peptide with new residues at different positions. The function uses the loop refinement functionality of Modeller to add the missing residues without altering the template conformer. The Modeller pipeline can also modify the peptide when bound to a protein target. For the capping functionality, it is possible to add glycines at any flanking parts using Modeller and then convert them to an acetyl group at the N-terminal or methylamine at the C-terminal part. An example of the filling function is shown in , where the two residues at each flanking part are added to the existing peptide PDB structure. describes an example of capping a peptide at both terminal parts.

Finally, a mutation class is added to the module to mutate a natural amino acid by a non-natural amino acid (NNAA) of interest. PepFun is limited to analyzing alpha NNAAs that only contain chemical modifications on the residue side chain but with typical backbone atoms that can generate peptide bonds with other natural or non-natural monomers. This class is one method that allows the analysis of non-natural residues in this version of PepFun. The method is inspired in the PeptideBuilder package [Citation28] by using BioPython as the basis to manipulate the structures, calculate parameters, and assign new atoms in the NNAA side chain. To run the tool, a PDB file of the NNAA with the coordinates and correct atom naming is necessary. The file can be obtained from the PDB or predicted using conformer generators. The mutated peptide will incorporate the adapted coordinates of the NNAA side chain without affecting the original backbone atoms coordinates, which can be used as input for simulations if the force field parameters are available.

Interactions module

A protein-peptide complex in PDB format is required in the interactions module as input. The module contains different functions. The first one runs the DSSP program to assign secondary structure elements to each residue in the peptide [Citation29]. The second one uses the same DSSP program to detect the possible hydrogen bonds between the peptide and the protein, which can be analyzed in the context of the solvent accessible surface area that DSSP also annotatesA detailed report is generated with the detected hydrogen bonds per residue/atom and a graph-based figure with the interactions, where the nodes are the residues and the edges interactions. The nodes are colored based on the chain, and the width of the interactions represents if more than one potential hydrogen bond is present. For this module, it is possible to have NNAAs in the peptide.

The module also allows the detection of hydrophobic contacts. For that goal, a function based on the Biopython package detects all the possible distances between the protein and the peptide residues under a defined threshold, which will depend on the user’s requirements. By default, a threshold of 4 Å is used. An example of the protein-peptide complex with PDB id 1xn2 [Citation30] is shown in . The graph-based representation is generated using the igraph module from python [Citation31].

Figure 3. Interactions detected for a protein-peptide complex with PDB id 1xn2.

The peptide sequence (shown in green in the structure) is WWSEVN[1OL]AEF, where 1OL is an NNAA available in the PDB. The hydrogen bond graph recognizes the interactions between the protein (shown in orange in the structure) as edges between nodes, which have different colors if they are part of the protein or the peptide. A plot with the number of contacts per residue using a threshold of 4 Å is also shown.

NNAA: Non-natural amino acid.

Figure 3. Interactions detected for a protein-peptide complex with PDB id 1xn2.The peptide sequence (shown in green in the structure) is WWSEVN[1OL]AEF, where 1OL is an NNAA available in the PDB. The hydrogen bond graph recognizes the interactions between the protein (shown in orange in the structure) as edges between nodes, which have different colors if they are part of the protein or the peptide. A plot with the number of contacts per residue using a threshold of 4 Å is also shown.NNAA: Non-natural amino acid.

Extra module

The extra module includes specialized functions to analyze modified peptides with NNAAs. The inputs for the modified peptides are sequences using the BILN notation [Citation32]. In this notation, the monomers are separated by dashes, and the NNAAs are described based on codes reported in official monomer repositories. PepFun 2.0 includes the open HELM monomer database to map 322 nonnatural elements (https://github.com/PistoiaHELM/HELMMonomerSets).

The first method is the alignment and score of modified peptides using the pairwise2 functionality of Biopython. The alignment can be unweighted (i.e., matches and mismatches) or weighted by a similarity-based scoring matrix generated between all the monomers. A function to update the matrix using any similarity threshold is also included. The formation of gaps is allowed in the alignment, and the score denotes the similarity between the peptides.

A second method is the generation of descriptors for modified peptides based on pre-calculated properties of the HELM monomer dataset. These properties are molecular weight, topological polar surface area (TPSA), partition coefficient (LogP), and the number of rotatable bonds. To obtain these properties only the monomers’ 2D molecular information is used as input. Then a function generates autocorrelation of these properties using the Moran equation [Citation33], described as:M(d)  =  1Ndi=1Nd(PiP¯t)(Pi+dP¯t)1Ni=1N(PiP¯t)2      d  =  1,2,3, , n

where each property Pi is normalized according to:Pi  =  PiP¯σ

Pi is the original property value, Pt is the average of all the amino acids, P¯ is the mean value, and σ is the standard deviation calculated based on the property values of the available amino acids. The number of descriptors will depend on the variable d, which is adjusted to the peptide sequence length. A graphical example of the alignment of modified peptides and the autocorrelation of properties to generate amino acid-based descriptors is shown in .

Figure 4. Examples of analysis for modified peptides with non-natural amino acids.

(A) Alignment of two peptide sequences having each a single non-natural amino acid (in blue). The alignment allows the generation of gaps to fit the matches. (B) Schema of the Moran equation to generate amino acid-based descriptors based on the autocorrelation of properties per monomer. The properties are the molecular weight, TPSA, partition coefficient (LogP), and the number of rotatable bonds.

TPSA: Topological polar surface area.

Figure 4. Examples of analysis for modified peptides with non-natural amino acids.(A) Alignment of two peptide sequences having each a single non-natural amino acid (in blue). The alignment allows the generation of gaps to fit the matches. (B) Schema of the Moran equation to generate amino acid-based descriptors based on the autocorrelation of properties per monomer. The properties are the molecular weight, TPSA, partition coefficient (LogP), and the number of rotatable bonds.TPSA: Topological polar surface area.

The last method allows converting the SMILES notation of a peptide to its FASTA sequence. A function maps natural amino acids by detecting the masses of the single residues based on RDKit calculations. The input can be adapted to accept standard input formats like SDF files.

Conclusion

PepFun 2.0 is a package that can be installed under a Linux environment and called through its independent modules in a python script. The functionalities included are motivated by everyday tasks required to study and design new peptide variants, with the possibility now to include modified peptides composed of non-natural amino acids available in open repositories. The package is open for academic purposes, with a general tutorial about running the main functions, and an examples folder with scripts and output files using different natural and modified peptides as input. Overall, all the included cases can guide the user on running each module in the code repository.

Summary points

  • Open cheminformatics tools for studying natural and modified peptides are required.

  • PepFun2.0 is a python package with modules to study peptide sequences and structures.

  • Conformers can be generated using secondary structure restraints.

  • The code is open source and available in GitHub.

  • Methods to align peptides with non-natural amino acids are available, which can be scored by a pre-built similarity matrix of a monomer dictionary.

  • Amino acid-based descriptors can be obtained to be used in machine learning models.

  • The functions can be adapted to analyze massive datasets of sequences or peptide structures obtained from the PDB or molecular dynamics simulations.

  • Different line notations can be used according to the functionalities. Some examples are provided in the code repository.

Availability

  • Project name: PepFun (version 2.0).

  • Project home page: https://github.com/rochoa85/PepFun2.

  • Operating system(s): Linux.

  • Programming language: Python 3.

  • Other requirements: RDKit 2020 or higher; Biopython 1.7.9 or higher; Modeller 10.3 or higher.

  • License: MIT.

The code is available as a Github repository. Any questions related with the implementation can be directed to the author’s email accounts.

Acknowledgments

I thank P Cossio for valuable insights into the conceptualization of PepFun. I also thank T Fox for the feedback on some of the package’s tools.

Financial & competing interests disclosure

This work has been supported by MinCiencias, University of Antioquia and Ruta N, Colombia, the Max Planck Society, Germany. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

No writing assistance was utilized in the production of this manuscript.

Additional information

Funding

This work has been supported by MinCiencias, University of Antioquia and Ruta N, Colombia, the Max Planck Society, Germany. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

References

  • PaulDS, GauthamN. MOLS 2.0: software package for peptide modeling and protein–ligand docking. J. Mol. Model.22(10), 1–9 (2016).
  • BaylonJL, UrsuO, MuzdaloA et al.PepSeA: peptide sequence alignment and visualization tools to enable lead optimization. J. Chem. Inf. Model.62(5), 1259–1267 (2022).
  • OchoaR, SolerMA, LaioA, CossioP. PARCE: protocol for amino acid refinement through computational evolution. Comput. Phys. Commun.260, 107716 (2021).
  • DuvaudS, GabellaC, LisacekF, StockingerH, IoannidisV, DurinxC. Expasy, the swiss bioinformatics resource portal, as designed by its users. Nucleic Acids Res.49(W1), W216–W227 (2021).
  • KawashimaS, KanehisaM. AAindex: amino acid index database. Nucleic Acids Res.28(1), 374–374 (2000).
  • ShenY, MaupetitJ, DerreumauxP, TufféryP. Improved PEP-FOLD approach for peptide and miniprotein structure prediction. J. Chem. Theory Comput.10(10), 4745–4758 (2014).
  • YanY, ZhangD, HuangSY. Efficient conformational ensemble generation of protein-bound peptides. J. Cheminformatics9(1), 1–13 (2017).
  • OchoaR, CossioP. PepFun: open source protocols for peptide-related computational analysis. Molecules26(6), 1664 (2021).
  • TuM, ChengS, LuW, DuM. Advancement and prospects of bioinformatics analysis for studying bioactive peptides from food-derived protein: sequence, structure, and functions. Trends Analyt. Chem.105, 7–17 (2018).
  • JoshiJ, BlankenbergD. PDAUG: a galaxy based toolset for peptide library analysis, visualization, and machine learning modeling. BMC Bioinform.23(1), 1–17 (2022).
  • AmarasingheKN, DeMaria L, TyrchanC, ErikssonLA, SadowskiJ, PetrovicD. Virtual screening expands the non-natural amino acid palette for peptide optimization. J. Chem. Inf. Model.62(12), 2999–3007 (2022).
  • ZimmermannT, ThomasL, Baader-PaglerT et al.BI 456906: discovery and preclinical pharmacology of a novel GCGR/GLP-1R dual agonist with robust anti-obesity efficacy. Mol. Metab.66, doi: https://doi.org/10.1016/j.molmet.2022.101633 (2022).
  • GfellerD, MichielinO, ZoeteV. SwissSideChain: a molecular and structural database of non-natural sidechains. Nucleic Acids Res.41(D1), D327–D332 (2012).
  • OchoaR, CossioP, FoxT. Protocol for iterative optimization of modified peptides bound to protein targets. J. Comput. Aided Mol. Des.36(11), 825–835 (2022).
  • CockPJ, AntaoT, ChangJT et al.Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics25(11), 1422–1423 (2009).
  • EswarN, EramianD, WebbB, ShenMY, SaliA. Protein structure modeling with MODELLER. Methods Mol. Biol.1137, 145–159 (2008).
  • EisenbergD, WeissRM, TerwilligerTC. The hydrophobic moment detects periodicity in protein hydrophobicity. Proc. Natl Acad. Sci. USA81(1), 140–144 (1984).
  • ArtimoP, JonnalageddaM, ArnoldK et al.ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res.40(W1), W597–W603 (2012).
  • MannholdR, vande Waterbeemd H. Substructure and whole molecule approaches for calculating log P. J. Comput. Aided Mol. Des.15(4), 337–354 (2001).
  • FlowerDR. On the properties of bit string-based measures of chemical similarity. J. Chem. Inf. Comput. Sci.38(3), 379–386 (1998).
  • YeJ, McGinnisS, MaddenTL. BLAST: improvements for better sequence analysis. Nucleic Acids Res.34(Suppl. 2), W6–W9 (2006).
  • SantosGB, GanesanA, EmeryFS. Oral administration of peptide-based drugs: beyond Lipinski’s rule. ChemMedChem11(20), 2245–2251 (2016).
  • MiltonJ, ZhangT, BellamyC et al.HELM software for biopolymers. J. Chem. Inf. Model.57(6), 1233–1239 (2017).
  • RinikerS, LandrumGA. Better informed distance geometry: using what we know to improve conformation generation. J. Chem. Inf. Model.55(12), 2562–2574 (2015).
  • WangS, KrummenacherK, LandrumGA et al.Incorporating NOE-derived distances in conformer generation of cyclic peptides with distance geometry. J. Chem. Inf. Model.62(3), 472–485 (2022).
  • BuchanDW, JonesDT. The PSIPRED protein analysis workbench: 20 years on. Nucleic Acids Res.47(W1), W402–W407 (2019).
  • TernoisF, StichtJ, DuquerroyS, KräusslichHG, ReyFA. The HIV-1 capsid protein C-terminal domain in complex with a virus assembly inhibitor. Nat. Struct.12(8), 678–682 (2005).
  • TienMZ, SydykovaDK, MeyerAG, WilkeCO. PeptideBuilder: a simple python library to generate model peptides. PeerJ.1, e80 (2013).
  • KabschW, SanderC. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers22(12), 2577–2637 (1983).
  • TurnerRT, HongL, KoelschG, GhoshAK, TangJ. Structural locations and functional roles of new subsites S5, S6, and S7 in memapsin 2 (β-secretase). Biochemistry44(1), 105–112 (2005).
  • CsardiG, NepuszT. The igraph software package for complex network research. Int. J. Complex Syst.1695(5), 1–9 (2006).
  • FoxT, BielerM, HaebelP, OchoaR, PetersS, WeberA. BILN: a human-readable line notation for complex peptides. J. Chem. Inf. Model.62(17), 3942–3947 (2022).
  • ChenZ, ZhaoP, LiF et al.iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics34(14), 2499–2502 (2018).