Note This tutorial should be run on gnoto JupyterHub and not on noto JupyterHub. This is because we will use GPU-accelerated techniques to speed up the simulations.

To this end, please use the following link instead of the usual rocket button: gnoto

Select the correct kernel for this exercise CH-351

6.3. Practical: TRP Cage using OpenMM#

We will use a modern and open source simulation program to run our molecular dynamics simulation of the TRP cage protein.

OpenMM is different from most other simulations tools in that it is not controlled by executing different subprograms in the command line but instead one uses a python program to control the flow of our simulation. This allows us to easily rerun parts or the whole simulation.

Below is a detailed description of our workflow. The respective parts you need to change during the exercise are indicated.

6.3.1. Imports#

First we need to install the necessary modules and import them. We install our simulation package openmm and parmed, which we use to load our Amber trajectory files.

import openmm
import py3Dmol

Exercise 7

Follow the instructions in this file and run the folding simulation. Calculate the correct number of steps for the simulations to run 50 ps of heating and 20 ns of dynamics and report those numbers in your report. These sections are marked in the following with #CHANGE HERE.

6.3.2. Creating our system#

Creating the system will be done using tleap which is part of the Amber software package. Linear peptides can be easily created using this tool.

Below we create this file using python and save it as leap.in to the filesystem.

Using !cat filename we can check whether it was sucessfully created.

Then we execute tleap using our input file using the command !tleap -f leap.in.

leapin = """source leaprc.protein.ff19SB
  set default PBradii mbondi3
  TRP = sequence {  NASN LEU TYR ILE GLN TRP LEU LYS ASP GLY GLY PRO SER SER GLY ARG PRO PRO PRO CSER}
  savepdb TRP trp_cage_linear.pdb
  saveamberparm TRP trp_cage.prmtop trp_cage_linear.nc
  quit
  """

with open("leap.in","w+") as f:
  f.writelines(leapin)
!cat leap.in
!/opt/CH-351/bin/tleap -f leap.in

You will be greeted with plenty output, lets go through it step by step.

>$ !/opt/CH-351/bin/tleap
-I: Adding /opt/CH-351/dat/leap/prep to search path.
-I: Adding /opt/CH-351/dat/leap/lib to search path.
-I: Adding /opt/CH-351/dat/leap/parm to search path.
-I: Adding /opt/CH-351/dat/leap/cmd to search path.
-f: Source leap.in.
Welcome to LEaP!
>(no leaprc in search path)

This is leap starting up telling us that it has added several directories to its search path. These directories contain the parameters of the amber force field that we now will load.

Next we need to load a specific force field. We do this using the following command:

source leaprc.protein.ff19SB

If you want you can take a look at this file you can make a new cell and use the following command:

!cat  /opt/CH-351/dat/leap/parm19.dat

Here you will find the masses, bond distance and force constants etc. that make up our forcefield.

This produces the following output which are all force field parameters that are loaded.

----- Source: /opt/CH-351/dat/leap/cmd/leaprc.protein.ff19SB
----- Source of /opt/CH-351/dat/leap/cmd/leaprc.protein.ff19SB done
Log file: ./leap.log
Loading parameters: /opt/CH-351/dat/leap/parm/parm19.dat
Reading title:
PARM99 + frcmod.ff99SB + frcmod.parmbsc0 + OL3 for RNA + ff19SB
Loading parameters: /opt/CH-351/dat/leap/parm/frcmod.ff19SB
Reading force field modification type file (frcmod)
Reading title:
ff19SB AA-specific backbone CMAPs for protein 07/25/2019
Loading library: /opt/CH-351/dat/leap/lib/amino19.lib
Loading library: /opt/CH-351/dat/leap/lib/aminoct12.lib
Loading library: /opt/CH-351/dat/leap/lib/aminont12.lib

In the next line, we set the radii of the atoms to be used for the implicit solvation.

set default PBradii mbondi3

This produces the following line in the leap output:

Using ArgH and AspGluO modified Bondi2 radii

Next, we create a new UNIT in AMBER using the sequence command.

TRP = sequence {  NASN LEU TYR ILE GLN TRP LEU LYS ASP GLY GLY PRO SER SER GLY ARG PRO PRO PRO CSER}

Then, we save a PDB file of our system.

savepdb TRP trp_cage_linear.pdb

This creates the pdb file on disk and produces the following output:

Writing pdb file: trp_cage_linear.pdb

/opt/CH-351/bin/teLeap: Warning!
 Converting N-terminal residue name to PDB format: NASN -> ASN

/opt/CH-351/bin/teLeap: Warning!
 Converting C-terminal residue name to PDB format: CSER -> SER
Checking Unit.

As you can see, the N and C-terminal residues have a different residue namer in AMBER compared e.g to a normal serine. This is due to the fact that here the force field parameters slightly deviate from the parameters of a serine involved in two peptide bonds versus one single peptide bond for a terminal uncapped serine.

In the last step, we now generate the parameters in AMBER format and the input coordinates in the NetCDF format (which is a binary format).

saveamberparm TRP trp_cage.prmtop trp_cage_linear.nc

This produces plenty of output and a warning:

/opt/CH-351/bin/teLeap: Warning!
The unperturbed charge of the unit (1.000000) is not zero.

/opt/CH-351/bin/teLeap: Note.
Ignoring the warning from Unit Checking.

Building topology.
Building atom parameters.
Building bond parameters.
Building angle parameters.
Building proper torsion parameters.
Building improper torsion parameters.
 total 63 improper torsions applied
Building H-Bond parameters.
Incorporating Non-Bonded adjustments.
Not Marking per-residue atom chain types.
Marking per-residue atom chain types.
  (Residues lacking connect0/connect1 - 
   these don't have chain types marked:

	res	total affected

	CSER	1
	NASN	1
  )
 (no restraints)

The first warning we can safely ignore as in the implicit solvent treatment we will use, we cannot use counter ions to neutralize our system. Instead we will use a screened charge model to emulate the presence of ions in the solvent. Next we see terms, that align with the functional form of the AMBER forcefield:

Again, we can look at the contents of the parameter file !cat trp_cage.prmtop and we see that it does not contain the coordinates anymore in contrast to the pdb file.

6.3.2.1. Looking at the starting structure#

# show starting structure here
with open("trp_cage_linear.pdb") as ifile:
    system = "".join([x for x in ifile])

view = py3Dmol.view(width=400, height=300)
view.addModelsAsFrames(system)
view.setStyle({'model': -1}, {"stick": {'color': 'spectrum'}})
view.zoomTo()
view.show()

Exercise 8

Include an image of the starting structure in your report. Are there any residues which would contribute to the instability of the starting structure in its current conformation, why?

6.3.3. Simulating the system#

Now we can start the simulation by writing our simulation script in python.

First, lets import the required modules:

  • openmm for running the simulation

  • parmed to read/write the input and output files

# OpenMM Imports
from openmm.app import *
from openmm import *

from openmm.app.dcdreporter import DCDReporter

# ParmEd Imports
from parmed import load_file, unit as u
from parmed.openmm import StateDataReporter

# System modules
import sys

6.3.3.1. Loading Amber files#

In this stage, we simply instantiate the AmberParm object from the input topology and coordinate files. After this command, trp_cage_gas will contain a full description of every particle, the parameters defining their interactions, and their position.

# Load the Amber files
print('Loading AMBER files...')
trp_cage_gas = load_file('trp_cage.prmtop', 'trp_cage_linear.nc')

6.3.3.2. Create the OpenMM System#

This command creates an OpenMM System object from the information stored in trp_cage_gas. It contains multiple Force instances for the bonds, angles, periodic torsions, and nonbonded (electrostatic and van der Waals) interactions. It is in this function that we define the potential parameters we want to use. In this example, we have chosen the default values for each parameter except the ones specified. In particular:

  • nonbondedMethod=app.NoCutoff indicates we do not want to use a cutoff for nonbonded interactions.

  • constraints=app.HBonds indicates we want to constrain all bonds in which at least one atom is a Hydrogen (i.e., SHAKE or SETTLE for water). Other options are None (no constraints), app.AllBonds, or app.HAngles.

  • implicitSolvent=app.GBn2 indicates we want to use the second GBneck model described in Nguyen et al., J. Chem. Theory Comput., 2014 9(4) p.2020-2034.

  • implicitSolventSaltConc=0.1*u.liters/u.mole indicates we want to model a ca. 0.1 molar solution of monovalent ions using a Debye screening model.

# Create the OpenMM system
print('Creating OpenMM System')
system = trp_cage_gas.createSystem(nonbondedMethod=app.NoCutoff,
                               constraints=app.HBonds, implicitSolvent=app.GBn2,
                               implicitSolventSaltConc=0.1*u.moles/u.liter,
)

6.3.3.3. Create the integrator to do Langevin Dynamics#

In this stage we specify an integrator. Common choices are LangevinIntegrator (as we’ve chosen here) to do simulations in the NVT ensemble. In this example, we’ve chosen the Langevin integrator with a target temperature of 325 K, a friction coefficient of 1/ps and a time step of 0.5 fs.

# Create the integrator to do Langevin dynamics
integrator = openmm.LangevinIntegrator(
                        0*u.kelvin,       # Temperature of heat bath
                        1.0/u.picoseconds,  # Friction coefficient
                        0.5*u.femtoseconds, # Time step
)

6.3.3.4. Create the Simulation object#

This step creates a Simulation object that will be used to run the actual simulations. OpenMM picks the fastest platform for us if we do not specify a platform.

# Create the Simulation object
sim = app.Simulation(trp_cage_gas.topology, system, integrator,)

This stage is very important. In this step, we set the particle positions stored in the trp_cage_gas object to our object. If you omit this step, you can get strange results or other errors like segmentation violations. These particle positions have been parsed from the input coordinate file.

# Set the particle positions
sim.context.setPositions(trp_cage_gas.positions)

6.3.3.5. Minimize the energy#

This stage performs a basic energy minimization to relax particle positions. This particular invocation will perform at most 500 iterations.

# Minimize the energy
print('Minimizing energy')
sim.minimizeEnergy(maxIterations=500)

state = sim.context.getState(getEnergy=True)
print(state.getPotentialEnergy())

6.3.3.6. Heating phase#

After having obtained the minimized coordinates of our system we can begin slowly heating it. We define reporters that will “report” on the status of the simulation periodically throughout the heating phase. The first is a StateDataReporter which will print out a summary of energies and temperatures every 500 steps. This reporter directs the printout to standard output (the screen), sys.stdout or to a file name (if set).

The second reporter is a NetCDF trajectory reporter, which is written in the Amber NetCDF format.

print('Heating phase')
sim.reporters.append(StateDataReporter('heating.csv', 500, step=True, potentialEnergy=True,  kineticEnergy=True, temperature=True)
 )
sim.reporters.append(
         DCDReporter('trp_cage_heating.dcd', 500)
)

for i in range(5):
    integrator.setTemperature(65*(i+1)*u.kelvin)
    sim.step(#CHANGE HERE)  ## Change this in order to run 50 ps of heating in 5 steps

state = sim.context.getState(getPositions=True)
minimized_and_heated=state.getPositions()
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df=pd.read_csv('heating.csv')

fig, ax = plt.subplots(1)

ax.plot(df['Time (ps)'],df['Potential Energy (kilocalorie/mole)'], label='Epot')
ax.plot(df['Time (ps)'],df['Kinetic Energy (kilocalorie/mole)'], label='Ekin')
ax.plot(df['Time (ps)'],df['Total Energy (kilocalorie/mole)'], label='Etot')
ax.set_xlabel('t (ps)')
ax.set_ylabel('Energy (kcal/mol)')
ax.legend()
plt.show()
fig, ax = plt.subplots(1)

ax.plot(df['Time (ps)'],df['Temperature (K)'], label='Temperature')
ax.set_xlabel('t (ps)')
ax.set_ylabel('Temp (K)')
ax.legend()
plt.show()

6.3.4. Running dynamics#

We first reset our Integrator to update the timestep, create the simulation object and then we actually run the MD of our heated system to see if our linear peptide folds up into its 3D form. We use again the NetCDF and StateDataReporter to retrieve information about our simulation.

# Create the integrator to do Langevin dynamics
integrator = openmm.LangevinIntegrator(
                        325*u.kelvin,       # Temperature of heat bath
                        1.0/u.picoseconds,  # Friction coefficient
                        2.0*u.femtoseconds, # Time step
)

# Create the Simulation object
sim = app.Simulation(trp_cage_gas.topology, system, integrator)

# Set the particle positions
sim.context.setPositions(minimized_and_heated)

# Set up the reporters to report energies and coordinates every 50000 steps
sim.reporters.append(
        StateDataReporter('production.csv', 50000, step=True, potentialEnergy=True,
                               kineticEnergy=True, temperature=True)
)
# Additionally print to the output of the notebook, so that we can track the simulation.
sim.reporters.append(
        StateDataReporter(sys.stdout, 50000, step=True, potentialEnergy=True,
                               kineticEnergy=True, temperature=True)
)
sim.reporters.append(
        DCDReporter('trp_cage_gb.dcd', 50000)
)

# Run dynamics
print('Running dynamics')
sim.step(#CHANGE HERE) # change this to run 20 ns of simulation
# Plotting energy from production simulation

df=pd.read_csv('production.csv')

fig, ax = plt.subplots(1)

ax.plot(df['Time (ps)'],df['Potential Energy (kilocalorie/mole)'], label='Epot')
ax.plot(df['Time (ps)'],df['Kinetic Energy (kilocalorie/mole)'], label='Ekin')
ax.plot(df['Time (ps)'],df['Total Energy (kilocalorie/mole)'], label='Etot')
ax.set_xlabel('t (ps)')
ax.set_ylabel('Energy (kcal/mol)')
ax.legend()
plt.show()
print('Average temperature')
print(f"{np.mean(df['Temperature (K)']):.3f} +- {np.std(df['Temperature (K)']):.3f} ")
fig, ax = plt.subplots(1)

ax.plot(df['Time (ps)'],df['Temperature (K)'], label='Temperature')
ax.set_xlabel('t (ps)')
ax.set_ylabel('Temp (K)')
ax.legend()
plt.show()

6.4. TrpCage Analysis#

import sys
sys.path.append("..")
import helpers
from helpers import *
helpers.set_style()
import numpy as np

import matplotlib.pyplot as plt

from matplotlib.ticker import MaxNLocator
from scipy import stats

import warnings

6.4.1. Lets load the trajectory files#

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    u = MDAnalysis.Universe("trp_cage.prmtop","trp_cage_gb.dcd")  # always start with a Universe
show_trajectory(u)

Exercise 9

What type of structure is the folded Trp-cage miniprotein? List the main components contributing to this structure, including the residues which are responsible for their formation.

We will also load an experimental NMR structure from an experimental NMR ensemble.

Experimental structures Quick overview

Experimental structures can be obtained using three main methods:

X-Ray diffraction Freeze the proteins and shoot x-rays at it. The diffraction pattern allows to reconstruct the electron density map in which we can fit a protein model. The models are usually very accurate, may have crystal artefacts due to packing and low temperature and can be obtained for proteins from large to small. Some proteins are difficult to crystallize.

NMR In solution structure using nuclear magnetic resonance (usually carbon and hydrogen) using complicated pulse sequences. Allows to resolve dynamical properties of the protein but normally is limited to smaller proteins.

Cryo EM The new kid on the block of structural biology. Works very well especially for large proteins, models are fit into a reconstructed map of the protein. Resolution tends to be lower than using X-Ray crystallography but sample preparation is much easier.

6.4.2. Align trajectory#

Aligning a trajectory is an important step when analysing molecular dynamics simulations. We will choose a reference structure/frame and project each frame to the reference (e.g by minimizing RMSD to reference).

In this case we align on the backbone nitrogen and carbon atoms.

As a reference we use one structure from the experimental NMR ensemble (PDB 1l2y).

from MDAnalysis.analysis import align
ref = MDAnalysis.Universe("1L2Y.pdb",  topology_format="PDB")
align.AlignTraj(u,  # trajectory to align
                ref,  # reference
                select='name CA',  # selection of atoms to align
                filename='aligned.xtc',  # file to write the trajectory to
                match_atoms=True,  # whether to match atoms based on mass
               ).run()
# load new aligned trajectory
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    u = MDAnalysis.Universe("trp_cage.prmtop", "aligned.xtc")

We can also compute the RMSD to the reference structure using a differnt function.

from MDAnalysis.analysis import rms
R = rms.RMSD(u,  # universe to align
             ref,  # reference universe or atomgroup
             select='backbone',  # group to superimpose and calculate RMSD
             
             ref_frame=0)  # frame index of the reference
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    R.run()

Next, we plot this RMSD.

import pandas as pd
df = pd.DataFrame(R.results.rmsd,
                  columns=['Frame', 'Time (ps)', 'RMSD (A)'])
df["Time (ns)"]=df["Time (ps)"]/1000
ax = df.plot(x='Time (ns)', y=['RMSD (A)'],
             kind='line')
ax.set_title('RMSD')
ax.set_ylabel(r'RMSD ($\AA$)');
ax.set_title('RMSD to NMR structure')
ax.set_xlabel('time [ns]')
plt.show()

an alternative is the RMSF

c_alphas = u.select_atoms('protein and name CA')
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    R = rms.RMSF(c_alphas).run()
fig, ax = plt.subplots(1)

ax.set_title('RMSF')
ax.plot( c_alphas.resids, R.results.rmsf)
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.set_xlim(0,20)
ax.set_xlabel('residue')
ax.set_ylabel('RMSF [$\AA$]')
plt.show()

Exercise 10

Explain the RMSD and RMSF plots. Does the trajectory reach the same conformation as the experimental structure? Which metric is more useful for the problem at hand? Bonus: Provide a use case for the other metric.

6.4.3. Visualize trajectory#

Now let’s compare the trajectory to one of the structures from the NMR structural ensemble to see how our linear Trp cage folded.

show_trajectory([u, "1L2Y.pdb"])

6.4.4. Emergence of secondary structure#

6.4.4.1. Hbonds#

We know that hydrogen bonds are very important for the formation of secondary structure elements. Let’s count the number of hbonds per frame and bin them in 1 ns bins by taking the mean of 10 frames.

from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    hbonds = HBA(universe=u)
    hbonds.run()
hbond_data = pd.DataFrame(hbonds.results.hbonds,
                  columns=['frame', 'donor index',
        'hydrogen index',
        'acceptor index',
        'distance',
        'angle'])
n_hbonds_per_frame = hbond_data.groupby("frame").size()
# Bin every nanosecond (i.e. 10 snapshots)
statistic, bin_edges, binnumber = stats.binned_statistic(n_hbonds_per_frame.index, n_hbonds_per_frame.values, bins=20)
fig, ax = plt.subplots(1)

times = np.linspace(0,20, num=u.trajectory.n_frames)

ax.set_title('Average number of Hbonds')
ax.plot(statistic)
ax.set_xlabel('time [ns]')
ax.set_ylabel('n hbonds')
plt.show()

Exercise 11

Include the hbond graph in your report, and explain the observed trend with reference to the structural components of the Trp-cage miniprotein ?

Exercise 12

Perform the Q1 and Q2 analysis explained below and provide the graph of the number of contacts. Can you infer at which interval (in nanoseconds) the secondary structure forms?

Here we calculate a Q1 vs Q2 plot, where Q1 refers to fraction of native contacts along a trajectory with reference to the first frame, and Q2 represents the fraction of native contacts with reference to the last.

from MDAnalysis.analysis import distances, contacts
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    q1q2 = contacts.q1q2(u, 'name CA', radius=8).run()
q1q2_df = pd.DataFrame(q1q2.results.timeseries,
                       columns=['Frame',
                                'Q1',
                                'Q2'])
q1q2_df['Frame']/=10
fig, ax = plt.subplots(1)

q1q2_df.plot(x='Frame', ax=ax)
ax.set_ylabel('Fraction of native contacts');

ax.set_title('Native contacts')
ax.set_xlabel('time [ns]')
plt.show()

6.4.4.2. Dihedral#

The central tryptophan in the Trp cage protein is located in the alpha helix. Here we track the dihedral angle along our simulation. When the protein forms ordered secondary structure elements we expect to see the dihedral only fluctuate a little.

from MDAnalysis.analysis import dihedrals
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    dihs = dihedrals.Dihedral( [u.residues[6].phi_selection()]).run()
fig, ax = plt.subplots(1)


ax.set_title('Trp6 $\phi$ dihedral')
ax.plot(np.linspace(0,20,num=200),dihs.results.angles.flatten())
ax.set_xlabel('time [ns]')
ax.set_ylabel('dihedral $[degree]$')
plt.show()

Exercise 13

Why is it useful to constrain bond lengths for larger MD simulations (typically with the SHAKE algorithm)? Which bonds would you typically constrain in such a scenario, and why?

Exercise 14 - Bonus

Which properties do you need to take into account in order to select an appropriate timestep for your MD simulation? Are there any other reasons you might wish to reduce or increase this timestep?

Exercise 15 - Bonus

Is it better to sample 2 x 10 ns from the same starting structure or 1 x 20 ns in order to explore conformational space efficiently?