← Back
BODIPY-Ruthenium(II) Bis-Terpyridine Complexes for Cellular Imaging and Type-I/-II Photodynamic Therapy.
RESEARCH ARTICLE
MDverse, shedding light on the dark
matter of molecular dynamics simulations
Johanna KS Tiemann1*†, Magdalena Szczuka2, Lisa Bouarroudj3,
Mohamed Oussaren3, Steven Garcia4, Rebecca J Howard5, Lucie Delemotte6,
Erik Lindahl5,6, Marc Baaden7, Kresten Lindorff-Larsen1, Matthieu Chavent2*,
Pierre Poulain3*
1
*For correspondence:
johanna.tiemann@gmail.com
(JKST);
matthieu.chavent@ipbs.fr (MC);
pierre.poulain@u-paris.fr (PP)
Present address:
NovozymesA/S, Lyngby,
Denmark
†
Competing interest: See page
16
Funding: See page 16
Preprint posted
02 May 2023
Sent for Review
28 June 2023
Reviewed preprint posted
20 September 2023
Reviewed preprint revised
01 July 2024
Version of Record published
30 August 2024
Reviewing Editor: Shozeb
Haider, University College
London, United Kingdom
Copyright Tiemann et al. This
article is distributed under the
terms of the Creative Commons
Attribution License, which
permits unrestricted use and
redistribution provided that the
original author and source are
credited.
Linderstrøm-Lang Centre for Protein Science, Department of Biology, University
of Copenhagen, Copenhagen, Denmark; 2Institut de Pharmacologie et Biologie
Structurale, CNRS, Université de Toulouse, Toulouse, France; 3Université Paris
Cité, CNRS, Institut Jacques Monod, Paris, France; 4Independent researcher,
Amsterdam, Netherlands; 5Department of Biochemistry and Biophysics, Science for
Life Laboratory, Stockholm University, Stockholm, Sweden; 6Department of applied
physics, Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm,
Sweden; 7Laboratoire de Biochimie Théorique, CNRS, Université Paris Cité, Paris,
France
Abstract The rise of open science and the absence of a global dedicated data repository for
molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data
repositories, constituting the dark matter of MD — data that is technically accessible, but neither
indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed
about 250,000 files and 2000 datasets from Zenodo, Figshare and Open Science Framework. With
a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the
mining of publicly available MD data. We identified systems with specific molecular composition and
were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this
analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To
continue in this direction, we call on the community to pursue the effort of sharing MD data, and to
report and standardize metadata to reuse this valuable matter.
eLife assessment
The study presents a valuable tool for searching molecular dynamics simulation data, making such
datasets accessible for open science. The authors provide convincing evidence that it is possible
to identify noteworthy molecular dynamics simulation datasets and that their analysis can produce
information of value to the community.
Introduction
The volume of data available in biology has increased tremendously (Marx, 2013; Stephens et al.,
2015), through the emergence of high-throughput experimental technologies, often referred to as
-omics, and the development of efficient computational techniques, associated with high-performance
computing resources. The Open Access (OA) movement to make research results free and available to
anyone (including e.g. the Budapest Open Access Initiative and the Berlin declaration on Open Access
to Knowledge) has led to an explosive growth of research data made available by scientists (Wilson
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
1 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
et al., 2021). The FAIR (Findable, Accessible, Interoperable and Reusable) principles Wilkinson et al.,
2016 have emerged to structure the sharing of these data with the goals of reusing research data and
to contribute to the scientific reproducibility. This leads to a world where research data has become
widely available and exploitable, and consequently new applications based on artificial intelligence
(AI) emerged. One example is AlphaFold (Jumper et al., 2021), which enables the construction of a
structural model of any protein from its sequence. However, it is important to be aware that the development of AlphaFold was only possible because of the existence of extremely well annotated and
cleaned open databases of protein structures (wwPDB Berman et al., 2003) and sequences (UniProt
Consortium, 2022). Similarly, accurate predictions of NMR chemical shifts and chemical-shift-driven
structure determination was only made possible via a community-driven collection of NMR data in the
Biological Magnetic Resonance Data Bank (Hoch et al., 2023). One can easily imagine novel possibilities of AI and deep learning reusing previous research data in other fields, if that data is curated and
made available at a large scale (Fan and Shi, 2022; Mahmud et al., 2021).
Molecular Dynamics (MD) is an example of a well-established research field where simulations give
valuable insights into dynamic processes, ranging from biological phenomena to material science
(Perilla et al., 2015; Hollingsworth and Dror, 2018; Yoo et al., 2020; Alessandri et al., 2021;
Krishna et al., 2021). By unraveling motions at details and timescales invisible to the eye, this well-
established technique complements numerous experimental approaches (Bottaro and Lindorff-
Larsen, 2018; Marklund and Benesch, 2019; Fawzi et al., 2021). Nowadays, large amounts of
MD data could be generated when modelling large molecular systems (Gupta et al., 2022) or when
applying biased sampling methods (Hénin et al., 2022). Most of these simulations are performed
to decipher specific molecular phenomena, but typically they are only used for a single publication.
We have to confess that many of us used to believe that it was not worth the storage to collect
all simulations (in particular since all might not have the same quality), but in hindsight this was
wrong. Storage is exceptionally cheap compared to the resources used to generate simulations data,
and they represent a potential goldmine of information for researchers wanting to reanalyze them
(Antila et al., 2021), in particular when modern machine-learning methods are typically limited by
the amount of training data. In the era of open and data-driven science, it is critical to render the
data generated by MD simulations not only technically available but also practically usable by the
scientific community. In this endeavor, discussions started a few years ago (Abraham et al., 2019;
Abriata et al., 2020; Merz et al., 2020) and the MD data sharing trend has been accelerated with
the effort of the MD community to release simulation results related to the COVID-19 pandemic
(Amaro and Mulholland, 2020; Mulholland and Amaro, 2020) in a centralized database (https://
covid.bioexcel.eu). Specific databases have also been developed to store sets of simulations related
to protein structures (MoDEL: Meyer et al., 2010), membrane proteins in general (MemProtMD:
Stansfeld et al., 2015; Newport et al., 2019), G-protein-coupled receptors in particular (GPCRmd:
Rodríguez-Espigares et al., 2020), or lipids (Lipidbook:Domański et al., 2010, NMRLipids Databank: Kiirikki et al., 2023).
Albeit previous attempts in the past (Tai et al., 2004; Meyer et al., 2010), there is, as of now, no
central data repository that could host all kinds of MD simulation files. This is not only due to the huge
volume of data and its heterogeneity, but also because interoperability of the many file formats used
adds to the complexity. Thus, faced with the deluge of biosimulation data (Hospital et al., 2020),
researchers often share their simulation files in multiple generalist data repositories. This makes it difficult to search and find available data on, for example, a specific protein or a given set of parameters.
We are qualifying this amount of scattered data as the dark matter of MD, and we believe it is essential
to shed light onto this overlooked but high-potential volume of data. When unlocked, publicly available MD files will gain more visibility. This will help people to access and reuse these data more easily
and overall, by making MD simulation data more FAIR (Wilkinson et al., 2016), it will also improve the
reproducibility of MD simulations (Elofsson et al., 2019; Porubsky et al., 2020; Bonomi et al., 2019).
In this work, we have employed a search strategy to index scattered MD simulation files deposited
in generalist data repositories. With a focus on the files generated by the Gromacs MD software, we
performed a proof-of-concept large-scale analysis of publicly available MD data. We revealed the
high value of these data and highlighted the different categories of the simulated molecules, as well
as the biophysical conditions applied to these systems. Based on these results and our annotations,
we proposed a search engine prototype to easily explore this dark matter of MD. Finally, building on
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
2 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
Figure 1. Explore and Expand ($Ex^2$) strategy used to index MD-related files and number of deposited files in generalist data repositories, identified
by this strategy. (A) Explore and Expand (Ex2 ) strategy used to index and collect MD-related files. Within the explore phase, we search in the respective
data repositories for datasets that contain specific keywords (e.g. ‘molecular dynamics’, ‘md simulation’, ‘namd’, ‘martini’...) in conjunction with specific
file extensions (e.g. ‘mdp’, ‘psf’, ‘parm7’...), depending on their uniqueness and level of trust to not report false-positives (i.e. not MD related). In the
expand phase, the content of the identified datasets is fully cataloged, including files that individually could result in false positives (such as e.g. ‘.log’
files). (B) Number of deposited files in generalist data repositories, identified by our Ex2 strategy.
this experience, we provide simple guidelines for data sharing to gradually improve the FAIRness of
MD data.
Results
With the rise of open science, researchers increasingly share their data and deposit them into generalist data repositories, such as Zenodo (https://zenodo.org), Figshare (https://figshare.com), Open
Science Framework (OSF, https://osf.io), and Dryad (https://datadryad.org/). In this first attempt to
find out how many files related to MD are deposited in data repositories, we focused our exploration
on three major data repositories: Figshare (∼3.3 million files, ∼112 TB of data, as of January 2023),
OSF (∼2 million files, as of November 2022) [Figures provided by Figshare and OSF user support
teams.], and Zenodo (∼9.9 million files, ∼1.3 PB of data, as of December 2022; Panero and Benito,
2022).
One immediate strategy to index MD simulation files available in data repositories is to perform a
text-based Google-like search. For that, one queries these repositories with keywords such as ‘molecular dynamics’ or ‘Gromacs’. Unfortunately, we experienced many false positives with this search
strategy. This could be explained by the strong discrepancy we observed in the quantity and quality
of metadata (title, description) accompanying datasets and queried in text-based search. For instance,
a description text could be composed of a couple of words to more than 1200 words. Metadata is
provided by the user depositing the data, with no incentive to issue relevant details to support the
understanding of the simulation. For the three data repositories studied, no human curation other by
that of the providers is performed when submitting data. It is also worth mentioning that title and
description are provided as free-text and do not abide to any controlled vocabulary such as a specific
MD ontology.
Table 1. Statistics of the MD-related datasets and files found in the data repositories Figshare, OSF, and Zenodo.
Data repository
datasets
first dataset
latest dataset
files
total size (GB)
zip files
files within zip
total files
Zenodo
1011
19/11/2014
05/03/2023
20,250
12,851
1780
141,304
161,554
Figshare
913
20/08/2012
03/03/2023
3336
736
590
74,720
78,056
OSF
55
24/05/2017
05/02/2023
6146
495
14
0
6146
Total
1979
–
–
29,732
14,082
2384
216,024
245,756
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
3 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
To circumvent this issue, we developed an original and specific search strategy that we called
Explore and Expand (Ex2) (see Figure 1A and Materials and methods section) and that relies on a
combination of file types and keywords queries. In the Explore phase, we searched for files based
on their file types (for instance: .xtc, .gro, etc) with MD-related keywords (for instance: ‘molecular
dynamics’, ‘Gromacs’, ‘Martini’, etc). Each of these hit files belonged to a dataset, which we further
screened in the Expand phase. There, we indexed all files found in a dataset identified in the previous
Explore phase with, this time, no restriction to the collected file types (see Figure 1A and details on
the data scraping procedure in the Materials and methods section).
Globally, we indexed about 250,000 files and 2000 datasets that represented 14 TB of data deposited between August 2012 and March 2023 (see Table 1). One major difficulty were the numerous
files stored in zipped archives, about seven times more than files steadily available in datasets (see
Table 1). While this choice is very convenient for depositing the files (as one just needs to provide
one big zip file to upload to the data repository server), it hinders the analysis of MD files as data
repositories only provide a limited preview of the content of the zip archives and completely inhibits,
for example, data streaming for remote analysis and visualization. Files within zip files are not indexed
and cannot be searched individually. The use of zip archives also hampers the reusability of MD data,
since a specific file cannot be downloaded individually. One has to download the entire zip archive
(sometimes with a size up to several gigabytes) to extract the one file of interest.
The first dataset we found related to MD data that has been deposited in August 2012 in Figshare
and corresponds to the work of Fuller et al., 2012 (see Table 1) but we may consider the start of
more substantial deposition of the MD data to be 2016 with more than 20,000 files deposited, mainly
in Figshare (see Figure 1B). While the number of files deposited in Zenodo was first relatively limited,
the last few years (2020–2022) saw a steep increase, passing from a few thousands files in 2018 to
almost 50,000 files in 2022 (see Figure 1B). In 2018, the number of MD files deposited in OSF was
similar to those in the two other data repositories, but did not take off as much as the other data
repositories. Zenodo seems to be favored by the MD community since 2019, even though Figshare
in 2022 also saw a sharp increase in deposited MD files. The preference for Zenodo could also be
explained by the fact that it is a publicly funded repository developed under the European OpenAIRE
program and operated by CERN (European Organization For Nuclear Research, 2013). Overall,
the trend showed a rise of deposited data with a steep increase in 2022 (Figure 1B). We believe that
this trend will continue in future years, which will lead to a greater amount of MD data available. It is
thus urgent to deploy a strategy to index this vast amount of data, and to allow the MD community
to easily explore and reuse such gigantic resource. The following describes what is already feasible
in terms of meta analysis, in particular what types of data are deposited in data repositories and the
simulation setup parameters used by MD experts that have deposited their data.
With our Ex2strategy (see Figure 1A), we assigned the deposited files to the MD packages: AMBER
(Salomon‐Ferrer et al., 2013), DESMOND (Bowers et al., 2006), Gromacs (Berendsen et al., 1995;
Abraham et al., 2015), and NAMD/CHARMM (Phillips et al., 2020; Brooks et al., 2009), based on
Figure 2. Categorization of index files based on their file types and assigned MD engine. (A) Distribution of files among MD simulation engines
(B) Expansion of (A) MD Engine category ‘Unknown’ into the 10 most observed file types.
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
4 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
their corresponding file types (see Materials and methods section). In the case of NAMD/CHARMM,
file extensions were mostly identical, which prevented us from distinguishing the respective files from
these two MD programs. With 87,204 files deposited, the Gromacs program was most represented
(see Figure 2A), followed by NAMD/CHARMM, AMBER, and DESMOND. This statistic is limited as it
does not consider more specific databases related to a particular MD program. For example, the DE
Shaw Research website contains a large amount of simulation data related to SARS-CoV-2 that has
been generated using the ANTON supercomputer (https://www.deshawresearch.com/downloads/
download_trajectory_sarscov2.cgi/) or other extensively simulated systems of interest to the community. However, this in itself might also serve as a good example, since few automated search strategies
will be able to find custom stand-alone web servers as valuable repositories. Here, our goal was not
to compare the availability of all data related to each MD program but to give a snapshot of the type
of data available at a given time (i.e. March 2023) in generalist data repositories. Interestingly, many
files (>133,000) were not directly associated to any MD program (see Figure 2A label ‘Unknown’).
We categorized these files based on their extensions (see Figure 2B). While 10% of these files were
without file extension (Figure 2B, column none), we found numerous files corresponding to structure coordinates such as .pdb (∼12,000) and .xyz (∼6800) files. We also got images (.tiff files) and
graphics (.xvg files). Finally, we found many text files such as .txt, .dat, and .out which can potentially
hold details about how simulations were performed. Focusing further on files related to the Gromacs
program, being currently most represented in the studied data repositories, we demonstrated in the
following present possibilities to retrieve numerous information related to deposited MD simulations.
First, we were interested in what file types researchers deposited and thereby find potentially of
great value to share. We therefore quantified the types of files generated by Gromacs (Figure 3A).
The most represented file type is the.xtc file (28,559 files, representing 8.6 TB). This compressed
(binary) file is used to store the trajectory of an MD simulation and is an important source of information to characterize the evolution of the simulated molecular system as a function of time. It is thus
logical to mainly find this type of file shared in data repositories, as it is of great value for reusage and
new analyses. Nevertheless, it is not directly readable but needs to be read by a third-party program,
such as Gromacs itself, a molecular viewer like VMD (Humphrey et al., 1996) or an analysis library
such as MDAnalysis (Gowers et al., 2016; Michaud-Agrawal et al., 2011). In addition, this trajectory
file can only be of use in combination with a matching coordinates file, in order to correctly access the
dynamics information stored in this file. Thus, as it is, this file is not easily mineable to extract useful
information, especially if multiple .xtc and coordinate files are available in one dataset. Interestingly,
we found 1406 .trr files, which contain trajectory but also additional information such as velocities,
energy of the system, etc. While this file is especially useful in terms of reusability, the large size (can go
up to several 100 GB) limits its deposition in most data repositories. For instance, a file cannot usually
exceed 50 GB in Zenodo, 20 GB in Figshare (for free accounts) and 5 GB in OSF. Altogether, Gromacs
trajectory files represented about 30,000 files in the three explored generalist repositories (34% of
Gromacs files). This is a large number in comparison to existing trajectories stored in known databases
dedicated to MD with 1700 MD trajectories available in MoDEL, 1737 trajectories (as of November
2022) available in GPCRmd, 5971 (as of January 2022) trajectories available in MemProtMD and 726
trajectories (as of March 2023) available in the NMRLipids Databank. Although fewer in count, these
numbers correspond to manually or semi-automatically curated trajectories of specific systems, mostly
proteins and lipids. Thus, ∼30,000 MD trajectories available in generalist data repositories may represent a wider spectrum of simulated systems but need to be further analyzed and filtered to separate
usable data from less interesting trajectories such as minimization or equilibration runs.
Given the large volume of data represented by .xtc files (see above), we could only scratch the
surface of the information stored in these trajectory files by analyzing a subset of 779 .xtc files - one
per dataset in which this type of file was found. We were able to get the size of the molecular systems
and the number of frames available in these files (Figure 3B). The system size was up to more than one
million atoms for a simulation of the TonB protein (Virtanen et al., 2020). The cumulative distribution
of the number of frames showed that half of the files contain more than 10,000 frames. This conformational sampling can be very useful for other research fields besides the MD community that study, for
instance, protein flexibility or protein engineering where diverse backbones can be of value. We found
an .xtc file containing more than 5 million frames, where the authors probe the picosecond–nanosecond dynamics of T4 lysozyme and guide the MD simulation with NMR relaxation data (Kümmerer
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
5 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Figure 3. Content analysis of .xtc and .gro files. (A) Number of Gromacs-related files available in searched data repositories. In red, files used for further
analyses. (B) Simple analyze of a subset of .xtc files with the cumulative distribution of the number of frames (in green) and the system size (in orange).
(C) Cumulative distribution of the system sizes extracted from .gro files. (D) Upset plot of systems grouped by molecular composition, inferred from the
analysis of .gro files. For this figure, 3D structures of representative systems were displayed, including soluble proteins such as TonB and T4 Lysozyme,
membrane proteins such as Kir Channels and the Gasdermin prepore, Protein-/RNA and G-quadruplex and other non-protein molecules.
et al., 2021). Extending this analysis to all 28,559 .xtc files detected would be of great interest for a
more holistic view, but this would require an initial step of careful checking and cleaning to be sure
that these files are analyzable. Of note, as .xtc files also contain time stamps, it would be interesting
to study the relationship between the time and the number of frames to get useful information about
the sampling. Nevertheless, this analysis would be possible only for unbiased MD simulations. So, we
would need to decipher if the .xtc file is coming from biased or unbiased simulations, which may not
be trivial.
These results bring a first explanation on why there is not a single special-purpose repository
for MD trajectory files. Databases dedicated to molecular structures such as the Protein Databank
(Berman et al., 2000; Kinjo et al., 2017; Armstrong et al., 2020), or even the recent PDB-dev
(Burley et al., 2017), designed for integrative models, cannot accept such large-size files, even less
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
6 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
if complete trajectories without reducing the number of frames would be uploaded. This would also
require implementing extra steps of data curation and quality control. In addition, the size of the IT
infrastructure and the human skills required for data curation represents a significant cost that could
probably not be supported by a single institution.
Subsequently, our interest shifted towards exploring which systems are being investigated by MD
researchers who deposit their files. We found 9718 .gro files which are text files that contain the
number of particles and the Cartesian coordinates of the system modelled. By parsing the number
of particles and the type of residue, we were able to give an overview of all Gromacs systems deposited (Figure 3C, D). In terms of system size, they ranged from very small - starting with two coarse-
grain (CG) particles of graphite (Piskorz et al., 2019), followed by coordinates of a water molecule
(3 atoms) (Ivanov et al., 2017), CG model of benzene (3 particles) (Dandekar and Mondal, 2020)
and atomistic model of ammonia (4 atoms) (Kelly and Smith, 2020) — to go up to atomistic and
coarse-grain systems composed of more than 3 million particles (Duncan et al., 2020; Schaefer and
Hummer, 2022; Figure 3C). Interestingly, the system sizes in .gro files exceeded those of the analyzed
.xtc files (Figure 3B). Even if we cannot exclude that the limited number of .xtc files analyzed (779 .xtc
files selected from 28,559 .xtc files indexed) could explain this discrepancy, an alternate hypothesis is
that the size of an .xtc file also depends on the number of frames stored. To reduce the size of .xtc files
deposited in data repositories, besides removing some frames, researchers might also remove parts
of the system, such as water molecules. As a consequence for reusability, this solvent removal could
limit the number of suitable datasets available for researchers interested in re-analysing the simulation
with respect to, in this case, water diffusion. While the size of systems extracted from .gro files was
homogeneously spread, we observed a clear bump around system sizes of circa 8500 atoms/particles. This enrichment of data could be explained by the deposition of ∼340 .gro files related to the
simulation of a peptide translocation through a membrane (Figure 3C; Kabelka et al., 2021). Beyond
1 million particles/atoms, the number of systems is, for the moment, very limited.
We then analyzed residues in .gro files and inferred different types of molecular systems (see
Figure 3D). Two of the most represented systems contained lipid molecules. This may be related
to NMRLipids initiative (http://nmrlipids.blogspot.com). For several years, this consortium has been
actively working on lipid modelling with a strong policy of data sharing and has contributed to share
numerous datasets of membrane systems. As illustrated in Figure 3C, a variety of membrane systems,
especially membrane proteins, were deposited. This highlights the vitality of this research field, and
the will of this community to share their data. We also found numerous systems containing solvated
proteins. This type of data, combined with .xtc trajectory files (see above), could be invaluable to
describe protein dynamics and potentially train new artificial intelligence models to go beyond the
current representation of the static protein structure (Lane, 2023). There was also a good proportion of systems containing nucleic acids alone or in interaction with proteins (1237 systems). At this
time, we found only few systems containing carbohydrates that also contained proteins and corresponded to one study to model hyaluronan–CD44 interactions (Vuorio et al., 2017). Maybe a reason
for this limited number is that systems containing sugars are often modelled using AMBER force field
(Salomon‐Ferrer et al., 2013), in combination with GLYCAM (Kirschner et al., 2008). A future study
on the ∼10,200 AMBER files deposited could retrieve more data related to carbohydrate containing
systems. Given the current developments to model glycans (Fadda, 2022), we expect to see more
deposited systems with carbohydrates in the coming years.
Finally, we found 1029 gro files which did not belong to the categories previously described. These
files were mostly related to models of small molecules, or molecules used in organic chemistry (Young
et al., 2020) and material science (Piskorz et al., 2019; Zheng et al., 2022) (see central panel,
Figure 3D). Several datasets contained lists of small molecules used for calculating free energy of
binding (Aldeghi et al., 2016), solubility of molecules (Liu et al., 2016), or osmotic coefficient (Zhu,
2019). Then, we identified models of nanoparticles (Kyrychenko et al., 2012; Pohjolainen et al.,
2016), polymers (Sarkar et al., 2020; Karunasena et al., 2021; Gertsen et al., 2020), and drug molecules like EPI-7170, which binds disordered regions of proteins (Zhu et al., 2022). Finally, an interesting
case from material sciences was the modelling of the PTEG-1 molecule, an addition of polar triethylene glycol (TEG) onto a fulleropyrrolidine molecule (see central panel, Figure 3D). This molecule was
synthesized to improve semiconductors (Jahani et al., 2014). We found several models related to this
peculiar molecule and its derivatives, both atomistic (Qiu et al., 2017; Sami et al., 2022) and coarse
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
7 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
grained (Alessandri et al., 2020). With a good indexing of data and appropriate metadata to identify
modelled molecules, a simple search, which was previously to this study missing, could easily retrieve
different models of the same molecule to compare them or to run multi-scale dynamics simulations.
Beyond .gro files, we would like to analyze the ensemble of the ∼12,000 .pdb extracted in this study
(see Figure 2B) to better characterize the types of molecular structures deposited.
Another important category of deposited files are those containing information about the topology
of the simulated molecules, including file extensions such as .itp and .top. Further, they are often the
results of long parametrization processes (Vanommeslaeghe and MacKerell, 2012; Souza et al.,
2021; Wang et al., 2004) and therefore of significant value for reusability. Based on our analysis,
we indexed almost 20,000 topology files which could spare countless efforts to the MD community
if these files could be easily found, annotated and reused. Interestingly, the number of .itp files was
elevated (13,058 files) with a total size of 2 GB, while there were less .top files (7009 files) with a total
size of 17 GB. Thus, .itp files seemed to contain much less information than the .top files. Among the
remaining file types, .tpr files contain all the information to potentially directly run a simulation. Here,
we found 4987 .tpr files, meaning that it could virtually be possible to rerun almost 5000 simulations
without the burden of setting up the system to simulate. Finally, the 3730 .log files are also a source of
useful information as it is relatively easy to parse this text file to extract details on how MD simulations
were run, such as the version of Gromacs, which command line was used to run the simulation, etc.
Our next step was to gain insight into the parameter settings employed by the MD community,
which may aid us in identifying preferences in MD setups and potential necessity for further education
to avoid suboptimal or outdated configurations. We therefore analyzed 10,055 .mdp files stored in
the different data repositories. These text files contain information regarding the input parameters to
run the simulations such as the integrator, the number of steps, the different algorithms for barostat
and thermostat, etc. (for more details see: https://manual.gromacs.org/documentation/current/user-
guide/mdp-options.html).
We determined the expected simulation time corresponding to the product of two parameters
found in .mdp files: the number of steps and the time step. Here, we acknowledge that one can set up
a very long simulation time and stop the simulation before the end or, on contrary, use a limited time
(especially when calculations are performed on HPC resources with wall-time) and then extend the
simulation for a longer duration. Using only the .mdp file, we cannot know if the simulation reached
its term. To do so, comparison with an .xtc file from the same dataset may help to answer this specific
question. However, in this study, we were interested in MD setup practices, in particular what simulation time researchers would set up their system with - likely in the mindset to reach that ending time.
We restricted this analysis to the 4623 .mdp files that used the md or sd integrator, and that have a
simulation time above 1 ns. We found that the majority of the .mdp files were used for simulations of
50 ns or less (see Figure 4A). Further, 697 .mdp files with simulations times set-up between 50 ns and
1 µs and 585 .mdp files with simulation time above 1 µs were identified. As analyzing .gro files showed
a good proportion of coarse-grained models (Figure 3B, C), we discriminated simulations setups for
these two types of models using the time step as a simple cutoff. We considered that a time step
greater than 10 fs (i.e. dt = 0.01) corresponded to MD setups for coarse grained models (Ingólfsson
et al., 2014). Globally, we found that over all simulations, the setups for atomistic simulations were
largely dominant. However, for simulations with a simulation time above 1 µs specifically, coarse-grain
simulations represented 86% of all.
We then looked into the combinations of thermostat and barostat (see Figure 4B) from 9199 .mdp
files. The main thermostat used is by far the V-rescale (Bussi et al., 2007) often associated with the
Parrinello-Rahman barostat (Parrinello and Rahman, 1981). This thermostat was also used with the
Berendsen barostat (Berendsen et al., 1984). In a few cases, we observed the use of the V-rescale
thermostat with the very recently developed C-rescale barostat (Bernetti and Bussi, 2020). A total of
2021 .mdp files presented neither thermostat nor barostat, which means they would not be used in
production runs. This could correspond to setups used for energy minimization, or to add ions to the
system (with the genion command), or for molecular mechanics with Poisson–Boltzmann and surface
area solvation (MM/PBSA) and molecular mechanics with generalised Born and surface area solvation
(MM/GBSA) calculations (Genheden and Ryde, 2015).
Finally, we analyzed the range of starting temperatures used to perform simulations (see Figure 4C).
We found a clear peak around the temperatures 298 K - 310 K which corresponds to the range
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
8 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Figure 4. Content analysis of .mdp files. (A) Cumulative distribution of .mdp files versus the simulation time for all-atom and coarse-grain simulations.
(B) Sankey graph of the repartition between different values for thermostat and barostat. (C) Temperature distribution, full scale in upper panel and
zoom-in in lower panel.
between ambient room (298 K - 25 °C) and physiological (310 K - 37 °C) temperatures. Nevertheless,
we also observed lower temperatures, which often relate to studies of specific organic systems or
simulations of Lennard-Jones models (Jeon et al., 2016). Interestingly, we noticed the appearance
of several pikes at 400 K, 600 K, and 800 K, which were not present before the end of the year 2022.
These peaks corresponded to the same study related to the stability of hydrated crystals (Dybeck
et al., 2023). Overall, this analysis revealed that a wide range of temperatures have been explored,
starting mostly from 100 K and going up to 800 K.
To encourage further analysis of the collected files, we shared our data collection with the community in Zenodo (see Data availability statement). The data scrapping procedure and data analysis is
available on GitHub with a detailed documentation. To let researchers having a quick glance and
explore this data collection, we created a prototype web application called MDverse data explorer
available at https://mdverse.streamlit.app/ and illustrated in Figure 5A. With this web application, it is
easy to use keywords and filters to access interesting datasets for all MD engines, as well as .gro and
.mdp files. Furthermore, when available, a description of the found data is provided and searchable
for keywords (Figure 5A, on the left sidebar). The sets of data found can then be exported as a tab-
separated values (.tsv) file for further analysis (Figure 5B).
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
9 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Figure 5. Snapshots of the MDverse data explorer, a prototype search engine to explore collected files and datasets. (A) General view of the web
application. (B) Focus on the .mdp and .gro files sets of data exported as.tsv files. The web application also includes links to their original repository.
Towards a better sharing of MD data
With this work, we have shown that it was possible to not only retrieve MD data from the generalist
data repositories Zenodo, Figshare and OSF, but to shed light onto the dark matter of MD data in
terms of learning current scientific practice, extracting valuable topology information, and analysing
how the field is developing. Our objective was not to assess the quality of the data but only to show
what kind of data was available. The Ex2 strategy to find files related to MD simulations relied on
the fact that many MD software output files with specific file extensions. This strategy could not be
applied in research fields where data exhibits non-specific file types. We experienced this limitation
while indexing zip archives related to MD simulations, where we were able to decide if a zip archive
was pertinent for this work only by accessing the list of files contained in the archive. This valuable
feature is provided by data repositories like Zenodo and Figshare, with some caveats, though.
As of March 2023, we managed to index 245,756 files from 1979 datasets, representing altogether
14 TB of data. This is a fraction of all files stored in data repositories. For instance, as of December
2022, Zenodo hosted about 9.9 million files for ∼1.3 PB of data (Panero and Benito, 2022). All these
files are stored on servers available 24/7. This high availability costs human resources, IT infrastructures and energy. Even if MD data represents only 1% of the total volume of data stored in Zenodo,
we believe it is our responsibility, as a community, to develop a better sharing and reuse of MD
simulation files - and it will neither have to be particularly cumbersome nor expensive. To this end,
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
10 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
we are proposing two solutions. First, improve practices for sharing and depositing MD data in data
repositories. Second, improve the FAIRness of already available MD data notably by improving the
quality of the current metadata.
Guidelines for better sharing of MD simulation data
Without a community-approved methodology for depositing MD simulation files in data repositories,
and based on the current experience we described here, we propose a few simple guidelines when
sharing MD data to make them more FAIR (Findable, Accessible, Interoperable and Reusable):
Avoid zip or tar archives whose content cannot be properly indexed by data repositories. As
much as possible, deposit original data files directly.
• Describe the MD dataset with extensive metadata. Provide adequate information along your
dataset, such as:
The scope of the study, e.g. investigate conformation dynamics, benchmark force field,...
The method on a basic (e.g. quantum mechanics, all-atom, coarse-grain) or advanced (accelerated, metadynamics, well-tempered) level.
The MD software: name, version (tag) and whether modifications have been made.
The simulation settings (for each of the steps, including minimization, equilibration and
production): temperature(s), thermostat, barostat, time step, total runtime (simulation
length), force field, additional force field parameters.
The composition of the system, with the precise names of the molecules and their numbers,
if possible also PDB, UniProt or Ensemble identifiers and whether the default structure has
been modified.
Give information about any post-processing of the uploaded files (e.g. truncation or stripping of the trajectory), including before and after values of what has been modified e.g.
number of frames or number of atoms of uploaded files.
Highlight especially valuable data, e.g. excessively QM-based parameterized molecules, and
their parameter files.
Store this metadata in the description of the dataset. An adaptation of the Minimum Information
About a Simulation Experiment (MIASE) guidelines Waltemath et al., 2011 in the context of MD
simulations would be useful to define required metadata.
•
Link the MD dataset to other associated resources, such as:
The research article (if any) for which these data have been produced. Datasets are usually
mentioned in the research articles, but rarely the other way around, since the deposition has
to be done prior to publication. However, it is eminently possible to submit a revised version,
and providing a link to the related research paper in updated metadata of the MD dataset
will ease the reference to the original publication upon data reuse.
The code used to analyze the data, ideally deposited in the repository to guarantee availability, or in a GitHub or GitLab repository.
Any other datasets that belong to the same study.
• Provide sufficient files to reproduce simulations and use a clear naming convention to make
explicit links between related files. For instance, for the Gromacs MD engine, t rajectory.xtc files
could share the same names as s tructure.gro files (e.g. p
roteinA.gro and p
roteinA.xtc).
• Revisit your data deposition after paper acceptance and update information if necessary.
Zenodo and Figshare provide a DOI for every new version of a dataset as well as a ‘master’ DOI
that always refers to the latest version available.
•
These guidelines are complementary to the reliability and reproducibility checklist for molecular
dynamics simulations (Commun Biol, 2023). Eventually, they could be implemented in machine
actionable Data Management Plan (maDMP) (Miksa et al., 2019). So far, MD metadata is formalized
as free text. We advocate for the creation of a standardized and controlled vocabulary to describe artifacts and properties of MD simulations. Normalized metadata will, in turn, enable scientific knowledge
graphs (Auer, 2018; Färber and Lamprecht, 2021) that could link MD data, research articles and MD
software in a rich network of research outputs.
Converging on a set of metadata and format requires a large consensus of different stakeholders,
from users, to MD program developers, and journal editors. It would be especially useful to organize
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
11 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
specific workshops with representatives of all these communities to collectively tackle this specific
issue.
Improving metadata of current MD data
While indexing about 2000 MD datasets, we found that title and description accompanying these
datasets were very heterogeneous in terms of quality and quantity and were difficult for machines to
process automatically. It was sometimes impossible to find even basic information such as the identity
of the molecular system simulated, the temperature or the length of the simulation. Without appropriate metadata, sharing data is pointless, and its reuse is doomed to fail (Musen, 2022). It is thus
important to close the gap between the availability of MD data and its discoverability and description
through appropriate metadata. We could gradually improve the metadata by following two strategies.
First, since MD engines produce normalized and well-documented files, we could extract parameters
of the simulation by parsing specific files. We already explored this path with Gromacs, by extracting
the molecular size and composition from .gro files and the simulation time (with some limitations),
thermostat and barostat from .mdp files. We could go even further, by extracting for instance Gromacs
version from .log file (if provided) or by identifying the simulated system from its atomic topology
stored in.gro files. This strategy can in principle be applied to files produced by other MD engines.
A second approach that we are currently exploring uses data mining and named entity recognition
(NER) methods (Perera et al., 2020) to automatically identify the molecular system, the temperature,
and the simulation length from existing textual metadata (dataset title and description), providing
they are of sufficient length. Finally, the possibilities afforded by large language models supplemented
by domain-specific tools (Bran et al., 2023) might help interpret the heterogenous metadata that is
often associated with the simulations.
Future works
In the future, it is desirable to go further in terms of analysis and integrate other data repositories,
such as Dryad and Dataverse instances (for example Recherche Data Gouv in France). The collaborative platform for source code GitHub could also be of interest. Albeit dedicated to source code and
not designed to host large-size binary files, GitHub handles small to medium-size text files like tabular
.csv and .tsv data files and has been extensively used to record cases of the Ebola epidemic in 2014
(Perkel, 2016) and the Covid-19 pandemic (Johns Hopkins University, 2020). Thus, GitHub could
probably host small text-based MD simulation files. For Gromacs, we already found 70,000 parameter
.mdp files and 55,000 structure .gro files. Scripts found along these files could also provide valuable
insights to understand how a given MD analysis was performed. Finally, GitHub repositories might
also be an entry point to find other datasets by linking to simulation data, such as institutional repositories (see for instance Pesce and Lindorff-Larsen, 2023). However, one potential point of concern
is that repositories like GitHub or GitLab do not make any promises about long-term availability of
repositories, in particular ones not under active development. Archiving of these repositories could
be achieved in Zenodo (for data-centric repositories) or Software Heritage (Di Cosmo and Zacchiroli,
2017; for source-code-centric repositories).
An obvious next step is the enrichment of metadata with the hope to render open MD data more
findable, accessible and ultimately reusable. Possible strategies have already been detailed previously in this paper. We could also go further by connecting MD data in the research ecosystem. For
this, two apparent resources need to be linked to MD datasets: their associated research papers to
mine more information and to establish a connection with the scientific context, and their simulated
biomolecular systems, which ultimately could cross-reference MD datasets to reference databases
such as UniProt Consortium, 2022, the PDB (Berman et al., 2000) or Lipid Maps (Sud et al., 2007).
For already deposited datasets, the enrichment of metadata can only be achieved via systematic
computational approaches, while for future depositions, a clear and uniformly used ontology and
dedicated metadata reference file (as it is used by the PLUMED-NEST: Bonomi et al., 2019) would
facilitate this task.
Eventually, front-
end solutions such as the MDverse data explorer tool can evolve to being
more user-friendly by interfacing the structures and dynamics with interactive 3D molecular viewers
(Tiemann et al., 2017; Kampfrath et al., 2022; Martinez and Baaden, 2021).
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
12 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
Conclusion
In this work, we showed that sharing data generated from MD simulations is now a common practice.
From Zenodo, Figshare and OSF alone, we indexed about 250,000 files from 2000 datasets, and we
showed that this trend is increasing. This data brings incentive and opportunities at different levels.
First, for researchers who cannot access high-performance computing (HPC) facilities, or do not want
to rerun a costly simulation to save time and energy, simulations of many systems are already available. These simulations could be useful to reanalyze existing trajectories, to extend simulations with
already equilibrated systems or to compare simulations of a dedicated molecular system modelled
with different settings. Second, building annotated and highly curated datasets for artificial intelligence will be invaluable to develop dynamic generative deep-learning models. Then, improving
metadata along available data will foster their reuse and will mechanically increase the reproducibility
of MD simulations. At last, we see here the occasion to push for good practices in the setup and
production of MD simulations.
Materials and methods
Initial data collection
We searched for MD-
related files in the data repositories Zenodo, Figshare and Open Science
Framework (OSF). Queries were designed with a combination of file types and optionally keywords,
depending on how a given file type was solely associated to MD simulations. We therefore built a list
of manually curated and cross-checked file types and keywords (https://github.com/MDverse/mdws/
blob/main/params/query.yml; Poulain et al., 2023). All queries were automated by Python scripts that
utilized Application Programming Interfaces (APIs) provided by data repositories. Since APIs offered
by data repositories were different, all implementations were performed in dedicated Python (van
Rossum, 1995) (version 3.9.16) scripts with the NumPy (Oliphant, 2007) (version 1.24.2), Pandas
(McKinney, 2010) (version 1.5.3) and Requests (version 2.28.2) libraries.
We made the assumption that files deposited by researchers in data repositories were coherent
and all related to a same research project. Therefore, when an MD-related file was found in a dataset,
all files belonging to this dataset were indexed, regardless of whether their file types were actually
identified as MD simulation files. This is the core of the Explore and Expand strategy (Ex2) we applied
in this work and illustrated in Figure 1. By default, the last version of the datasets was collected.
When a zip file was found in a dataset, its content was extracted from a preview provided by
Zenodo and Figshare. This preview was not provided through APIs, but as HTML code, which we
parsed using the Beautiful Soup library (version 4.11.2). Note that the zip file preview for Zenodo was
limited to the first 1000 files. To avoid false-positive files collected from zip archives, a final cleaning
step was performed to remove all datasets that did not share at least one file type with the file type
list mentioned above. In the case of OSF, there was no preview for zip files, so their content has not
been retrieved.
Gromacs files
After the initial data collection, Gromacs .mdp and .gro files were downloaded with the Pooch library
(version 1.6.0). When a .mdp or .gro file was found to be in a zip archive, the latter was downloaded
and the targeted .mdp or .gro file was selectively extracted from the archive. The same procedure was
applied for a subset of .xtc files that consisted of about one .xtc file per Gromacs datasets.
Once downloaded, .mdp files were parsed to extract the following parameters: integrator, time
step, number of steps, temperature, thermostat, and barostat. Values for thermostat and barostat
were normalized according to values provided by the Gromacs documentation. For the simulation
time analysis, we selected .mdp files with the md or sd integrator and with simulation time above
1 ns to exclude most minimization and equilibrating simulations. For the thermostat and barostat
analysis, only files with non-missing values and with values listed in the Gromacs documentation were
considered.
The .gro files were parsed with the MDAnalysis library (Michaud-Agrawal et al., 2011) to extract
the number of particles of the system. Values found in the residue name column were also extracted
and compared to a list of residues we manually associated to the following categories: protein,
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
13 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
lipid, nucleic acid, glucid and water or ions (https://github.com/MDverse/mdws/blob/main/params/
residue_names.yml; Poulain et al., 2023).
The .xtc files were analyzed using the gmxcheck command (https://manual.gromacs.org/current/
onlinehelp/gmx-check.html) to extract the number of particles and the number of frames.
MDverse data explorer web app
The MDverse data explorer web application was built in Python with the Streamlit library. Data was
downloaded from Zenodo (see the Data availability statement).
System visualization and molecular graphics
Molecular graphics were performed with VMD (Humphrey et al., 1996) and Chimera (Pettersen
et al., 2004). For all visualizations, .gro files containing molecular structure were used. In the case of
the two structures in Figure 3B, .xtc files were manually assigned to their corresponding .gro (for the
TonB protein) or .tpr (for the T4 Lysozyme) files based on their names in their datasets.
Origin of the structures displayed in this work:
TonB
Dataset URL: https://zenodo.org/record/3756664
Publication (DOI): https://doi.org/10.1039/D0CP03473H
T4 Lyzozyme
Dataset URL: https://zenodo.org/record/3989044
Publication (DOI): https://doi.org/10.1021/acs.jctc.0c01338
Benzene
Dataset URL: https://figshare.com/articles/dataset/Capturing_Protein_Ligand_Recognition_
Pathways_in_Coarse-Grained_Simulation/12517490/1
Publication (DOI): https://doi.org/10.1021/acs.jpclett.0c01683
Ammonia
Dataset URL: https://figshare.com/articles/dataset/Alchemical_Hydration_Free-Energy_Calculations_Using_Molecular_Dynamics_with_Explicit_Polarization_and_Induced_Polarity_Decoupling_An_On_the_Fly_Polarization_Approach/11702442
Publication (DOI): https://doi.org/10.1021/acs.jctc.9b01139
Peptide with membrane
Dataset URL: https://zenodo.org/record/4371296
Publication (DOI): https://doi.org/10.1021/acs.jcim.0c01312
Kir channels
Dataset URL: https://zenodo.org/record/3634884
Publication (DOI): https://doi.org/10.1073/pnas.1918387117
Gasdermin
Dataset URL: https://zenodo.org/record/6797842
Publication (DOI): https://doi.org/10.7554/eLife.81432
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
14 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
Protein-RNA
Dataset URL: https://zenodo.org/record/1308045
Publication (DOI): https://doi.org/10.1371/journal.pcbi.1006642
G-quadruplex
Dataset URL: https://zenodo.org/record/5594466
Publication (DOI): https://doi.org/10.1021/jacs.1c11248
Ptb
Dataset URL: https://osf.io/4aghb/
Publication (DOI): https://doi.org/10.1073/pnas.2116543119
EPI-7170
Dataset URL: https://zenodo.org/record/7120845
Publication (DOI): https://doi.org/10.1038/s41467-022-34077-z
Gold nanoparticle
Dataset URL: https://acs.figshare.com/articles/dataset/Fluorescence_Probing_of_Thiol_Functionalized_Gold_Nanoparticles_Is_Alkylthiol_Coating_of_a_Nanoparticle_as_Hydrophobic_as_
Expected_/2481241
Publication (DOI): https://doi.org/10.1021/jp3060813
Gd(DOTA)
Dataset URL: https://acs.figshare.com/articles/dataset/Modeling_Gd_sup_3_sup_Complexes_
for_Molecular_Dynamics_Simulations_Toward_a_Rational_Optimization_of_MRI_Contrast_
Agents/20334621
Publication (DOI): https://doi.org/10.1021/acs.inorgchem.2c01597
Metalo cage
Dataset URL: https://acs.figshare.com/articles/dataset/Rationalizing_the_Activity_of_an_Artificial_Diels-Alderase_Establishing_Efficient_and_Accurate_Protocols_for_Calculating_Supramolecular_Catalysis/11569452
Publication (DOI): https://doi.org/10.1021/jacs.9b10302
AL1
Dataset
URL:
https://acs.figshare.com/articles/dataset/Nucleation_Mechanisms_of_Self-
Assembled_Physisorbed_Monolayers_on_Graphite/8846045
Publication (DOI): https://doi.org/10.1021/acs.jpcc.9b01234
PTEG-1 (all-atom)
Dataset URL: https://figshare.com/articles/dataset/PTEG-1_PP_and_N-DMBI_atomistic_force_
fields/5458144
Publication (DOI): https://doi.org/10.1039/C7TA06609K
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
15 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
PTEG-1 (coarse-grain)
Dataset URL: https://figshare.com/articles/dataset/Neat_and_P3HT-Based_Blend_Morphologies_for_PCBM_and_PTEG-1/12338633
Publication (DOI): https://doi.org/10.1002/adfm.202004799
Theophylline
Dataset
URL:
https://figshare.com/articles/dataset/A_Comparison_of_Methods_for_
Computing_Relative_Anhydrous_Hydrate_Stability_with_Molecular_Simulation/21644393
Publication (DOI): https://doi.org/10.1021/acs.cgd.2c00832
Acknowledgements
We thank Lauri Mesilaakso, Bryan White, Jorge Hernansanz Biel, Zihwei Li and Kirill Baranov for their
participation in the Copenhagen BioHackathon 2020, whose results showcased the need for a more
advanced search strategy. We acknowledge Massimiliano Bonomi, Giovanni Bussi, Patrick Fuchs and
Elise Lehoux for helpful discussions and suggestions. We also thank the Zenodo, Figshare and OSF
support teams for providing figures on the content of their respective data repository and for their
help in using APIs. This work was supported by Institut français du Danemark (Blåtand program, 2021),
the Data Intelligence Institute of Paris (diiP, IdEx Université Paris Cité, ANR-18-IDEX-0001, 2023).
JKST and KL-L acknowledge funding by the Novo Nordisk Foundation [NNF18OC0033950 to KL-L],
and workshops funded by The BioExcel Center-of-Excellence (grant agreements 823830, 101093290).
Additional information
Competing interests
Lucie Delemotte: Reviewing editor, eLife. The other authors declare that no competing interests exist.
Funding
Funder
Grant reference number
Author
Institut francais du
Danemark
Blatand program 2021
Matthieu Chavent
Pierre Poulain
Data Intelligence Institute
of Paris
IdEx Université Paris Cité
ANR-18-IDEX-0001 2023
Mohamed Oussaren
Pierre Poulain
Novo Nordisk Foundation NNF18OC0033950
Johanna KS Tiemann
Kresten Lindorff-Larsen
BioExcel Center-ofExcellence
823830
Erik Lindahl
BioExcel Center-ofExcellence
101093290
Erik Lindahl
The funders had no role in study design, data collection and interpretation, or the
decision to submit the work for publication.
Author contributions
Johanna KS Tiemann, Conceptualization, Data curation, Software, Formal analysis, Supervision,
Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and
editing; Magdalena Szczuka, Lisa Bouarroudj, Mohamed Oussaren, Data curation, Software, Formal
analysis, Visualization; Steven Garcia, Data curation, Software, Formal analysis; Rebecca J Howard,
Lucie Delemotte, Erik Lindahl, Marc Baaden, Kresten Lindorff-Larsen, Conceptualization, Writing –
original draft, Writing – review and editing; Matthieu Chavent, Pierre Poulain, Conceptualization, Data
curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
16 of 22
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Research article
Author ORCIDs
Rebecca J Howard https://orcid.org/0000-0003-2049-3378
Erik Lindahl https://orcid.org/0000-0002-2734-2794
Marc Baaden https://orcid.org/0000-0001-6472-0486
Kresten Lindorff-Larsen https://orcid.org/0000-0002-4750-6039
Matthieu Chavent https://orcid.org/0000-0003-4524-4773
Pierre Poulain https://orcid.org/0000-0003-4177-3619
Peer review material
Reviewer #1 (Public Review): https://doi.org/10.7554/eLife.90061.3.sa1
Reviewer #2 (Public Review): https://doi.org/10.7554/eLife.90061.3.sa2
Reviewer #3 (Public Review): https://doi.org/10.7554/eLife.90061.3.sa3
Author response https://doi.org/10.7554/eLife.90061.3.sa4
Additional files
Supplementary files
• MDAR checklist
Data availability
Data files produced from the data collection and processing are shared in Parquet format in Zenodo.
They are freely available under the Creative Commons Attribution 4.0 International license (CC-BY).
Python scripts to search and index MD files, and to download and parse .mdp a
nd.gro files are open-
source (under the AGPL-3.0 license), freely available on GitHub and archived in Software Heritage
(Poulain et al., 2023). A detailed documentation is provided along the scripts to easily reproduce the
data collection and processing. Jupyter notebooks used to analyze results and create the figures of
this paper are open-source (under the BSD 3-Clause license), freely available on GitHub and archived
in Software Heritage (Poulain, 2023). The code of the MDverse data explorer web application is
open-source (under the BSD 3-Clause license), freely available on GitHub and archived in Software
Heritage (Poulain and Oussaren, 2023).
The following dataset was generated:
Author(s)
Year
Dataset title
Dataset URL
Database and Identifier
Johanna KST,
Mathieu C, Pierre P
2023
MDverse datasets
https://zenodo.org/
records/7856806
Zenodo, 10.5281/
zenodo.7856806
References
Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E. 2015. GROMACS: High performance
molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2:19–25.
DOI: https://doi.org/10.1016/j.softx.2015.06.001
Abraham M, Apostolov R, Barnoud J, Bauer P, Blau C, Bonvin AMJJ, Chavent M, Chodera J, Čondić-Jurkić K,
Delemotte L, Grubmüller H, Howard RJ, Jordan EJ, Lindahl E, Ollila OHS, Selent J, Smith DGA, Stansfeld PJ,
Tiemann JKS, Trellet M, et al. 2019. Sharing data from molecular simulations. Journal of Chemical Information
and Modeling 59:4093–4099. DOI: https://doi.org/10.1021/acs.jcim.9b00665, PMID: 31525920
Abriata LA, Lepore R, Dal Peraro M. 2020. About the need to make computational models of biological
macromolecules available and discoverable. Bioinformatics 36:2952–2954. DOI: https://doi.org/10.1093/
bioinformatics/btaa086, PMID: 32053168
Aldeghi M, Heifetz A, Bodkin MJ, Knapp S, Biggin PC. 2016. Accurate calculation of the absolute free energy of
binding for drug molecules. Chemical Science 7:207–218. DOI: https://doi.org/10.1039/c5sc02678d, PMID:
26798447
Alessandri R, Sami S, Barnoud J, de Vries AH, Marrink SJ, Havenith RWA. 2020. Resolving donor–acceptor
interfaces and charge carrier energy levels of organic semiconductors with polar side chains. Advanced
Functional Materials 30:2004799. DOI: https://doi.org/10.1002/adfm.202004799
Alessandri R, Grünewald F, Marrink SJ. 2021. The martini model in materials science. Advanced Materials
33:e2008635. DOI: https://doi.org/10.1002/adma.202008635, PMID: 33956373
Amaro RE, Mulholland AJ. 2020. A community letter regarding sharing biomolecular simulation data for
COVID-19. Journal of Chemical Information and Modeling 60:2653–2656. DOI: https://doi.org/10.1021/acs.
jcim.0c00319, PMID: 32255648
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
17 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Antila HS, M Ferreira T, Ollila OHS, Miettinen MS. 2021. Using open data to rapidly benchmark biomolecular
simulations: Phospholipid conformational dynamics. Journal of Chemical Information and Modeling 61:938–
949. DOI: https://doi.org/10.1021/acs.jcim.0c01299, PMID: 33496579
Armstrong DR, Berrisford JM, Conroy MJ, Gutmanas A, Anyango S, Choudhary P, Clark AR, Dana JM,
Deshpande M, Dunlop R, Gane P, Gáborová R, Gupta D, Haslam P, Koča J, Mak L, Mir S, Mukhopadhyay A,
Nadzirin N, Nair S, et al. 2020. PDBe: improved findability of macromolecular structure data in the PDB.
Nucleic Acids Research 48:D335–D343. DOI: https://doi.org/10.1093/nar/gkz990, PMID: 31691821
Auer S. 2018. Towards an open research knowledge graph. Version 1. Zenodo. https://doi.org/10.5281/zenodo.
1157185
Berendsen HJC, Postma JPM, van Gunsteren WF, DiNola A, Haak JR. 1984. Molecular dynamics with coupling to
an external bath. The Journal of Chemical Physics 81:3684–3690. DOI: https://doi.org/10.1063/1.448118
Berendsen HJC, van der Spoel D, van Drunen R. 1995. GROMACS: A message-passing parallel molecular
dynamics implementation. Computer Physics Communications 91:43–56. DOI: https://doi.org/10.1016/0010-
4655(95)00042-E
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The protein
data bank. Nucleic Acids Research 28:235–242. DOI: https://doi.org/10.1093/nar/28.1.235, PMID: 10592235
Berman H, Henrick K, Nakamura H. 2003. Announcing the worldwide protein data bank. Nature Structural
Biology 10:980. DOI: https://doi.org/10.1038/nsb1203-980, PMID: 14634627
Bernetti M, Bussi G. 2020. Pressure control using stochastic cell rescaling. The Journal of Chemical Physics
153:114107. DOI: https://doi.org/10.1063/5.0020514, PMID: 32962386
Bonomi M, Bussi G, Camilloni C, Tribello GA, Banáš P, Barducci A, Bernetti M. 2019. Promoting transparency and
reproducibility in enhanced molecular simulations. Nature Methods 16:670–673. DOI: https://doi.org/10.1038/
s41592-019-0506-8
Bottaro S, Lindorff-Larsen K. 2018. Biophysical experiments and biomolecular simulations: A perfect match?
Science 361:355–360. DOI: https://doi.org/10.1126/science.aat4010, PMID: 30049874
Bowers KJ, Chow DE, Xu H, Dror RO, Eastwood MP, Gregersen BA, Klepeis JL, Kolossvary I, Moraes MA,
Sacerdoti FD, Salmon JK, Shan Y, Shaw DE. 2006. Scalable Algorithms for Molecular Dynamics Simulations on
Commodity Clusters. SC 2006 Proceedings Supercomputing. . DOI: https://doi.org/10.1109/SC.2006.54
Bran AM, Cox S, White AD, Schwaller P. 2023 ChemCrow: augmenting large-language models with chemistry
tools. arXiv. https://arxiv.org/abs/2304.05376
Brooks BR, Brooks CL, Mackerell AD, Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S,
Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, et al. 2009.
CHARMM: the biomolecular simulation program. Journal of Computational Chemistry 30:1545–1614. DOI:
https://doi.org/10.1002/jcc.21287, PMID: 19444816
Burley SK, Kurisu G, Markley JL, Nakamura H, Velankar S, Berman HM, Sali A, Schwede T, Trewhella J. 2017.
PDB-Dev: A prototype system for depositing integrative/hybrid structural models. Structure 25:1317–1318.
DOI: https://doi.org/10.1016/j.str.2017.08.001, PMID: 28877501
Bussi G, Donadio D, Parrinello M. 2007. Canonical sampling through velocity rescaling. The Journal of Chemical
Physics 126:014101. DOI: https://doi.org/10.1063/1.2408420, PMID: 17212484
Commun Biol. 2023. Reliability and reproducibility checklist for molecular dynamics simulations. Communications
Biology 6:268. DOI: https://doi.org/10.1038/s42003-023-04653-0, PMID: 36918708
Dandekar BR, Mondal J. 2020. Capturing protein-ligand recognition pathways in coarse-grained simulation. The
Journal of Physical Chemistry Letters 11:5302–5311. DOI: https://doi.org/10.1021/acs.jpclett.0c01683, PMID:
32520567
Di Cosmo R, Zacchiroli S. 2017. Software Heritage: Why and How to Preserve Software Source Code. iPRES
2017 - 14th International Conference on Digital Preservation. 1–10.
Domański J, Stansfeld PJ, Sansom MSP, Beckstein O. 2010. Lipidbook: a public repository for force-field
parameters used in membrane simulations. The Journal of Membrane Biology 236:255–258. DOI: https://doi.
org/10.1007/s00232-010-9296-8, PMID: 20700585
Duncan AL, Corey RA, Sansom MSP. 2020. Defining how multiple lipid species interact with inward rectifier
potassium (Kir2) channels. PNAS 117:7803–7813. DOI: https://doi.org/10.1073/pnas.1918387117, PMID:
32213593
Dybeck EC, Thiel A, Schnieders MJ, Pickard FC, Wood GPF, Krzyzaniak JF, Hancock BC. 2023. A comparison of
methods for computing relative anhydrous–hydrate stability with molecular simulation. Crystal Growth &
Design 23:142–167. DOI: https://doi.org/10.1021/acs.cgd.2c00832
Elofsson A, Hess B, Lindahl E, Onufriev A, van der Spoel D, Wallqvist A. 2019. Ten simple rules on how to create
open access and reproducible molecular simulations of biological systems. PLOS Computational Biology
15:e1006649. DOI: https://doi.org/10.1371/journal.pcbi.1006649, PMID: 30653494
European Organization For Nuclear Research. 2013. Zenodo. OpenAIRE. https://catalogue.openaire.eu/
service/openaire.zenodo/overview
Fadda E. 2022. Molecular simulations of complex carbohydrates and glycoconjugates. Current Opinion in
Chemical Biology 69:102175. DOI: https://doi.org/10.1016/j.cbpa.2022.102175, PMID: 35728307
Fan FJ, Shi Y. 2022. Effects of data quality and quantity on deep learning for protein-ligand binding affinity
prediction. Bioorganic & Medicinal Chemistry 72:117003. DOI: https://doi.org/10.1016/j.bmc.2022.117003
Färber M, Lamprecht D. 2021. The data set knowledge graph: Creating a linked open data source for data sets.
Quantitative Science Studies 2:1324–1355. DOI: https://doi.org/10.1162/qss_a_00161
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
18 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Fawzi NL, Parekh SH, Mittal J. 2021. Biophysical studies of phase separation integrating experimental and
computational methods. Current Opinion in Structural Biology 70:78–86. DOI: https://doi.org/10.1016/j.sbi.
2021.04.004, PMID: 34144468
Fuller JC, Jackson RM, Edwards TA, Wilson AJ, Shirts MR. 2012. Modeling of arylamide helix mimetics in the P53
peptide binding site of hDM2 suggests parallel and anti-parallel conformations are both stable. PLOS ONE
7:e43253. DOI: https://doi.org/10.1371/journal.pone.0043253, PMID: 22916232
Genheden S, Ryde U. 2015. The MM/PBSA and MM/GBSA methods to estimate ligand-binding affinities. Expert
Opinion on Drug Discovery 10:449–461. DOI: https://doi.org/10.1517/17460441.2015.1032936, PMID:
25835573
Gertsen AS, Sørensen MK, Andreasen JW. 2020. Nanostructure of organic semiconductor thin films: Molecular
dynamics modeling with solvent evaporation. Physical Review Materials 4:075405. DOI: https://doi.org/10.
1103/PhysRevMaterials.4.075405
Gowers R, Linke M, Barnoud J, Reddy T, Melo M, Seyler S, Domański J, Dotson D, Buchoux S, Kenney I,
Beckstein O. 2016. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations.
Python in Science Conference. . DOI: https://doi.org/10.25080/Majora-629e541a-00e
Gupta C, Sarkar D, Tieleman DP, Singharoy A. 2022. The ugly, bad, and good stories of large-scale biomolecular
simulations. Current Opinion in Structural Biology 73:102338. DOI: https://doi.org/10.1016/j.sbi.2022.102338,
PMID: 35245737
Hénin J, Lelièvre T, Shirts MR, Valsson O, Delemotte L. 2022. Enhanced sampling methods for molecular
dynamics simulations [Article v1.0]. Living Journal of Computational Molecular Science 4:1583. DOI: https://
doi.org/10.33011/livecoms.4.1.1583
Hoch JC, Baskaran K, Burr H, Chin J, Eghbalnia HR, Fujiwara T, Gryk MR, Iwata T, Kojima C, Kurisu G, Maziuk D,
Miyanoiri Y, Wedell JR, Wilburn C, Yao H, Yokochi M. 2023. Biological magnetic resonance data bank. Nucleic
Acids Research 51:D368–D376. DOI: https://doi.org/10.1093/nar/gkac1050, PMID: 36478084
Hollingsworth SA, Dror RO. 2018. Molecular dynamics simulation for all. Neuron 99:1129–1143. DOI: https://
doi.org/10.1016/j.neuron.2018.08.011, PMID: 30236283
Hospital A, Battistini F, Soliva R, Gelpí JL, Orozco M. 2020. Surviving the deluge of biosimulation data. WIREs
Computational Molecular Science 10:e1449. DOI: https://doi.org/10.1002/wcms.1449
Humphrey W, Dalke A, Schulten K. 1996. VMD: visual molecular dynamics. Journal of Molecular Graphics
14:33–38, . DOI: https://doi.org/10.1016/0263-7855(96)00018-5, PMID: 8744570
Ingólfsson HI, Lopez CA, Uusitalo JJ, de Jong DH, Gopal SM, Periole X, Marrink SJ. 2014. The power of coarse
graining in biomolecular simulations. Wiley Interdisciplinary Reviews. Computational Molecular Science
4:225–248. DOI: https://doi.org/10.1002/wcms.1169, PMID: 25309628
Ivanov P, Mu J, Leay L, Chang SY, Sharrad CA, Masters AJ, Schroeder SLM. 2017. Organic and Third Phase in
HNO3/TBP/n-Dodecane System: No Reverse Micelles. Solvent Extraction and Ion Exchange 35:251–265. DOI:
https://doi.org/10.1080/07366299.2017.1336048
Jahani F, Torabi S, Chiechi RC, Koster LJA, Hummelen JC. 2014. Fullerene derivatives with increased dielectric
constants. Chemical Communications 50:10645–10647. DOI: https://doi.org/10.1039/c4cc04366a, PMID:
25075465
Jeon JH, Javanainen M, Martinez-Seara H, Metzler R, Vattulainen I. 2016. Protein crowding in lipid bilayers gives
rise to non-gaussian anomalous lateral diffusion of phospholipids and proteins. Physical Review X 6:021006.
DOI: https://doi.org/10.1103/PhysRevX.6.021006
Johns Hopkins University. 2020. COVID-19 data repository by the center for systems science and engineering
(CSSE) at johns hopkins university [GitHub]. . https://github.com/CSSEGISandData/COVID-19
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A,
Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R,
Adler J, Back T, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589.
DOI: https://doi.org/10.1038/s41586-021-03819-2, PMID: 34265844
Kabelka I, Brožek R, Vácha R. 2021. Selecting collective variables and free-energy methods for peptide
translocation across membranes. Journal of Chemical Information and Modeling 61:819–830. DOI: https://doi.
org/10.1021/acs.jcim.0c01312, PMID: 33566605
Kampfrath M, Staritzbichler R, Hernández GP, Rose AS, Tiemann JKS, Scheuermann G, Wiegreffe D,
Hildebrand PW. 2022. MDsrv: visual sharing and analysis of molecular dynamics simulations. Nucleic Acids
Research 50:W483–W489. DOI: https://doi.org/10.1093/nar/gkac398, PMID: 35639717
Karunasena C, Li S, Heifner MC, Ryno SM, Risko C. 2021. Reconsidering the roles of noncovalent intramolecular
“locks” in π-conjugated molecules. Chemistry of Materials 33:9139–9151. DOI: https://doi.org/10.1021/acs.
chemmater.1c02335
Kelly BD, Smith WR. 2020. Alchemical hydration free-energy calculations using molecular dynamics with explicit
polarization and induced polarity decoupling: An on-the-fly polarization approach. Journal of Chemical Theory
and Computation 16:1146–1161. DOI: https://doi.org/10.1021/acs.jctc.9b01139, PMID: 31930918
Kiirikki A, Antila H, Bort L, Buslaev P, Fernando F, Mendes Ferreira T, Fuchs P, Garcia-Fandino R, Gushchin I,
Kav B, Kučerka N, Kula P, Kurki M, Kuzmin A, Madsen J, Miettinen M, Nencini R, Piggot T, Pineiro A,
Suarez-Leston F, et al. 2023. NMRlipids Databank Makes Data-Driven Analysis of Biomembrane Properties
Accessible for All. ChemRxiv. DOI: https://doi.org/10.26434/chemrxiv-2023-jrpwm-v2
Kinjo AR, Bekker GJ, Suzuki H, Tsuchiya Y, Kawabata T, Ikegawa Y, Nakamura H. 2017. Protein Data Bank Japan
(PDBj): updated user interfaces, resource description framework, analysis tools for large structures. Nucleic
Acids Research 45:D282–D288. DOI: https://doi.org/10.1093/nar/gkw962, PMID: 27789697
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
19 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Kirschner KN, Yongye AB, Tschampel SM, González-Outeiriño J, Daniels CR, Foley BL, Woods RJ. 2008.
GLYCAM06: a generalizable biomolecular force field. Carbohydrates. Journal of Computational Chemistry
29:622–655. DOI: https://doi.org/10.1002/jcc.20820, PMID: 17849372
Krishna S, Sreedhar I, Patel CM. 2021. Molecular dynamics simulation of polyamide-based materials – A review.
Computational Materials Science 200:110853. DOI: https://doi.org/10.1016/j.commatsci.2021.110853
Kümmerer F, Orioli S, Harding-Larsen D, Hoffmann F, Gavrilov Y, Teilum K, Lindorff-Larsen K. 2021. Fitting
side-chain nmr relaxation data using molecular simulations. Journal of Chemical Theory and Computation
17:5262–5275. DOI: https://doi.org/10.1021/acs.jctc.0c01338, PMID: 34291646
Kyrychenko A, Karpushina GV, Svechkarev D, Kolodezny D, Bogatyrenko SI, Kryshtal AP, Doroshenko AO. 2012.
Fluorescence probing of thiol-functionalized gold nanoparticles: Is alkylthiol coating of a nanoparticle as
hydrophobic as expected? The Journal of Physical Chemistry C 116:21059–21068. DOI: https://doi.org/10.
1021/jp3060813
Lane TJ. 2023. Protein structure prediction has reached the single-structure frontier. Nature Methods 20:170–
173. DOI: https://doi.org/10.1038/s41592-022-01760-4, PMID: 36639584
Liu S, Cao S, Hoang K, Young KL, Paluch AS, Mobley DL. 2016. Using MD simulations to calculate how solvents
modulate solubility. Journal of Chemical Theory and Computation 12:1930–1941. DOI: https://doi.org/10.
1021/acs.jctc.5b00934, PMID: 26878198
Mahmud M, Kaiser MS, McGinnity TM, Hussain A. 2021. Deep learning in mining biological data. Cognitive
Computation 13:1–33. DOI: https://doi.org/10.1007/s12559-020-09773-x, PMID: 33425045
Marklund EG, Benesch JL. 2019. Weighing-up protein dynamics: the combination of native mass spectrometry
and molecular dynamics simulations. Current Opinion in Structural Biology 54:50–58. DOI: https://doi.org/10.
1016/j.sbi.2018.12.011
Martinez X, Baaden M. 2021. UnityMol prototype for FAIR sharing of molecular-visualization experiences: from
pictures in the cloud to collaborative virtual reality exploration in immersive 3D environments. Acta
Crystallographica. Section D, Structural Biology 77:746–754. DOI: https://doi.org/10.1107/
S2059798321002941, PMID: 34076589
Marx V. 2013. Biology: The big challenges of big data. Nature 498:255–260. DOI: https://doi.org/10.1038/
498255a, PMID: 23765498
McKinney W. 2010. Data Structures for Statistical Computing in Python. Python in Science Conference. 56–61.
DOI: https://doi.org/10.25080/Majora-92bf1922-00a
Merz KM, Amaro R, Cournia Z, Rarey M, Soares T, Tropsha A, Wahab HA, Wang R. 2020. Editorial: Method and
data sharing and reproducibility of scientific results. Journal of Chemical Information and Modeling 60:5868–
5869. DOI: https://doi.org/10.1021/acs.jcim.0c01389, PMID: 33378854
Meyer T, D’Abramo M, Hospital A, Rueda M, Ferrer-Costa C, Pérez A, Carrillo O, Camps J, Fenollosa C,
Repchevsky D, Gelpí JL, Orozco M. 2010. MoDEL (Molecular Dynamics Extended Library): a database of
atomistic molecular dynamics trajectories. Structure 18:1399–1409. DOI: https://doi.org/10.1016/j.str.2010.07.
013, PMID: 21070939
Michaud-Agrawal N, Denning EJ, Woolf TB, Beckstein O. 2011. MDAnalysis: A toolkit for the analysis of
molecular dynamics simulations. Journal of Computational Chemistry 32:2319–2327. DOI: https://doi.org/10.
1002/jcc.21787, PMID: 21500218
Miksa T, Simms S, Mietchen D, Jones S. 2019. Ten principles for machine-actionable data management plans.
PLOS Computational Biology 15:e1006750. DOI: https://doi.org/10.1371/journal.pcbi.1006750, PMID:
30921316
Mulholland AJ, Amaro RE. 2020. COVID19 - Computational Chemists Meet the Moment. Journal of Chemical
Information and Modeling 60:5724–5726. DOI: https://doi.org/10.1021/acs.jcim.0c01395, PMID: 33378852
Musen MA. 2022. Without appropriate metadata, data-sharing mandates are pointless. Nature 609:222. DOI:
https://doi.org/10.1038/d41586-022-02820-7, PMID: 36064801
Newport TD, Sansom MSP, Stansfeld PJ. 2019. The MemProtMD database: a resource for membrane-embedded
protein structures and their lipid interactions. Nucleic Acids Research 47:D390–D397. DOI: https://doi.org/10.
1093/nar/gky1047
Oliphant TE. 2007. Python for scientific computing. Computing in Science & Engineering 9:10–20. DOI: https://
doi.org/10.1109/MCSE.2007.58
Panero P, Benito J. 2022 OpenAIRE webinar: Zenodo - open digital repository. Version v1. Zenodo. https://doi.
org/10.5281/zenodo.7417839
Parrinello M, Rahman A. 1981. Polymorphic transitions in single crystals: A new molecular dynamics method.
Journal of Applied Physics 52:7182–7190. DOI: https://doi.org/10.1063/1.328693
Perera N, Dehmer M, Emmert-Streib F. 2020. Named entity recognition and relation detection for biomedical
information extraction. Frontiers in Cell and Developmental Biology 8:673. DOI: https://doi.org/10.3389/fcell.
2020.00673, PMID: 32984300
Perilla JR, Goh BC, Cassidy CK, Liu B, Bernardi RC, Rudack T, Yu H, Wu Z, Schulten K. 2015. Molecular dynamics
simulations of large macromolecular complexes. Current Opinion in Structural Biology 31:64–74. DOI: https://
doi.org/10.1016/j.sbi.2015.03.007, PMID: 25845770
Perkel J. 2016. Democratic databases: science on GitHub. Nature 538:127–128. DOI: https://doi.org/10.1038/
538127a, PMID: 27708327
Pesce F, Lindorff-Larsen K. 2023. Combining Experiments and Simulations to Examine the Temperature-
Dependent Behaviour of a Disordered Protein. bioRxiv. DOI: https://doi.org/10.1101/2023.03.04.531094
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
20 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE. 2004. UCSF Chimera--A
visualization system for exploratory research and analysis. Journal of Computational Chemistry 25:1605–1612.
DOI: https://doi.org/10.1002/jcc.20084, PMID: 15264254
Phillips JC, Hardy DJ, Maia JDC, Stone JE, Ribeiro JV, Bernardi RC, Buch R, Fiorin G, Hénin J, Jiang W,
McGreevy R, Melo MCR, Radak BK, Skeel RD, Singharoy A, Wang Y, Roux B, Aksimentiev A, Luthey-Schulten Z,
Kalé LV, et al. 2020. Scalable molecular dynamics on CPU and GPU architectures with NAMD. The Journal of
Chemical Physics 153:044130. DOI: https://doi.org/10.1063/5.0014475, PMID: 32752662
Piskorz TK, Gobbo C, Marrink SJ, De Feyter S, de Vries AH, van Esch JH. 2019. Nucleation mechanisms of
self-assembled physisorbed monolayers on graphite. The Journal of Physical Chemistry C 123:17510–17520.
DOI: https://doi.org/10.1021/acs.jpcc.9b01234
Pohjolainen E, Chen X, Malola S, Groenhof G, Häkkinen H. 2016. A Unified AMBER-compatible molecular
mechanics force field for thiolate-protected gold nanoclusters. Journal of Chemical Theory and Computation
12:1342–1350. DOI: https://doi.org/10.1021/acs.jctc.5b01053, PMID: 26845636
Porubsky VL, Goldberg AP, Rampadarath AK, Nickerson DP, Karr JR, Sauro HM. 2020. Best practices for making
reproducible biochemical models. Cell Systems 11:109–120. DOI: https://doi.org/10.1016/j.cels.2020.06.012,
PMID: 32853539
Poulain P. 2023. MDverse data analysis swh:1:rev:4562c50d1b51a51fdf952ae6e9efaa407dd06e20. Software
Heritage. https://archive.softwareheritage.org/swh:1:dir:fc72ac7a9c9f0489a361cb2b7fcf8ba48898e4ee;origin=
https://github.com/MDverse/mdda;visit=swh:1:snp:dbfe8b4401ac98d3728ebb00241429274c619beb;anchor=
swh:1:rev:4562c50d1b51a51fdf952ae6e9efaa407dd06e20
Poulain P, Bouarroudj L, Tiemann JKS, Bussi G. 2023. MDverse web scrapper.
swh:1:rev:0524199041e84be2d69993540ad8e2223d3b4698. Software Heritage. https://archive.
softwareheritage.org/swh:1:dir:ce91602834cf79e634d26aff585a9fea22b0fea3;origin=https://github.com/
MDverse/mdws;visit=swh:1:snp:540580756b211c116bd602423e0262d3055b8251;anchor=swh:1:rev:05241990
41e84be2d69993540ad8e2223d3b4698
Poulain P, Oussaren M. 2023. MDverse data explorer. swh:1:rev:52604906f80f96b27fd61209a78a93cd36be9a45.
Software Heritage. https://archive.softwareheritage.org/swh:1:dir:1fc8b8eaabf4a9087e6d5b0ec5ed9703
1482bcbf;origin=https://github.com/MDverse/mdde;visit=swh:1:snp:5a3326fd135f604290fb799470f52438
4a959b04;anchor=swh:1:rev:52604906f80f96b27fd61209a78a93cd36be9a45
Qiu L, Liu J, Alessandri R, Qiu X, Koopmans M, Havenith RWA, Marrink SJ, Chiechi RC, Anton Koster LJ,
Hummelen JC. 2017. Enhancing doping efficiency by improving host-dopant miscibility for fullerene-based
n-type thermoelectrics. Journal of Materials Chemistry A 5:21234–21241. DOI: https://doi.org/10.1039/
C7TA06609K
Rodríguez-Espigares I, Torrens-Fontanals M, Tiemann JKS, Aranda-García D, Ramírez-Anguita JM,
Stepniewski TM, Worp N, Varela-Rial A, Morales-Pastor A, Medel-Lacruz B, Pándy-Szekeres G, Mayol E,
Giorgino T, Carlsson J, Deupi X, Filipek S, Filizola M, Gómez-Tamayo JC, Gonzalez A, Gutiérrez-de-Terán H,
et al. 2020. GPCRmd uncovers the dynamics of the 3D-GPCRome. Nature Methods 17:777–787. DOI: https://
doi.org/10.1038/s41592-020-0884-y, PMID: 32661425
Salomon‐Ferrer R, Case DA, Walker RC. 2013. An overview of the Amber biomolecular simulation package.
WIREs Computational Molecular Science 3:198–210. DOI: https://doi.org/10.1002/wcms.1121
Sami S, Alessandri R, W. Wijaya JB, Grünewald F, de Vries AH, Marrink SJ, Broer R, Havenith RWA. 2022.
Strategies for enhancing the dielectric constant of organic materials. The Journal of Physical Chemistry C
126:19462–19469. DOI: https://doi.org/10.1021/acs.jpcc.2c05682
Sarkar A, Sasmal R, Empereur-mot C, Bochicchio D, Kompella SVK, Sharma K, Dhiman S, Sundaram B, Agasti SS,
Pavan GM, George SJ. 2020. Self-sorted, random, and block supramolecular copolymers via sequence
controlled, multicomponent self-assembly. Journal of the American Chemical Society 142:7606–7617. DOI:
https://doi.org/10.1021/jacs.0c01822
Schaefer SL, Hummer G. 2022. Sublytic gasdermin-D pores captured in atomistic molecular simulations. eLife
11:e81432. DOI: https://doi.org/10.7554/eLife.81432, PMID: 36374182
Souza PCT, Alessandri R, Barnoud J, Thallmair S, Faustino I, Grünewald F, Patmanidis I, Abdizadeh H,
Bruininks BMH, Wassenaar TA, Kroon PC, Melcr J, Nieto V, Corradi V, Khan HM, Domański J, Javanainen M,
Martinez-Seara H, Reuter N, Best RB, et al. 2021. Martini 3: a general purpose force field for coarse-grained
molecular dynamics. Nature Methods 18:382–388. DOI: https://doi.org/10.1038/s41592-021-01098-3, PMID:
33782607
Stansfeld PJ, Goose JE, Caffrey M, Carpenter EP, Parker JL, Newstead S, Sansom MSP. 2015. MemProtMD:
Automated insertion of membrane protein structures into explicit lipid membranes. Structure 23:1350–1361.
DOI: https://doi.org/10.1016/j.str.2015.05.006, PMID: 26073602
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. 2015.
Big data: Astronomical or genomical? PLOS Biology 13:e1002195. DOI: https://doi.org/10.1371/journal.pbio.
1002195, PMID: 26151137
Sud M, Fahy E, Cotter D, Brown A, Dennis EA, Glass CK, Merrill AH Jr, Murphy RC, Raetz CRH, Russell DW,
Subramaniam S. 2007. LMSD: LIPID MAPS structure database. Nucleic Acids Research 35:D527–D532. DOI:
https://doi.org/10.1093/nar/gkl838, PMID: 17098933
Tai K, Murdock S, Wu B, Ng MH, Johnston S, Fangohr H, Cox SJ, Jeffreys P, Essex JW, P. Sansom MS. 2004.
BioSimGrid: towards a worldwide repository for biomolecular simulations. Organic &Biomolecular Chemistry
2:3219. DOI: https://doi.org/10.1039/b411352g
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
21 of 22
Research article
Computational and Systems Biology | Structural Biology and Molecular Biophysics
Tiemann JKS, Guixà-González R, Hildebrand PW, Rose AS. 2017. MDsrv: viewing and sharing molecular
dynamics simulations on the web. Nature Methods 14:1123–1124. DOI: https://doi.org/10.1038/nmeth.4497,
PMID: 29190271
UniProt Consortium. 2022. UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Research
51:D523–D531. DOI: https://doi.org/10.1093/nar/gkac1052
Vanommeslaeghe K, MacKerell AD. 2012. Automation of the CHARMM General Force Field (CGenFF) I: bond
perception and atom typing. Journal of Chemical Information and Modeling 52:3144–3154. DOI: https://doi.
org/10.1021/ci300363c, PMID: 23146088
van Rossum G. 1995. Python Tutorial. Amsterdam, The Netherlands: Centrum voor Wiskunde en Informatica.
https://ir.cwi.nl/pub/5007
Virtanen SI, Kiirikki AM, Mikula KM, Iwaï H, Ollila OHS. 2020. Heterogeneous dynamics in partially disordered
proteins. Physical Chemistry Chemical Physics 22:21185–21196. DOI: https://doi.org/10.1039/d0cp03473h,
PMID: 32929427
Vuorio J, Vattulainen I, Martinez-Seara H. 2017. Atomistic fingerprint of hyaluronan-CD44 binding. PLOS
Computational Biology 13:e1005663. DOI: https://doi.org/10.1371/journal.pcbi.1005663, PMID: 28715483
Waltemath D, Adams R, Beard DA, Bergmann FT, Bhalla US, Britten R, Chelliah V, Cooling MT, Cooper J,
Crampin EJ, Garny A, Hoops S, Hucka M, Hunter P, Klipp E, Laibe C, Miller AK, Moraru I, Nickerson D,
Nielsen P, et al. 2011. Minimum Information About a Simulation Experiment (MIASE). PLOS Computational
Biology 7:e1001122. DOI: https://doi.org/10.1371/journal.pcbi.1001122
Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA. 2004. Development and testing of a general amber force
field. Journal of Computational Chemistry 25:1157–1174. DOI: https://doi.org/10.1002/jcc.20035, PMID:
15116359
Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW,
da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S,
Evelo CT, Finkers R, Gonzalez-Beltran A, et al. 2016. The FAIR Guiding Principles for scientific data
management and stewardship. Scientific Data 3:160018. DOI: https://doi.org/10.1038/sdata.2016.18, PMID:
26978244
Wilson SL, Way GP, Bittremieux W, Armache JP, Haendel MA, Hoffman MM. 2021. Sharing biological data: why,
when, and how. FEBS Letters 595:847–863. DOI: https://doi.org/10.1002/1873-3468.14067, PMID: 33843054
Yoo J, Winogradoff D, Aksimentiev A. 2020. Molecular dynamics simulations of DNA-DNA and DNA-protein
interactions. Current Opinion in Structural Biology 64:88–96. DOI: https://doi.org/10.1016/j.sbi.2020.06.007,
PMID: 32682257
Young TA, Martí-Centelles V, Wang J, Lusby PJ, Duarte F. 2020. RAtionalizing the activity of an “artificial
diels-alderase”: Establishing efficient and accurate protocols for calculating supramolecular catalysis. Journal of
the American Chemical Society 142:1300–1310. DOI: https://doi.org/10.1021/jacs.9b10302, PMID: 31852191
Zheng X, Chan MHY, Chan AKW, Cao S, Ng M, Sheong FK, Li C, Goonetilleke EC, Lam WWY, Lau TC, Huang X,
Yam VWW. 2022. Elucidation of the key role of Pt···Pt interactions in the directional self-assembly of platinum(II)
complexes. PNAS 119:e2116543119. DOI: https://doi.org/10.1073/pnas.2116543119, PMID: 35298336
Zhu S. 2019. Validation of the Generalized Force Fields GAFF, CGenFF, OPLS-AA, and PRODRGFF by Testing
Against Experimental Osmotic Coefficient Data for Small Drug-Like Molecules. Journal of Chemical Information
and Modeling 59:4239–4247. DOI: https://doi.org/10.1021/acs.jcim.9b00552
Zhu J, Salvatella X, Robustelli P. 2022. Small molecules targeting the disordered transactivation domain of the
androgen receptor induce the formation of collapsed helical states. Nature Communications 13:6390. DOI:
https://doi.org/10.1038/s41467-022-34077-z, PMID: 36302916
Tiemann et al. eLife 2023;12:RP90061. DOI: https://doi.org/10.7554/eLife.90061
22 of 22