Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions 10_Chemical_Format_Conversion_and_Metadata.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 10: Chemical Format Conversion and Metadata Handling\n",
"\n",
"Round-trip molecules between common chemical formats while preserving metadata fields.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objectives\n",
"\n",
"- Load structure-data files (SDF) that contain rich metadata.\n",
"- Convert the records into pandas DataFrames for analysis.\n",
"- Export SMILES and SDF files with selected properties.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"from rdkit import Chem\n",
"from rdkit.Chem import PandasTools\n",
"import pandas as pd\n",
"from io import StringIO\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read SDF Data\n",
"\n",
"The snippet below emulates reading from disk by loading a multi-record SDF string.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"sdf_block = \"\"\"\n",
" Mrv2108 07152116512D\n",
"\n",
" 6 5 0 0 0 0 999 V2000\n",
" 1.2990 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 -1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -1.2990 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -1.2990 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1.2990 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 2 2 0 0 0 0\n",
" 2 3 1 0 0 0 0\n",
" 3 4 2 0 0 0 0\n",
" 4 5 1 0 0 0 0\n",
" 5 6 2 0 0 0 0\n",
" 6 1 1 0 0 0 0\n",
"M END\n",
"> <Name>\n",
"Benzene\n",
"\n",
"> <Source>\n",
"Example\n",
"\n",
"$$$$\n",
"\"\"\".strip()\n",
"supplier = Chem.SDMolSupplier()\n",
"supplier.SetData(sdf_block, sanitize=True)\n",
"molecules = [mol for mol in supplier if mol is not None]\n",
"len(molecules)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convert to a DataFrame\n",
"\n",
"`PandasTools.LoadSDF` retains all metadata fields, making downstream analytics straightforward.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"sdf_buffer = StringIO(sdf_block)\n",
"df = PandasTools.LoadSDF(sdf_buffer, smilesName='smiles', molColName='ROMol')\n",
"df\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Write SMILES and SDF Outputs\n",
"\n",
"Export the curated data to SMILES or SDF files. String buffers allow inspection without touching disk.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"smiles_buffer = StringIO()\n",
"PandasTools.WriteSmi(df, smiles_buffer, molColName='ROMol', includeHeader=True, idName='Name')\n",
"smiles_buffer.getvalue()\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"sdf_writer = Chem.SDWriter()\n",
"sdf_output = StringIO()\n",
"sdf_writer.SetOutputStream(sdf_output)\n",
"for mol in molecules:\n",
" mol.SetProp('Processed', 'True')\n",
" sdf_writer.write(mol)\n",
"sdf_writer.close()\n",
"sdf_output.getvalue().splitlines()[:10]\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
160 changes: 160 additions & 0 deletions 5_Conformer_Generation_and_3D_Analysis.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 5: 3D Conformer Generation and Analysis\n",
"\n",
"Learn how to generate three-dimensional conformers for a molecule, optimise their geometry, and compare the resulting ensemble.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objectives\n",
"\n",
"- Prepare a molecule with explicit hydrogens so that the force field has the atoms it expects.\n",
"- Embed several conformers with the ETKDG algorithm and perform force-field minimisation.\n",
"- Analyse conformer energies and pairwise RMS values to identify the most representative structures.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"from rdkit import Chem\n",
"from rdkit.Chem import AllChem, Draw\n",
"from rdkit.Chem import rdMolAlign\n",
"import pandas as pd\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare an Example Molecule\n",
"\n",
"We will work with ibuprofen, a small drug-like molecule that exhibits several low-energy conformations.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"ibuprofen = Chem.AddHs(Chem.MolFromSmiles('CC(C)Cc1ccc(cc1)[C@@H](C)C(=O)O'))\n",
"ibuprofen\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate Conformers with ETKDG\n",
"\n",
"The experimental torsion knowledge distance geometry (ETKDG) method provides a robust starting point for 3D coordinates.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"params = AllChem.ETKDGv3()\n",
"params.randomSeed = 0xF00D\n",
"conformer_ids = list(AllChem.EmbedMultipleConfs(ibuprofen, numConfs=10, params=params))\n",
"print(f\"Generated {len(conformer_ids)} conformers\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optimise with a Force Field\n",
"\n",
"Each conformer is refined with the Universal Force Field (UFF). The final energy (in kcal/mol) helps rank conformers.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"energy_records = []\n",
"for cid in conformer_ids:\n",
" AllChem.UFFOptimizeMolecule(ibuprofen, confId=cid)\n",
" ff = AllChem.UFFGetMoleculeForceField(ibuprofen, confId=cid)\n",
" energy_records.append((cid, ff.CalcEnergy()))\n",
"energy_df = pd.DataFrame(energy_records, columns=['conformer_id', 'uff_energy_kcal'])\n",
"energy_df.sort_values('uff_energy_kcal').reset_index(drop=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare Conformer Geometries\n",
"\n",
"The RMS distance matrix quantifies structural differences between conformers. Smaller values indicate similar geometries.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"rms_matrix = AllChem.GetConformerRMSMatrix(ibuprofen, prealigned=False)\n",
"rms_df = pd.DataFrame(\n",
" data=rms_matrix,\n",
" columns=[f\"conf_{i}\" for i in conformer_ids[1:]],\n",
" index=[f\"conf_{i}\" for i in conformer_ids[:-1]]\n",
")\n",
"rms_df.round(3)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualise the Lowest-Energy Conformers\n",
"\n",
"Drawing the lowest-energy conformers helps communicate which geometry the force field prefers.\n"
]
},
{
"cell_type": "code",
"metadata": {},
"execution_count": null,
"outputs": [],
"source": [
"ranked = energy_df.sort_values('uff_energy_kcal').head(4)['conformer_id'].tolist()\n",
"mols = [Chem.Mol(ibuprofen) for _ in ranked]\n",
"for new_conf, cid in zip(mols, ranked):\n",
" new_conf.RemoveAllConformers()\n",
" new_conf.AddConformer(ibuprofen.GetConformer(id=cid), assignId=True)\n",
"Draw.MolsToGridImage([Chem.RemoveHs(m) for m in mols], legends=[f\"conf {cid}\" for cid in ranked], molsPerRow=2)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading