AC 209 Data Science Final project

by Benjamin Sanchez Lengeling (bsanchezlengeling@g.harvard.edu)

(Gifs loading, give it a minute, it's worth it)

(Ab initio Molecular Dynamics of Molecules doing the Harlem Shake)

The Idea
Data
Exploration Visualization
Finding Patterns
Future Directions
Requirements

The Idea

Can we "discover" chemistry concepts via patterns in data? (Obviously, let me explain more about what I mean)

Molecules are small objects whose chemical behaviour is governed by physical laws (Quantum Mechancis) and the dynamics between protons and electrons. Below is Quantum Monte Carlo simulation stocastically sampling the space of possible arrangements for the electrons.

Turns out we can simulate these laws in a computer, solving the Schrödinger equation to obtain molecular properties that you could otherwise obtain via an experiment.

Computational chemistry is finally in a position to conduct studies on hundreds or even thousands of molecules within a reasonable amount of time. That means a lot of data :).

Recently Ramakrishnan et Al. reported in the Journal Scientific Data, "Quantum chemistry structures and properties of 134 kilo molecules" a data set 134k calculated molecules.

Part of the philosophy of calculating molecules on a computer is the "First Principles", "Ab-initio", "Without knowing anything" mind set, we want to predict and model reality (molecular reality that is) without previous knowledge, without empirical observations (at least to certain degree).

For example if a computer had molecules of the skin of tomato, would it know the tomato is red? By simulating how light hits the molecules and bounces back we can find out the wavelength of this returned light, and find out it is actually within the spectrum of the color "red". We can predict the tomato is red "without knowing anything".

The aim of this project is to play around with this dataset and see what chemistry we can learn.

Data

Dataset is hosted on figshare, it is a few hundred MB's. Each calculation is in a specially formated text file (xyz file) which cotains the optimized geometry, the energetics and thermodynamic properties for a single molecule. There are in total 134k stable small organic molecules made up of CHONF atoms. The first step was to parse each file as specified in the README.txt included in the data set.

After that I created a pandas dataframe, with a row for each molecule, containing all the information within the txt file. This data structure can easily be converted to other more information-friendly formats (excel, databases, json). A pandas dataframe pickle can be downloaded here. I also computed a few other descriptors not included in the article, such as enthalpies of Atomization and molecular weight.

The entire data set once filtered is a 130830 molecules by 25 features dataframe. Data Wrangling

Exploration Visualization

I carried out first an inital exploration of all variables, to see which ones were more correlated, which ones were redundant and also to classify molecules based on their rotation.

The variables to explore were:

Size: Number of atoms, Molecular Weight, Spatial Extensivity.
Rotation: Rotational Constants and Clases.
Energetics: Internal energy, Enthalpy, Gibbs free energy, Zero Point Energy, Atomization Energies.
Special Properties: Electronic Band Gap, Molar Heat Capacity, Dipole Moment and Polarizability

Some examples:

Here we have a random selection of ten molecules for the dataset: Random Selection of molecules

Distribution of the number of atoms on each Molecule: Imgur

Distribution of the molecular weight, two groups are apparent: Imgur

*Violinplots of Rot Constants (A,B,C) for different types of rotors * = (Asymmetrical, Linear, Oblate, Prolate and Spherical) respectively: Imgur

Due to the sheer number of samples, many variables followed a normal distribution (like Heat Capacity), these types of variables can sometimes be removed for modelling since there is little variability. Imgur

While other properties were much more interesting, suggesting possible clusters of molecules, here we have the Distribution of Electronic Band gaps. Imgur

Distribution of Enthalpies of Atomization also displays some peaks. Imgur

Finding Patterns

Machine Learning enters the foray!

Having explored the dataset we can trim and have a better idea of what types of techniques might work. I decided to tryout to train a molecular classifier. The criteria was common substructures. In this specific case I focused on benzene rings, which are carbon atoms arranged in form of a hexagon. The idea was to see what type of molecular properties are going to be influenced by the presence of these substructures.

Here we have one molecule that does have a benzene ring and another that dosen't. Easy Peasy (lemon-squeezy). Imgur What if we could predict the absence or presence of this substructure via it's molecular properties, without reference of geometry? I set to do try this out.

Using RDKit, I did a substructure matching of this ring in all molecules, this would be for validation purposes. And so after this I then took a sample of ~250 molecules for each class (equal representation), with Ring and without. Below are a subset of these molecules. Imgur

To start the Machine Learning I reduced the number of variables from 25 to 12 using only: rot constants A, B and C, the dipole moment, polarizability, electronic bandgap, electronic spatial extent, zero point energy, 0 temperature enthalpy, Heat Capacity, molecular Weight and enthalpy of Atomization.

Initially I tried with a RandomForest classifier, first I had to find out the optimal number of trees to use. Using a grid search and KFold=10 cross validation I found out that the best number of estimators was 11, resulting in a really good mean score of 0.928 (49). These results can be seen below:

Imgur

Then with a good estimator, I extracted the most important features used in the classification. With the idea of finding what makes "Ringed" molecules different from a random subset.

Imgur

The top 5 features were in order of increasing importance were: Heat Capacity, Molecular Weight, Zero Point Energy, Rotational Constant C and Electronic Band Gap! These resulted motivated me to look at the top two features and make sense of their distribution using violinplots (Yellow="Ringed", Green="Non-Ringed". Imgur Imgur

We can clearly see different distributions in both features, this can be contrasted with Molecular Weight, which is much less important, it has the has the same distribution, just slightly shifted. Imgur

The electronic bandgap makes sense, this property indicated how well a material will serve as a conductor, semi-conductor or insulator. The lower the bandgap, the more conductive a material will be since the gap between the HOMO and LUMO will be small and electrons can jump from on to the other easily. The random samples have a much higher mean gap. This maskes sense chemistry-wise. Currently there are multiple attempts to create organic semiconductors and the building blocks for these are aromatic rings composed by chains of these simple rings. This can be considered a small success!

I also tried other classifers such as Naive-Bayes, K-Nearest Neighbors and Support Vector Machine. The results are summarized in the following table, in each case KFold=10 along with cross validation were used:

Method	Mean Score	Std Score
Naive-Bayes	0.865	0.066
K-NN	0.585	0.087
SVM	0.745	0.121
RandomForest	0.928	0.043

Future Directions

I only tried matching one substructure, there are many functional groups, substructures that influenced strongly the chemical behavior of a molecule. I had a priori knowledge of this substructure, with more time it would be interesting to see if you can discover these substructures by themselves. There are many techniques I did not try out. Also regression was not explored. A more tough problem ywoulbe be to predict accurately all chemical properties from just the geometric coordinates and atoms list, without having to simulate the physics, atleast not entirely.

There is certainly much more work to be done and many venues to explore with data science and Quantum Chemistry. For the most part I stuck to the available dataset, yet I feel there could be more interesting data embedded in the calculations which is not included in the data set. Examples:

Frequencies, only frequencies were reported but amplitudes were not. Using amplitudes and frequencies together you could devise spectra features such as spectral centroids, brightness o cSpectrum.
Coulombic Matrix, The matrix of electromagnetic interaction between different molecules.
Orbital Occupation Numbers, Occupation numbers for the electrons within the molecular orbitals.

The main idea would be design and find descriptors that can not be found in a classical way, to demonstrate that quantum calculations have more conection to reality than a classical ones. (at the molecular level).

Anyways, had a lot of fun, here is another gif as a reward for reading.

Requirement

Sci-kit learn
Pandas
Matplotlib
Seaborn
Numpy
Ipython notebook
RDKit
Pybel and Openbabel
Sense of data adventure

A data driven, Quantum Mechanical understanding of Chemistry

AKA Trying to make sense of 134k quantum calculations