Official SEING Documentation

SEING is a software package from the Clancy Group at Cornell University (https://clancygroup.cbe.cornell.edu) created to compute fingerprints of molecular systems for machine learning applications.

Author: Mardochee Reveil

Date first released: Feb 2018

SEING is distributed as free and open-source code available on github here: https://github.com/mreveil/seing

Contents:

Introduction

SEING streamlines the process of computing fingerprints of molecular systems with a focus on those that explicitly encode spatial coordinates.

Fingerprints (in this context) are numerical representations of chemical structures designed to be invariant under property-perseving operations such as permutation of atoms of the same nature, geometric rotation, etc.

Inspired by similar representations in chemi-informatics, those structural representations were created as alternatives to cartesian coordinates which are not suitable for machine learning studies. The hope is that those fingerprints will open the door for new strucutre-property explorations and the development of improved predictive capability in materials science.

SEING (old French word for “signature”) is created and released to the community as a vehicle to streamline and facilitate the use of those fingerprints in ML applications in the hope of accelerating the use of AI in materials science applications for better understanding and easier discovery of a wide range of new materials.

Download and Installation

The source code is available on github: https://github.com/mreveil/seing

SEING is built with minimal requirements and can be easily compiled with any C/C++ compiler. A generic Makefile is provided in src folder. As a starting point, you can just type.

cd src make seing

If this doesn’t work, changes might be necessary to adapt the makefile to your operating system and/or environment.

License

This program is free and open-source software distributed under the terms of the GNU GPL version 3 (or later) which can be found here: www.gnu.org/licenses/gpl-3.0.en.html

Please note that SEING is provided WITHOUT WARRANTY OF ANY KIND, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Please see the full terms of the GNU GPL license for more details.

User Support

SEING is provided with no dedicated user support, however questions and suggestions are welcome and the author(s) will do their best to provide answers in a timely fashion. Please email mr937@cornell.edu with any questions or concerns.

INSTALLATION

Installing SEING is meant to be easy and straighforward.

SEING is built with minimal requirements and can be easily compiled with any C/C++ compiler. A generic Makefile is provided in src folder. As a starting point, you can just type.

cd src
make seing

If this doesn’t work, changes might be necessary to adapt the makefile to your operating system and/or environment.

How to Use

To use SEING, a coordinate file and an option file are required. Example usage looks like this:

/path/to/seing coordinates.xyz optionfile.in

Trajectory File

Coordinates of each atom has to be provided in a coordinate file in the xyz format. Only the xyz file format for atomic coordinates is supported at the moment. Trajectory files (i.e. coordinate files with more than one frame) are also not supported at the moment but support will be added soon.

Option File

The option input file contains “key = value” pairs specifying the type of calculation to perform, the input parameters for the method chosen, etc. Current keys and possible values are as follow:

General Options

type (optional)

The type of fingerprinting scheme to use. Possible values are:

  • gaussian (default):

  • zernike:

  • bispectrum:

natomtypes (required)
The number of different species in the molecular system.
  • Integer values only

atomtypes (required)

A space separated list of the abbreviated names of the different species in the system. Names should correspond to the ones used in the coordinate file.

  • Number of species provided has to match the value for the natomtypes option above

strategy (optional)

How to account for more than one species. Possible values are:

  • augmented (default): the fingerprint size is increased with one subfingerprint for each different atom pair or triplets

  • weighted : the size of the fingerprint remains the same (as in with just one atom type) but the contribution of each atom type is weighted based on a specified weight type. Please note: this strategy doesn’t work with all fingerprints.

weight_type (optional)
Defines how contributions are weighted for the weighted strategy explained above. Possible values are:
  • atomic number (default)

  • electronegativity

Derivatives Options

calculate_derivatives (optional)
Whether or not to calculate fingerprint derivatives. If true, derivatives are added to the fingerprint. Please see documentation of your specific fingerprint for whether or not derivatives are supported and if so, how they are calculated and incorporated to the fingerprint vector or matrix.
  • true

  • false (default)

ndirections (required)

The number of derivative componenets to calculate (see directions below)

directions (optional)
The list of directions (x, y or z) for derivatives of the fingerprints to be calculated
  • 0 (default): Calculate only in the x direction

  • 1: Calculate only in the y direction

  • 2: Calculate only in the z direction

  • 3: Calculate in all three directions

nderivatives (optional)

The number of derivatives to calculate. Default is one and is with respect to the center atom. If a value greater than 1 provided, derivatives are calculated with respect to other neighboring atoms in order of increasing distance to center atom.

Output Options

output_file (optional)

Name of the output file to write the fingerprint in. Output file will be in current directory (where the coordinate and option file are). If the file already exists, the behavior of the program is determined by the output_mode keyword explained below. * Default output name is fingerprint_type+”_fingerprints.sg”

output_mode (optional)

Whether to append fingerprints to the given output file, if it already exists. If not, file will be overwritten * append * overwrite (default)

Neighbor Searching Options

cutoff (required)

Defines the cutoff value used to build the neighbor list.

box_size (optional)

Defines the size of the simulation box in the following format: xmin ymin zmin xmax ymax zmax

Fingerprint-Specific Options

Bispectrum

jmax

Zernike

nmax

Behler-Parinello (Gaussian)

nzetas

zetas

ngammas

gammas

netas

etas

netas2

etas

AGNI
width (required)

The width of the Gaussians.

dimensionality (required)

The dimensionality of the fingerprint. This is used to determine how many Gaussian centers are used. Those gaussians are uniformly placed from the center atom (distance = 0) to the cutoff distance.

alpha (required)

The direction of the fingerprint (0=x, 1=y, 2=z)

Fingerprints

Below we discuss fingerprints currently implemented in SEING as well as others in the pipeline for implementation.

Behler-Parinello

Behler-Parinello (BP) also called “Gaussian” fingerprints are local fingerprints based on symmetry functions. Two of such symmetry functions are given by the radial and angular componenets \(G^{rad}\) and \(G^{ang}\) below where summations run over all neighbors \(j\) and \(k\) separated by distances \(R_{ij}\) and \(R_{ik}\) with respect to atom \(i\) within a cutoff distance \(R_c\) around \(i\). \(\theta_{ijk}\) is the angle between atoms \(i,j and k\). \(\eta\), \(R_s\), \(\lambda\) and \(\zeta\) are parameters whose values are chosen by the user. \(f_c\) is a cutoff function used to ensure a smooth transition to zero at the \(R_c\). For more information, see: [BP]

\[G^{rad}_i = \sum_j e^{-\eta(R_{ij}-R_s)^2}f_c(R_{ij})\]
\[G^{ang}_i = 2^{1-\zeta}\sum_{j,k\neq i} (1+\lambda \cos \theta_{ijk})^\zeta e^{-\eta(R_{ij}^2+R_{ik}^2+R_{jk}^2)^2}f_c(R_{ij})f_c(R_{ij})f_c(R_{ij})\]
\[\begin{split}f_c = \begin{cases} & 0.5[\cos(\frac{\pi R_{ij}}{R_c})+1]~~\text{for}~~~R_{ij}\leq R_c \\ & 0 ~~~~~~~~~~~~~~~~~~~~~~~~~~ \text{for}~~~ R_{ij} > R_c \\ \end{cases}\end{split}\]

The SEING implementation of the BP fingerprint only requires the parameters \(\eta\), \(\lambda\) and \(\zeta\) as \(R_s\) is automatically set to zero.

AGNI

The [AGNI] method was developed as a framework for machine learning force field development in which the forces are calculated directly without going through energy predictions. The associated fingerprint is given by \(V_{i,\alpha,k}\) below where \(\alpha\) denotes the direction (x,y or z) of the force between atoms \(i\) and \(j\) separated by distance \(r_{ij}\). The parameter \(w\) corresponds to the width of Gaussians placed at positions \(a_k\) within a cutoff distance \(R_c\). Similarly to the BP fingerprint, \(f_c\) is the cutofff function ensuring a smooth transition to zero at \(Rc\).

\[V_{i,\alpha,k} = \sum_{j\neq i} \frac{r_{ij}^\alpha}{r_{ij}} \frac{1}{\sqrt{2\pi w}}e^{-0.5(\frac{r_{ij}-a_k}{w})^2}f_c(r_{ij})\]

For the SEING implementation of the AGNI fingerprint, gaussian centers are uniformly chosen between 0 and the cutoff disance \(R_c\). The only parameter necessary is the dimensionality of the fingerprint with determines the number of such Gaussian centers to generate. This fingerprint does not support derivative calculations.

Bispectrum

Bispectrum fingerprints for representaion of chemical environments were proposed by Bartok et al. and are based on teh decomposition of a local atomic density function with respect to 4D spherical harmonics. The bispecturm representation is then build based on the coefficients \(c_{m'm}^j\) of the decomposition given below. For more information, please consult the original paper on the development of [Bispectrum] fingerprints.

\[B_{j_1,j_2,j} = \sum_{m'_1,m_1=-j_1}^{j_1} \sum_{m'_2,m_2=-j_2}^{j_1} \sum_{m',m=-j}^{j} c_{m'm}^jC_{j_1m_1j_2m_2}^{jm}C_{j_1m'_1j_2m'_2}^{jm'}c_{m'_1m_1}^{j_1}c_{m'_2m_2}^{j_2}\]

Within SEING, only the parameter \(j_{max}\) is needed (suggested value: 5) to generate bispectrum fingerprints. Please note that this type of fingerprints is relatively slow compared to other fingerprints currently implemented. Derivatives are supported.

Zernike

Zernike fingerprints are similar to Bispectrum fingerprints in the sense that they are based on decomposition of a local atomic density wrt basis sets. However, Zernike fingerprints are based on decomposition wrt zernike polynomials and 3D spherical harmonics. The general formula is given below. Please consult this paper which describes the [Zernike] method.

\[\rho(\tilde{r},\theta,\phi) = \sum_{n=0}^{\inf} \sum_l \sum_{m=-l}^l c_{nl}^mZ_{nl}^m(\tilde{r},\theta, \phi) ~~~~ \text{for} ~~ n-l \geq 0\]

To use the Zernike fingerprint in SEING, only the parameter \(n_max\) is needed (suggested value: 5). Derivatives are supported.

PRDF (Coming Soon)

Contact Matrix (Coming Soon)

SPRINT (Coming Soon)

Citations

BP
  1. Behler and M. Parrinello, Phys. Rev. Lett., 2007, 98, 146401.

AGNI
    1. Huan, R. Batra, J. Chapman, S. Krishnan, L. Chen and R. Ramprasad, npj Comput. Mater., 2017, 3, 89–109.

Bispectrum
    1. Bartók, M. C. Payne, R. Kondor and G. Csányi, Phys. Rev. Lett., 2010, 104, 136403.

Zernike
  1. Khorshidi and A. A. Peterson, Comput. Phys. Commun., 2016, 207, 310–324.

Developer Information

To contribute the development of SEING, assuming you have cloned the Github repository, and created a new branch, there are three (3) files to work with:

  • inputs (.cpp, .h) : This is where all inputs are read and processed. You will want to add any input specific to your fingerprint scheme (such as a name for your fingerprint) to this file.

  • genericlocalcalculator (.cpp, .h) or genericglobalcalculator (.cpp, .h): This class acts as a way to switch between different fingerprints based on user inputs. Open those files and add in your own fingerprints as a new possibility.

  • your_own_fingerprint (.cpp, .h) files that implement codes specific to the new fingerprint. An existing fingerprint can be used as a starting point for code structure but really, only the calculate_fingerprint and get_size functions are required to be implemented for a fingerprint to be valid.

After implementation and validation, please submit a pull request for addition to the master branch.

Code Documentation

Basic Classes

class AtomicSystem

Class that holds information about the molecular system including atom types and spatial coordinates.

Public Functions

AtomicSystem(void)

Default constructor.

AtomicSystem(string, bool, bool, bool, double)

Constructor that creates the atomicsystem object from a coordinate file.

Parameters
  • filename – name of coordinate file

  • pbcx – boundary condition for x

  • pbcy – boundary condition for y

  • pbcz – boundary condition for z

  • skin – size of the skin around the box for used in neighbor list generation

void set_box_size(double, double, double, double, double, double)

Sets the boundaries of the box xmin, ymin, zmin, xmax, ymax, zmax.

double get_distance_component(Atom, Atom, int)

Calculates the x (=0), y (=1) or z (=2) component of the distance between two atoms.

Parameters
  • A – the first atom object.

  • B – the second atom object.

  • direction – the component of the distance to return 0=x, 1=y, 2=z

double get_square_distance(Atom, Atom)

Calculates the square distance between two atoms.

double get_square_distance(int, int)

Calculates the square distance between two atoms given by their index in the AtomicSystem.

vector<string> get_atom_types()

Finds unique atomic species and return them as a vector of atom names ordered by atomic numbers.

Atom get_atom(int)

Returns an Atom given by its index in the atomic system.

int get_n_atoms()

Returns the total number of atoms in the system.

class Atom

Class that holds information about an atom and its coordinates.

Public Functions

Atom(void)

Default constructor.

Atom(string, double, double, double)

Constructor that creates an Atom object from its name and coordinates.

Parameters
  • attype – name of atom (e.g. H, C, Si, Au, etc.)

  • cx – cartesian coordinates - x value

  • cy – cartesian coordinates - y value

  • cz – cartesian coordinates - z value

string get_atom_type()

Returns the name of this atom.

double get_x()

Returns the x coordinate of this atom.

double get_y()

Returns the y coordinate of this atom

double get_z()

Returns the z coordinate of this atom.

Neighbor Searching

class NeighborList

Class to generate and hold neighbor lists.

This will divide the box into bins and for a given atom, loop over atoms in neighboring bins to generate its neighbor list. This speeds up the process because it is no longer necessary to loop over all atoms in the system. Also, for additional speed up, the neighbor list is kept in a linked list.

Public Functions

NeighborList()

Default Constructor.

NeighborList(AtomicSystem&, double, int, int, int, int)

Creates a NeighborList object based on an AtomicSystem.

Parameters
  • asystem – reference to the atomic system for which the neighbor lists have to be generated.

  • cutoff – maximum distance for which two atoms are considered neighbors

  • nxb – number of bins in the x direction

  • nyb – number of bins in the y direction

  • nzb – number of bins in the z direction

void build()

Function that actually generates the neighbor list for each atom.

int *get_atoms_in_bin(int)

Returns a pointer to the list of atoms in given bin

int get_atoms_per_bin(int)

Returns the number of atoms in given bins.

int get_bin_number(double, double, double)

Retursn the bin number in which the atom with given x, y, and z coordinates belong.

int *get_neighboring_bins(int)

Returns the neighboring bins for a bin givne by its index.

int *get_neighbors(int)

Returns a pointer to the list of neighbors of an atom given by its index.

int *get_sorted_neighbors(int)

Returns a pointer to the neighbor list of a given atom sorted by distance.

int *get_sorted_neighbors(int, string)

Returns only atoms of a given type from the neighbor list of a given atom.

int *get_sorted_neighbors(int, vector<string>)

Returns only atoms of specific types from the neighbor list of a given atom.

int get_n_neighbors(int)

Returns the number of neighbors for a given atom.

int get_n_neighbors(int, string)

Returns the number of neighbors of a given type for a given atom.

int get_n_neighbors(int, vector<string>)

Returns the number of neighbors of given types for a given atom.

Input Parsing

struct fingerprintProperties

Structure to hold fingerprint parameters as provided by the user in the input file.

Public Members

string type

The type of fingerprint to compute (e.g. AGNI, Zernike, Behler-Parinello, etc.)

string calculate_derivatives

Values: yes or no.

int nderivatives

The number of derivatives to calculate.

Derivatives are calculated with respect to the central atom first, then wrt neighbor atoms ordered by increasing distance.

int ndirections

How many directions (x,y,z) to calculate derivatives for.

int *directions

Directions to calculate derivatives for 0=x, 1=y, 2=z.

string strategy

The strategy to use for incorporating multiple species in the fingerprint. Values: weighted or augmented.

string weight_type

For the “weighted” strategy, what type of weight to use (e.g. atomic number, elctronegativity, etc.)

double cutoff

In case of a local fingerprint, the cutoff to use.

bool is_box_size_provided

A boolean to save whether the box size is provided as part of the input file or not.

double *box_size

A pointer to the list that holds the dimensions of the box if they are provided in input file.

int natomtypes

The number of different species to consider for fingerprint generation.

string *atomtypes

A pointer to the list of atomic species for fingerprint generation.

Fingerprint Generation Utilities

class FingerprintGenerator

Class that handles fingerprint generation.

Public Functions

FingerprintGenerator(AtomicSystem&, fingerprintProperties)

Constructor that will instantiate the fingerprint calculator and perform the fingerprint generation

bool write2file(string, string)

Writes the fingerprint to a file in the same order as the atoms were given in the coordinates file

class GenericLocalCalculator

Class that holds a generic calculator to switch between actual fingerprint calculators.

Public Functions

GenericLocalCalculator(AtomicSystem&, fingerprintProperties)

Constructor based on the atomtic system and fingerprintProperties.

int get_size()

Returns the dimensionality of the fingerprint.

double *calculate_fingerprint(int, NeighborList&)

Will call the calculate_fingerprint function of the actual fingerprint calculator and returns the fingerprint

Fingerprint Calculators

All fingerprint calculator classes have the same constructor signature and expose a get_size and calculate_fingerprint functions to allow for easy switching between different types. The AGNI fingerprint calculator will be used here as an example.

class AGNICalculator

An AGNI fingerprint calculator.

Public Functions

AGNICalculator(AtomicSystem&, fingerprintProperties)

Constructor that instantiate an AGNI fingerprint object.

int get_size()

Returns the dimensionality of this AGNI fingerprint.

double *calculate_fingerprint(int, NeighborList&)

Function to calculate the fingerprint of an atom given its neighbor list.

Utility Functions

double cutoff_func(double, double)

Function that returns the value of the cutoff function given the cutoff value and the current distance

Indices and tables