Understanding ligand properties is essential for computational high-throughput screening of transition metal complexes. However, ligand properties such as net charge and other information such as thei Show more
Understanding ligand properties is essential for computational high-throughput screening of transition metal complexes. However, ligand properties such as net charge and other information such as their application area are often absent or inconsistently recorded in crystallographic datasets. Here, we construct a ligand dataset from 126,985 mononuclear transition metal complexes curated from the Cambridge Structural Database. Using an iterative charge-balancing workflow that combines complex charges, metal oxidation states, and consensus across crystallographic observations, we confidently assign net charges to 66,810 ligands among 94,581 identified unique ligand structures to curate the Boston Open-Shell Ligand (BOS-Lig) dataset. The workflow assigns ligand charges in homoleptic complexes first and then iteratively propagates these assignments across heteroleptic environments, allowing charges to be inferred even when direct charge information is unavailable. We analyze cases where simple heuristics such as the octet rule would have failed and introduce a purity metric to identify when our charge assignments may be incorrect. Each ligand is also classified in terms of its metal coordinating atoms and whether there are multiple variants (i.e., hemilability). We then link complexes to their associated journal abstracts and apply a topic-modeling workflow to link 25,146 ligands with functional application areas spanning reactivity, redox chemistry, biological chemistry, and photophysical chemistry. Together, we provide an experimentally grounded dataset of ligand chemical space that connects charge and functional application as a foundation for computational screening and data-driven ligand design. Show less
The InChI (International Chemical Identifier) standard stands as a cornerstone in chemical informatics, facilitating the structure-based identification and exchange chemical information about Show more
The InChI (International Chemical Identifier) standard stands as a cornerstone in chemical informatics, facilitating the structure-based identification and exchange chemical information about compounds across various platforms and databases. The InChI as a unique canonical line notation has made chemical structures searchable on the internet at a broad scale. The largest repositories working with InChIs contain more than 1 billion structures. Central to the functionality of the InChI is its codebase, which orchestrates a series of intricate steps to generate unique identifiers for chemical compounds. Up to now, these steps have been sparsely documented and the InChI algorithm had to be seen as a black box. For the new v1.07 release, the code has been analyzed and the major steps documented, more than 3000 bugs and security issues, as well as nearly 60 Google OSS-Fuzz issues have been fixed. New test systems have been implemented that allow users to directly test the code developments. The move to GitHub has not only made the development more transparent but will also enable external contributors to join the further development of the InChI code. Motivation for this modernisation was the urgency to treat molecular inorganic compounds by the InChI in a meaningful way. Until now, no classic string representation fulfills this need of molecular inorganic chemistry. Currently bonds to metal centers are by definition disconnected which makes most inorganic InChIs meaningless at the moment. Herein, we propose new routines to remedy this problem in the representation of molecular inorganic compounds by the InChI.
Show less