ChEMBL Project meeting summary

HP 2011-01-06

Focus

The aim of the meeting was to foster the creation of a pre-competitive alliance of Pipeline Pilot and ChEMBL users in which data, methods, components and protocols that work with ChEMBL derived data can be developed and shared.

Introduction

The meeting started with an introduction from Richard Compton describing the events leading to the meeting and referring to the Pistoia alliance as an example of how an industrially sponsored initial meeting can lead to a persistent, pre-competitive outcome. Brad Sherborne followed with a description of the practical motivators which inspired the call for such a meeting, in particular the suspected wasted effort in many companies developing similar Pipeline Pilot components and protocols and in particular those associated with the preparation, integration, presentation and exploration of the ChEMBL database.

Aspirations

Aspirations for the day and beyond were delivered by Noj Malcolm, John Overington and Brad Sherborne, emphasising the hope to create an open, collaborative community effort with protocols and components that matured and evolved, that were available solely as such for download and install or also were implemented as ChEMBL hosted services or at work behind ChEMBL interfaces. The challenge of bringing together open and commercial interests was noted.

Communication, Membership and Projects

It was agreed that the current membership of the group (the attendees) would need to be increased to include and be limited to PP users, though care would need to be taken to maintain focus on the first projects as new members accrete.

In order to initiate efforts the aim for the day was to identify a handful of key “projects” and assign folks to these to ensure ownership and leadership. Contributions to the projects might be guidance on functionality and commenting on beta versions or some code or tables from industry.

To facilitate communication within projects and within this community, a set of closed groups will be created on the Accelrys Forums resource. It was noted that the necessity and incentive for initially closed communication revolve around the opportunity for publications.

ChEMBL architecture and content

John gave an overview of the ChEMBL architecture and led discussions on plans for the future. Opportunities were noted in the areas of steering the ChEMBL team from forums such as this as well as combining cheminformatics and bioinformatics approaches and the future needs to rationally approach PPIs and Biologics.

John revealed a four week cycle for updates to the database with an aspirational target of 4 – 6 weeks behind publication of data in journals. Biological data is available via a mySQL service and REST webserives as well as an RDF triplestore will become available later. An https version of the website is available and the team will be more than willing to address any security concerns that users may have. The issue of whether to download or search on site was debated – it is clear that some downloads will be alleviated by such an interface and that some who integrate ChEMBL will always require downloads – webservices may alleviate some download needs.

The need for shared, “standard” definitions of actives at a target (active, active < uM etc.) and comparably active (same assay same paper, same assay different paper) was discussed along with the need for alternate means to classify the targets. Here it is clear that ChEMBL will not be a repository of an alternative target classification scheme, but could include links to additional external resources whilst this community could choose to maintain a standard table or database.

The planned content of ChEMBL raised much debate e.g. can it be a prerequisite of journal publication that the data is also filed at ChEMBL? The competing intentions of the RSC and ACS were noted, as were open access article requirements by e.g. NIH and some open access journals. Capturing inactive data was a need raised by many and though some thought that typical LO publications captured sufficient, it was clear that for HTS data this is not true. Links to academic efforts like Euroscreen were noted as the well as plans to include PubChem ids and data. Currently patent data is accessed via SIMPLE (IBM) and not included. Data that is planned, or under investigation, for inclusion is clinical, drug development and PSDPI data.

Identified Projects

Round table discussions identified the outlines of a number of projects which are captured here:

1. Pipeline Pilot ChEMBL integration components

The aim would be a hierarchical set of components that allowed basic and more elaborate interactions with the ChEMBL database. Whilst the aim was that they would be implemented via a ChEMBL webservice, in the short term they could be implemented via a local mySQL/Oracle instance. The problem of users of webservices that downloaded all or otherwise monopolised the resources was discussed but no clear answer was identified. Early on establishing some guidelines or rules for component construction seem prudent.

Priority components were name → prefname, target → compounds → +activity data, company → targets or compounds, drug or target synonyms, substructure and similarity searching and combined target and structural searching. Some issues were noted with the handling of units and how to cope with multiple activity results. More complex components were those that direct queries externally, e.g. using external sources of target or compound analogues, but the issue of integration with PDB data is felt of high importance and should be addressed.

2. Structure clean-up

Whilst each company is likely to have its own rules here, this project could benefit all with starting points and a shared repository of alternative “standards”. Perhaps the ChEMBL clean-up procedure could be the first to be deposited?

3. Series and related structural analyses

The most interesting opportunities for community sharing were felt to be around means to summarise data by structure and a number of existing protocols were discussed which if shared could kickstart this project.

Indexing of structures by scaffolds or scaffold trees were described as well as the related task of annotating compound series by substructure. Other related topics included clustering and activity/selectivity cliff exploration. The slightly different task of matched molecular pairs also would find a home here.

4. Visualisation

Visualising either aggregate data or multi-target or ADMET data was felt of common interest and an opportunity for development. The operations for in-depth characterisation at ChEMBL are currently using Spotfire, so this would be a timely effort to focus on reproducing the current capabilities and building a project to extend further. Interactive or other means to encode large activity or stereochemical cliffs, any classifications introduced in the above project and simple classifications such as non-peptidic were discussed.

5. Predictions

Two categories of prediction were noted, the first of which were structural alerts / PAINS etc. where a stepwise approach of introducing annotations and means for users to select a subset for filtering was ideal, and which could be a user community led exercise. Anonymously shared alerts could be shared as well as some published data sets.

An Accelrys led effort which generated excitement and would give some edge to the ChEMBL site was the naïve Bayes Classifier prediction of activities of ChEMBL compounds based on ChEMBL data. Model components could be made available to PP customers to allow predictions on non-ChEMBL, proprietary compounds).