Compbio Description - Long Version

Lists of molecular entities (genes/transcripts, proteins, metabolites) identified as differentially expressed across experimental conditions or correlated with phenotypic measures can be further analyzed using the CompBio platform (GTAC@MGI, WashU School of Medicine, https://gtac-compbio-ex.wustl.edu – Academic/Non-Profit. CompBio is an artificial intelligence platform designed to interpret multi-omics data through the systematic analysis of literature related to a given list of molecular entities. CompBio initially creates a biological knowledge base from the information contained within PubMed (>33 million abstracts, >3 million full-text articles) using natural language processing and conditional probability. The platform evaluates the input molecular entities against this knowledge base to identify statistically enriched biological concepts associated with them. These concepts are further analyzed for contextually relevant associations to form biological themes, representing pathways and processes. For interpretability, these themes and their interrelationships are displayed as an interactive 3-dimensional knowledge map. Themes are depicted as spheres, with size and rank determined by the overall absolute enrichment observed between the concepts and entities contained within them. Closely related themes are generally positioned proximally, and may share edges, with edge thickness indicating the number of shared genes between the adjoined themes. Normalized enrichment scores (NES) and empirical p-values are derived by comparing the absolute enrichment score of a ranked theme to themes of the same rank from thousands of randomized data sets of similar size to the users input list. Given the holistic nature of CompBio analysis, stringent fold change or p-value cutoffs are not required to eliminate random noise within the input list. Significant themes (Normalized Enrichment Score > 1.2 & p-value < 0.1) can be reviewed for biological interpretation. The annotation of themes is also a fully contextualized process. In short, CompBio assesses the most relevant annotations for each theme based not only on the concepts and their associated enrichments within that theme, but also within the context of other related concepts in adjoining themes. This ensures a full experiment awareness during the automated annotation process. Furthermore, since the assembly of the themes may identify multiple specific components of closely related biological processes, each theme can be identified by up to three biological labels to capture these related components comprehensively.

The CompBio knowledge maps can be further compared utilizing the Assertion Engine. The Assertion Engine is a machine-learning tool that identifies the preservation of concepts as well as their association with neighboring concepts within complex biological maps. The result is a 2-dimensional map which is a projection of strongly preserved concepts and relationships (shown as edges between concepts). For a concept relationship to form, two factors are necessary. First, the shared concepts must be in both CompBio knowledge maps that are being compared. In addition, they also must be interconnected with other concepts, multiple layers deep, with sufficient similarity to be considered a “signal” event. P-values for global-level association signals are computed empirically though comparison of signal containing CompBio maps to 10’s of thousands of randomized data sets.