PhD in Informatics Seminar

Evolving meaning for supervised learning in complex biomedical domains using knowledge graphs

Transmissão através de Videoconferência

Por Rita Sousa.

In recent years, the explosion in complexity and heterogeneity of data has motivated a new paradigm, where millions of semantically-described entities are available in knowledge graphs (KGs). Different KGs have been exploited in a wide variety of data mining and machine learning tasks, namely the prediction of specific relations between entities that correspond to KG instances but whose relationship is not encoded in the graph. This scenario has various bioinformatics applications, such as predicting protein-protein interactions, drug-drug interactions or gene-disease associations.

Many of the existing KG-based approaches for machine learning use KGs for generating static semantic representations, which are then used as features. These semantic representations can be considered static since they consider the full graph, blind to the fact that unnecessary information for representations can introduce noise. In applications where the target is encoded in the KG, this problem is mitigated. However, when the classification targets are not a part of the KG, representations cannot be trained on the targets. The problem is exacerbated in complex domains, such as the biomedical, where KGs represent multiple views (or semantic aspects) over the underlying data, some of which may be less relevant to train the model. This brings up the challenge of tailoring the semantic representation of the KGs entities to a specific goal.

This Ph.D. project aims to address this challenge by developing novel machine learning-based approaches to learn suitable semantic representations of data objects extracted from KGs to support specific supervised learning tasks in bioinformatics applications. The novel approaches are anchored in a framework that integrates semantic representation approaches and learning approaches, and allows a comparative evaluation using benchmarks. This framework was successfully applied to protein-protein interaction prediction with significant improvements over manually defined static semantic representations, and to learn similarity functions adapted to different biological perspectives.

More information: