Giter Club home page Giter Club logo

xai-proteins's Introduction

Insights into the inner workings of transformer models for protein function prediction

About

Finetuning pretrained universal protein language models to downstream tasks provides large benefits in protein function prediction. The used neural networks are, at the same time, notorious for having often millions and sometimes billions of trainable parameters. Therefore, it can be very difficult to interpret the decision making logic or strategy of these complex models.

Consequently, explainable machine learning is starting to gain traction in the field of proteomics too. We are exploring how explainability methods can help to shed light into the inner workings of transformers for protein function prediction.

Attribution methods, such as integrated gradients, make it possible to identify those features in the input space that the model apparently focuses on, because these features turn out to be relevant for the final classification decision of the model. We extended integrated gradients such that latent representations inside of transformers can be inspected too (separately for each head and layer).

To find out if the identified relevant sequence regions match expectations informed by knowledge from biology or chemistry, we combined this method with a subsequent statistical analysis across proteins where we correlated the obtained relevance with annotations of interest from sequence databases. In this way, we identified heads inside of the transformer architecture that are specialized for specific protein function prediction tasks.

The two folders of this repository are dedicated to the explainability analysis for the Gene Ontology (GO) term and Enzyme Commission (EC) number prediction (see the GO and EC README files) .

Publication

You find more information in our article:

Markus Wenzel, Erik Grüner, Nils Strodthoff (2024). Insights into the inner workings of transformer models for protein function prediction, Bioinformatics, btae031.

@article{10.1093/bioinformatics/btae031, author = {Wenzel, Markus and Grüner, Erik and Strodthoff, Nils}, title = "{Insights into the inner workings of transformer models for protein function prediction}", journal = {Bioinformatics}, pages = {btae031}, year = {2024}, month = {01}, issn = {1367-4811}, doi = {10.1093/bioinformatics/btae031}, url = {https://doi.org/10.1093/bioinformatics/btae031}}

Related works

If you are interested in this topic, you are welcome to have a look at our related papers:

Datasets

EC and GO data were preprocessed as detailed on https://github.com/nstrodt/UDSMProt with https://github.com/nstrodt/UDSMProt/blob/master/code/create_datasets.sh, resulting in six files for EC40 and EC50 on levels L0, L1, and L2, and in two files for GO "2016" (a.k.a. "temporalsplit") and GO "CAFA3". Preprocessed data can be accessed here (EC) and here (GO).

Authors

Markus Wenzel, Erik Grüner, Nils Strodthoff (2024)

xai-proteins's People

Contributors

markuswenzel avatar

Stargazers

 avatar Kaiyu (Rossmann) Qiu avatar cui avatar Ross Altman avatar Ross Altman avatar Jack Simon avatar Xiao avatar Kirty Vedula avatar Alex Naka avatar Jeroen Van Goey avatar Nick Greenfield avatar Said Muñoz Montero avatar

Watchers

 avatar Kostas Georgiou avatar

Forkers

animesh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.