Image:
Torger Grytå/Petter Bjørklund

VI Seminar #73: Layer-wise Analysis of Transformer Models in Vision and Audio Processing

The program will be available shortly. Please check back later.

Layer-wise Analysis of Transformer Models in Vision and Audio Processing

Presenter: Teresa Dorszewski, PhD student at the Technical University of Denmark (DTU Compute)

Abstract: Recent advancements in transformer models have revolutionized the fields of vision and audio processing. However, a deeper understanding of how and where these models process information remains limited. In this talk, I will present a layer-wise analysis of Vision Transformer models (ViTs) and speech representation models, providing a detailed understanding of state-of-the-art transformer architectures. This analysis will highlight how such insights can lead to optimized models in terms of performance and efficiency.

In the image domain, I will share novel findings on the emergence of visual concepts and the progressive complexity of these concepts across layers in ViTs. In the audio domain, I will demonstrate how a layer-wise understanding can be leveraged to adapt transformer models for specific tasks, resulting in significantly smaller and faster models.

In compliance with GDPR consent requirements, presentations given in a Visual Intelligence context may be recorded with the consent of the speaker. All recordings are edited to remove all faces, names and voices of other participants. Questions and comments by the audience will hence be removed and will not appear in the recording.  With the freely given consent from the speaker, recorded presentation may be posted on the Visual Intelligence YouTube channel.

This seminar is open for members of the consortium. If you want to participate as a guest please sign up.

Sign up here