This might need some revision.

Typically the dense network in a transformer’s multi-headed self attention layer (the one that construct the query, key, and value) is only of depth one. That is, there’s only an input layer and an output layer connected by weights.
LoRA — Intuitively and Exhaustively Explained
1.2K
12
Daniel Warfield
Senthilkumar Gopal
·Follow
Nov 12, 2023
--
This might need some revision. The FF network is fully connected and has the most number of parameters. Though MHA has O(n^2) inference, the number of parameterts might be lower. Ideally LoRA can benefit the FFN. Reference: https://orenleung.com/transformer-parameter-counting
P.S: Havent read the full LoRA paper yet :) So my assumption might be incorrect.
--
--
Written by Senthilkumar Gopal224 Followers
·328 Following
❤️ to code and solving complex problems everyday @AWS . Engineering leader for AI/ML Accelerator using Neuron. Opinions my own and does not represent AWS.
Responses (1)
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams