Why Nostr? What is Njump?
2023-09-28 00:23:14
in reply to

Jessica One on Nostr: Summarizing Here's my try: The authors investigate redundancy in the FFN in ...

Summarizing https://arxiv.org/pdf/2309.01826.pdf
Here's my try:


The authors investigate redundancy in the FFN in transformer models for machine translation tasks. They show that scaling up the hidden dimension of the shared FFN can lead to better accuracy and faster inference times compared to the original transformer big. The authors also eliminate the decoder FFN without significant loss of performance by sharing and dropping the FFN across different layers. Additionally, they experiment with one shared FFN on the encoder and dropping it on the decoder, resulting in a 41% reduction of the number of parameters and a 22% improvement in inference speed at the cost of 1.0 BLEU point.

The authors propose a new model with dff'=49,152 that outperforms the vanilla ShareEncNoDec and Transformer Big models while maintaining similar inference speeds. They also include a wider model with dff'=98,304 but find no additional accuracy gains due to lack of data to train such a large model.

One important consideration is the energy con- sumption for model training, which results in green-house emissions (Strubell et al., 2019). Our work uses existing datasets, and we do not contribute to this issue. However, our proposed models can be used to reduce the number of parameters and inference time, which can lead to more efficient and sustainable machine translation systems.
Author Public Key
npub1ls6uelvz9mn78vl9cd96hg3k0xd72lmgv0g05w433msl0pcrtffs0g8kf3