Papers - Custom Layers
updated
Unleashing the Power of Pre-trained Language Models for Offline
Reinforcement Learning
Paper
• 2310.20587
• Published
• 18
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
Attention
Paper
• 2310.00535
• Published
• 2
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
Choice Capabilities in Chinchilla
Paper
• 2307.09458
• Published
• 12
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
• 2310.19956
• Published
• 10
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published
• 10
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published
• 2
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
• 2311.10642
• Published
• 25
DenseFormer: Enhancing Information Flow in Transformers via Depth
Weighted Averaging
Paper
• 2402.02622
• Published
• 3
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
• 2403.17887
• Published
• 82
Lumiere: A Space-Time Diffusion Model for Video Generation
Paper
• 2401.12945
• Published
• 87
RWKV: Reinventing RNNs for the Transformer Era
Paper
• 2305.13048
• Published
• 21
Condition-Aware Neural Network for Controlled Image Generation
Paper
• 2404.01143
• Published
• 13
Locating and Editing Factual Associations in GPT
Paper
• 2202.05262
• Published
• 1
MLP Can Be A Good Transformer Learner
Paper
• 2404.05657
• Published
• 1
Toward a Better Understanding of Fourier Neural Operators: Analysis and
Improvement from a Spectral Perspective
Paper
• 2404.07200
• Published
• 2
MegaScale: Scaling Large Language Model Training to More Than 10,000
GPUs
Paper
• 2402.15627
• Published
• 36
Scaling MLPs: A Tale of Inductive Bias
Paper
• 2306.13575
• Published
• 17
GLIGEN: Open-Set Grounded Text-to-Image Generation
Paper
• 2301.07093
• Published
• 4
All you need is a good init
Paper
• 1511.06422
• Published
• 1
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
• 2404.16710
• Published
• 80
Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model
Editing with Llama-3
Paper
• 2405.00664
• Published
• 20
pyvene: A Library for Understanding and Improving PyTorch Models via
Interventions
Paper
• 2403.07809
• Published
• 1
TokenFormer: Rethinking Transformer Scaling with Tokenized Model
Parameters
Paper
• 2410.23168
• Published
• 24
Augmenting Self-attention with Persistent Memory
Paper
• 1907.01470
• Published
• 1
Paper
• 2412.09764
• Published
• 5