Attention-Only Transformers and Implementing MLPs with Attention Heads
Authors:
Robert Huben,
Valerie Morris
Abstract:
The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an M…
▽ More
The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
Anomaly Detection in Paleoclimate Records using Permutation Entropy
Authors:
Joshua Garland,
Tyler R. Jones,
Michael Neuder,
Valerie Morris,
James W. C. White,
Elizabeth Bradley
Abstract:
Permutation entropy techniques can be useful in identifying anomalies in paleoclimate data records, including noise, outliers, and post-processing issues. We demonstrate this using weighted and unweighted permutation entropy of water-isotope records in a deep polar ice core. In one region of these isotope records, our previous calculations revealed an abrupt change in the complexity of the traces:…
▽ More
Permutation entropy techniques can be useful in identifying anomalies in paleoclimate data records, including noise, outliers, and post-processing issues. We demonstrate this using weighted and unweighted permutation entropy of water-isotope records in a deep polar ice core. In one region of these isotope records, our previous calculations revealed an abrupt change in the complexity of the traces: specifically, in the amount of new information that appeared at every time step. We conjectured that this effect was due to noise introduced by an older laboratory instrument. In this paper, we validate that conjecture by re-analyzing a section of the ice core using a more-advanced version of the laboratory instrument. The anomalous noise levels are absent from the permutation entropy traces of the new data. In other sections of the core, we show that permutation entropy techniques can be used to identify anomalies in the raw data that are not associated with climatic or glaciological processes, but rather effects occurring during field work, laboratory analysis, or data post-processing. These examples make it clear that permutation entropy is a useful forensic tool for identifying sections of data that require targeted re-analysis---and can even be useful in guiding that analysis.
△ Less
Submitted 29 November, 2018; v1 submitted 3 November, 2018;
originally announced November 2018.