You're reading the documentation for a development version. For the latest released version, please have a look at v0.2.0.
rau.models
¶
This module contains implementations of neural network architectures.
Notes on the Transformer Architecture¶
The following notes apply to all instances of the transformer architecture [Vaswani et al., 2017].
They use pre-norm instead of post-norm [Nguyen and Salazar, 2019, Wang et al., 2019].
Dropout is applied to the same places as in Vaswani et al. [2017] and also to the hidden units of feedforward sublayers and the attention probabilities of the attention mechanism.
They use the sinusoidal positional encodings as originally proposed by Vaswani et al. [2017].
- rau.models.get_unidirectional_transformer_encoder(input_vocabulary_size, output_vocabulary_size, tie_embeddings, num_layers, d_model, num_heads, feedforward_size, dropout, use_padding, shared_embeddings=None, positional_encoding_cacher=None, tag=None)¶
Construct a causally-masked transformer encoder (also called a “decoder-only” model). This can be used as a language model.
It includes a scaled input embedding layer with sinusoidal positional encodings and an output layer for predicting logits.
- Parameters:
input_vocabulary_size (
int
) – The size of the input vocabulary.output_vocabulary_size (
int
) – The size of the output vocabulary.tie_embeddings (
bool
) – Whether to tie the input and output embeddings.num_layers (
int
) – Number of layers.d_model (
int
) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).num_heads (
int
) – Number of attention heads per layer.feedforward_size (
int
) – Number of hidden units in each feedforward sublayer.dropout (
float
) – Dropout rate used throughout the transformer.use_padding (
bool
) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.shared_embeddings (
Tensor
|None
) – An optional matrix of embeddings that can be shared elsewhere.positional_encoding_cacher (
SinusoidalPositionalEncodingCacher
|None
) – Optional cache for computing positional encodings that can be shared elsewhere.tag (
str
|None
) – An optional tag to add to the innerUnidirectionalTransformerEncoderLayers
for argument routing.
- Return type:
- Returns:
A module. Unless
tag
is given, it accepts the same arguments asUnidirectionalTransformerEncoderLayers
.
- class rau.models.UnidirectionalTransformerEncoderLayers¶
Bases:
Unidirectional
A causally-masked transformer encoder without input or output layers.
- class State¶
Bases:
State
State(encoder: ‘UnidirectionalTransformerEncoderLayers’, previous_inputs: torch.Tensor, is_padding_mask: torch.Tensor | None)
- __init__(encoder, previous_inputs, is_padding_mask)¶
- forward(input_sequence, include_first, return_state=False, return_output=True)¶
- Return type:
- __init__(num_layers, d_model, num_heads, feedforward_size, dropout, use_final_layer_norm)¶
- forward(input_sequence, is_padding_mask=None, initial_state=None, return_state=False, include_first=True)¶
- Parameters:
is_padding_mask (
Tensor
|None
) – Optional bool tensor indicating which positions in the input are padding tokens and should be ignored. Since the model is already causally masked, this should usually not be necessary, and it is better not to use it. Its size should be \(\text{batch size} \times \text{input length}\). A value of true indicates that a token is padding.- Return type:
- rau.models.get_transformer_encoder(vocabulary_size, num_layers, d_model, num_heads, feedforward_size, dropout, use_padding, shared_embeddings, positional_encoding_cacher, tag=None)¶
Construct a bidirectional transformer encoder.
It includes a scaled input embedding layer with sinusoidal positional encodings but no separate output layer. Its outputs are the outputs of layer norm applied to the outputs of the last layer.
- Parameters:
vocabulary_size (
int
) – The size of the input vocabulary.num_layers (
int
) – Number of layers.d_model (
int
) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).num_heads (
int
) – Number of attention heads per layer.feedforward_size (
int
) – Number of hidden units in each feedforward sublayer.dropout (
float
) – Dropout rate used throughout the transformer.use_padding (
bool
) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.shared_embeddings (
Tensor
|None
) – An optional matrix of embeddings that can be shared elsewhere.positional_encoding_cacher (
SinusoidalPositionalEncodingCacher
|None
) – Optional cache for computing positional encodings that can be shared elsewhere.tag (
str
|None
) – An optional tag to add to the innerTransformerEncoderLayers
for argument routing.
- Return type:
- Returns:
A module. Unless
tag
is given, it accepts the same arguments asTransformerEncoderLayers
.
- class rau.models.TransformerEncoderLayers¶
Bases:
Module
A bidirectional transformer encoder without input or outpt layers.
- __init__(num_layers, d_model, num_heads, feedforward_size, dropout, use_final_layer_norm)¶
- forward(source_sequence, is_padding_mask=None)¶
- Parameters:
is_padding_mask (
Tensor
|None
) – A bool tensor indicating which positions in the input should be treated as padding symbols and ignored. This always needs to be used if you are using a minibatch with sequences of different lengths. Its size should be \(\text{batch size} \times \text{input length}\). A value of true indicates that a token is padding.- Return type:
- rau.models.get_transformer_decoder(input_vocabulary_size, output_vocabulary_size, num_layers, d_model, num_heads, feedforward_size, dropout, use_padding, shared_embeddings, positional_encoding_cacher, tag=None)¶
Construct a transformer decoder with cross-attention.
It includes a scaled input embedding layer with sinusoidal positional encodings and an output layer for predicting logits.
- Parameters:
input_vocabulary_size (
int
) – The size of the input vocabulary.output_vocabulary_size (
int
) – The size of the output vocabulary.num_layers (
int
) – Number of layers.d_model (
int
) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).num_heads (
int
) – Number of attention heads per layer.feedforward_size (
int
) – Number of hidden units in each feedforward sublayer.dropout (
float
) – Dropout rate used throughout the transformer.use_padding (
bool
) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.shared_embeddings (
Tensor
|None
) – An optional matrix of embeddings that can be shared elsewhere.positional_encoding_cacher (
SinusoidalPositionalEncodingCacher
|None
) – Optional cache for computing positional encodings that can be shared elsewhere.tag (
str
|None
) – An optional tag to add to the innerUnidirectionalTransformerEncoderLayers
for argument routing.
- Return type:
- Returns:
A module. Unless
tag
is given, it accepts the same arguments asTransformerDecoderLayers
.
- class rau.models.TransformerDecoderLayers¶
Bases:
Unidirectional
A transformer decoder without input or output layers.
- class State¶
Bases:
State
State(decoder: ‘TransformerDecoderLayers’, encoder_sequence: torch.Tensor, input_is_padding_mask: torch.Tensor | None, encoder_is_padding_mask: torch.Tensor | None, previous_inputs: torch.Tensor)
- __init__(decoder, encoder_sequence, input_is_padding_mask, encoder_is_padding_mask, previous_inputs)¶
- forward(input_sequence, include_first, return_state=False, return_output=True)¶
- Return type:
-
decoder:
TransformerDecoderLayers
¶
- __init__(num_layers, d_model, num_heads, feedforward_size, dropout, use_final_layer_norm)¶
- forward(input_sequence, encoder_sequence, input_is_padding_mask=None, encoder_is_padding_mask=None, initial_state=None, return_state=False, include_first=True)¶
- Parameters:
encoder_sequence (
Tensor
) – The output sequence of the encoder.input_is_padding_mask (
Tensor
|None
) – Optional bool tensor indicating which positions in the decoder input correspond to padding symbols that should be ignored. Since the decoder is already causally masked, this should usually not be necessary, and it is better not to use it. Its size should be \(\text{batch size} \times \text{decoder input length}\). A value of true indicates that a token is padding.encoder_is_padding_mask (
Tensor
|None
) – Bool tensor indicating which positions in the encoder input correspond to padding symbols that should be ignored. This always needs to be used if you are using a minibatch with input sequences of different lengths. Its size should be \(\text{batch size} \times \text{encoder input length}\). A value of true indicates that a token is padding.
- Return type:
- rau.models.get_transformer_encoder_decoder(source_vocabulary_size, target_input_vocabulary_size, target_output_vocabulary_size, tie_embeddings, num_encoder_layers, num_decoder_layers, d_model, num_heads, feedforward_size, dropout, use_source_padding=True, use_target_padding=True)¶
Construct a transformer encoder-decoder.
It includes a scaled input embedding layers with sinusoidal positional encodings in the encoder and decoder and an output layer in the decoder for predicting logits.
- Parameters:
source_vocabulary_size (
int
) – The size of the vocabulary used by the encoder.target_input_vocabulary_size (
int
) – The size of the input vocabulary used by the decoder.target_output_vocabulary_size (
int
) – The size of the output vocabulary used by the decoder.tie_embeddings (
bool
) – Whether to tie the input and output embeddings.num_encoder_layers (
int
) – Number of layers to use in the encoder.num_decoder_layers (
int
) – Number of layers to use in the decoder.d_model (
int
) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).num_heads (
int
) – Number of attention heads per layer.feedforward_size (
int
) – Number of hidden units in each feedforward sublayer.dropout (
float
) – Dropout rate used throughout the transformer.use_source_padding (
bool
) – Whether to ensure that the embedding matrix for the encoder is big enough to accommodate an index for a reserved padding symbol.use_target_padding (
bool
) – Whether to ensure that the embedding matrix for the decoder is big enough to accommodate an index for a reserved padding symbol.
- Return type:
- class rau.models.TransformerEncoderDecoder¶
Bases:
Module
A transformer encoder-decoder.
- __init__(encoder, decoder)¶
- forward(source_sequence, target_sequence, source_is_padding_mask=None, target_is_padding_mask=None)¶
- Parameters:
source_sequence (
Tensor
) – A batch of source sequences. A tensor of ints of size \(\text{batch size} \times \text{source length}\).target_sequence (
Tensor
) – A batch of target sequences. A tensor of ints of size \(\text{batch size} \times \text{target length}\).source_is_padding_mask (
Tensor
|None
) – Bool tensor indicating which positions in the source correspond to padding symbols that should be ignored. This always needs to be used if you are using a minibatch with source sequences of different lengths. Its size should be \(\text{batch size} \times \text{source length}\). A value of true indicates that a token is padding.target_is_padding_mask (
Tensor
|None
) – Optional bool tensor indicating which positions in the target correspond to padding symbols that should be ignored. Since the decoder is already causally masked, this should usually not be necessary, and it is better not to use it. Its size should be \(\text{batch size} \times \text{target length}\). A value of true indicates that a token is padding.
- Return type:
- Returns:
The output logits of the decoder, of size \(\text{batch size} \times \text{target length} \times \text{target vocabulary size}\).
- initial_decoder_state(source_sequence, source_is_padding_mask)¶
Given a batch of source sequences, compute the initial state of the decoder.
- Parameters:
source_sequence (
Tensor
) – A batch of source sequences. A tensor of ints of size \(\text{batch size} \times \text{source length}\).source_is_padding_mask (
Tensor
|None
) – Bool tensor indicating which positions in the source correspond to padding symbols that should be ignored. This always needs to be used if you are using a minibatch with source sequences of different lengths. Its size should be \(\text{batch size} \times \text{source length}\). A value of true indicates that a token is padding.
- Return type:
- Returns:
The initial state of the decoder, conditioned on the source sequences.
- class rau.models.SinusoidalPositionalEncodingCacher¶
Bases:
Module
A module that caches a tensor of sinusoidal positional encodings.
This module can dynamically resize the cached tensor as needed, but it is highly recommended to set a maximum size up-front at the beginning of your program using :py:meth`get_encodings` (for example, by looping through the training data) and then disable dynamic resizing using
set_allow_realloation()
to avoid CUDA memory fragmentation. Otherwise, you may run out of memory in a way that is very hard to debug.- __init__()¶
- get_encodings(sequence_length, d_model)¶
Get a tensor of positional encodings of the requested size.
- set_allow_reallocation(value)¶
Set whether reallocating the tensor dynamically based on requested sizes should be enabled. By default, it is enabled. If it is disabled, requesting a size bigger than the currently cached tensor will cause an error. After setting a maximum size with
get_encodings()
, the advantage of disabling it is that it will treat requests for bigger sizes (which would imply that the way you determined the maximum length has a bug) as errors rather than silently allowing them to cause memory fragmentation.
- class rau.models.SimpleRNN¶
Bases:
UnidirectionalBuiltinRNN
A simple RNN (also known as an Elman RNN) wrapped in the
Unidirectional
API.- __init__(input_size, hidden_units, layers=1, dropout=None, nonlinearity='tanh', bias=True, learned_initial_state=True, use_extra_bias=False)¶
- Parameters:
input_size (
int
) – The size of the input vectors to the RNN.hidden_units (
int
) – The number of hidden units in each layer.layers (
int
) – The number of layers in the RNN.dropout (
float
|None
) – The amount of dropout applied in between layers. Iflayers
is 1, then this value is ignored.nonlinearity (
Literal
['tanh'
,'relu'
]) – The non-linearity applied to hidden units. Either'tanh'
or'relu'
.bias (
bool
) – Whether to use bias terms.learned_initial_state (
bool
) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the activation function. If false, the initial state will be zeros.use_extra_bias (
bool
) – The built-in PyTorch implementation of the RNN includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.
- class rau.models.LSTM¶
Bases:
UnidirectionalBuiltinRNN
An LSTM wrapped in the
Unidirectional
API.- __init__(input_size, hidden_units, layers=1, dropout=None, bias=True, learned_initial_state=True, use_extra_bias=False)¶
- Parameters:
input_size (
int
) – The size of the input vectors to the LSTM.hidden_units (
int
) – The number of hidden units in each layer.layers (
int
) – The number of layers in the LSTM.dropout (
float
|None
) – The amount of dropout applied in between layers. Iflayers
is 1, then this value is ignored.bias (
bool
) – Whether to use bias terms.learned_initial_state (
bool
) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the tanh activation function. If false, the initial state will be zeros. The initial memory cell is always zeros.use_extra_bias (
bool
) – The built-in PyTorch implementation of the LSTM includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.
- rau.models.get_simple_rnn_language_model(input_vocabulary_size, output_vocabulary_size, hidden_units, layers=1, dropout=0, nonlinearity='tanh', bias=True, learned_initial_state=True, use_extra_bias=False, use_padding=False)¶
Construct a simple RNN language model.
The embedding size and input size are assumed to be the same as the hidden size.
The input and output embeddings are tied.
- Parameters:
input_vocabulary_size (
int
) – The size of the input vocabulary.output_vocabulary_size (
int
) – The size of the output vocabulary.hidden_units (
int
) – The number of hidden units in each layer.layers (
int
) – The number of layers in the RNN.dropout (
float
) – The amount of dropout applied to inputs, in between layers, and the last layer outputs.nonlinearity (
Literal
['tanh'
,'relu'
]) – The non-linearity applied to hidden units. Either'tanh'
or'relu'
.bias (
bool
) – Whether to use bias terms.learned_initial_state (
bool
) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the activation function. If false, the initial state will be zeros.use_extra_bias (
bool
) – The built-in PyTorch implementation of the RNN includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.use_padding (
bool
) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.
- Return type:
- Returns:
A simple RNN language model with input and output layers. It accepts the same arguments as
SimpleRNN
.
- rau.models.get_lstm_language_model(input_vocabulary_size, output_vocabulary_size, hidden_units, layers=1, dropout=0, bias=True, learned_initial_state=True, use_extra_bias=False, use_padding=False)¶
Construct an LSTM language model.
The embedding size and input size are assumed to be the same as the hidden size.
The input and output embeddings are tied.
- Parameters:
input_vocabulary_size (
int
) – The size of the input vocabulary.output_vocabulary_size (
int
) – The size of the output vocabulary.hidden_units (
int
) – The number of hidden units in each layer.layers (
int
) – The number of layers in the RNN.dropout (
float
) – The amount of dropout applied to inputs, in between layers, and the last layer outputs.bias (
bool
) – Whether to use bias terms.learned_initial_state (
bool
) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the activation function. If false, the initial state will be zeros.use_extra_bias (
bool
) – The built-in PyTorch implementation of the RNN includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.use_padding (
bool
) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.
- Return type:
- Returns:
An LSTM language model with input and output layers. It accepts the same arguments as
LSTM
.
- rau.models.get_rnn_language_model(recurrence, input_vocabulary_size, output_vocabulary_size, hidden_units, dropout=0, use_padding=False)¶
Wrap any recurrent network with input and output layers to make it a language model.
- Parameters:
recurrence (
Unidirectional
) – A unidirectional representing the recurrent part of the language model that will be wrapped with input and output layers.input_vocabulary_size (
int
) – The size of the input vocabulary.output_vocabulary_size (
int
) – The size of the output vocabulary.hidden_units (
int
) – The size of the output vectors fromrecurrence
.dropout (
float
) – The dropout rate applied to the input embeddings and to the hidden states before the output layer.use_padding (
bool
) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.
- Return type:
- Returns:
A module that accepts the same arguments as
recurrence
.
Construct a matrix of embedding vectors that can be used as tied input embeddings and output embeddings.
The size of the output vocabulary must be no greater than the size of the input vocabulary.
- Parameters:
tie_embeddings (
bool
) – If false,None
is returned, indicating that a shared embedding matrix should not be used.input_vocabulary_size (
int
) – The size of the input vocabulary.output_vocabulary_size (
int
) – The size of the output vocabulary.embedding_size (
int
) – The size of the embedding vectors.use_padding (
bool
) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.
- Return type:
- Returns:
A matrix of size \(\text{input vocabulary size} \times \text{embedding size}\). If
use_padding
is true, then 1 will be added to input vocabulary size. Iftie_embeddings
is false,None
is returned.