rau.models

This module contains implementations of neural network architectures.

Notes on the Transformer Architecture

The following notes apply to all instances of the transformer architecture [Vaswani et al., 2017].

  • They use pre-norm instead of post-norm [Nguyen and Salazar, 2019, Wang et al., 2019].

  • Dropout is applied to the same places as in Vaswani et al. [2017] and also to the hidden units of feedforward sublayers and the attention probabilities of the attention mechanism.

  • They use the sinusoidal positional encodings as originally proposed by Vaswani et al. [2017].

rau.models.get_unidirectional_transformer_encoder(input_vocabulary_size, output_vocabulary_size, tie_embeddings, num_layers, d_model, num_heads, feedforward_size, dropout, use_padding, shared_embeddings=None, positional_encoding_cacher=None, tag=None)

Construct a causally-masked transformer encoder (also called a “decoder-only” model). This can be used as a language model.

It includes a scaled input embedding layer with sinusoidal positional encodings and an output layer for predicting logits.

Parameters:
  • input_vocabulary_size (int) – The size of the input vocabulary.

  • output_vocabulary_size (int) – The size of the output vocabulary.

  • tie_embeddings (bool) – Whether to tie the input and output embeddings.

  • num_layers (int) – Number of layers.

  • d_model (int) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).

  • num_heads (int) – Number of attention heads per layer.

  • feedforward_size (int) – Number of hidden units in each feedforward sublayer.

  • dropout (float) – Dropout rate used throughout the transformer.

  • use_padding (bool) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.

  • shared_embeddings (Tensor | None) – An optional matrix of embeddings that can be shared elsewhere.

  • positional_encoding_cacher (SinusoidalPositionalEncodingCacher | None) – Optional cache for computing positional encodings that can be shared elsewhere.

  • tag (str | None) – An optional tag to add to the inner UnidirectionalTransformerEncoderLayers for argument routing.

Return type:

Unidirectional

Returns:

A module. Unless tag is given, it accepts the same arguments as UnidirectionalTransformerEncoderLayers.

class rau.models.UnidirectionalTransformerEncoderLayers

Bases: Unidirectional

A causally-masked transformer encoder without input or output layers.

class State

Bases: State

State(encoder: ‘UnidirectionalTransformerEncoderLayers’, previous_inputs: torch.Tensor, is_padding_mask: torch.Tensor | None)

__init__(encoder, previous_inputs, is_padding_mask)
batch_size()
Return type:

int

forward(input_sequence, include_first, return_state=False, return_output=True)
Return type:

Tensor | ForwardResult

next(input_tensor)
Return type:

State

output()
Return type:

Tensor | tuple[Tensor, Unpack[tuple[Any, ...]]]

transform_tensors(func)
Return type:

State

encoder: UnidirectionalTransformerEncoderLayers
previous_inputs: Tensor
is_padding_mask: Tensor | None
__init__(num_layers, d_model, num_heads, feedforward_size, dropout, use_final_layer_norm)
forward(input_sequence, is_padding_mask=None, initial_state=None, return_state=False, include_first=True)
Parameters:

is_padding_mask (Tensor | None) – Optional bool tensor indicating which positions in the input are padding tokens and should be ignored. Since the model is already causally masked, this should usually not be necessary, and it is better not to use it. Its size should be \(\text{batch size} \times \text{input length}\). A value of true indicates that a token is padding.

Return type:

Tensor | ForwardResult

initial_state(batch_size, is_padding_mask=None)
Return type:

State

rau.models.get_transformer_encoder(vocabulary_size, num_layers, d_model, num_heads, feedforward_size, dropout, use_padding, shared_embeddings, positional_encoding_cacher, tag=None)

Construct a bidirectional transformer encoder.

It includes a scaled input embedding layer with sinusoidal positional encodings but no separate output layer. Its outputs are the outputs of layer norm applied to the outputs of the last layer.

Parameters:
  • vocabulary_size (int) – The size of the input vocabulary.

  • num_layers (int) – Number of layers.

  • d_model (int) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).

  • num_heads (int) – Number of attention heads per layer.

  • feedforward_size (int) – Number of hidden units in each feedforward sublayer.

  • dropout (float) – Dropout rate used throughout the transformer.

  • use_padding (bool) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.

  • shared_embeddings (Tensor | None) – An optional matrix of embeddings that can be shared elsewhere.

  • positional_encoding_cacher (SinusoidalPositionalEncodingCacher | None) – Optional cache for computing positional encodings that can be shared elsewhere.

  • tag (str | None) – An optional tag to add to the inner TransformerEncoderLayers for argument routing.

Return type:

Module

Returns:

A module. Unless tag is given, it accepts the same arguments as TransformerEncoderLayers.

class rau.models.TransformerEncoderLayers

Bases: Module

A bidirectional transformer encoder without input or outpt layers.

__init__(num_layers, d_model, num_heads, feedforward_size, dropout, use_final_layer_norm)
forward(source_sequence, is_padding_mask=None)
Parameters:

is_padding_mask (Tensor | None) – A bool tensor indicating which positions in the input should be treated as padding symbols and ignored. This always needs to be used if you are using a minibatch with sequences of different lengths. Its size should be \(\text{batch size} \times \text{input length}\). A value of true indicates that a token is padding.

Return type:

Tensor

rau.models.get_transformer_decoder(input_vocabulary_size, output_vocabulary_size, num_layers, d_model, num_heads, feedforward_size, dropout, use_padding, shared_embeddings, positional_encoding_cacher, tag=None)

Construct a transformer decoder with cross-attention.

It includes a scaled input embedding layer with sinusoidal positional encodings and an output layer for predicting logits.

Parameters:
  • input_vocabulary_size (int) – The size of the input vocabulary.

  • output_vocabulary_size (int) – The size of the output vocabulary.

  • num_layers (int) – Number of layers.

  • d_model (int) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).

  • num_heads (int) – Number of attention heads per layer.

  • feedforward_size (int) – Number of hidden units in each feedforward sublayer.

  • dropout (float) – Dropout rate used throughout the transformer.

  • use_padding (bool) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.

  • shared_embeddings (Tensor | None) – An optional matrix of embeddings that can be shared elsewhere.

  • positional_encoding_cacher (SinusoidalPositionalEncodingCacher | None) – Optional cache for computing positional encodings that can be shared elsewhere.

  • tag (str | None) – An optional tag to add to the inner UnidirectionalTransformerEncoderLayers for argument routing.

Return type:

Unidirectional

Returns:

A module. Unless tag is given, it accepts the same arguments as TransformerDecoderLayers.

class rau.models.TransformerDecoderLayers

Bases: Unidirectional

A transformer decoder without input or output layers.

class State

Bases: State

State(decoder: ‘TransformerDecoderLayers’, encoder_sequence: torch.Tensor, input_is_padding_mask: torch.Tensor | None, encoder_is_padding_mask: torch.Tensor | None, previous_inputs: torch.Tensor)

__init__(decoder, encoder_sequence, input_is_padding_mask, encoder_is_padding_mask, previous_inputs)
batch_size()
Return type:

int

forward(input_sequence, include_first, return_state=False, return_output=True)
Return type:

Tensor | ForwardResult

next(input_tensor)
Return type:

State

output()
Return type:

Tensor | tuple[Tensor, Unpack[tuple[Any, ...]]]

transform_tensors(func)
Return type:

State

decoder: TransformerDecoderLayers
encoder_sequence: Tensor
input_is_padding_mask: Tensor | None
encoder_is_padding_mask: Tensor | None
previous_inputs: Tensor
__init__(num_layers, d_model, num_heads, feedforward_size, dropout, use_final_layer_norm)
forward(input_sequence, encoder_sequence, input_is_padding_mask=None, encoder_is_padding_mask=None, initial_state=None, return_state=False, include_first=True)
Parameters:
  • encoder_sequence (Tensor) – The output sequence of the encoder.

  • input_is_padding_mask (Tensor | None) – Optional bool tensor indicating which positions in the decoder input correspond to padding symbols that should be ignored. Since the decoder is already causally masked, this should usually not be necessary, and it is better not to use it. Its size should be \(\text{batch size} \times \text{decoder input length}\). A value of true indicates that a token is padding.

  • encoder_is_padding_mask (Tensor | None) – Bool tensor indicating which positions in the encoder input correspond to padding symbols that should be ignored. This always needs to be used if you are using a minibatch with input sequences of different lengths. Its size should be \(\text{batch size} \times \text{encoder input length}\). A value of true indicates that a token is padding.

Return type:

Tensor | ForwardResult

initial_state(batch_size, encoder_sequence, input_is_padding_mask=None, encoder_is_padding_mask=None)
Return type:

State

rau.models.get_transformer_encoder_decoder(source_vocabulary_size, target_input_vocabulary_size, target_output_vocabulary_size, tie_embeddings, num_encoder_layers, num_decoder_layers, d_model, num_heads, feedforward_size, dropout, use_source_padding=True, use_target_padding=True)

Construct a transformer encoder-decoder.

It includes a scaled input embedding layers with sinusoidal positional encodings in the encoder and decoder and an output layer in the decoder for predicting logits.

Parameters:
  • source_vocabulary_size (int) – The size of the vocabulary used by the encoder.

  • target_input_vocabulary_size (int) – The size of the input vocabulary used by the decoder.

  • target_output_vocabulary_size (int) – The size of the output vocabulary used by the decoder.

  • tie_embeddings (bool) – Whether to tie the input and output embeddings.

  • num_encoder_layers (int) – Number of layers to use in the encoder.

  • num_decoder_layers (int) – Number of layers to use in the decoder.

  • d_model (int) – The size of the vector representations used in the model, or \(d_\mathrm{model}\).

  • num_heads (int) – Number of attention heads per layer.

  • feedforward_size (int) – Number of hidden units in each feedforward sublayer.

  • dropout (float) – Dropout rate used throughout the transformer.

  • use_source_padding (bool) – Whether to ensure that the embedding matrix for the encoder is big enough to accommodate an index for a reserved padding symbol.

  • use_target_padding (bool) – Whether to ensure that the embedding matrix for the decoder is big enough to accommodate an index for a reserved padding symbol.

Return type:

TransformerEncoderDecoder

class rau.models.TransformerEncoderDecoder

Bases: Module

A transformer encoder-decoder.

__init__(encoder, decoder)
forward(source_sequence, target_sequence, source_is_padding_mask=None, target_is_padding_mask=None)
Parameters:
  • source_sequence (Tensor) – A batch of source sequences. A tensor of ints of size \(\text{batch size} \times \text{source length}\).

  • target_sequence (Tensor) – A batch of target sequences. A tensor of ints of size \(\text{batch size} \times \text{target length}\).

  • source_is_padding_mask (Tensor | None) – Bool tensor indicating which positions in the source correspond to padding symbols that should be ignored. This always needs to be used if you are using a minibatch with source sequences of different lengths. Its size should be \(\text{batch size} \times \text{source length}\). A value of true indicates that a token is padding.

  • target_is_padding_mask (Tensor | None) – Optional bool tensor indicating which positions in the target correspond to padding symbols that should be ignored. Since the decoder is already causally masked, this should usually not be necessary, and it is better not to use it. Its size should be \(\text{batch size} \times \text{target length}\). A value of true indicates that a token is padding.

Return type:

Tensor

Returns:

The output logits of the decoder, of size \(\text{batch size} \times \text{target length} \times \text{target vocabulary size}\).

initial_decoder_state(source_sequence, source_is_padding_mask)

Given a batch of source sequences, compute the initial state of the decoder.

Parameters:
  • source_sequence (Tensor) – A batch of source sequences. A tensor of ints of size \(\text{batch size} \times \text{source length}\).

  • source_is_padding_mask (Tensor | None) – Bool tensor indicating which positions in the source correspond to padding symbols that should be ignored. This always needs to be used if you are using a minibatch with source sequences of different lengths. Its size should be \(\text{batch size} \times \text{source length}\). A value of true indicates that a token is padding.

Return type:

State

Returns:

The initial state of the decoder, conditioned on the source sequences.

class rau.models.SinusoidalPositionalEncodingCacher

Bases: Module

A module that caches a tensor of sinusoidal positional encodings.

This module can dynamically resize the cached tensor as needed, but it is highly recommended to set a maximum size up-front at the beginning of your program using :py:meth`get_encodings` (for example, by looping through the training data) and then disable dynamic resizing using set_allow_realloation() to avoid CUDA memory fragmentation. Otherwise, you may run out of memory in a way that is very hard to debug.

__init__()
clear()

Clear the cache.

Return type:

None

get_encodings(sequence_length, d_model)

Get a tensor of positional encodings of the requested size.

Parameters:
  • sequence_length (int) – Get positional encodings up to this length.

  • d_model (int) – The \(d_\mathrm{model}\) of the positional encodings.

Return type:

Tensor

Returns:

A tensor of positional encodings of the requested size.

set_allow_reallocation(value)

Set whether reallocating the tensor dynamically based on requested sizes should be enabled. By default, it is enabled. If it is disabled, requesting a size bigger than the currently cached tensor will cause an error. After setting a maximum size with get_encodings(), the advantage of disabling it is that it will treat requests for bigger sizes (which would imply that the way you determined the maximum length has a bug) as errors rather than silently allowing them to cause memory fragmentation.

Parameters:

value (bool) – Whether to allow reallocation.

Return type:

None

class rau.models.SimpleRNN

Bases: UnidirectionalBuiltinRNN

A simple RNN (also known as an Elman RNN) wrapped in the Unidirectional API.

__init__(input_size, hidden_units, layers=1, dropout=None, nonlinearity='tanh', bias=True, learned_initial_state=True, use_extra_bias=False)
Parameters:
  • input_size (int) – The size of the input vectors to the RNN.

  • hidden_units (int) – The number of hidden units in each layer.

  • layers (int) – The number of layers in the RNN.

  • dropout (float | None) – The amount of dropout applied in between layers. If layers is 1, then this value is ignored.

  • nonlinearity (Literal['tanh', 'relu']) – The non-linearity applied to hidden units. Either 'tanh' or 'relu'.

  • bias (bool) – Whether to use bias terms.

  • learned_initial_state (bool) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the activation function. If false, the initial state will be zeros.

  • use_extra_bias (bool) – The built-in PyTorch implementation of the RNN includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.

class rau.models.LSTM

Bases: UnidirectionalBuiltinRNN

An LSTM wrapped in the Unidirectional API.

__init__(input_size, hidden_units, layers=1, dropout=None, bias=True, learned_initial_state=True, use_extra_bias=False)
Parameters:
  • input_size (int) – The size of the input vectors to the LSTM.

  • hidden_units (int) – The number of hidden units in each layer.

  • layers (int) – The number of layers in the LSTM.

  • dropout (float | None) – The amount of dropout applied in between layers. If layers is 1, then this value is ignored.

  • bias (bool) – Whether to use bias terms.

  • learned_initial_state (bool) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the tanh activation function. If false, the initial state will be zeros. The initial memory cell is always zeros.

  • use_extra_bias (bool) – The built-in PyTorch implementation of the LSTM includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.

rau.models.get_simple_rnn_language_model(input_vocabulary_size, output_vocabulary_size, hidden_units, layers=1, dropout=0, nonlinearity='tanh', bias=True, learned_initial_state=True, use_extra_bias=False, use_padding=False)

Construct a simple RNN language model.

The embedding size and input size are assumed to be the same as the hidden size.

The input and output embeddings are tied.

Parameters:
  • input_vocabulary_size (int) – The size of the input vocabulary.

  • output_vocabulary_size (int) – The size of the output vocabulary.

  • hidden_units (int) – The number of hidden units in each layer.

  • layers (int) – The number of layers in the RNN.

  • dropout (float) – The amount of dropout applied to inputs, in between layers, and the last layer outputs.

  • nonlinearity (Literal['tanh', 'relu']) – The non-linearity applied to hidden units. Either 'tanh' or 'relu'.

  • bias (bool) – Whether to use bias terms.

  • learned_initial_state (bool) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the activation function. If false, the initial state will be zeros.

  • use_extra_bias (bool) – The built-in PyTorch implementation of the RNN includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.

  • use_padding (bool) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.

Return type:

Unidirectional

Returns:

A simple RNN language model with input and output layers. It accepts the same arguments as SimpleRNN.

rau.models.get_lstm_language_model(input_vocabulary_size, output_vocabulary_size, hidden_units, layers=1, dropout=0, bias=True, learned_initial_state=True, use_extra_bias=False, use_padding=False)

Construct an LSTM language model.

The embedding size and input size are assumed to be the same as the hidden size.

The input and output embeddings are tied.

Parameters:
  • input_vocabulary_size (int) – The size of the input vocabulary.

  • output_vocabulary_size (int) – The size of the output vocabulary.

  • hidden_units (int) – The number of hidden units in each layer.

  • layers (int) – The number of layers in the RNN.

  • dropout (float) – The amount of dropout applied to inputs, in between layers, and the last layer outputs.

  • bias (bool) – Whether to use bias terms.

  • learned_initial_state (bool) – Whether the initial hidden state should be a learned parameter. If true, the initial hidden state will be the result of passing learned parameters through the activation function. If false, the initial state will be zeros.

  • use_extra_bias (bool) – The built-in PyTorch implementation of the RNN includes redundant bias terms, resulting in more parameters than necessary. If this is true, the extra bias terms are kept. Otherwise, they are removed.

  • use_padding (bool) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.

Return type:

Unidirectional

Returns:

An LSTM language model with input and output layers. It accepts the same arguments as LSTM.

rau.models.get_rnn_language_model(recurrence, input_vocabulary_size, output_vocabulary_size, hidden_units, dropout=0, use_padding=False)

Wrap any recurrent network with input and output layers to make it a language model.

Parameters:
  • recurrence (Unidirectional) – A unidirectional representing the recurrent part of the language model that will be wrapped with input and output layers.

  • input_vocabulary_size (int) – The size of the input vocabulary.

  • output_vocabulary_size (int) – The size of the output vocabulary.

  • hidden_units (int) – The size of the output vectors from recurrence.

  • dropout (float) – The dropout rate applied to the input embeddings and to the hidden states before the output layer.

  • use_padding (bool) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.

Return type:

Unidirectional

Returns:

A module that accepts the same arguments as recurrence.

rau.models.get_shared_embeddings(tie_embeddings, input_vocabulary_size, output_vocabulary_size, embedding_size, use_padding)

Construct a matrix of embedding vectors that can be used as tied input embeddings and output embeddings.

The size of the output vocabulary must be no greater than the size of the input vocabulary.

Parameters:
  • tie_embeddings (bool) – If false, None is returned, indicating that a shared embedding matrix should not be used.

  • input_vocabulary_size (int) – The size of the input vocabulary.

  • output_vocabulary_size (int) – The size of the output vocabulary.

  • embedding_size (int) – The size of the embedding vectors.

  • use_padding (bool) – Whether to ensure that the embedding matrix is big enough to accommodate an index for a reserved padding symbol.

Return type:

Tensor | None

Returns:

A matrix of size \(\text{input vocabulary size} \times \text{embedding size}\). If use_padding is true, then 1 will be added to input vocabulary size. If tie_embeddings is false, None is returned.