Composable Neural Networks ========================== For those using the Python API of Rau, a useful feature that the library provides is the ability to easily create new neural network modules by composing simpler modules with the ``|`` operator, so that the output of one is used as the input to the other. (The choice of ``|`` as the composition operator is meant to evoke piping from shell languages.) If ``A`` and ``B`` are :py:class:`~torch.nn.Module`\ s and ``A`` is also an instance of Rau's :py:class:`~rau.tools.torch.compose.BasicComposable` class, then the expression ``A | B`` creates a new :py:class:`~torch.nn.Module` whose ``()`` operator passes its input to ``A``, feeds the output of ``A`` as input to ``B``, and returns the output of ``B``. This module is also an instance of :py:class:`~rau.tools.torch.compose.BasicComposable`, so you can easily create a pipeline of more than two modules like ``A | B | C | D | ...``. You can make any :py:class:`~torch.nn.Module` an instance of :py:class:`~rau.tools.torch.compose.BasicComposable` by wrapping it in :py:class:`~rau.tools.torch.compose.Composable`. .. code-block:: python import torch from torch.nn import Linear from rau.tools.torch.compose import Composable # Create a simple pipeline of Linear modules. # We only need to wrap the first module in Composable to kick # off composition. # Note that the sizes of connecting linear layers match. M = Composable(Linear(3, 7)) | Linear(7, 5) | Linear(5, 11) # Feed an input to the composed module. x = torch.ones(3) y = M(x) # The output is the size of the output from the last module. assert y.size() == (11,) This saves you the trouble of defining a custom :py:class:`~torch.nn.Module` subclass that implements this pipeline. The full API is documented in :py:mod:`rau.tools.torch.compose`. Composable Sequential Neural Networks ------------------------------------- This composition feature is especially useful when dealing with sequential neural networks in Rau. Rau uses an abstraction for sequential neural networks called :py:class:`~torch.unidirectional.Unidirectional`. A :py:class:`~torch.unidirectional.Unidirectional` is a :py:class:`~torch.nn.Module` that receives a variable-length sequence of :py:class:`~torch.Tensor`\ s as input and produces an output :py:class:`~torch.Tensor` for each input :py:class:`~torch.Tensor`. Moreover, each output :py:class:`~torch.Tensor` **must not** have any data dependencies on future inputs. As usual, a :py:class:`~torch.unidirectional.Unidirectional` has a ``()`` operator, which receives the inputs stacked into a single :py:class:`~torch.Tensor` along dimension 1 (the batch dimension is 0) and returns the outputs similarly stacked into a single :py:class:`~torch.Tensor`. .. code-block:: python import torch from rau.models.transformer.unidirectional_encoder import ( get_unidirectional_transformer_encoder ) # This instantiates a causally-masked transformer encoder (also # known as a "decoder-only" transformer). It is an instance of # Unidirectional. M = get_unidirectional_transformer_encoder( # This module will receive a sequence of tensors of size 5 as # input. input_vocabulary_size=5, # This module will produce a sequence of tensors of size 3 as # output. output_vocabulary_size=3, # Turn off dropout in order ot make the outputs deterministic # for this example. dropout=0, # The remaining arguments are not relevant for this example. tie_embeddings=False, num_layers=5, d_model=32, num_heads=4, feedforward_size=64, use_padding=False ) # Batch size. B = 7 # Sequence length. n = 11 # Create a batch of sequences of integer inputs in the range [0, 5) # of length n. These are the "tokens" given to the transformer # encoder. x = torch.randint(5, (B, n)) # Use the () operator to get an output sequence of vectors. # The argument include_first=False tells the module that we do not # want it to attempt to produce an output before reading the first # input. This is not possible for transformers, but it is for RNNs, # which have an initial hidden state. For transformers, an output # corresponding to an initial BOS input can serve the same purpose, # but the BOS would need to be added to the input x, which we have # not done in this example. y = M(x, include_first=False) assert y.size() == (B, n, 3) It *also* has an :py:meth:`~torch.unidirectional.Unidirectional.initial_state` method that returns a :py:class:`~torch.unidirectional.Unidirectional.State` object, which can be used to receive inputs and return outputs iteratively using its :py:class:`~torch.unidirectional.Unidirectional.State.next` and :py:class:`~torch.unidirectional.Unidirectional.State.output` methods. .. code-block:: python from torch.testing import assert_close state = M.initial_state(batch_size=B) # Call .next() to feed a new input to the current state and produce # a new state. state = state.next(x[:, 0]) # Call .output() to get the output tensor of this state. # Because transformers have no initial output vector before reading # any inputs, calling .output() before .next() would have raised an # error. y1 = state.output() # The output of this state is a single vector of size 3 and is # equivalent to the first element of the output of (). assert y1.size() == (B, 3) assert_close(y1, y[:, 0]) # Do the same thing for a second iteration. state = state.next(x[:, 1]) y2 = state.output() assert y2.size() == (B, 3) assert_close(y2, y[:, 1]) These two modes are useful in different scenarios. The ``()`` method can be overridden to parallelize computation across the sequence dimension, making it more efficient than the iterative mode. This makes the ``()`` method useful for training, where future inputs are always known in advance. The iterative mode is useful when future inputs are *not* known in advance, namely when generating sequences from language models or decoders in machine translation systems. :py:class:`~torch.unidirectional.Unidirectional`\ s can also be composed with the ``|`` operator. If ``A`` and ``B`` are both :py:class:`~torch.unidirectional.Unidirectional`\ s, then the expression ``A | B`` returns another :py:class:`~torch.unidirectional.Unidirectional` that feeds its inputs to ``A``, feeds the outputs of ``A`` as inputs to ``B``, and returns the outputs of ``B``. Like ``A`` and ``B``, the :py:class:`~torch.unidirectional.Unidirectional` returned by ``A | B`` also supports both ``()`` and iterative modes. If ``A`` and ``B`` implement their ``()`` and iterative modes efficiently, then ``A | B`` gives you a composed module that implements both modes efficiently for free. The full API is documented in :doc:`rau.unidirectional`. Argument Routing ---------------- What if you try to compose modules that require multiple arguments? For example, if you have a module ``A`` that takes no keyword arguments, a module ``B`` that requires a keyword argument ``foo``, and a module ``C`` that requires keyword arguments ``bar`` and ``baz``, how do you invoke ``A | B | C``? Rau handles this by allowing you to add tags to modules that signal which modules should receive which arguments. .. code-block:: python # Create a pipeline where individual modules have been tagged. M = A | B.tag('b') | C.tag('c') x = torch.rand(B, n, A_input_size) y = M( # x will be passed as input to A, whose output will be passed # as input B, whose output will be passed as input to C, whose # output will be returned as y. x, # tag_kwargs is a dict that maps tags to dicts of keyword # arguments. The keyword argument foo=123 will be passed to B, # and the keywords bar=456 and baz=789 will be passed to C. tag_kwargs=dict( b=dict(foo=123), c=dict( bar=456, baz=789 ) ) ) You can make this more succinct by designating at most one module in a pipeline as the "main" module, which will receive any extra positional or keyword arguments. This is useful when wrapping a module with input and output layers. .. code-block:: python # Create a pipeline where B is tagged with 'b' and C is the main # module. M = A | B.tag('b') | C.main() x = torch.rand(B, n, A_input_size) y = M( x, bar=456, baz=789, tag_kwargs=dict( b=dict(foo=123) ) )