Getting Started =============== Installation ------------ Install Rau from PyPI using your favorite package manager: .. code-block:: sh pip install rau This should install the command ``rau``, which serves as the library's command-line interface. Examples -------- Below are some quick examples of setting up pipelines for language modeling and sequence-to-sequence transduction. We will use the pretokenized datasets from :cite:t:`mccoy-etal-2020-syntax`; their simplicity and small size make them convenient for our purposes. Language Modeling ^^^^^^^^^^^^^^^^^ For this example, we'll train a transformer language model on simple declarative sentences in English (the data comes from the source side of the question formation task of :cite:t:`mccoy-etal-2020-syntax`). Download the dataset: .. code-block:: sh mkdir language-modeling-example curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.train | sed 's/[a-z]\+\t.*//' > language-modeling-example/main.tok mkdir language-modeling-example/datasets mkdir language-modeling-example/datasets/validation curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.dev | sed 's/[a-z]\+\t.*//' > language-modeling-example/datasets/validation/main.tok mkdir language-modeling-example/datasets/test curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.test | sed 's/[a-z]\+\t.*//' > language-modeling-example/datasets/test/main.tok Note the directory structure: * ``language-modeling-example/``: a directory representing this dataset * ``main.tok``: the pretokenized training set * ``datasets/``: additional datasets that should be processed with the training set's vocabulary * ``validation/`` * ``main.tok``: the pretokenized validation set * ``test/`` * ``main.tok``: the pretokenized test set Now, "prepare" the data by figuring out the vocabulary of the training data and converting all tokens to integers: .. code-block:: sh rau lm prepare \ --training-data language-modeling-example \ --more-data validation \ --more-data test \ --never-allow-unk The flag ``--training-data`` refers to the directory containing our dataset. The flag ``--more-data`` indicates the name of a directory under ``language-modeling-example/datasets`` to prepare using the vocabulary of the training data. The flag ``--never-allow-unk`` indicates that the training data does not contain a designated unknown (UNK) token, and out-of-vocabulary tokens should be treated as errors at inference time. Note the new files generated: * ``language-modeling-example/`` * ``main.prepared``: the prepared training set * ``main.vocab``: the vocabulary of the training data * ``datasets/`` * ``validation/`` * ``main.prepared``: the prepared validation set * ``test/`` * ``main.prepared``: the prepared test set Now, train a transformer language model: .. code-block:: sh rau lm train \ --training-data language-modeling-example \ --architecture transformer \ --num-layers 6 \ --d-model 64 \ --num-heads 8 \ --feedforward-size 256 \ --dropout 0.1 \ --init-scale 0.1 \ --max-epochs 10 \ --max-tokens-per-batch 2048 \ --optimizer Adam \ --initial-learning-rate 0.01 \ --gradient-clipping-threshold 5 \ --early-stopping-patience 2 \ --learning-rate-patience 1 \ --learning-rate-decay-factor 0.5 \ --examples-per-checkpoint 50000 \ --output saved-language-model This saves a transformer language model to the directory ``saved-language-model``. Finally, calculate the perplexity of this language model on the test set: .. code-block:: sh rau lm evaluate \ --load-model saved-language-model \ --training-data language-modeling-example \ --input test \ --batching-max-tokens 2048 Sequence-to-Sequence ^^^^^^^^^^^^^^^^^^^^ For this example, we'll train a transformer encoder-decoder on the question formation task of :cite:t:`mccoy-etal-2020-syntax`, which involves converting a declarative sentence in English to question form. Download the dataset: .. code-block:: sh mkdir sequence-to-sequence-example curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.train > sequence-to-sequence-example/train.tsv cut -f 1 < sequence-to-sequence-example/train.tsv > sequence-to-sequence-example/source.tok cut -f 2 < sequence-to-sequence-example/train.tsv > sequence-to-sequence-example/target.tok mkdir sequence-to-sequence-example/datasets mkdir sequence-to-sequence-example/datasets/validation curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.dev > sequence-to-sequence-example/validation.tsv cut -f 1 < sequence-to-sequence-example/validation.tsv > sequence-to-sequence-example/datasets/validation/source.tok cut -f 2 < sequence-to-sequence-example/validation.tsv > sequence-to-sequence-example/datasets/validation/target.tok mkdir sequence-to-sequence-example/datasets/test curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.test | head -100 > sequence-to-sequence-example/test.tsv cut -f 1 < sequence-to-sequence-example/test.tsv > sequence-to-sequence-example/datasets/test/source.tok cut -f 2 < sequence-to-sequence-example/test.tsv > sequence-to-sequence-example/datasets/test/target.tok rm sequence-to-sequence-example/{train,validation,test}.tsv Note the directory structure: * ``sequence-to-sequence-example/``: a directory representing this dataset * ``source.tok``: the source side of the pretokenized training set * ``target.tok``: the target side of the pretokenized training set * ``datasets/``: additional datasets that should be processed with the training set's vocabulary * ``validation/`` * ``source.tok``: the source side of the pretokenized validation set * ``target.tok``: the target side of the pretokenized validation set * ``test/`` * ``source.tok``: the source side of the pretokenized test set * ``target.tok``: the target side of the pretokenized test set Now, "prepare" the data by figuring out the vocabulary of the training data and converting all tokens to integers: .. code-block:: sh rau ss prepare \ --training-data sequence-to-sequence-example \ --vocabulary-types shared \ --more-data validation \ --more-source-data test \ --never-allow-unk The flag ``--training-data`` refers to the directory containing our dataset. The flag ``--vocabulary-types shared`` means that the script will generate a single vocabulary that is shared by both the source and target sides. This makes it possible to tie source and target embeddings. The flag ``--more-data`` indicates the name of a directory under ``sequence-to-sequence-example/datasets`` to prepare using the vocabulary of the training data (both the source and target sides will be prepared). The flag ``--more-source-data`` does the same thing, but it only prepares the source side (only the source side is necessary for generating translations on a test set). The flag ``--never-allow-unk`` indicates that the training data does not contain a designated unknown (UNK) token, and out-of-vocabulary tokens should be treated as errors at inference time. Note the new files generated: * ``sequence-to-sequence-example/`` * ``source.shared.prepared`` * ``target.shared.prepared`` * ``shared.vocab``: a shared vocabulary of tokens that appear in either the source or target side of the training set * ``datasets/`` * ``validation/`` * ``source.shared.prepared`` * ``target.shared.prepared`` * ``test/`` * ``source.shared.prepared`` * ``target.shared.prepared`` Now, train a transformer encoder-decoder model: .. code-block:: sh rau ss train \ --training-data sequence-to-sequence-example \ --vocabulary-type shared \ --num-encoder-layers 6 \ --num-decoder-layers 6 \ --d-model 64 \ --num-heads 8 \ --feedforward-size 256 \ --dropout 0.1 \ --init-scale 0.1 \ --max-epochs 10 \ --max-tokens-per-batch 2048 \ --optimizer Adam \ --initial-learning-rate 0.01 \ --label-smoothing-factor 0.1 \ --gradient-clipping-threshold 5 \ --early-stopping-patience 2 \ --learning-rate-patience 1 \ --learning-rate-decay-factor 0.5 \ --examples-per-checkpoint 50000 \ --output saved-sequence-to-sequence-model This saves a model to the directory ``saved-sequence-to-sequence-model``. Finally, translate the source sequences in the test data using beam search: .. code-block:: sh rau ss translate \ --load-model saved-sequence-to-sequence-model \ --input sequence-to-sequence-example/datasets/test/source.shared.prepared \ --beam-size 4 \ --max-target-length 50 \ --batching-max-tokens 256 \ --shared-vocabulary-file sequence-to-sequence-example/shared.vocab