Getting Started¶
Installation¶
Install Rau from PyPI using your favorite package manager:
pip install rau
This should install the command rau
, which serves as the library’s
command-line interface.
Examples¶
Below are some quick examples of setting up pipelines for language modeling and sequence-to-sequence transduction. We will use the pretokenized datasets from McCoy et al. [2020]; their simplicity and small size make them convenient for our purposes.
Language Modeling¶
For this example, we’ll train a transformer language model on simple declarative sentences in English (the data comes from the source side of the question formation task of McCoy et al. [2020]).
Download the dataset:
mkdir language-modeling-example
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.train | sed 's/[a-z]\+\t.*//' > language-modeling-example/main.tok
mkdir language-modeling-example/datasets
mkdir language-modeling-example/datasets/validation
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.dev | sed 's/[a-z]\+\t.*//' > language-modeling-example/datasets/validation/main.tok
mkdir language-modeling-example/datasets/test
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.test | sed 's/[a-z]\+\t.*//' > language-modeling-example/datasets/test/main.tok
Note the directory structure:
language-modeling-example/
: a directory representing this datasetmain.tok
: the pretokenized training setdatasets/
: additional datasets that should be processed with the training set’s vocabularyvalidation/
main.tok
: the pretokenized validation set
test/
main.tok
: the pretokenized test set
Now, “prepare” the data by figuring out the vocabulary of the training data and converting all tokens to integers:
rau lm prepare \
--training-data language-modeling-example \
--more-data validation \
--more-data test \
--never-allow-unk
The flag --training-data
refers to the directory containing our dataset.
The flag --more-data
indicates the name of a directory under
language-modeling-example/datasets
to prepare using the vocabulary of the
training data. The flag --never-allow-unk
indicates that the training data
does not contain a designated unknown (UNK) token, and out-of-vocabulary tokens
should be treated as errors at inference time.
Note the new files generated:
language-modeling-example/
main.prepared
: the prepared training setmain.vocab
: the vocabulary of the training datadatasets/
validation/
main.prepared
: the prepared validation set
test/
main.prepared
: the prepared test set
Now, train a transformer language model:
rau lm train \
--training-data language-modeling-example \
--architecture transformer \
--num-layers 6 \
--d-model 64 \
--num-heads 8 \
--feedforward-size 256 \
--dropout 0.1 \
--init-scale 0.1 \
--max-epochs 10 \
--max-tokens-per-batch 2048 \
--optimizer Adam \
--initial-learning-rate 0.01 \
--gradient-clipping-threshold 5 \
--early-stopping-patience 2 \
--learning-rate-patience 1 \
--learning-rate-decay-factor 0.5 \
--examples-per-checkpoint 50000 \
--output saved-language-model
This saves a transformer language model to the directory
saved-language-model
.
Finally, calculate the perplexity of this language model on the test set:
rau lm evaluate \
--load-model saved-language-model \
--training-data language-modeling-example \
--input test \
--batching-max-tokens 2048
Sequence-to-Sequence¶
For this example, we’ll train a transformer encoder-decoder on the question formation task of McCoy et al. [2020], which involves converting a declarative sentence in English to question form.
Download the dataset:
mkdir sequence-to-sequence-example
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.train > sequence-to-sequence-example/train.tsv
cut -f 1 < sequence-to-sequence-example/train.tsv > sequence-to-sequence-example/source.tok
cut -f 2 < sequence-to-sequence-example/train.tsv > sequence-to-sequence-example/target.tok
mkdir sequence-to-sequence-example/datasets
mkdir sequence-to-sequence-example/datasets/validation
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.dev > sequence-to-sequence-example/validation.tsv
cut -f 1 < sequence-to-sequence-example/validation.tsv > sequence-to-sequence-example/datasets/validation/source.tok
cut -f 2 < sequence-to-sequence-example/validation.tsv > sequence-to-sequence-example/datasets/validation/target.tok
mkdir sequence-to-sequence-example/datasets/test
curl -s https://raw.githubusercontent.com/tommccoy1/rnn-hierarchical-biases/master/data/question.test | head -100 > sequence-to-sequence-example/test.tsv
cut -f 1 < sequence-to-sequence-example/test.tsv > sequence-to-sequence-example/datasets/test/source.tok
cut -f 2 < sequence-to-sequence-example/test.tsv > sequence-to-sequence-example/datasets/test/target.tok
rm sequence-to-sequence-example/{train,validation,test}.tsv
Note the directory structure:
sequence-to-sequence-example/
: a directory representing this datasetsource.tok
: the source side of the pretokenized training settarget.tok
: the target side of the pretokenized training setdatasets/
: additional datasets that should be processed with the training set’s vocabularyvalidation/
source.tok
: the source side of the pretokenized validation settarget.tok
: the target side of the pretokenized validation set
test/
source.tok
: the source side of the pretokenized test settarget.tok
: the target side of the pretokenized test set
Now, “prepare” the data by figuring out the vocabulary of the training data and converting all tokens to integers:
rau ss prepare \
--training-data sequence-to-sequence-example \
--vocabulary-types shared \
--more-data validation \
--more-source-data test \
--never-allow-unk
The flag --training-data
refers to the directory containing our dataset.
The flag --vocabulary-types shared
means that the script will generate a
single vocabulary that is shared by both the source and target sides. This
makes it possible to tie source and target embeddings. The flag --more-data
indicates the name of a directory under
sequence-to-sequence-example/datasets
to prepare using the vocabulary of
the training data (both the source and target sides will be prepared). The flag
--more-source-data
does the same thing, but it only prepares the source
side (only the source side is necessary for generating translations on a test
set). The flag --never-allow-unk
indicates that the training data does not
contain a designated unknown (UNK) token, and out-of-vocabulary tokens should
be treated as errors at inference time.
Note the new files generated:
sequence-to-sequence-example/
source.shared.prepared
target.shared.prepared
shared.vocab
: a shared vocabulary of tokens that appear in either the source or target side of the training setdatasets/
validation/
source.shared.prepared
target.shared.prepared
test/
source.shared.prepared
target.shared.prepared
Now, train a transformer encoder-decoder model:
rau ss train \
--training-data sequence-to-sequence-example \
--vocabulary-type shared \
--num-encoder-layers 6 \
--num-decoder-layers 6 \
--d-model 64 \
--num-heads 8 \
--feedforward-size 256 \
--dropout 0.1 \
--init-scale 0.1 \
--max-epochs 10 \
--max-tokens-per-batch 2048 \
--optimizer Adam \
--initial-learning-rate 0.01 \
--label-smoothing-factor 0.1 \
--gradient-clipping-threshold 5 \
--early-stopping-patience 2 \
--learning-rate-patience 1 \
--learning-rate-decay-factor 0.5 \
--examples-per-checkpoint 50000 \
--output saved-sequence-to-sequence-model
This saves a model to the directory saved-sequence-to-sequence-model
.
Finally, translate the source sequences in the test data using beam search:
rau ss translate \
--load-model saved-sequence-to-sequence-model \
--input sequence-to-sequence-example/datasets/test/source.shared.prepared \
--beam-size 4 \
--max-target-length 50 \
--batching-max-tokens 256 \
--shared-vocabulary-file sequence-to-sequence-example/shared.vocab