rau.vocab

This module provides tools for mapping token types to integer IDs.

class rau.vocab.Vocabulary

Bases: object

An abstract base class that represents a mapping between token types and integer IDs.

class rau.vocab.VocabularyBuilder

Bases: Generic[V]

An abstract base class that can be used for constructing Vocabulary objects of a certain type.

catchall(token)

Build a vocabulary that maps all token types to a single token type. This implements the behavior of an UNK token.

Parameters:

token (str) – A token string used to represent the catchall token.

Return type:

TypeVar(V)

Returns:

A vocabulary that maps all token strings to the specified token.

content(tokens)

Build a vocabulary that assigns consecutive integer IDs to a list of token strings. These are “content” tokens in the sense that they come from a corpus and are not special tokens.

Parameters:

tokens (list[str]) – A list of token type strings.

Return type:

TypeVar(V)

Returns:

A vocabulary containing the specified tokens.

reserved(tokens)

Build a vocabulary that assigns consecutive integer IDs to a list of special reserved tokens. The token strings of these special tokens are for display purposes only and will never conflict with content tokens.

Parameters:

tokens (list[str]) – A list of token type strings.

Return type:

TypeVar(V)

Returns:

A vocabulary containing the specified tokens.

rau.vocab.build_to_int_vocabulary(func)
Return type:

ToIntVocabulary

class rau.vocab.ToIntVocabulary

Bases: Vocabulary

has_catchall()
Return type:

bool

to_int(token)
Return type:

int

class rau.vocab.ToIntVocabularyBuilder

Bases: VocabularyBuilder[ToIntVocabulary]

catchall(token)
Return type:

ToIntVocabulary

content(tokens)
Return type:

ToIntVocabulary

reserved(tokens)
Return type:

ToIntVocabulary

rau.vocab.build_to_string_vocabulary(func)
Return type:

ToStringVocabulary

class rau.vocab.ToStringVocabulary

Bases: Vocabulary

__init__(reserved_names)
to_string(index)
Return type:

str

class rau.vocab.ToStringVocabularyBuilder

Bases: VocabularyBuilder[ToStringVocabulary]

catchall(token)
Return type:

ToStringVocabulary

content(tokens)
Return type:

ToStringVocabulary

reserved(tokens)
Return type:

ToStringVocabulary