Skip to content

Feature request: steps to produce the SentencePiece tokenizer's vocabulary.proto (chapter 18) #282

@jchwenger

Description

@jchwenger

Hi @mattdangerw,

I've been looking into the material for this chapter (whilst teaching the whole book, glad to see the updates in this third edition!), and for educational purposes, I it would be quite nice to be able to say: by the way, this vocabulary.proto file for Mini-C4 was produced in this way.

Currently, I have this, close but no cigar:

import io
import pathlib
import sentencepiece
import tensorflow as tf
import keras
import keras_hub

TEXTGEN_DIR = pathlib.Path("text-generation")
TEXTGEN_DIR.mkdir(exist_ok=True)

extract_dir = keras.utils.get_file(
    fname="mini-c4",
    origin=("https://hf.co/datasets/mattdangerw/mini-c4/resolve/main/mini-c4.zip"),
    cache_dir=TEXTGEN_DIR,
    extract=True,
)
DS_DIR = pathlib.Path(extract_dir) / "mini-c4"

files = [str(file) for file in DS_DIR.glob("*.txt")]
ds = tf.data.Dataset.from_tensor_slices(files)

def read_file(filename):
    lines = tf.data.TextLineDataset(filename).filter(lambda x: tf.strings.length(x) > 0)
    # avoid `Found null character.` warning
    # lines = lines.map(lambda x: tf.strings.regex_replace(x, r"\x00", ""))
    # '\n' are escaped in the dataset <-> one multiline document per line
    lines = lines.map(lambda x: tf.strings.regex_replace(x, r"\\n", "\n"))
    # each line is a 'document' -> append the 'end of document' token
    lines = lines.map(lambda x: tf.strings.join([x, "<|endoftext|>"]))
    return lines


ds = ds.interleave(read_file, cycle_length=32, num_parallel_calls=32)
bytes_io = io.BytesIO()
# options: https://github.com/google/sentencepiece/blob/master/doc/options.md
sentencepiece.SentencePieceTrainer.train(
    sentence_iterator=ds.as_numpy_iterator(),
    model_type="bpe",  # default is "unigram"
    model_writer=bytes_io,
    # size extracted using Matthew Watson's proto file -> tokenizer.vocabulary_size()
    vocab_size=32000,
    # we can add a special token ourselves (not in Watson's file)
    # user_defined_symbols=["<|endoftext|>"],
    # overkill: guaranteeing we catch even very long lines, default: 4192
    max_sentence_length=500_000,
    # no cap on # sentences used, adjust given available RAM (10_000_000 would be ok)
    input_sentence_size=0,
    # the default: random sample rather than just the first N files
    shuffle_input_sentence=True,
    # default: 0.9995
    character_coverage=0.99995,
    # make sure we have all 256 bytes in the vocab
    byte_fallback=True,
    # optional tweaks
    split_digits=True,
    allow_whitespace_only_pieces=True,
)
proto = bytes_io.getvalue()

# save proto
vocabulary_file = str(TEXTGEN_DIR / "vocabulary.from-scratch.proto")
with open(vocabulary_file, "wb") as o:
    o.write(proto)

tok = keras_hub.tokenizers.SentencePieceTokenizer(proto)

vocabulary_file_ref = keras.utils.get_file(
    origin="https://hf.co/mattdangerw/spiece/resolve/main/vocabulary.proto",
)
tok_ref = keras_hub.tokenizers.SentencePieceTokenizer(vocabulary_file_ref)

# get vocabs
ref_vocab = [tok_ref.id_to_token(i) for i in range(tok_ref.vocabulary_size())]
scratch_vocab = [tok.id_to_token(i) for i in range(tok.vocabulary_size())]

vocabs_path = TEXTGEN_DIR / "vocabs.txt"
print(f"saving vocabs to {vocabs_path}")
with open(vocabs_path, "w") as o:
    for i, (ref, scr) in enumerate(zip(ref_vocab, scratch_vocab)):
        o.write(f"{ref}\t{scr}\n")

I'm left with a few questions: is it close-ish, and if so what are the options I overlooked to make it identical? (If not, was the tokenizer trained on C4 instead, and not Mini-C4?) I'm also surprised you didn't add "<|endoftext|>" as a user_defined_symbols, for instance, but that's perhaps because you use this tokenization more broadly, not just for a Mini-GPT example?

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions