Build a baseline 'word-based' tokenizer