LLM 25일 코스 - Day 15: Tokenizer 심화

Day 15: Tokenizer 심화

토크나이저는 텍스트와 모델 사이의 번역기입니다. 올바른 토큰화 없이는 아무리 좋은 모델도 제대로 동작하지 않습니다. 오늘은 실무에서 자주 마주치는 토크나이저 심화 주제를 다룹니다.

encode/decode와 특수 토큰

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Hello, how are you?"

# encode: 텍스트 -> 토큰 ID 리스트
token_ids = tokenizer.encode(text)
print(f"토큰 ID: {token_ids}")
# [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# 101=[CLS], 102=[SEP] 특수 토큰이 자동 추가됨

# decode: 토큰 ID -> 텍스트
decoded = tokenizer.decode(token_ids)
print(f"디코딩: {decoded}")

# 특수 토큰 없이 encode
token_ids_no_special = tokenizer.encode(text, add_special_tokens=False)
print(f"특수 토큰 제외: {token_ids_no_special}")

# 토큰 단위로 분리하여 확인
tokens = tokenizer.tokenize(text)
print(f"토큰 리스트: {tokens}")
# ['hello', ',', 'how', 'are', 'you', '?']

padding과 truncation 전략

배치 처리 시 모든 입력의 길이를 맞춰야 합니다. padding은 짧은 입력을 늘리고, truncation은 긴 입력을 자릅니다.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sentences = [
    "짧은 문장",
    "이것은 조금 더 긴 문장입니다",
    "이것은 매우 매우 매우 긴 문장으로 패딩과 트렁케이션의 차이를 보여줍니다",
]

# 배치 토큰화 - padding과 truncation 동시 적용
encoded = tokenizer(
    sentences,
    padding=True,          # 가장 긴 문장에 맞춰 패딩
    truncation=True,       # max_length 초과 시 자름
    max_length=20,         # 최대 토큰 수
    return_tensors="pt",   # PyTorch 텐서로 반환
)

print(f"input_ids shape: {encoded['input_ids'].shape}")
print(f"attention_mask shape: {encoded['attention_mask'].shape}")

# attention_mask: 1은 실제 토큰, 0은 패딩
for i, sent in enumerate(sentences):
    real_tokens = encoded["attention_mask"][i].sum().item()
    print(f"문장 {i+1}: 실제 토큰 {real_tokens}개, 패딩 {20 - real_tokens}개")

chat_template으로 대화 형식 구성

최신 LLM은 대화 형식(system/user/assistant)을 요구합니다. apply_chat_template()을 사용하면 모델별 올바른 형식을 자동으로 적용할 수 있습니다.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "당신은 친절한 한국어 AI 어시스턴트입니다."},
    {"role": "user", "content": "파이썬의 장점을 3가지 알려주세요."},
]

# 대화 형식을 모델에 맞게 변환
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,           # 문자열로 반환 (True면 토큰 ID)
    add_generation_prompt=True, # 어시스턴트 응답 시작 태그 추가
)
print(formatted)

# 토큰화까지 한번에
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)
print(f"토큰 수: {input_ids.shape[-1]}")

모델마다 chat_template이 다릅니다. Llama는 <|begin_of_text|> 태그를, ChatML 형식은 <|im_start|> 태그를 사용합니다. apply_chat_template()을 쓰면 이런 차이를 신경 쓰지 않아도 됩니다.

오늘의 연습문제

gpt2와 bert-base-uncased 토크나이저로 동일한 영어 문장을 토큰화하고, 토큰 수와 토큰 분리 방식의 차이를 비교해보세요.
길이가 다른 문장 5개를 padding="max_length", max_length=32로 배치 토큰화한 뒤, 각 문장의 attention_mask에서 패딩 비율을 계산해보세요.
모델 2개(예: Llama, Mistral)의 apply_chat_template() 결과를 비교하여, 같은 대화가 모델별로 어떻게 다르게 포맷팅되는지 확인해보세요.