[Offline 강화학습 챗봇] Policy Gradient를 이용한 구현 도전기 - 강화학습 (2)

2023. 8. 24. 15:51연구 프로젝트/강화학습 챗봇

※어디까지나 도전기일 뿐, 맞는 방법이 아니므로 따라하지 마시오.※

3. 첫 KoGPT2 강화학습 도전

1) 원본 논문 코드

(1) 1번의 에피소드에 대해 손실값과 모델로부터 생성되는 답변을 구하는 함수

def rl(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, batch_size, teacher_forcing_ratio):
    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for rnn packing should always be on the cpu
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    response = []

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, n_total = mask_nll_loss(
                decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * n_total)
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor(
                [[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, n_total = mask_nll_loss(
                decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * n_total)

            #ni or decoder_output
            response.append(topi)
    print("rl-print_losses", print_losses)

    return loss, max_target_len, response

-input_variable: 입력값, target_variable: 정답값, mask: input_variable의 mask 텐서

-teacher forcing 사용: 현재 정답값, target_variable이 다음 입력값으로 설정

-teacher forcing 사용 X: input_variable을 모델에 입력했을 때 나온 출력값 decoder output을 다음 입력값으로 설정

(설마 나중에 실패하는 이유가 teacher forcing을 사용하지 않아서 일 수도...일단 고려 대상에 추가해야겠다)

-생성해야 할 결과값의 최대 길이만큼 단어를 생성 후 이에 대한 손실값 계산

-손실값, 출력값의 최대 길이, 생성된 답변 반환

 

(2) 전체 학습 과정 코드

def training_rl_loop(model_name, voc, pairs, batch_size, forward_encoder, forward_encoder_optimizer, forward_decoder, forward_decoder_optimizer, backward_encoder, backward_encoder_optimizer, backward_decoder, backward_decoder_optimizer,teacher_forcing_ratio,n_iteration, print_every, save_every, save_dir):

    dull_responses = ["i do not know what you are talking about.", "i do not know.", "you do not know.", "you know what i mean.", "i know what you mean.", "you know what i am saying.", "you do not know anything."]

    # Load batches for each iteration
    training_batches = [batch_2_train_data(voc, [random.choice(pairs) for _ in range(batch_size)])
                      for _ in range(n_iteration)]

    # Initializations
    print('Initializing ...')
    start_iteration = 1
    print_loss = 0


    #Training loop
    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        print("Iteration", iteration)
        training_batch = training_batches[iteration - 1]
        # Extract fields from batch
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        ##MODIFS HERE
        # Zero gradients the optimizer
        forward_encoder_optimizer.zero_grad()
        forward_decoder_optimizer.zero_grad()

        backward_encoder_optimizer.zero_grad()
        backward_decoder_optimizer.zero_grad()

        #Forward
        forward_loss, forward_len, _ = rl(input_variable, lengths, target_variable, mask, max_target_len, forward_encoder, forward_decoder, batch_size, teacher_forcing_ratio)

        #Calculate reward
        reward = calculate_rewards(input_variable, lengths, target_variable, mask, max_target_len, forward_encoder, forward_decoder, backward_encoder, backward_decoder, batch_size, teacher_forcing_ratio)

        #Update forward seq2seq with loss scaled by reward
        loss = forward_loss * reward

        loss.backward()
        forward_encoder_optimizer.step()
        forward_decoder_optimizer.step()

        # Run a training iteration with batch
        print_loss += loss / forward_len

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))
            print_loss = 0

        #SAVE CHECKPOINT TO DO
        if (iteration % save_every == 0):
            directory = os.path.join(save_dir, model_name, corpus_name)#, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size))
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))

-내가 지정한 iteration 값만큼 반복 진행

-rl 함수를 이용해 손실값 계산, calculate_rewards 함수를 통해 최종 보상 값 계산

→ 이 둘을 곱해 최종적인 손실값 계산, 해당 손실값으로 정책 함수의 파라미터 (여기서는 Seq2seq 모델의 파라미터) 업데이트

 

(3) 실제로 학습시키는 코드

# Load/Assemble voc and pairs
voc, pairs = load_prepare_data(corpus, corpus_name, datafile, save_dir)
for pair in pairs[:10]:
    print(pair)

# Configure models
model_name = 'cb_model'
attn_model = 'dot'
# attn_model = 'general'
# attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 10000  # 4000
# loadFilename = os.path.join(save_dir, model_name, corpus_name,
#                            '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
#                            '{}_checkpoint.tar'.format(checkpoint_iter))
# print(loadFilename)

# Load model if a loadFilename is provided
if loadFilename:
    # If loading on same machine the model was trained on
    #checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']

print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(
    attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

# Configure training/optimization
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 500  # 4000
print_every = 1
save_every = 500

# Ensure dropout layers are in train mode
encoder.train()
decoder.train()

# Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
if loadFilename:
    encoder_optimizer.load_state_dict(encoder_optimizer_sd)
    decoder_optimizer.load_state_dict(decoder_optimizer_sd)

# If you have cuda, configure cuda to call
for state in encoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in decoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

# Run training iterations
print("Starting Training!")

forward_encoder = encoder
forward_decoder = decoder
forward_encoder = forward_encoder.to(device)
forward_decoder = forward_decoder.to(device)

backward_encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
backward_decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
backward_encoder = backward_encoder.to(device)
backward_decoder = backward_decoder.to(device)

    #Configure RL model
model_name='RL_model_seq'
n_iteration = 10000
print_every=100
save_every=500
learning_rate = 0.0001
decoder_learning_ratio = 5.0
teacher_forcing_ratio = 0.5

# Ensure dropout layers are in train mode
forward_encoder.train()
forward_decoder.train()

backward_encoder.train()
backward_decoder.train()

# Initialize optimizers
print('Building optimizers ...')
forward_encoder_optimizer = optim.Adam(forward_encoder.parameters(), lr=learning_rate)
forward_decoder_optimizer = optim.Adam(forward_decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
backward_encoder_optimizer = optim.Adam(backward_encoder.parameters(), lr=learning_rate)
backward_decoder_optimizer = optim.Adam(backward_decoder.parameters(), lr=learning_rate * decoder_learning_ratio)

# If you have cuda, configure cuda to call
for state in forward_encoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in forward_decoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in backward_encoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in backward_decoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

# Run training iterations
print("Starting Training!")
training_rl_loop(model_name, voc, pairs, batch_size, forward_encoder, forward_encoder_optimizer, forward_decoder, forward_decoder_optimizer, backward_encoder, backward_encoder_optimizer, backward_decoder, backward_decoder_optimizer,teacher_forcing_ratio,n_iteration, print_every, save_every, save_dir)

-모델, 최적화 함수 등 다양한 설정 지정 및 training_rl_loop 함수 실행

 

 

2) KoGPT2를 기반으로 강화학습을 진행하기 위한 학습 코드

(1) 수정된 rl 함수 코드

def RL(token_ids, mask, labels_ids, forward_model, forward_optimizer, criterion):
  # Forward pass through the GPT-2 model
  output = forward_model(token_ids)
  output = output.logits

  # Calculate and accumulate loss
  mask_3d = mask.unsqueeze(dim=2).repeat(1, 1, output.shape[2]).to(device)
  mask_out = torch.where(mask_3d == 1, output, Sneg * torch.ones_like(output)).to(device)
  loss = criterion(mask_out.transpose(2, 1), labels_ids).to(device)
  avg_loss = loss.sum() / mask.sum()

  return avg_loss, output

-token_ids: 입력값, mask: token_ids에 대한 mask 텐서, labels_ids: 정답값

-모델에 token_ids 입력 및 출력값 생성

-원래 정답값인 label_ids, 모델이 만든 답변에 대한 mask 텐서를 제작한 후 두 텐서를 이용해 손실값 계산

 

(2) 수정된 전체 학습 과정 코드

def training_rl_loop(data, epochs, forward_model, forward_optimizer, backward_model, backward_optimizer, criterion):
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  forward_model.to(device)
  backward_model.to(device)

  # Training loop
  print("Training...")
  for epoch in range(5):  # epoch 값은 자신이 원하는 대로 수정
    print(f"Epoch: {epoch} Start")
    for batch_idx, samples in tqdm(enumerate(train_dataloader)):
      token_ids, mask, labels_ids = samples
      token_ids = token_ids.to(device)
      mask = mask.to(device)
      labels_ids = labels_ids.to(device)

      forward_optimizer.zero_grad()
      backward_optimizer.zero_grad()

      # Forward
      forward_loss, _ = RL(token_ids, mask, labels_ids, forward_model, forward_optimizer, criterion)

      # Calculate reward
      reward = calculate_rewards(token_ids, mask, labels_ids, forward_model, backward_model, criterion)

      loss = forward_loss.mean() * reward
      loss.backward()
      forward_optimizer.step()
      backward_optimizer.step()

    # Print
    print(f"Epoch: {epoch}; Average loss: {loss}")

    # Save
    checkpoint = {
          "epoch": epoch,
          "forward_model_state_dict": forward_model.state_dict(),
          "backward_model_state_dict": backward_model.state_dict(),
          "forward_optimizer_state_dict": forward_optimizer.state_dict(),
          "backward_optimizer_state_dict": backward_optimizer.state_dict()}
    torch.save(checkpoint, f"./train_RL/5_{epoch}_kogpt2_checkpoint.pt")

-epoch 값을 지정해 해당 epoch 값 만큼 강화학습을 진행

-매 epoch마다 모델 저장

 

 

3) 강화학습을 위한 학습 데이터

*강화학습 첫번째 시도 때는 참고 논문에서 사용한 데이터 쪼개기 방식 사용

-이러한 구성의 데이터 200,000행 이용해 학습

 

 

4) 강화학습 시작 코드

learning_rate = 3e-5
Sneg = -1e18
epochs = 10
batch_size = 2
max_length = 64

forward_model = GPT2LMHeadModel.from_pretrained("skt/kogpt2-base-v2")
forward_model.resize_token_embeddings(len(TOKENIZER))
forward_checkpoint = torch.load("./train_RL/5_0_kogpt2_checkpoint.pt")
forward_model.load_state_dict(forward_checkpoint["forward_model_state_dict"])

backward_model = GPT2LMHeadModel.from_pretrained("skt/kogpt2-base-v2")
backward_model.resize_token_embeddings(len(TOKENIZER))
backward_checkpoint = torch.load("./train_RL/5_0_kogpt2_checkpoint.pt")
backward_model.load_state_dict(backward_checkpoint["backward_model_state_dict"])

forward_model = forward_model.to(device)
backward_model = backward_model.to(device)

train_set = ChatbotDataset(ChatbotData, max_len=64)
train_dataloader = DataLoader(train_set, batch_size, num_workers=0, shuffle=True, collate_fn=collate_batch)

criterion = torch.nn.CrossEntropyLoss(reduction="none").to(device)
forward_optimizer = optim.Adam(forward_model.parameters(), lr=learning_rate)
forward_optimizer.load_state_dict(forward_checkpoint["forward_optimizer_state_dict"])
backward_optimizer = optim.Adam(backward_model.parameters(), lr=learning_rate)
backward_optimizer.load_state_dict(backward_checkpoint["backward_optimizer_state_dict"])

# Ensure models are in train mode
forward_model.train()
backward_model.train()

# If you have cuda, configure cuda to call
for state in forward_optimizer.state.values():
  for k, v in state.items():
    if isinstance(v, torch.Tensor):
      state[k] = v.cuda()

for state in forward_optimizer.state.values():
  for k, v in state.items():
    if isinstance(v, torch.Tensor):
      state[k] = v.cuda()

# Run training iterations
print("Starting Training!")
training_rl_loop(train_dataloader, epochs, forward_model, forward_optimizer, backward_model, backward_optimizer, criterion)

*학습률 등 여러 파라미터 값 설정

*batch_size가 2인 이유는 코랩의 메모리 부족 문제 때문에...ㅠ

*Fine-tuning만 진행된 KoGPT2 모델 및 최적화 함수를 가져옴

*forward model & backward model 나눈 이유: semantic_coherence 함수에서 forward loss와 backward loss를 구해야 하므로 각각의 모델로 구하게끔 설정

*이후 학습 진행

 

 

5) 챗봇 성능 평가

*2번 정도의 epoch 만큼 학습을 진행한 후, 챗봇 성능 평가

-Fine-tuning만 시도했을 때와 별반 다를 게 없다.