740
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Reward estimation with scheduled knowledge distillation for dialogue policy learning

, &
Article: 2174078 | Received 09 Oct 2022, Accepted 24 Jan 2023, Published online: 07 Feb 2023
 

Abstract

Formulating dialogue policy as a reinforcement learning (RL) task enables a dialogue system to act optimally by interacting with humans. However, typical RL-based methods normally suffer from challenges such as sparse and delayed reward problems. Besides, with user goal unavailable in real scenarios, the reward estimator is unable to generate reward reflecting action validity and task completion. Those issues may slow down and degrade the policy learning significantly. In this paper, we present a novel scheduled knowledge distillation framework for dialogue policy learning, which trains a compact student reward estimator by distilling the prior knowledge of user goals from a large teacher model. To further improve the stability of dialogue policy learning, we propose to leverage self-paced learning to arrange meaningful training order for the student reward estimator. Comprehensive experiments on Microsoft Dialogue Challenge and MultiWOZ datasets indicate that our approach significantly accelerates the learning speed, and the task-completion success rate can be improved from 0.47%∼9.01% compared with several strong baselines.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability

The Microsoft Dialogue Challenge and MultiWOZ dialogue datasets analysed during the current study are available in E2EDialog and ConvLab-2 repositories, respectively.

Notes

1 r = 2T if the dialogue succeeds and r = −T if it fails. T is the maximise number of dialogue turns. In this paper, T = 40. r = −1 if the dialogue is not finished yet.

2 Ra is the same with that in Equation (Equation7)

3 Simulated experience buffer is only used in DDQ-based agents.

4 Readers can refer to the E2Edialog (Li et al., Citation2018) repository for implementation details.

5 The reason for setting the hidden nodes of GRU in teacher model to 768 is to make it compatible with the outputs of BERT model.