Reward estimation with scheduled knowledge distillation for dialogue policy learning

Junyan Qiua University of Chinese Academy of Sciences, Haidian District, Beijing, People's Republic of ChinaCorrespondence[email protected]

Haidong Zhangb Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, People's Republic of China

Yiping Yangb Institute of Automation, Chinese Academy of Sciences, Haidian District, Beijing, People's Republic of China

Abstract

Formulating dialogue policy as a reinforcement learning (RL) task enables a dialogue system to act optimally by interacting with humans. However, typical RL-based methods normally suffer from challenges such as sparse and delayed reward problems. Besides, with user goal unavailable in real scenarios, the reward estimator is unable to generate reward reflecting action validity and task completion. Those issues may slow down and degrade the policy learning significantly. In this paper, we present a novel scheduled knowledge distillation framework for dialogue policy learning, which trains a compact student reward estimator by distilling the prior knowledge of user goals from a large teacher model. To further improve the stability of dialogue policy learning, we propose to leverage self-paced learning to arrange meaningful training order for the student reward estimator. Comprehensive experiments on Microsoft Dialogue Challenge and MultiWOZ datasets indicate that our approach significantly accelerates the learning speed, and the task-completion success rate can be improved from 0.47%∼9.01% compared with several strong baselines.

Keywords:

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability

The Microsoft Dialogue Challenge and MultiWOZ dialogue datasets analysed during the current study are available in E2EDialog and ConvLab-2 repositories, respectively.

Notes

1 r = 2T if the dialogue succeeds and r = −T if it fails. T is the maximise number of dialogue turns. In this paper, T = 40. r = −1 if the dialogue is not finished yet.

2 $R_{a}$ is the same with that in Equation (Equation7(7) $\begin{aligned} r^{t e a} & = R_{a} \cdot p^{t e a} \end{aligned}$ (7) )

3 Simulated experience buffer is only used in DDQ-based agents.

4 Readers can refer to the E2Edialog (Li et al., Citation2018) repository for implementation details.

5 The reason for setting the hidden nodes of GRU in teacher model to 768 is to make it compatible with the outputs of BERT model.

Reward estimation with scheduled knowledge distillation for dialogue policy learning

Information for

Open access

Opportunities

Help and information

Reward estimation with scheduled knowledge distillation for dialogue policy learning

Abstract

Disclosure statement

Data availability

Notes

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature