https://arxiv.org/abs/2401.10020 Self-Rewarding Language ModelsWe posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performanarxiv.orgAbstract 기존의 LLM은 인간의 선호도를 바탕으로 한 보상 모델이다. 따라서 bottleneck문제와 LLM이 학습 ..