논문 리뷰 39

[논문리뷰] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

https://arxiv.org/abs/2310.12921 Vision-Language Models are Zero-Shot Reward Models for Reinforcement LearningReinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: usiarxiv.org 비전 기반 환경에서 RL을 훈련시키는 데..

논문 리뷰/RL 2025.04.09

[논문리뷰] Vision-Language Models as a Source of Rewards

https://arxiv.org/abs/2312.09187 Vision-Language Models as a Source of RewardsBuilding generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of rewaarxiv.org  강화학습(RL)은 명확한 보상 함수가 존재하는 도메인에서 큰 성과를 거두었지만, generali..

논문 리뷰/RL 2025.04.07

[논문리뷰] Vision-Language Models Provide PromptableRepresentations for Reinforcement Learning

https://arxiv.org/pdf/2402.02651  본 논문에서는 Vision-Language Model(VLM)을 강화 학습(RL) 에이전트의 표현 학습에 활용하는 PR2L(Promptable Representations for Reinforcement Learning) 프레임워크를 제안한다. PR2L은 VLM이 제공하는 프롬프트 기반 표현을 활용하여, 시각적 관찰로부터 의미론적 특징(Semantic Features)을 추출하고 이를 RL 정책 학습에 적용하는 방식이다. 특히, PR2L은 프롬프팅을 통해 의미론적으로 풍부한 표현을 만들고, 이를 통해 에이전트가 배경 지식을 활용하여 빠르게 행동을 학습할 수 있도록 돕는다. PR2L - Promptable Representations for Re..

논문 리뷰/RL 2025.04.01

[논문리뷰] Diffusion 논문 수식 정리

https://www.youtube.com/watch?v=ybvJbvllgJkhttps://youtu.be/uFoGaIVHfoE?si=eRGixNZxxetAPi_1 diffusion loss 정리Forward Diffusion Process 원본 데이터(이미지)에 점진적으로 가우시안 노이즈를 추가하는 과정noisy해지는 process (Forward diffusion process)$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$$단계가 진행될수록 데이터가 점점 더 노이즈로 변하며, 최종적으로는 순수한 가우시안 노이즈가 됨Reverse Denoising Process학습된 신경망(Neural Network)을 사용하..

논문 리뷰/ML&DL 2025.03.30

[논문리뷰] DeepSeek-R1 : Incentivizing Reasoning Capability in LLMs viaReinforcement Learning

https://arxiv.org/abs/2501.12948 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningWe introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoninarxiv.org AbstractDeepSeek-..

논문 리뷰/RL 2025.03.16

[논문리뷰] R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning

https://arxiv.org/abs/2503.05379 R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement LearningIn this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model in the context of emotion recognition, a task where both visual and audio modalities play crucial roles. We learxiv.org Abstract본 연구는 감..

논문 리뷰/RL 2025.03.14

[논문리뷰] Enhancing multi-modal Relation Extraction with Reinforcement Learning Guided Graph Diffusion Framework

https://aclanthology.org/2025.coling-main.65/ Enhancing multi-modal Relation Extraction with Reinforcement Learning Guided Graph Diffusion FrameworkRui Yang, Rajiv Gupta. Proceedings of the 31st International Conference on Computational Linguistics. 2025.aclanthology.org Introduction & Abstract본 논문은 멀티모달의 정보의 분석 및 정렬 문제를 해결하기 위해 RL guided graph diffusion 프레임워크를 제안한다. 사전 훈련된 모델을 사용하여 멀티모달 데이터를 sc..

논문 리뷰/RL 2025.03.14

[논문리뷰] Detecting Deepfakes without seeing any

https://arxiv.org/pdf/2311.01458https://github.com/talreiss/FACTOR?tab=readme-ov-file GitHub - talreiss/FACTOR: Detecting Deepfakes Without Seeing AnyDetecting Deepfakes Without Seeing Any. Contribute to talreiss/FACTOR development by creating an account on GitHub.github.com Abstract기존 딥페이크 탐지는 이전에 본 딥페이크와 유사한 경우에만 탐지가 가능하다. 딥페이크 공격에는 허위 사실(신원, 발화 내용, 동작, 외모)에 대한 주장이 포함되며 현재 생성 기술은 이러한 허위 사실을 완벽..

[논문리뷰] Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

https://arxiv.org/html/2410.03742v2 Beyond Scalar Reward Model: Learning Generative Judge from Preference DataBeyond Scalar Reward Model: Learning Generative Judge from Preference Data Ziyi Ye1, Xiangsheng Li2, Qiuchi Li3, Qingyao Ai1, Yujia Zhou1, Wei Shen2, Dong Yan2, Yiqun Liu1 1Department of Computer Science and Technology, Tsinghua University 2Baichuan AI  arxiv.orgAbstract기존 방식에서는 preferen..

논문 리뷰/RL 2025.02.14

[논문리뷰] Self-Rewarding Language Models

https://arxiv.org/abs/2401.10020 Self-Rewarding Language ModelsWe posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performanarxiv.orgAbstract  기존의 LLM은 인간의 선호도를 바탕으로 한 보상 모델이다. 따라서 bottleneck문제와 LLM이 학습 ..

논문 리뷰/RL 2025.02.06