JuYoung Suk (@scott_sjy) Twitter Tweets • TwiCopy

hyunji amy lee

a year ago

New preprint "Semiparametric Token-Sequence Co-Supervision" We introduce semiparametric token-sequence co-supervision, which trains LM by simultaneously leveraging supervision from a parametric token and a nonparametric sequence embedding space. arxiv.org/abs/2403.09024

thumb_up_off_alt89

chat_bubble_outline5

repeat25

shareShare

Hyeonbin Hwang

@ronalhwang

a year ago

🚨 New LLM Reasoning Paper 🚨 Q. How can LLMs self-improve their reasoning ability? ⇒ Introducing Self-Explore⛰️🧭, a training method specifically designed to help LLMs avoid reasoning pits by learning from their own outputs! [1/N]

thumb_up_off_alt290

chat_bubble_outline8

repeat55

shareShare

Aran Komatsuzaki

@arankomatsuzaki

a year ago

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Presents a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements repo: github.com/prometheus-eva… abs: arxiv.org/abs/2405.01535

thumb_up_off_alt450

chat_bubble_outline8

repeat116

shareShare

AK

@_akhaliq

a year ago

Prometheus 2 An Open Source Language Model Specialized in Evaluating Other Language Models Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability

thumb_up_off_alt257

chat_bubble_outline4

repeat57

shareShare

elvis

@omarsar0

a year ago

An Open Source LM Specialized in Evaluating Other LMs Open-source Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LLMs that closely mirror human and GPT-4 judgments. They support both direct assessments and pair-wise ranking formats grouped with user-defined

thumb_up_off_alt443

chat_bubble_outline4

repeat108

shareShare

Seungone Kim @ NAACL2025

@seungonekim

a year ago

#NLProc Introducing 🔥Prometheus 2, an open-source LM specialized on evaluating other language models. ✅Supports both direct assessment & pairwise ranking. ✅ Improved evaluation capabilities compared to its predecessor. ✅Can assess based on user-defined evaluation criteria.

thumb_up_off_alt170

chat_bubble_outline3

repeat42

shareShare

Alvaro Bartolome

@alvarobartt

a year ago

🔥 Prometheus 2 was recently released by KAIST AI as an alternative and closely mirroring both human and GPT-4 evaluation, and surpassing the former Prometheus! And we Argilla have already implemented it in `distilabel`, but first, let’s see what Prometheus 2 has to offer:

🔥 Prometheus 2 was recently released by <a href="/kaist_ai/">KAIST AI</a> as an alternative and closely mirroring both human and GPT-4 evaluation, and surpassing the former Prometheus!

And we <a href="/argilla_io/">Argilla</a> have already implemented it in `distilabel`, but first, let’s see what Prometheus 2 has to offer:

thumb_up_off_alt114

chat_bubble_outline7

repeat23

shareShare

Seungone Kim @ NAACL2025

@seungonekim

a year ago

🤔How can we systematically assess an LM's proficiency in a specific capability without using summary measures like helpfulness or simple proxy tasks like multiple-choice QA? Introducing the ✨BiGGen Bench, a benchmark that directly evaluates nine core capabilities of LMs.

thumb_up_off_alt194

chat_bubble_outline8

repeat57

shareShare

Hoyeon Chang

@hoyeon_chang

a year ago

🚨 New paper 🚨 How Large Language Models Acquire Factual Knowledge During Pretraining? I’m thrilled to announce the release of my new paper! 🎉 This research explores how LLMs acquire and retain factual knowledge during pretraining. Here are some key insights:

thumb_up_off_alt523

chat_bubble_outline12

repeat119

shareShare

Doyoung Kim

@doyoungkim_ml

a year ago

🤔 Humans excel at generalizing planning into extrapolated data or rapidly adapting with limited train data. How is it possible for language models? Introducing 🧠Cognitive Map for Language Models, a framework achieving Optimal Planning via Verbally Representing the World Model🌍

thumb_up_off_alt36

chat_bubble_outline1

repeat13

shareShare

MiyoungKo

@miyoung_ko

a year ago

📢 Excited to share our latest paper on the reasoning capabilities of LLMs! Our research dives into how these models recall and utilize factual knowledge during solving complex questions. [🧵1 / 10] arxiv.org/abs/2406.19502

thumb_up_off_alt306

chat_bubble_outline4

repeat73

shareShare

Rohan Paul

@rohanpaul_ai

a year ago

Prometheus-2 is a great alternative of GPT-4 evaluation when doing fine-grained evaluation of an underlying LLM & a Reward model for Reinforcement Learning from Human Feedback (RLHF).✨ It's a state-of-the-art evaluator language model offering significant improvements over its

thumb_up_off_alt42

chat_bubble_outline2

repeat10

shareShare

Seungone Kim @ NAACL2025

@seungonekim

a year ago

Glad to share that Prometheus 2 was accepted at the main track of EMNLP 2025 ! See you in Miami ☺️

thumb_up_off_alt78

chat_bubble_outline0

repeat6

shareShare

Seonghyeon Ye

@seonghyeonye

a year ago

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking

thumb_up_off_alt216

chat_bubble_outline3

repeat58

shareShare

JuYoung Suk

@scott_sjy

10 months ago

MM-Eval is a new multilingual benchmark for LLM-as-Judge and reward models! We hope this work can contribute towards better multilingual alignment for LLMs :)

thumb_up_off_alt13

chat_bubble_outline0

repeat1

shareShare

Nathan Lambert

@natolambert

10 months ago

As more and more attention shifts back to on-policy RL for LLM post training, thx o1, (away from just using DPO-like methods for alignment) it's been clear we need a better reward model ecosystem. The good news is, we're starting to get a lot of evals. True progress only comes

thumb_up_off_alt110

chat_bubble_outline3

repeat22

shareShare

Dongkeun Yoon

@dongkeun_yoon

4 months ago

🙁 LLMs are overconfident even when they are dead wrong. 🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”? ❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

thumb_up_off_alt298

chat_bubble_outline9

repeat50

shareShare

Hoyeon Chang

@hoyeon_chang

3 months ago

New preprint 📄 (with Jinho Park ) Can neural nets really reason compositionally, or just match patterns? We present the Coverage Principle: a data-centric framework that predicts when pattern-matching models will generalize (validated on Transformers). 🧵👇

New preprint 📄 (with <a href="/jinho___park/">Jinho Park</a> )

Can neural nets really reason compositionally, or just match patterns?
We present the Coverage Principle: a data-centric framework that predicts when pattern-matching models will generalize (validated on Transformers). 🧵👇

thumb_up_off_alt125

chat_bubble_outline2

repeat30

shareShare

Hyeonbin Hwang

@ronalhwang

3 months ago

🚨 New Paper co-led with byeongguk jeon 🚨 Q. Can we adapt Language Models, trained to predict next token, to reason in sentence-level? I think LMs operating in higher-level abstraction would be a promising path towards advancing its reasoning, and I am excited to share our

🚨 New Paper co-led with <a href="/bkjeon1211/">byeongguk jeon</a> 🚨

Q. Can we adapt Language Models, trained to predict next token, to reason in sentence-level?

I think LMs operating in higher-level abstraction would be a promising path towards advancing its reasoning, and I am excited to share our

thumb_up_off_alt167

chat_bubble_outline4

repeat44

shareShare

Zeyuan Allen-Zhu, Sc.D.

@zeyuanallenzhu

a month ago

Phase 1 of Physics of Language Models code release ✅our Part 3.1 + 4.1 = all you need to pretrain strong 8B base model in 42k GPU-hours ✅Canon layers = strong, scalable gains ✅Real open-source (data/train/weights) ✅Apache 2.0 license (commercial ok!) 🔗github.com/facebookresear…

thumb_up_off_alt567

chat_bubble_outline8

repeat93

shareShare