Tanya Goyal (@tanyaagoyal) 's Twitter Profile
Tanya Goyal

@tanyaagoyal

NLP-ing @Cornell_CS (since Fall 2024). she/her

ID: 1171859782396981253

calendar_today11-09-2019 18:55:27

169 Tweet

1,1K Followers

370 Following

Alex Wettig (@_awettig) 's Twitter Profile Photo

How to train long-context LMs? (and beat Llama-3.1 🏆) Many takeaways from our new paper! - Focus on diverse & reliable evaluations (not just perplexity) - Find good sources of long data and high-quality short data - ... A 🧵 on how we produced ProLong, a SoTA 8B 512K model

How to train long-context LMs? (and beat Llama-3.1 🏆)

Many takeaways from our new paper!
- Focus on diverse & reliable evaluations (not just perplexity)
- Find good sources of long data and high-quality short data
- ...

A 🧵 on how we produced ProLong, a SoTA 8B 512K model
Lucy Zhao (@lucy_xyzhao) 's Twitter Profile Photo

1/ When does synthetic data help with long-context extension and why? 🤖 while more realistic data usually helps, symbolic data can be surprisingly effective 🔍effective synthetic data induces similar retrieval heads–but often only subsets of those learned on real data!

1/ When does synthetic data help with long-context extension and why?

🤖 while more realistic data usually helps, symbolic data can be surprisingly effective

🔍effective synthetic data induces similar retrieval heads–but often only subsets of those learned on real data!
Jessy Li (@jessyjli) 's Twitter Profile Photo

Our department UT Linguistics Dept is hiring 2 new faculty in computational linguistics! NLP at UT is an absolutely lovely family so join us 🥰 apply.interfolio.com/158280

Our department <a href="/UT_Linguistics/">UT Linguistics Dept</a> is hiring 2 new faculty in computational linguistics!
NLP at UT is an absolutely lovely family so join us 🥰

apply.interfolio.com/158280
Marzena Karpinska (@mar_kar_) 's Twitter Profile Photo

Will be presenting #nocha at #EMNLP2024 (Tue 16:00-17:30 (Riverfront Hall). Also happy to share that we have updated the dataset and our analysis! 🧚‍♀️🔮

Will be presenting #nocha at #EMNLP2024 (Tue 16:00-17:30 (Riverfront Hall). Also happy to share that we have updated the dataset and our analysis! 🧚‍♀️🔮
Wenting Zhao (@wzhao_nlp) 's Twitter Profile Photo

I’m at #EMNLP2024 this week presenting our work on reformulating unanswerable questions (Nov 12, 16-17:30). These days, I think about how to use formal tools and harder evals to get LMs closer to intelligence. I’m also on the faculty job market for 2024-2025! Please come say hi!

I’m at #EMNLP2024 this week presenting our work on reformulating unanswerable questions (Nov 12, 16-17:30). These days, I think about how to use formal tools and harder evals to get LMs closer to intelligence. I’m also on the faculty job market for 2024-2025! Please come say hi!
John Thickstun (@jwthickstun) 's Twitter Profile Photo

I am recruiting PhD students for Fall '25 at Cornell! I plan to admit multiple students interested in building more controllable generative models, music technologies (or both!). 🎶 Please apply to Cornell Computer Science.

Niloofar (on faculty job market!) (@niloofar_mire) 's Twitter Profile Photo

I'm on the faculty market and at #NeurIPS!👩‍🏫 homes.cs.washington.edu/~niloofar/ I work on privacy, memorization, and emerging challenges in data use for AI. Privacy isn't about PII removal but about controlling the flow of information contextually, & LLMs are still really bad at this!

I'm on the faculty market and at #NeurIPS!👩‍🏫
homes.cs.washington.edu/~niloofar/

I work on privacy, memorization, and emerging challenges in data use for AI.

Privacy isn't about PII removal but about controlling the flow of information contextually, &amp; LLMs are still really bad at this!
Wenting Zhao (@wzhao_nlp) 's Twitter Profile Photo

Eval platforms like Chatbot Arena attract users to provide preference votes. But what are the incentives of these users? Are they apathetic, or are they adversarial and just aiming to inflate their model rankings? We show 10% adversarial votes change the model rankings by a lot!

Eval platforms like Chatbot Arena attract users to provide preference votes. But what are the incentives of these users? Are they apathetic, or are they adversarial and just aiming to inflate their model rankings? We show 10% adversarial votes change the model rankings by a lot!
Tanya Goyal (@tanyaagoyal) 's Twitter Profile Photo

Getting high-quality human annotations is always tricky, even for targeted domains/tasks. Check out Wenting Zhao's work where we analyze how this manifests in open community data collection efforts with minimal quality checks by design.

Sasha Rush (@srush_nlp) 's Twitter Profile Photo

This year, I have an exceptional student on the academic market. Wenting Zhao (Wenting Zhao) builds systems that reason in natural settings. She combines AI & NLP to study newly emerging problems. She recently released WildChat (wildchat.allen.ai) and Commit-0

This year, I have an exceptional student on the academic market.

Wenting Zhao (<a href="/wzhao_nlp/">Wenting Zhao</a>) builds systems that reason in natural settings. She combines AI &amp; NLP to study newly emerging problems.

She recently released WildChat (wildchat.allen.ai) and Commit-0
Marzena Karpinska (@mar_kar_) 's Twitter Profile Photo

We've added #o1 and #Llama 3.3 70B to the #Nocha leaderboard for long-context narrative reasoning! Surprisingly, o1 performs worse than o1-preview, and Llama 3.3 70B matches proprietary models like gpt4o-mini & gemini-Flash. Check out our website for more results! More in 🧵

We've added #o1 and #Llama 3.3 70B to the #Nocha leaderboard for long-context narrative reasoning! Surprisingly, o1 performs worse than o1-preview, and Llama 3.3 70B matches proprietary models like gpt4o-mini &amp; gemini-Flash. Check out our website for more results! More in 🧵
Alex Wettig (@_awettig) 's Twitter Profile Photo

🤔 Ever wondered how prevalent some type of web content is during LM pre-training? In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐 Key takeaway: domains help us curate better pre-training data! 🧵/N

🤔 Ever wondered how prevalent some type of web content is during LM pre-training?

In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐

Key takeaway: domains help us curate better pre-training data! 🧵/N
Fangyuan Xu (@brunchavecmoi) 's Twitter Profile Photo

Can we generate long text from compressed KV cache? We find existing KV cache compression methods (e.g., SnapKV) degrade rapidly in this setting. We present 𝐑𝐞𝐟𝐫𝐞𝐬𝐡𝐊𝐕, an inference method which ♻️ refreshes the smaller KV cache, which better preserves performance.

Can we generate long text from compressed KV cache? We find existing KV cache compression methods (e.g., SnapKV) degrade rapidly in this setting. We present 𝐑𝐞𝐟𝐫𝐞𝐬𝐡𝐊𝐕, an inference method which ♻️ refreshes the smaller KV cache, which better preserves performance.
Wenting Zhao (@wzhao_nlp) 's Twitter Profile Photo

Time to revisit our paper: Open community-driven evaluation platforms could be corrupted from a few sources of bad annotations, making their results not as trustworthy as we'd like. arxiv.org/pdf/2412.04363

Time to revisit our paper: Open community-driven evaluation platforms could be corrupted from a few sources of bad annotations, making their results not as trustworthy as we'd like.

arxiv.org/pdf/2412.04363
Kabir (@kabirahuja004) 's Twitter Profile Photo

📢 New Paper! Tired 😴 of reasoning benchmarks full of math & code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎 W/ Melanie Sclar, and tsvetshop 1/n

📢 New Paper!

Tired 😴 of reasoning benchmarks full of math &amp; code? In our work we consider the problem of reasoning for plot holes in stories -- inconsistencies in a storyline that break the internal logic or rules of a story’s world 🌎

W/ <a href="/melaniesclar/">Melanie Sclar</a>, and <a href="/tsvetshop/">tsvetshop</a>

1/n
Philippe Laban (@philippelaban) 's Twitter Profile Photo

🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄arxiv.org/abs/2505.06120

🆕paper: LLMs Get Lost in Multi-Turn Conversation

In real life, people don’t speak in perfect prompts.
So we simulate multi-turn conversations — less lab-like, more like real use.

We find that LLMs get lost in conversation.
👀What does that mean? 🧵1/N
📄arxiv.org/abs/2505.06120
Oliver Li (@oliver54244160) 's Twitter Profile Photo

🤯 GPT-4o knows H&M left Russia in 2022 but still recommends shopping at H&M in Moscow. 🤔 LLMs store conflicting facts from different times, leading to inconsistent responses. We dig into how to better update LLMs with fresh facts that contradict their prior knowledge. 🧵 1/6

🤯 GPT-4o knows H&amp;M left Russia in 2022 but still recommends shopping at H&amp;M in Moscow.

🤔 LLMs store conflicting facts from different times, leading to inconsistent responses. We dig into how to better update LLMs with fresh facts that contradict their prior knowledge.

🧵 1/6
Tanya Goyal (@tanyaagoyal) 's Twitter Profile Photo

Check out Oliver's paper on learning new knowledge and resolving knowledge conflicts in LLMs! Surprising finding: conditioning on self-generated contexts during training gives massive performance gains! We are excited to extend this ideas to other domains!

Anmol Mekala (@anmol_mekala) 's Twitter Profile Photo

📢 New Paper 📢 Struggling to fit in very long contexts on your LLM? Considering 4-bit quantization to 2x your context window? Prior work says 4-bit is “good enough,” but at long-context tasks it can drop 16%: with up to 59% drops on specific models❗❗ Details in 🧵👇

📢 New Paper 📢
Struggling to fit in very long contexts on your LLM? Considering 4-bit quantization to 2x your context window?

Prior work says 4-bit is “good enough,” but at long-context tasks it can drop 16%: with up to 59% drops on specific models❗❗
Details in 🧵👇