Avi Caciularu (@clu_avi) 's Twitter Profile
Avi Caciularu

@clu_avi

Research Scientist @GoogleAI | previously ML & NLP PhD student @biunlp, intern at @allen_ai, @Microsoft, @AIatMeta.

ID: 61239647

linkhttps://aviclu.github.io calendar_today29-07-2009 16:52:44

255 Tweet

520 Followers

452 Following

Royi Rassin (@royirassin) 's Twitter Profile Photo

How diverse are the outputs of text-to-image models and how can we measure that? In our new work, we propose a measure based on LLMs and Visual-QA (VQA), and show NONE of the 12 models we experiment with are diverse. 🧵 1/11

How diverse are the outputs of text-to-image models and how can we measure that? In our new work, we propose a measure based on LLMs and Visual-QA (VQA), and show NONE of the 12 models we experiment with are diverse. 🧵 
1/11
Sasha Goldshtein (@goldshtn) 's Twitter Profile Photo

I am hiring a Senior SWE to work on Gemini post-training, improving Gemini factuality. Factuality is a top blocker for LLM adoption and a critical priority for Gemini. Prior experience with LLM training and evaluation is a major advantage. Apply here: google.com/about/careers/…

lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

Massive News from Chatbot Arena🔥 Google DeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision

Massive News from Chatbot Arena🔥

<a href="/GoogleDeepMind/">Google DeepMind</a>'s latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision
AK (@_akhaliq) 's Twitter Profile Photo

Google just released gemini-exp-1121 - significant gains on coding performance - stronger reasoning capabilities - improved visual understanding Now available on Anychat

Google just released gemini-exp-1121
 
- significant gains on coding performance 
- stronger reasoning capabilities 
- improved visual understanding  

Now available on Anychat
Jeff Dean (@jeffdean) 's Twitter Profile Photo

What a way to celebrate one year of incredible Gemini progress -- #1🥇across the board on overall ranking, as well as on hard prompts, coding, math, instruction following, and more, including with style control on. Thanks to the hard work of everyone in the Gemini team and

What a way to celebrate one year of incredible Gemini progress -- #1🥇across the board on overall ranking, as well as on hard prompts, coding, math, instruction following, and more, including with style control on.

Thanks to the hard work of everyone in the Gemini team and
Yonatan Bitton (@yonatanbitton) 's Twitter Profile Photo

🚨 Happening NOW at #NeurIPS2024 with nitzan guetta ! 🎭 #VisualRiddles: A Commonsense and World Knowledge Challenge for Vision-Language Models. 📍 East Ballroom C, Creative AI Track 🔍 visual-riddles.github.io

🚨 Happening NOW at #NeurIPS2024 with <a href="/nitzanguetta/">nitzan guetta</a> !
🎭 #VisualRiddles: A Commonsense and World Knowledge Challenge for Vision-Language Models.
📍 East Ballroom C, Creative AI Track
🔍 visual-riddles.github.io
Sasha Goldshtein (@goldshtn) 's Twitter Profile Photo

Today we published FACTS Grounding, a benchmark and leaderboard for evaluating the factuality of LLMs when grounding to the input context. The leaderboard is on Kaggle and we plan to maintain it and track progress. deepmind.google/discover/blog/… kaggle.com/facts-leaderbo…

Mor Geva (@megamor2) 's Twitter Profile Photo

How can we interpret LLM features at scale? 🤔 Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs! We propose efficient output-centric methods that better predict how steering a feature will affect model outputs. New

Ori Yoran (@oriyoran) 's Twitter Profile Photo

New #ICLR2024 paper! The KoLMogorov Test: can CodeLMs compress data by code generation? The optimal compression for a sequence is the shortest program that generates it. Empirically, LMs struggle even on simple sequences, but can be trained to outperform current methods! 🧵1/7

omer goldman (@omernlp) 's Twitter Profile Photo

Wanna check how well a model can share knowledge between languages? Of course you do! 🤩 But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯

Wanna check how well a model can share knowledge between languages? Of course you do! 🤩

But can you do it without access to the model’s weights? Now you can with ECLeKTic 🤯
Gabrielle Kaili-May Liu (@pybeebee) 's Twitter Profile Photo

🔥 Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs🔥 How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"? Check out our new preprint to find out! Details in 🧵(1/n):

🔥 Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs🔥

How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"?
Check out our new preprint to find out!
Details in 🧵(1/n):
Eran Hirsch (@hirscheran) 's Twitter Profile Photo

🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)! LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2

🚨 Introducing LAQuer, accepted to #ACL2025 (main conf)!

LAQuer provides more granular attribution for LLM generations: users can just highlight any output fact (top), and get attribution for that input snippet (bottom). This reduces the amount of text the user has to read by 2
Arie Cattan (@ariecattan) 's Twitter Profile Photo

🚨 RAG is a popular approach but what happens when the retrieved sources provide conflicting information?🤔 We're excited to introduce our paper: “DRAGged into CONFLICTS: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs”🚀 A thread 🧵👇

🚨 RAG is a popular approach but what happens when the retrieved sources provide conflicting information?🤔

We're excited to introduce our paper: 
“DRAGged into CONFLICTS: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs”🚀

A thread 🧵👇
Sundar Pichai (@sundarpichai) 's Twitter Profile Photo

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦 Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦

Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the
Arman Cohan (@armancohan) 's Twitter Profile Photo

Excited for the release of SciArena with Ai2! LLMs are now an integral part of research workflows, and SciArena helps measure progress on scientific literature tasks. Also checkout the preprint for a lot more results/analyses. Led by: Yilun Zhao, Kaiyan Zhang 📄 paper:

Nathan Lambert (@natolambert) 's Twitter Profile Photo

This new benchmark created by Valentina Pyatkin should be the new default replacing IFEval. Some of the best frontier models get <50% and it comes with separate training prompts so people don’t effectively train on test. Wild gap from o3 to Gemini 2.5 pro of like 30 points.

Gabrielle Kaili-May Liu (@pybeebee) 's Twitter Profile Photo

I will be presenting our work 𝗠𝗗𝗖𝘂𝗿𝗲 at #ACL2025NLP in Vienna this week! 🇦🇹 Come by if you’re interested in multi-doc reasoning and/or scalable creation of high-quality post-training data! 📍 Poster Session 4 @ Hall 4/5 🗓️ Wed, July 30 | 11-12:30 🔗 aclanthology.org/2025.acl-long.…