Elie Bursztein (@elie) Twitter Tweets • TwiCopy

Elie Bursztein

a year ago

Very excited to participate to the #LLM Agent MOOC Hackathon. Can't wait to see what participants will come up with! #hackathon #A

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

[weekend read] Human Creativity in the Age of LLMs - arxiv.org/abs/2410.03703 - Worryingly this #research shows that #AI might boost short-term creativity at the expense of long-term one. Figuring out how to leverage #LLM without degrading human long-term capabilities is vital.

thumb_up_off_alt4

chat_bubble_outline3

repeat2

shareShare

Elie Bursztein

@elie

a year ago

[Weekend Read] AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild arxiv.org/pdf/2405.11697 This comprehensive study clearly highlight the magnitude of the problem with great examples and measurements. #research #AI #misinformation #disinformation

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

François Chollet

@fchollet

a year ago

Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task

thumb_up_off_alt8,8K

chat_bubble_outline204

repeat1,1K

shareShare

Elie Bursztein

@elie

a year ago

[Tool Tuesday] LLM the best CLI utility for interacting with Large Models - github.com/simonw/llm This comprehensive tool support images, local/remote models and shell workflow. You can for example type: cat mycode.py | llm -s "Explain this code" #LLM #tool #AI

thumb_up_off_alt1

chat_bubble_outline0

repeat2

shareShare

Elie Bursztein

@elie

10 months ago

[Weekend Read] Humanity’s Last Exam - static.scale.com/uploads/654197… New large scale knowledge benchmark where the best models barely reach 9% with DeepSeek-R1 outperforming everyone. #benchmark #research #AI #LLM #deepseek #openai #anthropic #gemini

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Elie Bursztein

@elie

10 months ago

[Weekend Read] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training - arxiv.org/pdf/2501.17161 - Using Reinforcement Learning (RL) help generalizing and SFT help stabilizing. #AI #LLM #research #RL

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Elie Bursztein

@elie

10 months ago

[Weekend Read] The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence - microsoft.com/en-us/research… The more people are confident in #GenAI the less they think critically. #Research #AI #education

thumb_up_off_alt3

chat_bubble_outline0

repeat3

shareShare

Elie Bursztein

@elie

9 months ago

[Weekend Read] Do Not Trust Licenses You See—Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing - lgresearch.ai/data/upload/LG… Present LG new dataset legal risk framework and the #AI agent built to use it to automatically to review datasets compliance at scale.

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Elie Bursztein

@elie

9 months ago

[Weekend Read] Reasoning Language Models: A Blueprint - arxiv.org/abs/2501.11223 All you need to know on how thinking/reasoning models are trained and evaluated. #AI #RLM #LLM

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

Elie Bursztein

@elie

9 months ago

Gemma 3 is here! Our new open models are incredibly efficient - the largest 27B model runs on just one H100 GPU. You'd need at least 10x the compute to get similar performance from other models 👇 #AI #LLM

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Elie Bursztein

@elie

9 months ago

[Weekend Read] AI Search Has A Citation Problem - cjr.org/tow_center/we-… TL;DR: Doing an agent is one thing, getting to the level of reliability where an agent can be trusted is a total different ball game. Evaluating reliability is critical to true progress. #AI #LLM

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

Elie Bursztein

@elie

9 months ago

Accelerating Large-Scale Test Migration with LLMs - medium.com/airbnb-enginee… Airbnb was able to leverage AI to reduce migration time by about 90% (6weeks / 1.5y) #AI #airbnb

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Elie Bursztein

@elie

9 months ago

[Weekend Read] Measuring #AI Ability to Complete Long Tasks - arxiv.org/pdf/2503.14499 The ability of models to perform longer and longer tasks roughly double every 7 months. That's encouraging however I am unsure if true at 99% success rate which is needed to trust agents. #LLM

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Elie Bursztein

@elie

8 months ago

[Weekend read] Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification - arxiv.org/pdf/2502.01839 If you are interested in understanding how scaling computation at generation time help improve model performance this is the paper to read. #AI #LLM

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Elie Bursztein

@elie

8 months ago

[Weekend Read] RealHarm: A Collection of Real-World Language Model Application Failures arxiv.org/abs/2504.10277 This paper by looking at real world examples of AI failure highlight the disconnect between what safety filters block and what goes wrong in practice. #safety #ai

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Elie Bursztein

@elie

7 months ago

[Weekend Read] Exploring LLM Reasoning Through Controlled Prompt Variations - arxiv.org/abs/2504.02111 Show how critical it is to only have relevant data in the model context. Accurately filtering out data is very difficult and simple vector search is NOT the answer. #AI #RAG

thumb_up_off_alt4

chat_bubble_outline0

repeat2

shareShare

Elie Bursztein

@elie

7 months ago

The Phare Benchmark results key insights include that popularity on benchmarks like LMArena doesn't guarantee factual reliability and that the more confidently a user phrase its query the less willing models are willing to refute controversial claims - giskard.ai/knowledge/good…

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Elie Bursztein

@elie

6 months ago

[Weekend Read] Large Language Models, Small Labor Market Effects - nber.org/system/files/w… Recent study by the National Bureau of Economic Research that shows that in Danemark despite massive investment in AI productivity gains are minimal at about ~3% #AI #LLM #Economy

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Elie Bursztein

@elie

6 months ago

[Weekend Read] The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity arxiv.org/abs/2506.06941 This paper highlights that while thinking models outperform regular LLMs on reasoning tasks, they still experience

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare