Tom Hosking (@tomhosking) 's Twitter Profile
Tom Hosking

@tomhosking

Model merging lead for Command A @cohere. Prev: PhD student in NLP @EdinburghNLP @Edin_CDT_NLP, @BloomsburyAI @UCL @DRWTrading

ID: 30673001

linkhttp://tomho.sk calendar_today12-04-2009 16:16:43

1,1K Tweet

931 Followers

640 Following

Nick Frosst (@nickfrosst) 's Twitter Profile Photo

UPDATE: my numbers were off, external benchmarking actually shows we are faster and better. GPQA-diamond: 53% miliseconds per token: 5.36 artificialanalysis.ai/providers/cohe…

Max Bartolo (@max_nlp) 's Twitter Profile Photo

I really enjoyed my Machine Learning Street Talk chat with Tim at #NeurIPS2024 about some of the research we've been doing on reasoning, robustness and human feedback. If you have an hour to spare and are interested in some semi-coherent thoughts revolving around AI robustness, it may be worth

I really enjoyed my <a href="/MLStreetTalk/">Machine Learning Street Talk</a> chat with Tim at #NeurIPS2024 about some of the research we've been doing on reasoning, robustness and human feedback. If you have an hour to spare and are interested in some semi-coherent thoughts revolving around AI robustness, it may be worth
Max Bartolo (@max_nlp) 's Twitter Profile Photo

I'm excited to the tech report for our @Cohere Cohere For AI Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised

I'm excited to the tech report for our @Cohere <a href="/CohereForAI/">Cohere For AI</a> Command A and Command R7B models. We highlight our novel approach to model training including the use of self-refinement algorithms and model merging techniques at scale. Command A is an efficient, agent-optimised
Tom Hosking (@tomhosking) 's Twitter Profile Photo

I'm really proud to have led the model merging work that went into cohere Command A and R7B, all made possible by an amazing group of collaborators. Check out the report for loads of details on how we trained a GPT-4o level model that fits on 2xH100!

Cohere Labs (@cohere_labs) 's Twitter Profile Photo

Following the open-weight release of Command A and Command R7B models, we're excited to have collaborated with @Cohere colleagues on a tech report highlighting our novel approach to model training, including self-refinement algorithms and model merging techniques at scale.

Following the open-weight release of Command A and Command R7B models, we're excited to have collaborated with @Cohere colleagues on a tech report highlighting our novel approach to model training, including self-refinement algorithms and model merging techniques at scale.
cohere (@cohere) 's Twitter Profile Photo

We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency.

aakanksha (@____aakanksha) 's Twitter Profile Photo

the complete cooking guide with all the ingredients, seasonings and garnishes for this soup of a model is here! 🍲🧂🌶️🔥 couldn’t be more proud to have continued my exploration of model merging and translated it to our A-class flagship model, Command A, with the best team! ✨

the complete cooking guide with all the ingredients, seasonings and garnishes for this soup of a model is here! 🍲🧂🌶️🔥

couldn’t be more proud to have continued my exploration of model merging and translated it to our A-class flagship model, Command A, with the best team! ✨
wh (@nrehiew_) 's Twitter Profile Photo

The most interesting part of their post training is just how much they use model merging both in SFT and RL. Their process is: - Train an instruct model - Train 6 SFT in 6 domains(Code, Safety, RAG, Math, Multilingual, and General Long-Context) - Merge - Use this merge to train

The most interesting part of their post training is just how much they use model merging both in SFT and RL. Their process is:
- Train an instruct model 
- Train 6 SFT in 6 domains(Code, Safety, RAG, Math, Multilingual, and General Long-Context)
- Merge 
- Use this merge to train
wh (@nrehiew_) 's Twitter Profile Photo

The next section on Merging is the most interesting imo. As a summary of what we discussed earlier, they used expert merging: merge the SFT then merge the preference tuned models

The next section on Merging is the most interesting imo. 

As a summary of what we discussed earlier, they used expert merging: merge the SFT then merge the preference tuned models
wh (@nrehiew_) 's Twitter Profile Photo

They find that linear merging is pretty interpretable ie upweight an expert leads to better performance in that domain. However, the corresponding drop in performance is unpredictable Interestingly, they add cross-domain data for each expert as a form of regularization.

They find that linear merging is pretty interpretable ie upweight an expert leads to better performance in that domain. However, the corresponding drop in performance is unpredictable 

Interestingly, they add cross-domain data for each expert as a form of regularization.
wh (@nrehiew_) 's Twitter Profile Photo

Some thoughts: First, the paper is pretty well-written and easy to follow. They have so many benchmarks and results. I think its similar in spirit to the llama3 paper but they are more complementary imo as this paper focuses heavily on post training while Llama3 didn't do a good

Tom Hosking (@tomhosking) 's Twitter Profile Photo

Now feels like a good time to plug cohere Command A: - model evaled on lmarena.ai is same as hosted on Hugging Face - claimed performance is reproducible - not trained on the test set - uses the cohere hybrid attention architecture for long context - fits on 2xH100 not 8x

Douwe Kiela (@douwekiela) 's Twitter Profile Photo

When we came up with RAG five years ago, we weren't creating a workaround for small context windows—we were designing a principled approach to augment models with external knowledge. The core challenges RAG addresses remain unsolved with just larger context windows: • Accessing

When we came up with RAG five years ago, we weren't creating a workaround for small context windows—we were designing a principled approach to augment models with external knowledge.

The core challenges RAG addresses remain unsolved with just larger context windows:
• Accessing
Matthias Gallé (@mgalle) 's Twitter Profile Photo

You like Code, you like LLMs, you are looking for a leadership position? We are searching for somebody who can support our amazing team and bring code agents for enterprises to new heights! jobs.ashbyhq.com/cohere/758e544…

Arduin Findeis @ ICLR2025 (@arduinfindeis) 's Twitter Profile Photo

How exactly was the initial Chatbot Arena version of Llama 4 Maverick different from the public HuggingFace version?🕵️ I used our Feedback Forensics app to quantitatively analyse how exactly these two models differ. An overview…👇🧵

How exactly was the initial Chatbot Arena version of Llama 4 Maverick different from the public HuggingFace version?🕵️

I used our Feedback Forensics app to quantitatively analyse how exactly these two models differ. An overview…👇🧵
Cohere Labs (@cohere_labs) 's Twitter Profile Photo

How does sparse attention reshape LLM scaling? 🔍 We’re excited to share this work by former @Cohere intern Piotr Nawrot, “The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs.”