Jose Javier Gonzalez (@jjgort) 's Twitter Profile
Jose Javier Gonzalez

@jjgort

Research Scientist at MosaicAI DataBricks. Working on LLMs

ID: 815647100100808705

linkhttp://josejg.com calendar_today01-01-2017 19:53:17

28 Tweet

369 Followers

91 Following

Katie Lewis (@katielewismit) 's Twitter Profile Photo

ML+art project with Divya Shanmugam, Jose Javier Gonzalez, and artist, Agnieszka Kurant ! Our GAN-based approach generates signatures containing features learned from a collection of MIT and Cambridge residents’ signatures. #creativeAI #MachineLearning MIT Clinical and Applied Machine Learning listart.mit.edu/agnieszka-kura…

Vitaliy Chiley (@vitaliychiley) 's Twitter Profile Photo

Introducing DBRX: A New Standard for Open LLM 🔔 databricks.com/blog/introduci… 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵

Introducing DBRX: A New Standard for Open LLM 🔔

databricks.com/blog/introduci…

💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens
🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks.

Is this thread mostly written by DBRX? Yes!
🧵
Cameron R. Wolfe, Ph.D. (@cwolferesearch) 's Twitter Profile Photo

🧱DBRX🧱 is so good that it forced 3-4 companies to release "competing" LLMs in the last two days (and we've barely heard about them). Some of my thoughts are summarized below... Prior research from Mosaic. DBRX is the next model in the series of Open LLMs released by Mosaic.

🧱DBRX🧱 is so good that it forced 3-4 companies to release "competing" LLMs in the last two days (and we've barely heard about them). Some of my thoughts are summarized below...

Prior research from Mosaic. DBRX is the next model in the series of Open LLMs released by Mosaic.
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Bilal AI at Meta LMSYS Org no. people misunderstand chinchilla. chinchilla doesn't tell you the point of convergence. it tells you the point of compute optimality. if all you care about is perplexity, for every FLOPs compute budget, how big model on how many tokens should you train? for reasons not fully

Cody Blakeney (@code_star) 's Twitter Profile Photo

An interesting bit of nuance missing from throughput charts like this is that tokens != generated text. Because DBRX / LLama3 / GPT4s tokenizer has a larger vocabulary (100k+) they actually generate much faster then (20-30%) then tokens alone will measure compared to say Mixtral

Dan Biderman (@dan_biderman) 's Twitter Profile Photo

People think LoRA is a magic bullet for LLMs. Is it? Does it deliver the same quality as full finetuning but on consumer GPUs? Though LoRA has the advantage of a lower memory footprint, we find that it often substantially underperforms full finetuning. However, it forgets less

People think LoRA is a magic bullet for LLMs. Is it? Does it deliver the same quality as full finetuning but on consumer GPUs?

Though LoRA has the advantage of a lower memory footprint, we find that it often substantially underperforms full finetuning. However, it forgets less
Databricks Mosaic Research (@dbrxmosaicai) 's Twitter Profile Photo

Popular #LLM scaling laws only factor in training costs, and ignore the costs of deployment. In a paper presented at ICML Conference 2024, Databricks Mosaic AI researchers Nikhil Sardana, Jacob Portes, and Sasha Doubov propose a modified scaling law that considers the cost of both

Popular #LLM scaling laws only factor in training costs, and ignore the costs of deployment. In a paper presented at <a href="/icmlconf/">ICML Conference</a> 2024, <a href="/databricks/">Databricks</a> Mosaic AI researchers Nikhil Sardana, <a href="/JacobianNeuro/">Jacob Portes</a>, and <a href="/sashadoubov/">Sasha Doubov</a> propose a modified scaling law that considers the cost of both
Sasha Doubov (@sashadoubov) 's Twitter Profile Photo

some notes from paper! - 405B trained on 15.6T tokens, 3.8e25 flops - use SFT, rejection sampling and DPO - annealing is used to judge quality of domain specific data (s/o dbrx paper)

Mansheej Paul (@mansiege) 's Twitter Profile Photo

Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm: icml.cc/virtual/2024/w…

Dan Biderman (@dan_biderman) 's Twitter Profile Photo

*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR🚀 Checkout the latest numbers fresh from the Databricks Mosaic Research oven 👨‍🍳

Davis Blalock (@davisblalock) 's Twitter Profile Photo

Deep learning training is a mathematical dumpster fire. But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]

Deep learning training is a mathematical dumpster fire.

But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]
Jonathan Frankle (@jefrankle) 's Twitter Profile Photo

RLVR and test-time compute are a powerful combo for enterprises, so much so that Databricks now leads overall BIRD single-model leaderboard. This isn't about BIRD, though. It's an example of what our customers are accomplishing in their domains with our RL recipe in Agent Bricks

RLVR and test-time compute are a powerful combo for enterprises, so much so that <a href="/databricks/">Databricks</a> now leads overall BIRD single-model leaderboard. This isn't about BIRD, though. It's an example of what our customers are accomplishing in their domains with our RL recipe in Agent Bricks
Thinking Machines (@thinkymachines) 's Twitter Profile Photo

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.
Hallee Wong (@halleewong) 's Twitter Profile Photo

Presenting MultiverSeg — a scalable in-context system for interactively segmenting new datasets — at #ICCV2025 today! 📍poster 110 (10:45 AM–12:45 PM)

Presenting MultiverSeg — a scalable in-context system for interactively segmenting new datasets — at #ICCV2025 today!
📍poster 110 (10:45 AM–12:45 PM)