Jose Javier Gonzalez (@jjgort) Twitter Tweets • TwiCopy

Katie Lewis

5 years ago

ML+art project with Divya Shanmugam, Jose Javier Gonzalez, and artist, Agnieszka Kurant ! Our GAN-based approach generates signatures containing features learned from a collection of MIT and Cambridge residents’ signatures. #creativeAI #MachineLearning MIT Clinical and Applied Machine Learning listart.mit.edu/agnieszka-kura…

thumb_up_off_alt18

chat_bubble_outline1

repeat6

shareShare

Vitaliy Chiley

@vitaliychiley

2 years ago

Introducing DBRX: A New Standard for Open LLM 🔔 databricks.com/blog/introduci… 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵

thumb_up_off_alt472

chat_bubble_outline22

repeat82

shareShare

Tessa Barton

@tessybarton

2 years ago

TFW the code eval metrics are so good your boss does his hair blue

thumb_up_off_alt67

chat_bubble_outline1

repeat6

shareShare

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years ago

🧱DBRX🧱 is so good that it forced 3-4 companies to release "competing" LLMs in the last two days (and we've barely heard about them). Some of my thoughts are summarized below... Prior research from Mosaic. DBRX is the next model in the series of Open LLMs released by Mosaic.

thumb_up_off_alt225

chat_bubble_outline4

repeat56

shareShare

Cody Blakeney

@code_star

2 years ago

DBRX is the best open model on AI2 WildBench! 😀

thumb_up_off_alt40

chat_bubble_outline3

repeat5

shareShare

Andrej Karpathy

@karpathy

2 years ago

Bilal AI at Meta LMSYS Org no. people misunderstand chinchilla. chinchilla doesn't tell you the point of convergence. it tells you the point of compute optimality. if all you care about is perplexity, for every FLOPs compute budget, how big model on how many tokens should you train? for reasons not fully

thumb_up_off_alt579

chat_bubble_outline23

repeat50

shareShare

Jose Javier Gonzalez

@jjgort

2 years ago

Nice! Kagi now includes links to hackernews and reddit discussion threads directly in the search results. It's the small things

Nice! <a href="/KagiHQ/">Kagi</a> now includes links to hackernews and reddit discussion threads directly in the search results. It's the small things

thumb_up_off_alt31

chat_bubble_outline1

repeat5

shareShare

Cody Blakeney

@code_star

2 years ago

An interesting bit of nuance missing from throughput charts like this is that tokens != generated text. Because DBRX / LLama3 / GPT4s tokenizer has a larger vocabulary (100k+) they actually generate much faster then (20-30%) then tokens alone will measure compared to say Mixtral

thumb_up_off_alt18

chat_bubble_outline2

repeat1

shareShare

Dan Biderman

@dan_biderman

a year ago

People think LoRA is a magic bullet for LLMs. Is it? Does it deliver the same quality as full finetuning but on consumer GPUs? Though LoRA has the advantage of a lower memory footprint, we find that it often substantially underperforms full finetuning. However, it forgets less

thumb_up_off_alt561

chat_bubble_outline22

repeat104

shareShare

Databricks Mosaic Research

@dbrxmosaicai

a year ago

Popular #LLM scaling laws only factor in training costs, and ignore the costs of deployment. In a paper presented at ICML Conference 2024, Databricks Mosaic AI researchers Nikhil Sardana, Jacob Portes, and Sasha Doubov propose a modified scaling law that considers the cost of both

Popular #LLM scaling laws only factor in training costs, and ignore the costs of deployment. In a paper presented at <a href="/icmlconf/">ICML Conference</a> 2024, <a href="/databricks/">Databricks</a> Mosaic AI researchers Nikhil Sardana, <a href="/JacobianNeuro/">Jacob Portes</a>, and <a href="/sashadoubov/">Sasha Doubov</a> propose a modified scaling law that considers the cost of both

thumb_up_off_alt53

chat_bubble_outline1

repeat20

shareShare

Sasha Doubov

@sashadoubov

a year ago

some notes from paper! - 405B trained on 15.6T tokens, 3.8e25 flops - use SFT, rejection sampling and DPO - annealing is used to judge quality of domain specific data (s/o dbrx paper)

thumb_up_off_alt40

chat_bubble_outline2

repeat6

shareShare

Mansheej Paul

@mansiege

a year ago

Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm: icml.cc/virtual/2024/w…

thumb_up_off_alt41

chat_bubble_outline0

repeat14

shareShare

Dan Biderman

@dan_biderman

a year ago

*LoRA Learns Less and Forgets Less* is now out in its definitive edition in TMLR🚀 Checkout the latest numbers fresh from the Databricks Mosaic Research oven 👨‍🍳

thumb_up_off_alt82

chat_bubble_outline5

repeat20

shareShare

Davis Blalock

@davisblalock

4 months ago

Deep learning training is a mathematical dumpster fire. But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat148

shareShare

Jose Javier Gonzalez

@jjgort

3 months ago

This is the most rewarding thing I did during my PhD, looking forward to teaching an updated version!

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Jonathan Frankle

@jefrankle

3 months ago

RLVR and test-time compute are a powerful combo for enterprises, so much so that Databricks now leads overall BIRD single-model leaderboard. This isn't about BIRD, though. It's an example of what our customers are accomplishing in their domains with our RL recipe in Agent Bricks

RLVR and test-time compute are a powerful combo for enterprises, so much so that <a href="/databricks/">Databricks</a> now leads overall BIRD single-model leaderboard. This isn't about BIRD, though. It's an example of what our customers are accomplishing in their domains with our RL recipe in Agent Bricks

thumb_up_off_alt35

chat_bubble_outline1

repeat6

shareShare

Thinking Machines

@thinkymachines

2 months ago

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices.