Zhe Gan (@zhegan4) 's Twitter Profile
Zhe Gan

@zhegan4

Research Scientist and Manager @Apple AI/ML. Ex-Principal Researcher @Microsoft Azure AI. Working on building vision and multimodal foundation models.

ID: 1095758911284473856

calendar_today13-02-2019 18:57:37

203 Tweet

2,2K Followers

345 Following

Ruoming Pang (@ruomingpang) 's Twitter Profile Photo

At WWDC we introduce a new generation of LLMs developed to enhance the Apple Intelligence features. We also introduce the new Foundation Models framework, which gives app developers direct access to the on-device foundation language model. machinelearning.apple.com/research/appleโ€ฆ

Ruoming Pang (@ruomingpang) 's Twitter Profile Photo

In this report we describe the 2025 Apple Foundation Models ("AFM"). We also introduce the new Foundation Models framework, which gives app developers direct access to the on-device AFM model. machinelearning.apple.com/research/appleโ€ฆ

Jiasen Lu (@jiasenlu) 's Twitter Profile Photo

Vision tokenizers are stuck in 2020๐Ÿค”while language models revolutionized AI๐Ÿš€ Language: One tokenizer for everything Vision: Fragmented across modalities & tasks Introducing AToken: The first unified visual tokenizer for images, videos & 3D that does BOTH reconstruction AND

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Apple presents Manzano: Simple & scalable unified multimodal LLM โ€ข Hybrid vision tokenizer (continuous โ†” discrete) cuts task conflict โ€ข SOTA on text-rich benchmarks, competitive in gen vs GPT-4o/Nano Banana โ€ข One model for both understanding & generation โ€ข Joint recipe:

Apple presents Manzano: Simple & scalable unified multimodal LLM

โ€ข Hybrid vision tokenizer (continuous โ†” discrete) cuts task conflict
โ€ข SOTA on text-rich benchmarks, competitive in gen vs GPT-4o/Nano Banana
โ€ข One model for both understanding & generation
โ€ข Joint recipe:
Jiasen Lu (@jiasenlu) 's Twitter Profile Photo

Checked out Appleโ€™s latest Flagship project! While focused on AToken, I admired the compute and resources on this flagship project, and the result is also amazing๐ŸŽ‰

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Just read the Manzano paper. They really wrote it in Pages, Sans font, plots in Numbers O.รด But after getting past that superficial weirdness, it's nice work. I like this figure, which clearly illustrates "big model smell" I also like it when Parti did that already, see below.

Just read the Manzano paper. They really wrote it in Pages, Sans font, plots in Numbers O.รด

But after getting past that superficial weirdness, it's nice work. I like this figure, which clearly illustrates "big model smell"

I also like it when Parti did that already, see below.
Zhe Gan (@zhegan4) 's Twitter Profile Photo

Checkout our recent Flagship research project Manzano, a simple, scalable unified multimodal model for image understanding and generation. ๐Ÿš€๐Ÿš€ Why ๐— ๐—ฎ๐—ป๐˜‡๐—ฎ๐—ป๐—ผ? - It is a Spanish word for an apple tree ๐ŸŽ๐ŸŒณ - Just one unified hybrid vision tokenizer with support for both

Zhe Gan (@zhegan4) 's Twitter Profile Photo

๐Ÿค” Knowledge stored in multimodal LLM weights can be inherently limited. How can we empower multimodal LLMs with multimodal web search? ๐Ÿ’ก In DeepMMSearch-R1, we aim to train a multimodal search agent capable of performing on-demand, multi-turn web searches and dynamically

๐Ÿค” Knowledge stored in multimodal LLM weights can be inherently limited. How can we empower multimodal LLMs with multimodal web search?

๐Ÿ’ก In DeepMMSearch-R1, we aim to train a multimodal search agent capable of performing on-demand, multi-turn web searches and dynamically
Zhe Gan (@zhegan4) 's Twitter Profile Photo

๐Ÿ’ก Computer-use agents (CUAs) rely exclusively on primitive actions (click, type, scroll) that require lengthy execution chains, which can be cumbersome and error-prone.ย How to improve this? ๐Ÿ”ฅ ๐Ÿ”ฅ In our native agent UltraCUA, we advocate the idea of "hybrid action" --

๐Ÿ’ก Computer-use agents (CUAs) rely exclusively on primitive actions (click, type, scroll) that require lengthy execution chains, which can be cumbersome and error-prone.ย How to improve this?

๐Ÿ”ฅ ๐Ÿ”ฅ In our native agent UltraCUA, we advocate the idea of "hybrid action" --
Zhe Gan (@zhegan4) 's Twitter Profile Photo

๐ŸŽ๐ŸŽ We release Pico-Banana-400K, a large-scale, high-quality image editing dataset distilled from Nana-Banana across 35 editing types. ๐Ÿ”— Data link: github.com/apple/pico-banโ€ฆ ๐Ÿ”—Paper link: arxiv.org/abs/2510.19808 It includes 258K single-turn image editing data, 72K multi-turn

๐ŸŽ๐ŸŽ We release Pico-Banana-400K, a large-scale, high-quality image editing dataset distilled from Nana-Banana across 35 editing types. 

๐Ÿ”— Data link: github.com/apple/pico-banโ€ฆ

๐Ÿ”—Paper link: arxiv.org/abs/2510.19808

It includes 258K single-turn image editing data, 72K multi-turn
WebAgentlab (@webagentlab) 's Twitter Profile Photo

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action UltraCUA is a foundation model that enhances computer-use agents by integrating low-level GUI actions with high-level programmatic tool calls through a hybrid action mechanism, significantly improving their

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action 

UltraCUA is a foundation model that enhances computer-use agents by integrating low-level GUI actions with high-level programmatic tool calls through a hybrid action mechanism, significantly improving their