Gene Chou (@gene_ch0u) 's Twitter Profile
Gene Chou

@gene_ch0u

CS PhD student @Cornell; previously Princeton '22

ID: 1582075953169285138

linkhttp://genechou.com calendar_today17-10-2022 18:28:12

48 Tweet

232 Followers

240 Following

Andi Marafioti (@andimarafioti) 's Twitter Profile Photo

I came up with a technique for dynamic token selection in Vision-Language Models. Instead of wasting compute on every part of an image, this method adapts the number of tokens based on the complexity of each region. Here’s an example of how it works: 👇

I came up with a technique for dynamic token selection in Vision-Language Models.
Instead of wasting compute on every part of an image, this method adapts the number of tokens based on the complexity of each region.
Here’s an example of how it works: 👇
Bingyi Kang (@bingyikang) 's Twitter Profile Photo

Thrilled to introduce Video Depth Anything to support Depth Estimation for super-long videos (over 5 minutes). 👉It enjoys all the benefits of #DepthAnything: high-quality, fast, robust, etc. Proj Page: videodepthanything.github.io

Qianqian Wang (@qianqianwang5) 's Twitter Profile Photo

Introducing CUT3R! An online 3D reasoning framework for many 3D tasks directly from just RGB. For static or dynamic scenes. Video or image collections, all in one!

youming.deng (@denghilbert) 's Twitter Profile Photo

How can we use wide-FOV cameras for reconstruction? We propose self-calibration Gaussian Splatting that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations to directly reconstruct from a set of wide-angle captures. page: denghilbert.github.io/self-cali/

Yuncong Yang (@yuncongyy) 's Twitter Profile Photo

Excited to introduce 3D-Mem! Spatial Intelligence simply isn’t possible without robust 3D Scene Memory. That’s why we developed 3D-Mem, an effective framework for lifelong exploration and reasoning. Thrilled to share that it’s been accepted to #CVPR2025 !

Ning Yu (@realningyu) 's Twitter Profile Photo

The first project I led at Netflix Eyeline Studios is headed to #CVPR2025 with 5,5,4 review scores: 🌊Go-with-the-Flow🌊 warps noise for effortless motion control in video diffusion — no pipeline changes, same compute. Direct camera/object motion, transfer movement between

The first project I led at Netflix Eyeline Studios is headed to #CVPR2025 with 5,5,4 review scores: 🌊Go-with-the-Flow🌊 warps noise for effortless motion control in video diffusion — no pipeline changes, same compute. Direct camera/object motion, transfer movement between
David Fan (@davidjfan) 's Twitter Profile Photo

Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.

Can visual SSL match CLIP on VQA?

Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
Karan Dalal (@karansdalal) 's Twitter Profile Photo

Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training. We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency. Every video below is produced directly by

Xichen Pan (@xichen_pan) 's Twitter Profile Photo

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all.
MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
Jon Barron (@jon_barron) 's Twitter Profile Photo

Here's my 3DV talk, in chapters: 1) Intro / NeRF boilerplate. 2) Recent reconstruction work. 3) Recent generative work. 4) Radiance fields as a field. 5) Why generative video has bitter-lessoned 3D. 6) Why generative video hasn't bitter-lessoned 3D. 5 & 6 are my favorites.

Here's my 3DV talk, in chapters:

1) Intro / NeRF boilerplate.
2) Recent reconstruction work.
3) Recent generative work.
4) Radiance fields as a field.
5) Why generative video has bitter-lessoned 3D.
6) Why generative video hasn't bitter-lessoned 3D.

5 & 6 are my favorites.
Gordon Wetzstein (@gordonwetzstein) 's Twitter Profile Photo

Most video models 🤯forget the past 🐌slow down over time 🔁rely on bidirectional (not causal) attention Our state-space video world models (SSM) 🧠remember across hundreds of frames ⚡️generate at constant speed ⏩is fully causal, enabling real-time rollout 1/3

Xun Huang (@xunhuang1995) 's Twitter Profile Photo

Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for training autoregressive diffusion models. The key to high quality? Simulate the inference process during training by unrolling transformers with KV caching.

Gene Chou (@gene_ch0u) 's Twitter Profile Photo

I'll be presenting our work with Kai Zhang at #cvpr2025. We finetune video models to be 3d consistent without any 3d supervision!  Feel free to stop by our poster or reach out to chat: Sunday, Jun 15, 4-6pm ExHall D, poster #168 cvpr.thecvf.com/virtual/2025/p…

Sukjun (June) Hwang (@sukjun_hwang) 's Twitter Profile Photo

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data