
Sam Park
@smsampark
Postdoc at Stanford CS. Previously: MIT PhD, Cornell BS
ID: 1607751613
https://sungminpark.com 20-07-2013 07:54:39
53 Tweet
403 Followers
476 Following


Contributed talks at ATTRIB 2023 (Rm 271-273): starting with Teddi Worledge on Corroborative vs Contributive data attribution for language models!


I gave a keynote this week at the fantastic ATTRIB Workshop #NeurIPS2023 "What does scale give us: Why we are building a ladder đȘ to the moon đ" Some of you asked for my slides, sharing below: docs.google.com/presentation/d⊠Thanks to the organizers for a fantastic workshop! đ„

How do we attribute an image generated by a diffusion model back to the training data? w/ Kristian Georgiev Josh Vendrow Hadi Salman Sam Park we show that itâs useful to look at each step of the diffusion process:

We tend to choose LM training data via intuitive notions of text quality... but LMs are often *un*intuitive. Is there a better way? w/Logan Engstrom, Axel Feldmann: we select better data by modeling how models learn from data. Our method, DsDm, can greatly improve



In work w/ Andrew Ilyas Jennifer Allen Hannah Li Aleksander Madry we give experimental evidence that users strategize on recommender systems! We find that users react to their (beliefs about) *algorithms* (not just content!) to shape future recs. Paper: arxiv.org/abs/2405.05596 1/8


In ML, we train on biased (huge) datasets âĄïž models encode spurious corrs and fail on minority groups. Can we scalably remove "bad" data? w/ Saachi Jain Kimia Hamidieh Kristian Georgiev Andrew Ilyas Marzyeh we propose D3M, a method for exactly this: gradientscience.org/d3m/


At #ICML2024 ? Our tutorial "Data Attribution at Scale" will be to tomorrow at 9:30 AM CEST in Hall A1! I will not be able to make it (but will arrive later that day), but my awesome students Andrew Ilyas Sam Park Logan Engstrom will carry the torch :)

Starting now in Hall A1! With accompanying notes (WIP) at ml-data-tutorial.org (compiled w/ Sam Park Logan Engstrom Kristian Georgiev Aleksander Madry )

Attending #ICML2024? Check out our work on decomposing predictions and editing model behavior via targeted interventions to model internals! Poster: #2513, Hall C 4-9, 1:30p (Tue) Paper: arxiv.org/abs/2404.11534 w/ Harshay Shah Andrew Ilyas


Stop by our poster on model-aware dataset selection at ICML! Location/time: 1:30pm Hall C 4-9 #1010 (Tuesday) Paper: arxiv.org/abs/2401.12926 with: Axel Feldmann Aleksander Madry

Thanks to all who attended our tutorial "Data Attribution at Scale" at ICML (w/ Sam Park Logan Engstrom Kristian Georgiev Aleksander Madry)! We're really excited to see the response to this emerging topic. Slides, notes, ICML video: ml-data-tutorial.org Public recording soon!


The ATTRIB workshop is back @ NeurIPS 2024! We welcome papers connecting model behavior to data, algorithms, parameters, scale, or anything else. Submit by Sep 18! More info: attrib-workshop.cc Co-organizers: Tolga Bolukbasi Logan Engstrom Sadhika Malladi Elisa Nguyen Sam Park



Machine unlearning ("removing" training data from a trained ML model) is a hard, important problem. Datamodel Matching (DMM): a new unlearning paradigm with strong empirical performance! w/ Kristian Georgiev Roy Rinberg Sam Park Shivam Garg Aleksander Madry Seth Neel (1/4)


At #ICLR, check out Perplexity Correlations: a statistical framework to select the best pretraining data with no LLM training! I canât make the trip, but Tatsunori Hashimoto will present the poster for us! Continue reading for the latest empirical validations of PPL Correlations:
