
Adrien Barbaresi
@adbarbaresi
Distant reader of digital texts ⦀ Research scientist @bbaw_de ⦀
Corpus Linguistics, NLProc, Digital Humanities, Open Source Software
ID: 222584301
https://adrien.barbaresi.eu 03-12-2010 21:03:59
1,1K Tweet
770 Followers
451 Following


Further refining web data actually works! Penedo et al. show how using Trafilatura along with a series of filters and corrections can improve the zero-shot performance of large language models on a task aggregate: falconllm.tii.ae/Falcon_LLM_Ref… LightOn Technology Innovation Institute @LP_ENS_ #LLM


Info via Alexander Doria: Falcon is currently the best model in Hugging Face's LLM benchmark, it is now also completely open source tii.ae/news/uaes-falc…






Just added nice tutorials to the Trafilatura playlist 📽️ 👉 Samuel on GPT & LlamaIndex integration 👉 Marcel Samyn on a chatbot for web searches 👉 Jorge Hudson with a quickstart in Spanish Exciting, thanks for your interest! #webscraping youtube.com/playlist?list=…



Je pointe l'incroyable travail de Adrien Barbaresi avec son projet Trafilatura sorti en 2019 et fruit d'une thèse soutenu en 2015 en matière de corpus web. mywebintelligence defend cet enjeux non-trivial en SHS qui fait de toutes les cartos du web des objets ambigus voire faux. Welcom



Latentscope by Ian Johnson 🔬🤖 is like a microscope that allows you to get a new perspective on what's happening to your data when it's embedded. You can try similarity search with different embeddings, peruse automatically labeled clusters and zoom in: github.com/enjalot/latent…


What do we mean when we talk about web corpora and how are they built? Video out today as part of the DiLCo Video Reader series Universität Hamburg #CorpusLinguistics youtube.com/watch?v=SGcy79…



Next week, the Haystack team is co-working in Berlin, and we're taking the chance to host a meetup at deepset HQ (and online)! 🌐 Join to learn about Trafilatura with Adrien Barbaresi and deployment with Hayhooks & Open WebUI Sign up here👇lu.ma/opennlp-15