Adrien Barbaresi (@adbarbaresi) 's Twitter Profile
Adrien Barbaresi

@adbarbaresi

Distant reader of digital texts ⦀ Research scientist @bbaw_de ⦀
Corpus Linguistics, NLProc, Digital Humanities, Open Source Software

ID: 222584301

linkhttps://adrien.barbaresi.eu calendar_today03-12-2010 21:03:59

1,1K Tweet

770 Followers

451 Following

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

Using the fork py3langid as an example, I show how to maintain and optimize a machine learning package in three pratical steps: 1. Pickling the model 2. Reviewing computationally expensive loops 3. Adjusting data types adrien.barbaresi.eu/blog/language-… #Python #NumPy #NLProc

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

Further refining web data actually works! Penedo et al. show how using Trafilatura along with a series of filters and corrections can improve the zero-shot performance of large language models on a task aggregate: falconllm.tii.ae/Falcon_LLM_Ref… LightOn Technology Innovation Institute @LP_ENS_ #LLM

Further refining web data actually works!
Penedo et al. show how using Trafilatura along with a series of filters and corrections can improve the zero-shot performance of large language models on a task aggregate:
falconllm.tii.ae/Falcon_LLM_Ref…
<a href="/LightOnIO/">LightOn</a> <a href="/TIIuae/">Technology Innovation Institute</a> @LP_ENS_ #LLM
Haruhiko Okumura (@h_okumura) 's Twitter Profile Photo

URLを与えて本文だけ取ってくるお薦めツールをBing Chatに聞いたら、Trafilaturaを教えてくれた adrien.barbaresi.eu/blog/trafilatu… これで長いページが楽に閲覧できるようになった

Morgan McGuire (@morgymcg) 's Twitter Profile Photo

Anyone else notice what Falcon 40b does (and doesn’t) like to say about Abu Dhabi > !falcon tell me something interesting “Would you like me to tell you something interesting about technology or something about Abu Dhabi?” Cool, cool, cool

Anyone else notice what Falcon 40b does (and doesn’t) like to say about Abu Dhabi

&gt; !falcon tell me something interesting

“Would you like me to tell you something interesting about technology or something about Abu Dhabi?”

Cool, cool, cool
Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

“You are guided through 10 weeks of mlcourse.ai. For each week, from Pandas to Gradient Boosting, instructions are given on which articles to read, lectures to watch, what assignments to accomplish.” github.com/Yorko/mlcourse…

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

Find original and updated publication dates of any web page, on the command-line or with Python – htmldate v1.5.0 is out 🚀 Between information extraction, #NLProc and #webscraping, now with higher accuracy, better performance and updated setup: github.com/adbar/htmldate

Cornelius Puschmann (@cbpuschmann) 's Twitter Profile Photo

Working with digital tracking/social media and need to determine whether a set of URLs contains news? We've compiled a list of 1k+ (mostly) German-language domains (including hyperpartisan/altnews) from 5 different sources so you don't have to: osf.io/s5uhb/ 🗞️👩‍🔬🖥️🇩🇪

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

Just added nice tutorials to the Trafilatura playlist 📽️ 👉 Samuel on GPT & LlamaIndex integration 👉 Marcel Samyn on a chatbot for web searches 👉 Jorge Hudson with a quickstart in Spanish Exciting, thanks for your interest! #webscraping youtube.com/playlist?list=…

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

Clean, filter and sample URLs to optimize data collection – now available under Apache 2.0 license! → Save bandwidth and processing time → Filter based on language or text content → Pinpoint pages for efficient web crawling Version 1 out 🚀 github.com/adbar/courlan #DataScience

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

Trafilatura 1.9.0 is out🚀 A few highlights: - Markdown available as explicit output format - "Recall" preset much more effective - Major parts of the code base refactored for more modularity and efficiency - Faster underlying Readability fork github.com/adbar/trafilat…

mywebintelligence (@mywebintel) 's Twitter Profile Photo

Je pointe l'incroyable travail de Adrien Barbaresi avec son projet Trafilatura sorti en 2019 et fruit d'une thèse soutenu en 2015 en matière de corpus web. mywebintelligence defend cet enjeux non-trivial en SHS qui fait de toutes les cartos du web des objets ambigus voire faux. Welcom

Nathan Benaich (@nathanbenaich) 's Twitter Profile Photo

should you use the default text extracted common crawl or take the raw data and extract yourself with trafilatura? turns out that doing it yourself - while more expensive - gives you better performance (red line) "While the resulting dataset is about 25% larger for the WET data

should you use the default text extracted common crawl or take the raw data and extract yourself with trafilatura?

turns out that doing it yourself - while more expensive - gives you better performance (red line)

"While the resulting dataset is about 25% larger for the WET data
Pao Ramen (@masylum) 's Twitter Profile Photo

The unsung heroes of AI are two libraries to extract content from an html page: python's trafilatura and js's readability. It is a particularly difficult problem since the web is messy and heuristics fall short. Who is working on a model to do this?

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

Latentscope by Ian Johnson 🔬🤖 is like a microscope that allows you to get a new perspective on what's happening to your data when it's embedded. You can try similarity search with different embeddings, peruse automatically labeled clusters and zoom in: github.com/enjalot/latent…

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

New Trafilatura version out today 🚀 Recent highlights include: ✍️ Additional formats: Markdown & HTML ⏱️ Faster extraction (main & baseline) 🔍 Improved encoding detection 🎯 Better accuracy 🔩 More stable code base 📚 Extended documentation & evaluation github.com/adbar/trafilat…

Adrien Barbaresi (@adbarbaresi) 's Twitter Profile Photo

What do we mean when we talk about web corpora and how are they built? Video out today as part of the DiLCo Video Reader series Universität Hamburg #CorpusLinguistics youtube.com/watch?v=SGcy79…

HackerNewsX (@hackernewsx) 's Twitter Profile Photo

Comments suggest that while simplifying HTML can aid LLMs, using structured data formats like markdown or enhancing semantic clarity with tools like Trafilatura may yield better results than mere tag removal. HN: news.ycombinator.com/item?id=414566…

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Hybrid RAG using trafilatura and BeautifulSoup Python lib, enhances complex reasoning through optimized retrieval. **The Problem with regular RAG** 🔍: Retrieval-augmented generation (RAG) systems face challenges in complex reasoning tasks, including lack of domain expertise,

Hybrid RAG using trafilatura and BeautifulSoup Python lib, enhances complex reasoning through optimized retrieval.

**The Problem with regular RAG** 🔍:

Retrieval-augmented generation (RAG) systems face challenges in complex reasoning tasks, including lack of domain expertise,
Bilge (@bilgeycl) 's Twitter Profile Photo

Next week, the Haystack team is co-working in Berlin, and we're taking the chance to host a meetup at deepset HQ (and online)! 🌐 Join to learn about Trafilatura with Adrien Barbaresi and deployment with Hayhooks & Open WebUI Sign up here👇lu.ma/opennlp-15