Language Models to understand unstructured data

Understanding unstructured data with language models

Presented By Alex Peattie
Alex Peattie
Alex Peattie
Co-founder and CTO at Peg

Alex Peattie is the co-founder and CTO of Peg, a technology platform helping multinational brands and agencies to find and work with top YouTubers. Peg is used by over 1500 organisations worldwide including Coca-Cola, L'Oreal and Google.

An experienced digital entrepreneur, Alex spent six years as a developer and consultant for the likes of Grubwithus, Huckberry, UNICEF and Nike, before joining coding bootcamp Makers Academy as senior coach, where he trained hundreds of junior developers. Alex was also a technical judge at this year's TechCrunch Disrupt conference.

Presentation Description

As data scientists, we've seen a rapid improvement in the last decades in the tools available for working with structured data (be it tabular data, graph data, sensor data etc.). Yet, the vast majority of our data (Merrill Lynch puts the figure at roughly 90%) is *unstructured* and lives in the form of documents, emails, reviews, reports, and chat logs etc. 

Many of us are far less familiar with how to analyze and understand this trove of unstructured data. This talk focuses on language models, one of the most fundamental tools for working with unstructured data. Language models are all around us (although we're probably unaware of them), underpinning everything from Word's spellchecker to home assistants like Alexa. 

While plenty of "out of the box" language modeling libraries exists, the first part of the talk focuses on getting a thorough understanding of what a language model is, and how it works. We'll touch on key ideas from statistics and information theory, and see how Alan Turing, in developing techniques to break Nazi codes at Bletchley Park, created the smoothing techniques which remain widely used in language models today. We'll then proceed to the present day, looking at how techniques like word vectors and transfer learning have yielded an improved generation of tools.
In the second half of the talk, we'll look at how we can practically use language models to understand unstructured data.
Specifically, we'll explore:
- Classification - the canonical application of language models, they can help us identify spam, analyze sentiment or perform unsupervised clustering. We'll look at a famous case where language models were able to successfully identify a Shakespeare forgery.
- Predictive modeling - if I were to look at your Tweets (and nothing else), could I guess your gender? It turns out state-of-the-art techniques can successfully predict it with an 80%+ success rate. We'll look at how language models can enrich your datasets with additional demographic or contextual data.
- Information retrieval - finally, we'll see how language models have been used extensively (for example in the legal sector), to extract targeted insights from enormous data sets.

Presentation Curriculum

Understanding unstructured data with language models
69:24
Hide Content