Get the latest insights—delivered regularly Newsletter

Join the LETR LABS Community — Stay Ahead with AI Content Insights

Get regular updates on experimental content technologies, fresh ideas, and behind-the-scenes stories from our lab.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Jinhyuk Choi

November 10, 2024

5 min read

Extracting Today’s Trends from Newsletters: Building Heybunny’s Keyword Trend Engine

Background

This project started with a simple yet powerful idea: help users make sense of the massive amount of newsletter content they receive every day.
Heybunny (heybunny.io) wanted to give users a snapshot of what’s trending right now—by analyzing the keywords in their newsletters and presenting summarized trend insights.

Newsletters are curated information sources, often focused on specific industries or themes. But they come in various formats, are delivered asynchronously, and are rarely synthesized. We wanted to fix that by delivering "today’s trends at a glance."

Tech Stack

  • Language: Python 3.x
  • Cloud Platform: AWS Lambda
  • Keyword Extraction Engine: Custom hb_keyword_extraction_engine
  • Trend Generation Logic: hb_trend_generation_lambda
  • NLP Tools: NLTK, spaCy, custom SNR filter
  • Repo: heybunny-api-lambda-py

We chose serverless architecture for flexibility and cost-efficiency, since newsletter analysis happens in bursts rather than constant load.

Problem Definition

  1. Incoming newsletter formats were inconsistent (HTML-heavy, or plaintext, or hybrid).
  2. Keyword extraction alone was not sufficient—we needed semantically coherent "trends".
  3. Processing had to be fast, anonymized, and scalable to support many users concurrently.

If we missed a key topic or served outdated information, the user value would collapse—speed and insight quality were non-negotiable.

Solution Process

1. Newsletter Preprocessing

We first normalized newsletter content by stripping HTML, removing PII, and filtering stopwords.

from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

def strip_html_tags(text):
    return BeautifulSoup(text, "html.parser").get_text()

def preprocess_newsletter(text):
    text = strip_html_tags(text)
    tokens = nltk.word_tokenize(text)
    tokens = [t.lower() for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    return tokens

2. Keyword Extraction Engine (hb_keyword_extraction_engine_lambda.py)

This Lambda function performed:

  • TF-IDF ranking
  • SNR (Signal-to-Noise Ratio) filtering
  • Optional fallback to TextRank or YAKE
from sklearn.feature_extraction.text import TfidfVectorizer

def extract_keywords(documents, top_k=10):
    vectorizer = TfidfVectorizer(max_df=0.9, min_df=2)
    tfidf_matrix = vectorizer.fit_transform(documents)
    scores = tfidf_matrix.sum(axis=0).A1
    keywords = [(word, scores[idx]) for word, idx in vectorizer.vocabulary_.items()]
    keywords.sort(key=lambda x: x[1], reverse=True)
    return [kw for kw, _ in keywords[:top_k]]

We also applied a custom filter:

def snr_filter(keywords, noise_threshold=0.5):    
	return [kw for kw in keywords if compute_snr(kw) > noise_threshold]

3. Trend Generation Module (hb_trend_generation_lambda.py)

This aggregated and clustered keywords into semantically grouped trends. Each trend was accompanied by a summary and linked newsletter references.

from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

def cluster_keywords(keywords):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(keywords)
    kmeans = KMeans(n_clusters=5, random_state=0).fit(embeddings)
    clusters = {}
    for i, label in enumerate(kmeans.labels_):
        clusters.setdefault(label, []).append(keywords[i])
    return clusters

Trend objects were structured like this:

trend = {
    "topic": "Generative AI",
    "keywords": ["ChatGPT", "OpenAI", "LLM", "GPT-4o"],
    "summary": "Recent newsletters highlight new capabilities of GPT-4o and OpenAI’s expanding dominance in enterprise AI adoption.",
    "sources": [
        {"title": "The Rise of GPT-4o", "link": "https://example.com/newsletter1"},
        {"title": "OpenAI’s Next Move", "link": "https://example.com/newsletter2"}
    ]
}

Results & Outcomes

  • Users could understand core trends without reading entire newsletters.
  • Avg. processing time for ~200 newsletters: 3.2 seconds
  • 87% of beta users said the trend summaries "saved time" and “felt insightful”
  • Several B2B clients integrated our API as a market research plugin

Lessons & Next Steps

What We Learned

  • Trends are more than keyword counts—they need structure, context, and clarity.
  • Users appreciate clear summaries more than raw data, especially in time-sensitive environments.

Next Steps

  • Integrate GPT-4 based summarization layer
  • Expand input sources (e.g. news, blogs, Twitter, research papers)
  • Add time-based trend visualizations (e.g. rolling 7-day trends)

References