Jinhyuk Choi

November 10, 2024

5 min read

Extracting Today’s Trends from Newsletters: Building Heybunny’s Keyword Trend Engine

Background

This project started with a simple yet powerful idea: help users make sense of the massive amount of newsletter content they receive every day.
Heybunny (heybunny.io) wanted to give users a snapshot of what’s trending right now—by analyzing the keywords in their newsletters and presenting summarized trend insights.

Newsletters are curated information sources, often focused on specific industries or themes. But they come in various formats, are delivered asynchronously, and are rarely synthesized. We wanted to fix that by delivering "today’s trends at a glance."

‍

Tech Stack

Language: Python 3.x
Cloud Platform: AWS Lambda
Keyword Extraction Engine: Custom hb_keyword_extraction_engine
Trend Generation Logic: hb_trend_generation_lambda
NLP Tools: NLTK, spaCy, custom SNR filter
Repo: heybunny-api-lambda-py

We chose serverless architecture for flexibility and cost-efficiency, since newsletter analysis happens in bursts rather than constant load.

‍

Problem Definition

Incoming newsletter formats were inconsistent (HTML-heavy, or plaintext, or hybrid).
Keyword extraction alone was not sufficient—we needed semantically coherent "trends".
Processing had to be fast, anonymized, and scalable to support many users concurrently.

If we missed a key topic or served outdated information, the user value would collapse—speed and insight quality were non-negotiable.

‍

Solution Process

1. Newsletter Preprocessing

We first normalized newsletter content by stripping HTML, removing PII, and filtering stopwords.

from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

def strip_html_tags(text):
    return BeautifulSoup(text, "html.parser").get_text()

def preprocess_newsletter(text):
    text = strip_html_tags(text)
    tokens = nltk.word_tokenize(text)
    tokens = [t.lower() for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    return tokens

‍

2. Keyword Extraction Engine (`hb_keyword_extraction_engine_lambda.py`)

This Lambda function performed:

TF-IDF ranking
SNR (Signal-to-Noise Ratio) filtering
Optional fallback to TextRank or YAKE

from sklearn.feature_extraction.text import TfidfVectorizer

def extract_keywords(documents, top_k=10):
    vectorizer = TfidfVectorizer(max_df=0.9, min_df=2)
    tfidf_matrix = vectorizer.fit_transform(documents)
    scores = tfidf_matrix.sum(axis=0).A1
    keywords = [(word, scores[idx]) for word, idx in vectorizer.vocabulary_.items()]
    keywords.sort(key=lambda x: x[1], reverse=True)
    return [kw for kw, _ in keywords[:top_k]]

We also applied a custom filter:

def snr_filter(keywords, noise_threshold=0.5):    
	return [kw for kw in keywords if compute_snr(kw) > noise_threshold]

‍

3. Trend Generation Module (`hb_trend_generation_lambda.py`)

This aggregated and clustered keywords into semantically grouped trends. Each trend was accompanied by a summary and linked newsletter references.

from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

def cluster_keywords(keywords):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(keywords)
    kmeans = KMeans(n_clusters=5, random_state=0).fit(embeddings)
    clusters = {}
    for i, label in enumerate(kmeans.labels_):
        clusters.setdefault(label, []).append(keywords[i])
    return clusters

‍

Trend objects were structured like this:

trend = {
    "topic": "Generative AI",
    "keywords": ["ChatGPT", "OpenAI", "LLM", "GPT-4o"],
    "summary": "Recent newsletters highlight new capabilities of GPT-4o and OpenAI’s expanding dominance in enterprise AI adoption.",
    "sources": [
        {"title": "The Rise of GPT-4o", "link": "https://example.com/newsletter1"},
        {"title": "OpenAI’s Next Move", "link": "https://example.com/newsletter2"}
    ]
}

‍

Results & Outcomes

Users could understand core trends without reading entire newsletters.
Avg. processing time for ~200 newsletters: 3.2 seconds
87% of beta users said the trend summaries "saved time" and “felt insightful”
Several B2B clients integrated our API as a market research plugin

‍

Lessons & Next Steps

What We Learned

Trends are more than keyword counts—they need structure, context, and clarity.
Users appreciate clear summaries more than raw data, especially in time-sensitive environments.

Next Steps

Integrate GPT-4 based summarization layer
Expand input sources (e.g. news, blogs, Twitter, research papers)
Add time-based trend visualizations (e.g. rolling 7-day trends)

‍

References

GitHub: https://github.com/twigfarm/heybunny-api-lambda-py
Product Overview Deck: Newsletter Trends by LETR Labs

‍

Extracting Today’s Trends from Newsletters: Building Heybunny’s Keyword Trend Engine

Background

Tech Stack

Problem Definition

Solution Process

1. Newsletter Preprocessing

2. Keyword Extraction Engine (`hb_keyword_extraction_engine_lambda.py`)

3. Trend Generation Module (`hb_trend_generation_lambda.py`)

Results & Outcomes

Lessons & Next Steps

What We Learned

Next Steps

References

Related Blogs

Get Real-World Tech Insights. Straight to Your Inbox.

Join the LETR LABS Community — Stay Ahead with AI Content Insights

Extracting Today’s Trends from Newsletters: Building Heybunny’s Keyword Trend Engine

Background

Tech Stack

Problem Definition

Solution Process

1. Newsletter Preprocessing

2. Keyword Extraction Engine (hb_keyword_extraction_engine_lambda.py)

3. Trend Generation Module (hb_trend_generation_lambda.py)

Results & Outcomes

Lessons & Next Steps

What We Learned

Next Steps

References

Related Blogs

Get Real-World Tech Insights. Straight to Your Inbox.

2. Keyword Extraction Engine (`hb_keyword_extraction_engine_lambda.py`)

3. Trend Generation Module (`hb_trend_generation_lambda.py`)