November 10, 2024
5 min read
Jinhyuk Choi
November 10, 2024
5 min read
This project started with a simple yet powerful idea: help users make sense of the massive amount of newsletter content they receive every day.
Heybunny (heybunny.io) wanted to give users a snapshot of what’s trending right now—by analyzing the keywords in their newsletters and presenting summarized trend insights.
Newsletters are curated information sources, often focused on specific industries or themes. But they come in various formats, are delivered asynchronously, and are rarely synthesized. We wanted to fix that by delivering "today’s trends at a glance."
hb_keyword_extraction_engine
hb_trend_generation_lambda
NLTK
, spaCy
, custom SNR filterWe chose serverless architecture for flexibility and cost-efficiency, since newsletter analysis happens in bursts rather than constant load.
If we missed a key topic or served outdated information, the user value would collapse—speed and insight quality were non-negotiable.
We first normalized newsletter content by stripping HTML, removing PII, and filtering stopwords.
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
def strip_html_tags(text):
return BeautifulSoup(text, "html.parser").get_text()
def preprocess_newsletter(text):
text = strip_html_tags(text)
tokens = nltk.word_tokenize(text)
tokens = [t.lower() for t in tokens if t.isalpha()]
tokens = [t for t in tokens if t not in stopwords.words('english')]
return tokens
hb_keyword_extraction_engine_lambda.py
)This Lambda function performed:
TextRank
or YAKE
from sklearn.feature_extraction.text import TfidfVectorizer
def extract_keywords(documents, top_k=10):
vectorizer = TfidfVectorizer(max_df=0.9, min_df=2)
tfidf_matrix = vectorizer.fit_transform(documents)
scores = tfidf_matrix.sum(axis=0).A1
keywords = [(word, scores[idx]) for word, idx in vectorizer.vocabulary_.items()]
keywords.sort(key=lambda x: x[1], reverse=True)
return [kw for kw, _ in keywords[:top_k]]
We also applied a custom filter:
def snr_filter(keywords, noise_threshold=0.5):
return [kw for kw in keywords if compute_snr(kw) > noise_threshold]
hb_trend_generation_lambda.py
)This aggregated and clustered keywords into semantically grouped trends. Each trend was accompanied by a summary and linked newsletter references.
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
def cluster_keywords(keywords):
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(keywords)
kmeans = KMeans(n_clusters=5, random_state=0).fit(embeddings)
clusters = {}
for i, label in enumerate(kmeans.labels_):
clusters.setdefault(label, []).append(keywords[i])
return clusters
Trend objects were structured like this:
trend = {
"topic": "Generative AI",
"keywords": ["ChatGPT", "OpenAI", "LLM", "GPT-4o"],
"summary": "Recent newsletters highlight new capabilities of GPT-4o and OpenAI’s expanding dominance in enterprise AI adoption.",
"sources": [
{"title": "The Rise of GPT-4o", "link": "https://example.com/newsletter1"},
{"title": "OpenAI’s Next Move", "link": "https://example.com/newsletter2"}
]
}
November 10, 2024
5 min read