Discovering my favorites topics in Hacker News with NLP

TL;DR: Use of NLP (spaCy and Gensim) for topic modelling of Hacker News favorites links scraped with Selenium.

Freud Icon from
  • In how many topics can I classify them?
  • What technologies seem to interest me the most?
  • How many “upvotes” links do I have?

Let’s go

You know the saying: “Divide and conquer”. I split the job in the following three parts:

  1. Scrap each favorite with Selenium
  2. Pre-processing and topic modelling with Spacy and Gensim

Grab the favorites from Hacker News

To scrap Hacker News is very pretty straightforward.

Scrapping the favorites

Once all the favorites links are obtained it’s time to scrap them.

Pre-processing and topic modelling

The most interesting part arrives (for a NLP fan!)


We use a custom spaCy pipeline to process the scraped content to convert into features.

Custom spaCy Pipeline for pre-processing

Topic modelling

It is time to go from text to numeric model. We generate a vocabulary vector where each word has an unique index number.

#Topics, perplexity and coherence results
X = #topics, Y = coherence
My HN topics


I seem to have answered the main questions raised.

  • Infrastructure: Kubernetes, GCP, AWS, …
  • Hardware stuff.
  • ML and AI.
  • Startups news.
  • Pre-processing. Some words or punctuation marks have slipped in that shouldn’t be there.
  • The content itself has high cohesion in the domain of IT. This means that the cohesion is not very high nor the topics are very differentiated. LDA is probably not the best model for this case.

Next steps

To improve the results I can think of the following improvements:

  • Use of Bigrams and trigrams. Instead of using only the main keywords, it can be more clarifying to use pairs or trio of words.
  • Better scraping and feature extraction. It has been mentioned before. Without good information, there is nothing to do.
  • Use of other algorithms like as lda2vec. The idea is to better capture the relationships as word2vec provides.
Thanks so much! =)


[1] “Building a Topic Modeling Pipeline with spaCy and Gensim” by Jonathan Keller @ towards data science

CTO @ & Beyond-Full-stack developer #go #python #kubernetes

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store