Discovering my favorites topics in Hacker News with NLP

TL;DR: Use of NLP (spaCy and Gensim) for topic modelling of Hacker News favorites links scraped with Selenium.

Image for post
Image for post
Freud Icon from iconspng.com
  • In how many topics can I classify them?
  • What technologies seem to interest me the most?
  • How many “upvotes” links do I have?

Let’s go

You know the saying: “Divide and conquer”. I split the job in the following three parts:

  1. Scrap each favorite with Selenium
  2. Pre-processing and topic modelling with Spacy and Gensim

Grab the favorites from Hacker News

To scrap Hacker News is very pretty straightforward.

Scrapping the favorites

Once all the favorites links are obtained it’s time to scrap them.

Pre-processing and topic modelling

The most interesting part arrives (for a NLP fan!)

Pre-processing

We use a custom spaCy pipeline to process the scraped content to convert into features.

Image for post
Image for post
Custom spaCy Pipeline for pre-processing

Topic modelling

It is time to go from text to numeric model. We generate a vocabulary vector where each word has an unique index number.

Image for post
Image for post
#Topics, perplexity and coherence results
Image for post
Image for post
X = #topics, Y = coherence
Image for post
Image for post
My HN topics
Image for post
Image for post

Conclusion

I seem to have answered the main questions raised.

  • Infrastructure: Kubernetes, GCP, AWS, …
  • Hardware stuff.
  • ML and AI.
  • Startups news.
  • Pre-processing. Some words or punctuation marks have slipped in that shouldn’t be there.
  • The content itself has high cohesion in the domain of IT. This means that the cohesion is not very high nor the topics are very differentiated. LDA is probably not the best model for this case.

Next steps

To improve the results I can think of the following improvements:

  • Use of Bigrams and trigrams. Instead of using only the main keywords, it can be more clarifying to use pairs or trio of words.
  • Better scraping and feature extraction. It has been mentioned before. Without good information, there is nothing to do.
  • Use of other algorithms like as lda2vec. The idea is to better capture the relationships as word2vec provides.
Image for post
Image for post
Thanks so much! =)

References

[1] “Building a Topic Modeling Pipeline with spaCy and Gensim” by Jonathan Keller @ towards data science

CTO @ Digitalilusion.com & DigitalSecured.net Beyond-Full-stack developer #go #python #kubernetes

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store