Hashtag segmentation using Wikipedia n-grams

by Oskar Kosch | First published:

For those, who ever wondered how to split Twitter or Facebook hashtags, so they can be included in social content natural language processing (NLP) project to perform, for example, latent semantic analysis (LSA), there is quite handy solution: n-grams segmentation.

The idea is simply to create artificial hashtags out of spotted n-grams in a corpus of Twitter or Facebook (or any other source), and boost it with n-grams used in Wikipedia. This way may have some limitations (primarily related to memory requirements) that may make using this solution cumbersome, but these obstacles also create some additional value: instead of using a model that may split things that shouldn’t be split (like abbreviations), here we are sure that string that replaces hashtag is a real existing entity. That brings more quality to analyses done by latent semantic analysis.

Now, apart from posting .csv file with synthetic hashtags, it should be posted how to use them. As my favourite tool for doing sci-analysis is R, all solution will given using this language.


Leave a Reply