January 10, 2017


Building a Text Classifier in R – Part 1

This blog post will be a 3-part series. I am going to share each step for building a Text Classifier in R, assuming you have prior knowledge of R programming language. Before going further, if you are required to learn R, I recommend a free course - DataCamp. You can also watch the webinar Data Science for the Rest of Us to know about Machine Learning, which helps you to start from the very beginning. Going further, let’s explore the steps involved.

The first step of a new project is Data Engineering, which involves gathering the data and transforming it into the appropriate form for processing. Most data scientists agree that it is often the most time consuming step. If you don’t have your own data, or just want to have fun learning, try building a sentiment classifier using this data gathered from twitter, Sentiment 140.

The second step for the new project is Choosing an Algorithm, which is the best and most suitable with your task, data, and scenario. I suggest to check this cheat sheet which helps to narrow down the choices.  Generally, text classification involves more than 100 features, so in this case we use a SVM (Support Vector Machine). Next you must Train your model and Test it to optimize its performance. Finally, in the Reporting step, the results of your new text classifier are made visually pleasing for easy human consumption.

In the next part, I will cover Data Collection and Preparation as well as Training.

External Resources

  1. Basics of Data Science
    • Data Science for the Rest of Us
      • i.Math-free, jargon-free, pictorial Webinar of what data science is all about.
    • Machine Learning Algorithm Cheat Sheet
  2. Tutorials
    • DataCamp
      • Free online course for R
    • R Inferno
      • A guide through the trials and tribulations of R programming
  3. Andrew Ng’s Online Machine Learning Course
    • Free!
    • Best way to learn is by doing
  4. Why stop at visualizing results?
    • Visualize the process as well!
    • Also gives a look into the challenges of getting a sample that is representative of the full data.
  5. R distribution (MRO, previously RRO)
    • RStudio (IDE)