Suppose I’m working on some classification problem. (Fraud detection and comment spam are two problems I’m working on right now, but I’m curious about any classification task in general.)

How do I know which classifier I should use?

  1. Decision tree
  2. SVM
  3. Bayesian
  4. Neural network
  5. K-nearest neighbors
  6. Q-learning
  7. Genetic algorithm
  8. Markov decision processes
  9. Convolutional neural networks
  10. Linear regression or logistic regression
  11. Boosting, bagging, ensambling
  12. Random hill climbing or simulated annealing

In which cases is one of these the “natural” first choice, and what are the principles for choosing that one?

Examples of the type of answers I’m looking for (from Manning et al.’s Introduction to Information Retrieval book):

a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes).

I’m guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.

b. If you have a ton of data, then the classifier doesn’t really matter so much, so you should probably just choose a classifier with good scalability.

  1. What are other guidelines? Even answers like “if you’ll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent” are good. I care less about implementation/library issues, though.

  2. Also, for a somewhat separate question, besides standard Bayesian classifiers, are there ‘standard state-of-the-art’ methods for comment spam detection (as opposed to email spam)?

9 Answers
9

Leave a Reply

Your email address will not be published. Required fields are marked *