Classification is a technique for categorizing data based on how previous data was categorized. It's considered a “supervised” learning technique, which means that it bases its predictions on labeled (already-categorized) training data.
I have a set of 10,000 patient records containing information on age, gender, height, weight, blood type, cholesterol level, and blood pressure. 250 of these patients had heart attacks; the rest did not. I wish to predict whether a new patient is at risk for a heart attack based on this data.
k-Nearest Neighbor, Decision Trees, Naïve Bayes, Bayesian Networks, Neural Networks, Support Vector Machines, Radial Basis Function Networks, Bagging, Boosting.
Classification performance is typically measured in terms of percentage of items accurately classified. The dataset can be split into a training and test set to measure performance, or cross-validation can be done. It is also possible to plot true positive rate (y-axis) against false positive rate (x-axis) to measure accuracy; such a plot is called an ROC curve.