Refining Al Methods for Medical Diagnostics Management

Abstract

This paper extends the utility of support vector machines—a recent innovation in AI being adopted for cancer diagnostics—by analytically modelling the impact of imperfect class labelling of training data. It uses ROC computations to study the SVM’s ability to classify new examples correctly even in the presence of misclassified data in training examples. It uses DOE to reveal that misclassifications present in training data affect training quality, and hence performance, albeit not as strongly as the SVM’s primary design parameters. Still, our results give strong support for one’s striving to develop the best trained SVM that is intended to be utilized, for instance, for medical diagnostics, for misclassified training data shrink decision boundary distance, and increasing generalization error. Further, this study affirms that to be effective, the SVM design optimization objective should incorporate real life costs or consequences of classifying wrongly.

Two-class Classification and the SVM

A missed timely detection of cancer may cost one’s life, whereas a false classification of a benign case as cancerous would cost $250,000 in agonizing therapies and its psychological effects—for unjust reason. Clinical diagnostics has always depended on the clinician’s ability to diagnose pathologies based on the observation of symptoms exhibited by the patient and then classifying his/her condition. Correct diagnosis can make a big difference in the form of correct and timely intervention, be it hypertension, diabetes, or the various types of malignancy. Similar situations arise where the precise links between cause and effect is not yet established and its management is predestined to process information as best as possible to draw inferences to guide decisions. For hypertension, for example, attempts have been recently made to probe the situation beyond the measurement of systol i c /diastol i c blood pressures—one finds such studies attempting to predict the occurrence of hypertension based on observations of age, sex, family history, smoking habits, lipoprotein, triglyceride, uric acid, total cholesterol and body mass index, etc. In many such situations, the treatment given is based on a binary classification—the ailment is present, or it is not (Ture et al. 2005).

Classification of a pathology is challenging not only in respect to acquiring the relevant data through tests about factors known to be associated with the pathology, but also the data analytics adopted to lead to reliable and correct prediction. This paper looks into one such data analysis technique, now about 20 years in use and known as support vector machine or SVM, that helps one to develop classification models based on statistical principles of learning. Like artificial neural networks, an SVM is data driven—it is trained using a dataset of examples with known class (label), and then utilized to predict the class of new examples. How well an SVM works is measured by the accuracy with which it can predict the class of unseen examples (examples not included in training the SVM).

The tone set in the present work is to move beyond the simple notion of “accuracy”—the conventional classifier performance measure—by incorporating analytical modelling of correct/incorrect classification of instances in the training sample. This has not yet been done. We focus specifically on the effect on ROC of imperfect labelling of input data.

Support vector machine is an algorithmic approach proposed by Vapnik and his colleagues (Boser et al. 1992) to the issue of classification of instances (for example, patients who may or may not have diabetes) and it falls in the broader context of supervised learning in artificial intelligence (AI) in computer science. It begins with a set of data comprising feature vectors of instances {x}, and a class tag or label i {y }attached to each of those instances. The most i common application of SVM aims at training a model that learns from those instances, and estimates the model parameters. Subsequently that model is used to predict the class of an instance for which only feature values are available and one is interested in finding its class (y) label, with a high degree of correctness. By an elaborate procedure of optimization, an SVM is designed so as to display minimum classification error for unseen instances, an attribute measured by its “generalization error.” Binary classification is its most common use.

Read Full Article