CAD system: a content based image retrieval approach for pulmonary nodule detection in CT images

This paper proposes Computerized Aided Detection System (CAD) which uses Content Based Image Retrieval (CBIR) to detect cancer nodules present in an image. The CAD system is concerned for the radiologists to identify lung cancer at premature stages, which are very tiny nodules that are not able to seen by naked eye. In modern years, Image processing techniques play a key role in predicting diseases at early stages in particular in various cancer types such as liver cancer, breast cancer etc. This paper comprises of four steps: i) preprocessing an image in order to lessen the noise level and the accuracy of the image is to be improved, so that the accuracy in detection will be higher. ii) The image is segmented based on Marker-Controlled Watershed Segmentation. iii) The features of the nodules present in the image are extracted using GLCM. iv) The nodules are classified based on the extracted features using KNN classifier. The Content Based Image Retrieval Technique is used which is used to redeem query based images in the database by combining feature extraction and similarity matching methods. For experimentation of proposed technique, CT images are used which are extracted from Lung Image Database Consortium database (LIDC).


Introduction
Lung Cancer is one of the serious causes of the cancer death among men and women. Every year, many people die due to lung cancer rather than other cancer like skin cancer, breast cancer, bladder cancer, bone cancer etc [1]. The risk factors for lung cancer are smoking, exposure to radon, exposure to asbestos, air pollution, arsenic in drinking water, family history of lung cancer etc. The symptoms of lung cancer do not appear until the disease is in advanced stage. Sometimes, lung cancer can be detected at early stage as a result of tests taken for other medical conditions like pneumonia, heart disease, lung disease [2]. Some of the manifestation of lung cancer are continuous coughing, chest pain, hoarseness, weight loss, breath shortness, wheezing, and tierd feeling. The lung cancer is detected using the following tests: Sputum cytology, Chest x-ray, Computed Tomography (CT scan), positron emission tomography (PET) scan [3]. The detection of lung cancer at early stages will help the cancer patients to survive for longer time. In order to predict the lung cancer at early stages, the lung cancer is to be detected accurately using Computer Aided Detection System (CAD) which helps the radiologists to predict the lung cancer nodules correctly. The radiologists reading are the current clinical standard in diagnosing the presence of cancer nodules [4]. The CAD system must find out the features such as margin, size, shape and give the brief description about the nodules present in the image. The proposed system is based on Content Based Image Retrieval (CBIR) which helps to find out the similar feature of the given image from the large database. The major advantage of CBIR is the radiologists is able to provide the patients with a similar images from the database which is earlier diagnosed thereby providing them in their process of interpretation [5].

Literature review
Several researches has been proposed and implemented for the detection of lung cancer using various image processing techniques. Wook-Jin Choi , Tae-Sun Choi [6] proposed nodule detection system based on hierarchical block classification. For Image enhancement median filter and Gaussian filter are used. The SVM Classifier is used for extracting features. The system achieves 95.28% sensitivity. Ye, Xujiong Lin, Xinyu Dehmeshki, Jamshid Slabaugh, Greg Member, Senior Beddoe, Gareth [7] proposed a 3D local geometric and statistical intensity features for nodule detection. The detection rate of the sytem is 90.2% and false positive is 8.2/scan. Taher, Fatma Werghi, Naoufel Al-ahmad, Hussain Sammouda, Rachid [2] proposed a system in which Hopfield Neural Network nad Fuzzy C-Mean algorithm are used for segmentation process. The threshold classifier is used to extract features. The accuracy of the system is 98%. The sensitivity of the system is 83% and the specificity of the system is 99%. Sharma, Disha Jindal, Gagandeep [8] proposed a system in which bitplane slicing is used as the extraction process and for segmentation region growing segmentation algorithm is used. The accuracy of the system is 80%. Sharma, Disha Jindal, Gagandeep [9] proposed a system in which contrast enhancement, thresholding, filtering and blob analysis are used as preprocessing techniques. For segmenting an image, Otsu thresholding technique is used. Texture feature are extracted using artificial neural network(ANN). The accuracy of the system is 85%. Gattass, Marcelo Acatauassu, Rodolfo [10] proposed a system in which growing neural gas(GAG) technique is used to obtain the lung nodules present in an image. The support vector machine(SVM) is used to obtain the texture features of lung nodules. The sensitivity of the system is found to be 86%. The specificity is 91% and accuracy is 91%. Lee, Y Hara, T Fujita, H Itoh, S Ishigaki, T [11] proposed a system in which genetic algorithm template matching(GATM) is used to detect lung nodules present in an image. The accuracy of the system is measured as 72%.Valdivieso, Manlio Amin, Hamdan [12] proposed a system in which algorithm template matching(GATM) is used to detect lung nodules present in an image. The features are extracted using lung nodule phantom images as reference image. The accuracy of the system is 90%. Choi, Wookjin Cancer, Memorial Sloan-kettering Robust [13] used shape based feature descriptor to detect the lung nodules present in an image. The features are extracted and trained using SVM classifier. The sensitivity of the system is 97.5%. Kim, D Kim, J Noh, S Park, J Kim, J Noh, S Nodule, J Park Pulmonary [14] used gray level thresholding methods to segment an image. The features extracted using contour-follwing method. The sensitivity of the system is measured as 96%. Kumar, Ashis Sudipta, Dhara Anirvan, Mukhopadhyay Garg, Mandeep Khandelwal, Niranjan [5] proposed a system in which content based image retrieval technique is used. The system works closer to SVM.

Proposed system
The proposed CAD system in which Content Based Image Retrieval (CBIR) is used to detect Pulmonary nodules in an CT image by analyzing the nodule database as well as feature database [5]. The first step is image acquisition. The second step is preprocessing an image so that noise present in the image can be reduced or removed. The third step is to segment an image, hence Marker-Controlled watershed segmentation method is applied to segment the image in order to identify the Region of Interest (ROI). The fourth step is feature extraction which measures certain properties or features and reduce the original dataset using GLCM feature extraction. The fifth step is feature selection based on genetic algorithm. The sixth step is classification which detects the lung nodules present in an image using KNN classifier and the content based image retrieval method is used. The Proposed architecture of the system is clearly shown in Fig.1.

Image acquisition
The very first stage of any visual system is image acquisition. The input images are collected in the form of intensity, binary and RGB. Intensity represents brightness value. Indexed values are used as the index of a lookup table from which the value is read. RGB represents wavelength intensities. In the proposed CAD system LIDC images are used as input images. Fig.2 represents the collected image by acquisition process.

Image pre-processing
Image preprocessing is a process of removing unwanted noise or distortions present in an image. In the CAD system, the image is converted into grayscale image. In the proposed system rgb2gray matlab code is used to convert original image into grayscale image [8]. This function converts the given input image into grayscale image by removing hue as well as saturation and retaining luminance. Fig.3 represents grayscale image.

Order statistics filter
Order Statistics Filter is based on ordering pixels present in the image [15] . In the filtering algorithm, sliding window technique is used to perform pixel by pixel operation. The ordered statistical information is obtained by ordering neighbourhood data [16]. Some of the order statistics filters are median filter, max and min filter, midpoint filter, alpha-trimmed mean filter. Fig.4. represents Order Statistics Filter.

Image segmentation
In the proposed system, Sobel edge detection algorithm is used to find the edges in an image [17]. To derive the edges of an image, edge detection method is used. The sobel method performs 2D spatial gradient measurement on images [16]. It is used to calculate gradient magnitude at each point in the input image. The sobel edge detection algorithm use 3x3 convolution masks. Fig.5 represent Sobel Edge Detection Algorithm

Marker-controlled watershed segmentation
In the proposed system, Marker-Controlled Watershed Segmentation is used to avoid calculation of many regions by reducing the minima [18]. The basis of this method are: i)morphological reconstructions. ii) Extracting the markers of the regions. iii) Applying watershed transform. This method is achieved using repeated morphological operations. In this method, the foreground and background marks are marked by opening and closing by reconstruction method. Due to this, we got limited number of region [19]. Fig.6. represents Marker Controlled Watershed Segmentation

Feature extraction
In feature extraction stage the original dataset is reduced by finding the features that distinguish one input image pattern from the other input image. The characteristics of the input type are provided to the classifier by considering the depiction of the significant properties of the image into feature space. In the proposed system, Grey Level Co-Occurance Matrix(GLCM) is used. The GLCM features is a second order statistical method to characterize texture spatial relationship in an image [20]. It is used to extract statistical parameter values such as energy, entropy, correlation, Inverse Direct Moment etc. In the GLCM matrix, the number of rows and columns will be equal to geey levels in an image [21]. The following statistical features are used for feature selection process: • The contrast calculates the intensity of a pixel and its adjacent over the image. It is resolved by the divergence in the colour and brightness of the object and the further object within the equivalent field of view.
The correlation is a measure of joint probability existence of the specified pixel pairs.
The cluster prominence measures the skewness or asymmetry in an image. When cluster prominence is less, there is an apex in the co-occurance matrix in the order of the mean values.
The cluster shade measures the skewness of an image and is defined by cluster shade feature. When cluster shade is high, the image is asymmetric.
The dissimilarity measures the principled resemblance among segmentations produced by non-identical algorithms as well as segmentations on diverse images. The two feature vectors uses Texture features as its base. The relative frequency distribution is shown by feature vector and it is calculated by the Kullback-Leibler (K-L) divergence or relative entropy.
The energy is defined as the assess of the area of pixel pair repetitions. It calculates the consistency of an image using the formula: The entropy is used to exemplify the consistency of an input image. Its value will be utmost when every single one of the elements of the co-occurance matrix is same. Then entropy can be calculated using the formula: The Homogeneity calculates the proximity of the assessment of elements in the gray level matrix and it helps in enhanced segmentation. To acquire the local spatial statistics of the segmented image scaling is used and orientation characterizes the texture regions for similarity. The image is filtered out to a matrix of harmonized texture parts, and the texture features which are related with the regions are indexed in the image data.
The Sum of Squares, Variance put relatively high weight on the elements that vary from average value of P(i,j). Sum of Squares, Variance = ∑ ∑ ( − ) 2 ( , )

Feature selection
Feature selection is the process of removing irrelevant information while retaining significant information. Feature Selection plays a major role in pattern recognition problems such as image classification [22]. Feature Selection is used to reduce the dimensionality of feature space which will improve classification accuracy and time consumption can be reduced. In the proposed system, genetic algorithm(GA) is used for feature selection process [11]. The genetic algorithm operates in a binary search space and it operates on a sequence of 0's and 1's and it is called as chromosome. The proposed feature selection method is to find the feature set S which depends on the target class.

Image classification
The K-Nearest Neighbour (KNN) classifier is a classification technique used to find the distance between feature vectors so that we can predict and return an actual category of the image [23]. The training process for this classifier is storing feature vectors and labelling the training images. The unlabelled query is assigned to its k-nearest neighbours. In order to apply KNN classification, we need to know the value of the similarity function using a Euclidean distance formula:

Performance measures
In order to evaluate the proposed CAD system, some of the performance measures is to be taken place [22]. A true positive denotes abnormal lung nodule is accurately identified as abnormal. A true negative denotes normal lung nodule is accurately identified as normal. A false positive denotes normal lung nodule is erroneously identified as abnormal. A true negative denotes abnormal lung nodule is erroneously identified as normal. The confusion matrix [24] is used to calculate the performance of the CAD system. The table 1. Shows the confusion matrix of the proposed system. It helps to visualize the performance of the learning system. Each row represents actual class and each column represents predicted class. Here True Positive means sliced segmented containing cancer nodules. True Negative means slice segmented without nodule is identified as non-cancer. False Positive means slice segmented without cancer nodule is identified as cancer. False Negative means slice segmented containing cancer nodule is identified as non-cancer. In this research, there are two types of train images are used. They are cancer image and non-cancer image. Totally we use 150 images to train the classifier. Of these 100 images are cancer image and 50 images are non-cancer image. The classification accuracy of the system is 98.2% and the sensitivity of the system is 97.1% and the specificity of the system is 96.5%.

Conclusion
In this paper we present a CAD system to detect the lung nodules present in LIDC images. The proposed system uses Marker Controlled Watershed Segmentation method to segment the image and the features are extracted using GLCM method. The feature selection process is carried out using Genetic algorithm. The content based image retrieval method is used to detect the cancer nodule present in the image by using KNN classifier technique. In the proposed system, hybrid methodology of combining genetic algorithm with KNN classifier is used so that the proposed method helps the radiologists in detecting lung cancer and take correct decisions. The proposed system has less positive values and the performance of the system is calculated using the statistical parameters. The system has 98.2% accuracy, 97.1% sensitivity and 96.5% specificity.