Interrelationship identification between humans from images using two class classifier

The paper proposes an automatic interrelationship identification algorithm between human beings. The image database contains two interrelationship classes i


Introduction
Interrelationship between objects in the image is an important area of research which focuses on the generation of linguistic expressions from the images usually seen by a normal human being. Identifying a relevant and useful information from the visual world around us is a challenging aspect of computer vision. Along with visual categorization of the objects, to understand the relationship between them is another important research problem. In this paper, we have proposed an algorithm to identify the interrelationship between human beings in images. We believe that identification of action will provide better description of the images. A decent amount of work has been already done in the field of human action identification [1]using motion analysis [2], which are bounded by manual segmentation of objects and also manual detection of the actions using certain body pose analysis. The recent approaches in the field of computer vision and machine learning have provided better and autonomous solutions for these problems. We have utilized one of those possible solutions to solve our problem of interest i.e. Bag of words [3]. It corresponds to a histogram of the number of occurrences of particular image patterns in the given images. The vector representation of this particular image pattern is known as a descriptor. We have generated unique descriptors for each of the class. The process of descriptor generation [4], [5] starts with detection of interest points [6]. These interest points are used by feature detectors to extract feature. We have utilized two feature detectors i.e. SURF features [7] and FAST features [8] to extract the features around the interest points. These descriptors have been further used to train a classifier using SVM. The classifier has been tested against a set of test images to identify their classes. In section 2 we present few important works related to visual categorization of images and identification of relationships between objects. In section 3 we explain the classification methodology in detail. In section 4 we explain the learning of the classifier using SVM. In section 5 we demonstrate the accuracy of the classifier by applying that on test and training data. Finally, in section 6 we conclude our proposed algorithm.

Related works
Berg et. al [9] and Bernard et. al [10] have visually categorized the faces with the names, which is comparatively easier as every face has differentiable features which can be easily detected. Aker et al. [11] and Farhadi et. al [12] went a step ahead as they proposed a method for visual categorization of general objects. It was a difficult task, because of the variability in the visual appearance of the similar object. But it was limited to only object identification task. Yang et al. [13] and Yao et al [14] content based image retrieval approach, in which the content of an image has been recognized first and then to construct a class for that image. But it affected the accuracy of the classification. Feng et. al [5] pro-posed another method for captioning the image using extractive generation but also assuming that corresponding object labels are available as input for the test image and the labels have to match with the objects. Agrawal et al. [15] proposed a visual sense model using textual descriptions. He has defined the visual categorization problem as a machine learning problem and solved that by learning models i.e. SVM. More significant work related to our approach is to detect some feature points and define a feature vectors [16], [17], [18] for the objects in the image. These feature vectors have then been utilized to learn a classifier. We have used Bag of Words to create the feature vectors and applied SVM to learn classifier.

Proposed method
The main steps of the method are: 1. Feature Point detection and Feature Extraction. 2. Generation of feature metric and code vectors. 3. Assigning these code vectors to a predefined cluster using k mean clustering to generate bag of words. 4. Learn a two-class classifier to categorize two different classes i.e. Handshaking or Hugging using SVM.

Dataset
We have taken 50 random images from internet sources belongs to handshaking class and similarly 50 images for the second class.
The main advantage of using the Bag of Words models is that we do not have to resize the database images to similar pixel size as it can handle the variability of the size of the pixels. It is being assume that all the training images in the database contains only two persons with the corresponding interrelationships and with a constant background

Preprocessing
All the images in the dataset have been converted into grayscale images as we are using SURF and FAST feature points. Both of the feature point detector requires gray images as input.

Bag of words model
In this model, an image is considered as a document and the image features are called visual words. The concept of bag of feature model is very easy to understand. i.e. when we see a document; there may be certain words whose frequency is too high as compared to others. Similarly, from the training images it detects the similar patches and treat them as descriptor for those images.

Feature detection and extraction
For feature detection, we have used two different feature detectors i.e. Speed Up Robust Features (SURF) and Features from Accelerated Segment Test (FAST). The model has been tested for both the feature detectors. SURF features [7] are robust and scale invariant feature detector with high repeatability which detects a blob like structure around the interest points from each image which is also called as SURF points. It uses Hessian Matrix approximation for blob(feature) detection. For a given point = ( , ) in an image , the Hessian matrix ( , ) in at scale is defined as follows: where ( , ) is the convolution of the Gaussian second order Another feature detector which we have used here is FAST algorithm, which was proposed by Edward Rosten and Tom Drummond [8] which is fast enough to apply for real time applications. The FAST algorithm detects interest points also called as corner points. For a pixel is corner if there are connected pixel in a circle of 16 pixels around and if all pixel are darker than − ℎ or all pixels are brighter than + ℎ. where is pixel intensity of and ℎ is some chosen threshold. Each pixel (say ) in these pixels can have one of the following three states: where ∈ {1,2, … } Now for feature extraction process we have used the similar method as SURF which is already explained above. These feature vectors are nothing but descriptors calculated from the neighborhood around the interest points representing local patches in the image.

= [ 1 2 3 4 … … … ]
(3) where, is a descriptor and collection of feature vector for each image and shows number of local patches detected. In each image from each classes, a set of feature points has been identified. Around these points a feature vector has been calculated. For similar points in different images, these feature vectors will produce similar values. In figure 2, few similar points identified from different images have been shown. It is very much clear that for handshaking class the fingertip is most common feature. Similarly, for the hugging class, the point of contact of the two arms or elbows of the two persons generating a unique and common feature points.

Generation of feature metric and codebook
Now for each feature vector a feature matric has been calculated by assigning weights to each feature vector. The stronger features are assigned more weights which helps us remove the weak features before learning. Another important reason of de-fining the feature matric is that different images may have different number of local images patches. The feature matric sort them in order and select the higher ones. After completion of the feature metric generation, the process of codebook generation has been started. The codebook is the collection of the code vectors. The code vectors are nothing but collections of similar patches. For codebook generation, we define fix number of clusters. These clusters are nothing but the number of similar patches (code vectors) which are also called as similar words in Bag of Words analogy. Each feature matric is mapped to a code vector using K-mean clustering [19]. After mapping we have K clusters i.e. K code vectors. So, each code vector is considered as a visual word and codebook is considered as a visual dictionary as seen in figure 3. The length of each code vector represents its frequencies in the database. This codebook is called as bag of words.

Learning with SVM
Once the code vectors have been assigned to the clusters we reduce our visual classification problem to multi class supervised learning problem. The classifier separates the images into two different classes i.e. Handshaking and Hugging. This classifier has been trained using SVM classifier [21]. The SVM classifier computes a hyperplane that best separates the two-class data using maximal margin approach.
where feature vectors ∈ and output label ∈ {+1, −1}. and represents the parameters of the hyperplane. The hyperplane separates the feature matrices generated for the two classes. As already mentioned we are using two feature detectors i.e. SURF and FAST, so we have generated two different classifiers for each feature detector. This classifier can now be used to identify a new image belongs to which class.

Evaluation of results
The classifiers have now been tested against the training data and test data. We have already defined 100 images for training dataset and we have also defined 100 new test images for checking the average accuracy of the classifier. The results are shown in figures below. Figure 4 and figure 5 show the confusion matrix for training images for SURF and FAST features respectively. For SURF features out of 100 training images 97 images has been identified in correct class with average accuracy 97% whereas for FAST features out of 100 training images 95 images has been identified in correct class with average accuracy 95%. It is identified that the SURF features are giving better results than FAST features. The results are shown in figures below. Figure 6 and figure 7 show the confusion matrix for training images for SURF and FAST features respectively. For SURF features out of 100 training images 97 images has been identified in correct class with average accuracy 97% whereas for FAST features out of 100 training images 95 images has been identified in correct class with average accuracy 95%. It is identified that the SURF features are giving better results than FAST features. Figure 5 and figure 6 show the confusion matrix for test images for SURF and FAST features respectively. For SURF features out of 100 test images 68 images has been identified in correct class with average accuracy 68% whereas for FAST features out of 100 test images 74 images has been identified in correct class with average accuracy 74%. It is identified that the FAST features are giving better results than SURF features.

Conclusion
We have presented a simple and novel approach for interrelationship identification between two humans. Both the feature detectors i.e. SURF and FAST; are working well with the database and the two class classifier has been evaluated and producing good accuracy for both i.e. test images and training images. In near future, more number of interrelationships can be added to this problem. The classifier has now been trained using SVM which can be further extended with artificial neural networks.