Attribute Based Classifiers for Image Understanding

Demirel, Berkan

Göster/Aç

2016 YL 57279.pdf (10.32Mb)

Tarih

2017-01-09

Yazar

Demirel, Berkan

Üst veri

Tüm öğe kaydını göster

Özet

Attributes are mid-level semantic concepts which describe visual appearance, functional affordance or other human-understandable aspects of objects and scenes. In the recent years, several works have investigated the use of attributes to solve various computer vision problems. Examples include attribute based image retrieval, zero-shot learning of unseen object categories, part localization and face recognition. This thesis proposes two novel attribute based approaches towards solving (i) top-down visual saliency estimation problem, and, (ii) unsupervised zero-shot object classification problem. For top-down saliency estimation, we propose a simple yet efficient approach based on Conditional Random Fields (CRFs), in which we use attribute classifier outputs as visual features. For zero-shot learning, we also propose a novel approach to solve unsupervised zero-shot object classification problem via attribute-class relationships. However, unlike other attribute-based approaches, we require attribute definitions only at training time, and require only the names of novel classes of interest at test time. Our detailed experimental results show that our methods perform on par with or better than the state-of-the-art.

Bağlantı

http://hdl.handle.net/11655/3103

Koleksiyonlar

Bilgisayar Mühendisliği Bölümü Tez Koleksiyonu [255]

Künye

[1] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, and Serge Belongie. Visual recognition with humans in the loop. In European Conference on Computer Vision, pages 438–451. Springer, 2010. [2] Adriana Kovashka, Devi Parikh, and Kristen Grauman. Whittlesearch: Image search with relative attribute feedback. In Computer Vision and Pattern Recog- nition (CVPR), 2012 IEEE Conference on, pages 2973–2980. IEEE, 2012. [3] Rogerio Feris, Behjat Siddiquie, Yun Zhai, James Petterson, Lisa Brown, and Sharath Pankanti. Attribute-based vehicle search in crowded surveillance videos. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 18. ACM, 2011. [4] Amar Parkash and Devi Parikh. Attributes for classifier feedback. In European Conference on Computer Vision, pages 354–368. Springer, 2012. [5] Huizhong Chen, Andrew Gallagher, and Bernd Girod. Describing clothing by semantic attributes. In Computer Vision–ECCV 2012, pages 609–623. Springer, 2012. [6] Zhiyuan Shi, Timothy M Hospedales, and Tao Xiang. Transferring a semantic representation for person re-identification and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4184– 4193. 2015. [7] Alireza Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing ob- jects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778–1785. IEEE, 2009. [8] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. At- tribute and simile classifiers for face verification. In 2009 IEEE 12th Interna- tional Conference on Computer Vision, pages 365–372. IEEE, 2009. [9] Jingen Liu, Benjamin Kuipers, and Silvio Savarese. Recognizing human actions by attributes. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3337–3344. IEEE, 2011. 50[10] Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2751–2758. IEEE, 2012. [11] Devi Parikh and Kristen Grauman. Relative attributes. In 2011 International Conference on Computer Vision, pages 503–510. IEEE, 2011. [12] Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Attribute learning for understanding unstructured social activity. In European Conference on Computer Vision, pages 530–543. Springer, 2012. [13] Huizhong Chen, Andrew C Gallagher, and Bernd Girod. What’s in a name? first names as facial attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366–3373. 2013. [14] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California In- stitute of Technology, 2010. [15] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011. [16] Shuo Wang, Jungseock Joo, Yizhou Wang, and Song-Chun Zhu. Weakly su- pervised learning for attribute localization in outdoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3111–3118. 2013. [17] Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Describing people: A poselet-based approach to attribute classification. In 2011 International Confer- ence on Computer Vision, pages 1543–1550. IEEE, 2011. [18] Gaurav Sharma and Frederic Jurie. Learning discriminative spatial representa- tion for image classification. In BMVC, pages 1–11. BMVA Press, 2011. [19] Lucy Liang and Kristen Grauman. Beyond comparing image pairs: Setwise active learning for relative attributes. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 208–215. 2014. [20] Adriana Ivanova Kovashka. Interactive image search with attributes. Ph.D. thesis, 2015. [21] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pages 951–958. 2009. [22] C.H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. Pattern Analysis and Machine Intel- ligence, IEEE Transactions on, 36(3):453–465, 2014. [23] Olivier Le Meur, Patrick Le Callet, Dominique Barba, and Dominique Thoreau. A coherent computational approach to model the bottom-up visual attention. IEEE transactions on pattern analysis and machine intelligence, 28:802–817, 2006. [24] Thomas Mauthner, Horst Possegger, Georg Waltner, and Horst Bischof. Encod- ing based saliency detection for videos and images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2494–2502. 2015. [25] Vidhya Navalpakkam and Laurent Itti. An integrated model of top-down and bottom-up attention for optimizing detection speed. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2049–2056. IEEE, 2006. [26] Dirk Walther and Christof Koch. Modeling attention to salient proto-objects. Neural networks, 19(9):1395–1407, 2006. [27] Ueli Rutishauser, Dirk Walther, Christof Koch, and Pietro Perona. Is bottom- up attention useful for object recognition? In 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), volume 2, pages II–37. IEEE, 2004. [28] Yu-Fei Ma, Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. A generic frame- work of user attention model and its application in video summarization. IEEE transactions on multimedia, 7(5):907–919, 2005. [29] Ran Margolin, Ayellet Tal, and Lihi Zelnik-Manor. What makes a patch dis- tinct? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1139–1146. 2013. [30] Federico Perazzi, Philipp Kr ̈ahenb ̈uhl, Yael Pritch, and Alexander Hornung. Saliency filters: Contrast based filtering for salient region detection. In Com- puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 733–740. IEEE, 2012. [31] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3166–3173. 2013. [32] Erkut Erdem and Aykut Erdem. Visual saliency estimation by nonlinearly inte- grating features using region covariances. Journal of vision, 13(4):11–11, 2013. [33] Wangjiang Zhu, Shuang Liang, Yichen Wei, and Jian Sun. Saliency optimization from robust background detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2814–2821. 2014. [34] Jiwhan Kim, Dongyoon Han, Yu-Wing Tai, and Junmo Kim. Salient region detection via high-dimensional color transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 883–890. 2014. [35] Ruth Rosenholtz. A simple saliency model predicts a number of motion popout phenomena. Vision research, 39(19):3157–3163, 1999. [36] Ruth Rosenholtz. Search asymmetries? what search asymmetries? Perception & Psychophysics, 63(3):476–489, 2001. [37] Jimei Yang and Ming-Hsuan Yang. Top-down visual saliency via joint crf and dictionary learning. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2296–2303. IEEE, 2012. [38] Aysun Kocak, Kemal Cizmeciler, Aykut Erdem, and Erkut Erdem. Top down saliency estimation via superpixel-based discriminative dictionaries. In BMVC. 2014. [39] Moran Cerf, Jonathan Harel, Wolfgang Einh ̈auser, and Christof Koch. Predict- ing human gaze using low-level saliency combined with face detection. In Ad- vances in neural information processing systems, pages 241–248. 2008. [40] Stas Goferman, Lihi Zelnik-Manor, and Ayellet Tal. Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(10):1915–1926, 2012. [41] Tilke Judd, Fr ́edo Durand, and Antonio Torralba. A benchmark of computa- tional models of saliency to predict human fixations. 2012. [42] Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological review, 113(4):766, 2006. [43] Nazar Khan and Marshall F Tappen. Discriminative dictionary learning with spatial priors. In 2013 IEEE International Conference on Image Processing, pages 166–170. IEEE, 2013. [44] Marcin Marszałek and Cordelia Schmid. Accurate object recognition with shape masks. International journal of computer vision, 97(2):191–209, 2012. [45] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Pro- ceedings of the eighteenth international conference on machine learning, ICML, volume 1, pages 282–289. 2001. [46] Ziad Al-Halah and Rainer Stiefelhagen. How to transfer? zero-shot object recognition via hierarchical transfer of semantic attributes. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, pages 837–843. IEEE, 2015. [47] Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1641–1648. IEEE, 2011. [48] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classifier: Zero shot learning using purely textual descriptions. In ICCV. 2013. [49] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In Com- puter Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE Computer Society, 2015. [50] Jimmy Ba, Kevin Swersky, Sanja Fidler, and Ruslan Salakhutdinov. Predict- ing deep zero-shot convolutional neural networks using textual descriptions. In ICCV. 2015. [51] David G Lowe. Object recognition from local scale-invariant features. In Com- puter vision, 1999. The proceedings of the seventh IEEE international confer- ence on, volume 2, pages 1150–1157. Ieee, 1999. [52] David G Lowe. Local feature view clustering for 3d object recognition. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–682. IEEE, 2001. [53] David G Lowe. Distinctive image features from scale-invariant keypoints. In- ternational journal of computer vision, 60(2):91–110, 2004. [54] Vittorio Ferrari and Andrew Zisserman. Learning visual attributes. In Advances in Neural Information Processing Systems, pages 433–440. 2007. [55] Thomas Deselaers and Vittorio Ferrari. Visual and semantic similarity in ima- genet. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Con- ference on, pages 1777–1784. IEEE, 2011. [56] Yu Su and Fr ́ed ́eric Jurie. Learning compact visual attributes for large-scale image classification. In Computer Vision–ECCV 2012. Workshops and Demon- strations, pages 51–60. Springer, 2012. [57] Olga Russakovsky and Li Fei-Fei. Attribute learning in large-scale datasets. In Trends and Topics in Computer Vision, pages 1–14. Springer, 2012. [58] Neeraj Kumar, Peter Belhumeur, and Shree Nayar. Facetracer: A search engine for large collections of images with faces. In Computer Vision–ECCV 2008, pages 340–353. Springer, 2008. [59] Behjat Siddiquie, Rogerio S Feris, and Larry S Davis. Image ranking and re- trieval based on multi-attribute queries. In Computer Vision and Pattern Recog- nition (CVPR), 2011 IEEE Conference on, pages 801–808. IEEE, 2011. [60] Jingen Liu, B. Kuipers, and S. Savarese. Recognizing human actions by at- tributes. In CVPR. 2011. [61] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning bases of action attributes and parts. In ICCV. 2011. [62] Gang Wang and David Forsyth. Joint learning of visual attributes, object classes and visual saliency. In 2009 IEEE 12th International Conference on Computer Vision, pages 537–544. IEEE, 2009. [63] Maxime Bucher, St ́ephane Herbin, and Fr ́ed ́eric Jurie. Improving semantic embedding consistency by metric learning for zero-shot classification. arXiv preprint arXiv:1607.08085, 2016. [64] Qian Wang and Ke Chen. Zero-shot visual recognition via bidirectional latent embedding. arXiv preprint arXiv:1607.02104, 2016. [65] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label- embedding for attribute-based classification. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 819–826. IEEE, 2013. [66] Mihir Jain, Jan C van Gemert, Thomas Mensink, and Cees GM Snoek. Ob- jects2action: Classifying and localizing actions without any video example. In Proceedings of the IEEE International Conference on Computer Vision, pages 4588–4596. 2015. [67] Amir Sadovnik, Andrew Gallagher, Devi Parikh, and Tsuhan Chen. Spoken at- tributes: Mixing binary and relative attributes to say the right thing. In Proceed- ings of the IEEE International Conference on Computer Vision, pages 2160– 2167. 2013. [68] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105. 2012. [69] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Im- agenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. [70] Yann LeCun, LD Jackel, Leon Bottou, A Brunot, Corinna Cortes, JS Denker, Harris Drucker, I Guyon, UA Muller, Eduard Sackinger, et al. Comparison of learning algorithms for handwritten digit recognition. In International confer- ence on artificial neural networks, volume 60, pages 53–60. 1995. [71] Yann LeCun, L ́eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [72] Pierre Sermanet, David Eigen, Xiang Zhang, Micha ̈el Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. [73] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pages 1–9. 2015. [74] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. [75] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn- ing for image recognition. arXiv preprint arXiv:1512.03385, 2015. [76] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [77] Hisham Cholakkal, Jubin Johnson, and Deepu Rajan. Backtracking scspm im- age classifier for weakly supervised top-down saliency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5278– 5287. 2016. [78] Hisham Cholakkal, Jubin Johnson, and Deepu Rajan. Weakly supervised top- down salient object detection. arXiv preprint arXiv:1611.05345, 2016. [79] Tie Liu, Zejian Yuan, Jian Sun, Jingdong Wang, Nanning Zheng, Xiaoou Tang, and Heung-Yeung Shum. Learning to detect a salient object. IEEE Transactions on Pattern analysis and machine intelligence, 33(2):353–367, 2011. [80] Ali Borji and Laurent Itti. Exploiting local and global patch rarities for saliency detection. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 478–485. IEEE, 2012. [81] Peng Jiang, Haibin Ling, Jingyi Yu, and Jingliang Peng. Salient region detection by ufo: Uniqueness, focusness and objectness. In Proceedings of the IEEE International Conference on Computer Vision, pages 1976–1983. 2013. [82] Jianming Zhang and Stan Sclaroff. Saliency detection: A boolean map ap- proach. In Proceedings of the IEEE International Conference on Computer Vision, pages 153–160. 2013. [83] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Ben- gio, Yuan Li, Hartmut Neven, and Hartwig Adam. Large-scale object classifica- tion using label relation graphs. In Computer Vision–ECCV 2014, pages 48–64. Springer, 2014. [84] Dinesh Jayaraman and Kristen Grauman. Zero-shot recognition with unreliable attributes. In Advances in Neural Information Processing Systems, pages 3464– 3472. 2014. [85] Bernardino Romera-Paredes and PHS Torr. An embarrassingly simple approach to zero-shot learning. In Proceedings of The 32nd International Conference on Machine Learning, pages 2152–2161. 2015. [86] Svetlana Kordumova, Thomas Mensink, and Cees GM Snoek. Pooling objects for recognizing scenes without examples. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 143–150. ACM, 2016. [87] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. 2011. [88] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statis- tical learning, volume 1. Springer series in statistics Springer, Berlin, 2001. [89] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013. [90] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129. 2013. [91] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning with semantic output codes. In Advances in neural informa- tion processing systems, pages 1410–1418. 2009. [92] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero- shot learning through cross-modal transfer. In Advances in neural information processing systems, pages 935–943. 2013. [93] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annota- tion: learning to rank with joint word-image embeddings. Machine learning, 81(1):21–35, 2010. [94] John W Sammon. A nonlinear mapping for data structure analysis. IEEE Trans- actions on computers, 18(5):401–409, 1969. [95] Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-cue zero-shot learning with strong supervision. arXiv preprint arXiv:1603.08754, 2016. [96] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent embeddings for zero-shot classification. arXiv preprint arXiv:1603.08895, 2016. [97] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic simi- larity embedding. arXiv preprint arXiv:1509.04767, 2015. [98] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119. 2013. [99] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751. 2013. [100] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Nat- ural Language Processing (EMNLP 2014), 12:1532–1543, 2014. [101] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. [102] Christiane Fellbaum. WordNet. Wiley Online Library, 1998. [103] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. Wordnet:: Similar- ity: measuring the relatedness of concepts. In Demonstration papers at HLT- NAACL 2004, pages 38–41. Association for Computational Linguistics, 2004. [104] George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. Introduction to wordnet: An on-line lexical database. Inter- national journal of lexicography, 3(4):235–244, 1990. [105] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [106] Daniel N Osherson, Joshua Stern, Ormond Wilkie, Michael Stob, and Edward E Smith. Default probability. Cognitive Science, 15(2):251–269, 1991. [107] Charles Kemp, Joshua B Tenenbaum, Thomas L Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5. 2006. [108] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. Interna- tional journal of computer vision, 88(2):303–338, 2010. [109] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 60[110] HA Jhuang, HA Garrote, EA Poggio, TA Serre, and T Hmdb. A large video database for human motion recognition. In Proc. of IEEE International Confer- ence on Computer Vision. 2011. [111] Khurram Soomro and Amir R Zamir. Action recognition in realistic sports videos. In Computer Vision in Sports, pages 181–208. Springer, 2014. [112] Marcin Marszalek and Cordelia Schmid. Accurate object localization with shape masks. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007. [113] Andreas Opelt, Axel Pinz, Michael Fussenegger, and Peter Auer. Generic object recognition with boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3):416–431, 2006. [114] James MacQueen et al. Some methods for classification and analysis of multi- variate observations. In Proceedings of the fifth Berkeley symposium on math- ematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA., 1967. [115] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learn- ing, 20(3):273–297, 1995. [116] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE, 2005. [117] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986. [118] Gernot Hoffmann. Cielab color space. Wikipedia, the free encyclopedia. mht, 2003. [119] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Al- tun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(Sep):1453–1484, 2005. [120] Ben Taskar Carlos Guestrin Daphne Roller. Max-margin markov networks. Ad- vances in neural information processing systems, 16:25, 2004. 61[121] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 951–958. IEEE, 2009. [122] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Re- turn of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014. [123] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [124] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1, 2015. [125] Liefeng Bo and Cristian Sminchisescu. Twin gaussian processes for structured prediction. International Journal of Computer Vision, 87(1-2):28–52, 2010. [126] Ziad Al-Halah, Makarand Tapaswi, and Rainer Stiefelhagen. Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning. [127] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 2015. [128] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 73–80. IEEE, 2010. [129] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. Bing: Bina- rized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3286– 3293. 2014. [130] Ian Endres and Derek Hoiem. Category independent object proposals. In Euro- pean Conference on Computer Vision, pages 575–588. Springer, 2010. [131]Pekka Rantalankila, Juho Kannala, and Esa Rahtu. Generating object segmen- tation proposals using global and local search. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2417–2424. 2014.