The State of the Art of Image Search Engines: A Brief Survey of Image Search Engines 图像搜索引擎的框架:图像搜索引擎的概述
Today, the image searching experiences of all major commercial image search engines are embarrassing. This is because these image search engines are
当今,主流的商业图像搜索引擎的搜索结果都是不尽如人意的,这是因为这些图像搜索引擎是:
1. Using non-image correlations such as the image file names and the texts in the vicinity of the images to guess what are the images all about;
1. 使用了一些非图像内部的关联关系,比如:图像的名字,图像相邻的文本内容,然后猜测图像的内容;
2. Using low-level features, such as colors, textures and primary shapes, of image to make content-based indexing/retrievals.
2. 使用了一些低级的图像特征,比如:色彩、纹理和基本的形状,然后根据这些内容进行图像索引及修补。
For the first kind of image search engine, it is very efficient to search objects/scenes with very precise, non-ambiguity and unique text descriptions such as “Time Square” and “Golden Gate Bridge”. However, even in this case, there are still many problems. First, since these kinds of objects/scenes are usually well-known and were documented by thousands of images/videos, one might need to narrow down the search results to more specific subset of the general search results; say, one might be more interest in the search text “Time Square in rain”, “Time Square three yellow cabs”, and “Golden Gate Bridge and one man”, etc. Since this kind of search engine is in fact using the relevant information in texts/titles to guess the contents in the images, it is entirely blind to what is really in an image. The more specific the information the user want, the worse experience the user suffer.
对于第一个真正意义上的图像搜索引擎,它能很有效率的搜索物体和场景只要你输入非常精确的、毫不含糊的、具有独立意义的文本描述,比如“时代广场”、“金门大桥”等。然而,即便在这种情况下,依然有许多的问题。首先,由于这种物体或场景通常很著名并且被成千上万的图像或录像所记录,或许你需要缩小搜索的范围;也就是说,你或许更感兴趣于搜索文本“雨中的时代广场”、“时代广场三辆黄色出租车”和“金门大桥和一个男人”,等等。由于这种搜索引擎实际上是使用与文本内容或者主题相关的信息来猜测图像的内容,所以它实际上对图像内的真正内容是一无所知的。如果使用者想知道越具体的信息,得到的结果就越使人失望。
Another scenario to make this kind of text-based image search engine low efficient is the
ambiguities and vagueness in the texts around the images. For example, when we talk about
“White House”, it can have very many meanings to us. “White House” can be the building of
the “White House”. It can also be some events happened near the “White House”.
Or, it can even be any house that white. Try search “White House” in
Google Image
you will know what are these all about.
另外一个关键性的问题使这种基于文本的图像搜索引擎工作起来效率很低,那是因为在图像附近的文本内容描述很模糊不清。比如:当我们谈论“白宫”时,它可以有许许多多的意思。“白宫”可以是建筑物“白宫”,它也可以是在“白宫”附近发生的事件,或者,它也可以是一些白色的房子。试着在Google Image里搜索“白宫”,你就可以知道那些结果是什么。
The problem with the second kind of image search engine is that while low-level image features
are important to describe images, they fail to represent high-level semantic and cognitive
features of images because they are only the basic components to build cognitive features.
This problem can be easily understood by take a brief look at the mainstream porn-detection
software available to the market. These porn-detection software packages only inspect
skin-tone regions in images and may misclassify many innocent images as shown in the survey
of porn-detection software packages.
第二种图像搜索引擎的问题是他们将低级的图像特征作为重要的图像特征来描述图像,这样他们就不能反映图像的高级的语义认知特征,因为他们仅仅是组成认知特征的基本元素。这个问题很好理解,只要我们简单的看一下现在市场上主流的色情内容侦测软件就可以明白。这些色情内容侦测软件包只能检查图像中具有肤色的区域,并且可能将许多无辜的图片进行错误地分类,见 色情内容侦测软件包概述所示。
Another problem in the second kind of image search engine is that it is lack of scale-up ability.
With the growth of the number of images and the number of categories of the image base,
the classifiers in this kind of image search engine can be easily overwhelmed by the
inter-class mix-up and the intra-class diversities.
第二种图像搜索引擎的问题是缺少按比例增加/扩大的能力。随着图像库中图像数量和图像种类的增加,这种图像搜索引擎的分类器很容易被inter-class的混乱和intra-class的多样化所湮灭。
Just take a look at the embarrassing
progresses that we made in the face recognition software, we can easily understand how
serious the scale-up issues should be. While many face recognition software claimed to
be over 99% accurate when recognize the fixed database, none of then can be really applied to
recognized suspects from the real-time video streams in an airport.
看一看现在面部识别软件差强人意的进展,我们就可以很容易地理解按比例增加/扩大的问题有多严重。虽然许多面部识别软件声称在一定的数据库中识别可以达到99%的精确度,但是,没有一个软件真正用在机场实时的监控录像中进行识别可疑对象。
Why does this happen? The secrete is that while a database might contain
1 million face samples, it is still too few comparing a video stream generating 20
images per second. Just imagine an Internet with thousands of webcams and millions of
digital cameras, scanners, cellular phone cams and digital video cams, the scale-up ability is
the first problem to be solved by any image search engines.
为什么会发生这样的事情呢?秘密就在于:一个数据库可能包含1百万的面部样本,但这相比较于1秒钟产生20幅图片的录像仍然少得可怜。只要想象一下:一个互联网有成千上万的摄像头、数码相机、扫描仪、手机摄像头和数码摄像机,对于任何一个图像搜索引擎来说“按比例增加/扩大的能力”是一个首先要解决的问题。
Then what is the ultimate solution to build a smart image search engine? The answer is to build real image recognizers for all objects in the world piece by piece based on hierarchical structures mimicking the cognitive image understanding abilities of human brains. There are two levels of structures to be addressed.
那么,建立一个聪明、智能的图像搜索引擎的最终解决方案是什么呢?答案是:分层次、分结构的模仿人类大脑的智能图像理解能力,从而建立一个对世界上所有物体都能进行识别的真正的图像识别器。有两个结构层次需要阐述。
1. At the lower level of these hierarchical structures we must build a set of feature detectors that capable of recognizing all low level feature such as: mouths, eyes, faces, trees, poles, light bulbs, mugs, tables, panda, tiger, President Washington, the Time Square and the Wall Street, etc. Yes, it is whole lots of jobs and it seems to be a mission-impossible based on the mainstream image understanding technology because the forbidden amount man powers and computing resources that will grow exponentially with respect to the number of recognizers.
1. 在这些分层次结构中的低层级中,我们必须建立一套能识别低层级特征的特征探测器,能识别一些如:嘴、眼睛、脸、树、杆、灯泡、杯子、桌子、熊猫、老虎、华盛顿总统、时代广场和华尔街,等等。是的,这是一个工程浩大的工作,并且如果基于现在主流的图像理解技术来做的化,这简直是不可能做到的事情,因为有限的人力和计算资源会随着识别器数量呈指数规律的增长而增长。
Therefore, the first problem of building a smart image search engine is to find a way to build a bank of image feature recognizers that have linear demands to man-power and computing resources with respect to the number of recognizers. This is known as the scale-up challenge.
因此,建立一个聪明智能的图像搜索引擎的首要问题是找到一种方法来建立一个图像特征识别器的库,与识别器数量增长相匹配,它对人力和计算资源有线性增长的需求。这就是按比例增加/扩大的能力。
2. At the higher level, we need to make to the “layout”, the “meaning”, and the “intuition” behind the image. As a human being looks at an image, the “meaning” of the image is a much more important aspect s/he is looking for. In a word, we look at everything and then focus on the most interesting portion of an image and try to see it. The cognitive features of images play the most important role in the understanding of images. This is the level at which people search images. We peoples search image using cognitive features rather than signal features. Since cognitive features are coded by using natural language and the signal features are coded in data, to search images indexed by using cognitive features are much more efficient and accurate than to search images indexed by signal feature.
2. 在较高的层级,我们需要达到理解图像的整体结构、意义以及对图像有直觉。作为一个人在看一幅图片的时候,对于他/她来说,图片的内在意义是一个非常重要的方面。总之,当我们看一样东西的时候会集中注意在图片中最有趣的部分并且竭力看清楚。图片中认知的特征在理解图片时扮演了很重要的角色,这是人类搜索图像的层次水平。我们人类搜索图像时更多的是使用认知特征,而不是单一的特征。由于认知特征用自然语言进行表述而单一的特征用数据来表达,所以用认知特征来搜索图像比用单一特征来搜索会更加高效、精确。
What are cognitive features of images? From the computational cognition point of view,
a cognitive feature of images is a feature that can be described by using computational
nouns and computational verbs, which are two indispensable components in Physical Linguistics.
Unlike many low-level feature based image search engines where images are viewed under
context-free assumptions, PicSeer views each picture under context-rich scenarios. For example,
when PicSeer looks at a picture of a person, it doesn’t look only the clues of colors and
textures as the other image search engines do. Instead, PicSeer looks for eyes, face,
hands, legs, hair, clothes, facial expressions, gestures and background. PicSeer uses its
Physical Linguistic Modeling Engine to organize the layout of the picture, to arrange the
relations between different cognitive features in the image and provides the cognitive model
for the entire image. In a word, PicSeer translates any image of interest into a story coded
by a pseudo-natural language.
什么是图像的认知特征呢?从计算认知的角度看,图像的认知特征是可以用计算名词和计算动词来描述的特征,它们是物理语义学中不可缺少的组成部分。与许多根据低层级特征进行搜索的图像搜索引擎(它们对图片进行猜测假设的结果与上下文的关系无关)不同,PicSeer都将每一幅图片与上下文的内容密切相关。比如,当PicSeer看着一个人的照片,它不象其它的图像搜索引擎那样只查看颜色和纹理,取而代之的是,PicSeer查看眼睛、脸、手、腿、头发、衣服、面部表情、姿势和背景。PicSeer使用它自己的物理语义建模引擎来组织出图像的整体布局,再将不同的认知特征按照一定的关系在图片上安排整理好,然后将该认知模型作为整幅图片。总之,PicSeer可以将任何感兴趣的图像翻译成一个类似用自然语言表达的故事。
For example, PicSeer can translate the following picture into a story “A boy smiles”.
比如,PicSeer能够将下面的图片翻译成一个故事“一个男孩微笑着”。

How can PicSeer have this kind of understanding towards images? The Physical Linguistic
Vision Technologies have can represent cognitive features into nouns and verbs called
computational nouns and computational verbs, respectively. In this case, the image of the
boy is represented as a computational noun “boy” and the facial expression of the boy is
represented by a computational verb “smile”. All these steps are done by the computer itself
automatically.
PicSeer怎么会对图像有这样的理解呢?物理语义视觉技术能够将认知特征反映成名词和动词,分别叫作计算名词和计算动词。在这种情况下,男孩的照片由一个计算名词“男孩”和一个代表面部表情的计算动词“微笑”所组成的。所有这些步骤都是由计算机自己自动完成的。
Without using the high-level cognitive features, an image search engine can still play many
tricks to make the contents out of an image. For example, with the assumption that one must
put images, which are closely related to the texts, on a webpage in mind, Google categorizes
images from a webpage based on all related texts such as file names, webpage title, and more,
near images. However, the searching results can be entirely surprising! The followings are
some examples to test the technologies behind Google.
没有使用高层级的认知特征,图像搜索引擎可能有很多与图像内容无关的结果。比如,在假设猜想情况下的图片搜索,它是和文本内容紧密相关联的,在一个网页的脑海中,Google根据所有网页中相关的文本内容(比如:文件名、网页名、还有其它很多的在图像附近的内容)对图片进行分类。然而,搜索结果可能完全出人意料。下面是测试Google技术的一些例子。
On November 12, 2005 Google was inquired by using key word “boy smiles” and the following is
the first page of the searching results. The third thumb nail in the first row is a surprise
because there is neither boy nor smile. This fact shows that Google doesn’t know neither the
cognitive features of boy nor the cognitive features of a smile.
2005年11月12日,在Google中查询“boy smiles”,下面是搜索结果的第一页。第一行中的第三个thumb nail就是一个意外,因为图片里面既没有男孩也没有微笑。这个事实显示Google既不知道男孩的认知特征也不知道微笑的认知特征。

On November 12, 2005 Google was inquire by using key word “boy smile” and the following is
the first page of the searching results. Comparing the previous result we have the
following conclusions:
2005年11月12日,在Google中查询关键词“boy smile”,下面是搜索结果的第一页,与前面的结果相比我们得出下面的结论:
1. Google don’t take care the meanings behind the inquire terms. To Google, “boy smile” and “boy smiles” are entirely different searching criteria. This is, of course, cognitively incorrect.
1. Google不关注查询关键词的内在意义。对于Google来说,“boy smile” 和 “boy smiles”是完全不同的搜索原则。这当然就是认知上的错误。
2. The image features used by Google has no cognitive significance.
2. Google使用的图像特征没有认知的意义。

On November 12, 2005 Google was inquire by using key word “boy smiled” and the following is
the first page of the searching results. Confused? Yes, computers did their jobs well,
but the results were not quite what should be in our minds.
2005年11月12日,在Google中搜索关键词“boy smiled”,下面是搜索结果的第一页。 有没有感觉到很困惑?是的,计算机把它们自己的工作做得很好,但是结果却不是我们想象中的那样。

Other mainstream commercial image search engines have similar performance as shown in
above because the principles behind them are quite the same. The failure of these image
search engines is caused by the low-level features of images they are using and the
inconsistence and randomness in the relations between the images and the texts surround them.
其它主流的商业图像搜索引擎具有与上面所示类似的表现,因为它们所使用的搜索原理基本上是一样的。这些图像搜索引擎之所以失败是因为它们使用了图像的低层级特征进行搜索,并且它们利用了图像附近一些无关的文本内容和关系进行猜测。





PicSeer will make its priority to search the images that most people interested in.
你想知道对于你最感兴趣的图片PicSeer能给你提供什么结果吗?欢迎您将搜索行(搜索内容)email至


