The State of the Art of Image Search Engines: A Brief Survey of Image Search Engines
Today, the image searching experiences of all major commercial image search engines are embarrassing. This is because these image search engines are
1. Using non-image correlations such as the image file names and the texts in the vicinity of the images to guess what are the images all about;
2. Using low-level features, such as colors, textures and primary shapes, of image to make content-based indexing/retrievals.
For the first kind of image search engine, it is very efficient to search objects/scenes with very precise, non-ambiguity and unique text descriptions such as “Time Square” and “Golden Gate Bridge”. However, even in this case, there are still many problems. First, since these kinds of objects/scenes are usually well-known and were documented by thousands of images/videos, one might need to narrow down the search results to more specific subset of the general search results; say, one might be more interest in the search text “Time Square in rain”, “Time Square three yellow cabs”, and “Golden Gate Bridge and one man”, etc. Since this kind of search engine is in fact using the relevant information in texts/titles to guess the contents in the images, it is entirely blind to what is really in an image. The more specific the information the user want, the worse experience the user suffer.
Another scenario to make this kind of text-based image search engine low efficient is the
ambiguities and vagueness in the texts around the images. For example, when we talk about
“White House”, it can have very many meanings to us. “White House” can be the building of
the “White House”. It can also be some events happened near the “White House”.
Or, it can even be any house that white. Try search “White House” in
Google Image
you will know what are these all about.
The problem with the second kind of image search engine is that while low-level image features
are important to describe images, they fail to represent high-level semantic and cognitive
features of images because they are only the basic components to build cognitive features.
This problem can be easily understood by take a brief look at the mainstream porn-detection
software available to the market. These porn-detection software packages only inspect
skin-tone regions in images and may misclassify many innocent images as shown in the survey
of porn-detection software packages.
Another problem in the second kind of image search engine is that it is lack of scale-up ability.
With the growth of the number of images and the number of categories of the image base,
the classifiers in this kind of image search engine can be easily overwhelmed by the
inter-class mix-up and the intra-class diversities.
Just take a look at the embarrassing
progresses that we made in the face recognition software, we can easily understand how
serious the scale-up issues should be. While many face recognition software claimed to
be over 99% accurate when recognize the fixed database, none of then can be really applied to
recognized suspects from the real-time video streams in an airport.
Why does this happen? The secrete is that while a database might contain
1 million face samples, it is still too few comparing a video stream generating 20
images per second. Just imagine an Internet with thousands of webcams and millions of
digital cameras, scanners, cellular phone cams and digital video cams, the scale-up ability is
the first problem to be solved by any image search engines.
Then what is the ultimate solution to build a smart image search engine? The answer is to build real image recognizers for all objects in the world piece by piece based on hierarchical structures mimicking the cognitive image understanding abilities of human brains. There are two levels of structures to be addressed.
1. At the lower level of these hierarchical structures we must build a set of feature detectors that capable of recognizing all low level feature such as: mouths, eyes, faces, trees, poles, light bulbs, mugs, tables, panda, tiger, President Washington, the Time Square and the Wall Street, etc. Yes, it is whole lots of jobs and it seems to be a mission-impossible based on the mainstream image understanding technology because the forbidden amount man powers and computing resources that will grow exponentially with respect to the number of recognizers.
Therefore, the first problem of building a smart image search engine is to find a way to build a bank of image feature recognizers that have linear demands to man-power and computing resources with respect to the number of recognizers. This is known as the scale-up challenge.
2. At the higher level, we need to make to the “layout”, the “meaning”, and the “intuition” behind the image. As a human being looks at an image, the “meaning” of the image is a much more important aspect s/he is looking for. In a word, we look at everything and then focus on the most interesting portion of an image and try to see it. The cognitive features of images play the most important role in the understanding of images. This is the level at which people search images. We peoples search image using cognitive features rather than signal features. Since cognitive features are coded by using natural language and the signal features are coded in data, to search images indexed by using cognitive features are much more efficient and accurate than to search images indexed by signal feature.
What are cognitive features of images? From the computational cognition point of view,
a cognitive feature of images is a feature that can be described by using computational
nouns and computational verbs, which are two indispensable components in Physical Linguistics.
Unlike many low-level feature based image search engines where images are viewed under
context-free assumptions, PicSeer views each picture under context-rich scenarios. For example,
when PicSeer looks at a picture of a person, it doesn’t look only the clues of colors and
textures as the other image search engines do. Instead, PicSeer looks for eyes, face,
hands, legs, hair, clothes, facial expressions, gestures and background. PicSeer uses its
Physical Linguistic Modeling Engine to organize the layout of the picture, to arrange the
relations between different cognitive features in the image and provides the cognitive model
for the entire image. In a word, PicSeer translates any image of interest into a story coded
by a pseudo-natural language.
For example, PicSeer can translate the following picture into a story “A boy smiles”.

How can PicSeer have this kind of understanding towards images? The Physical Linguistic
Vision Technologies have can represent cognitive features into nouns and verbs called
computational nouns and computational verbs, respectively. In this case, the image of the
boy is represented as a computational noun “boy” and the facial expression of the boy is
represented by a computational verb “smile”. All these steps are done by the computer itself
automatically.
Without using the high-level cognitive features, an image search engine can still play many
tricks to make the contents out of an image. For example, with the assumption that one must
put images, which are closely related to the texts, on a webpage in mind, Google categorizes
images from a webpage based on all related texts such as file names, webpage title, and more,
near images. However, the searching results can be entirely surprising! The followings are
some examples to test the technologies behind Google.
On November 12, 2005 Google was inquired by using key word “boy smiles” and the following is
the first page of the searching results. The third thumb nail in the first row is a surprise
because there is neither boy nor smile. This fact shows that Google doesn’t know neither the
cognitive features of boy nor the cognitive features of a smile.

On November 12, 2005 Google was inquire by using key word “boy smile” and the following is
the first page of the searching results. Comparing the previous result we have the
following conclusions:
1. Google don’t take care the meanings behind the inquire terms. To Google, “boy smile” and “boy smiles” are entirely different searching criteria. This is, of course, cognitively incorrect.
2. The image features used by Google has no cognitive significance.

On November 12, 2005 Google was inquire by using key word “boy smiled” and the following is
the first page of the searching results. Confused? Yes, computers did their jobs well,
but the results were not quite what should be in our minds.

Other mainstream commercial image search engines have similar performance as shown in
above because the principles behind them are quite the same. The failure of these image
search engines is caused by the low-level features of images they are using and the
inconsistence and randomness in the relations between the images and the texts surround them.











