featured photo
How can machines see?

7 minutes read.

Can machines see? Well we, people, are biological machines and we are able to see. So, can we also make mechanical machines, like our computers, see? If yes, a lot of tasks could be automated.

Machine vision was a dream of many researches. Some of them dedicated years to solving the problem. It turned out to be harder than expected. But why is vision so hard? Something so easy and intuitive for us, turns out to be exceptionally hard for our smartest machines. The answer is in the fact that a computer “sees” every image as a matrix of numbers.


So, what was needed are smart algorithms that could convert these numbers into meaningful interpretations and understanding. Researchers started developing a set of mathematical tools and algorithms as a set of rules to recognise visual elements. Unfortunately, due to complexities and richness of visual information, these rules became very complex and practically unmanageable.

But wait, no one teaches a child a complex set of rules of how to recognize a banana. Instead of teaching rules, we show them lots of examples. During first 3 years of life, a child sees approximately 1 billion images, and that’s how it learns – by linking these images and concepts they represent. Neurons in a child brain are the place where this link happens.

Enters the field of machine learning (ML) and artificial intelligence (AI), where people got an idea to use a simplified version of a biological neuron, called artificial neuron, to imitate human and animal capability to learn by examples. It is quite simple in fact.


It computes a weighted sum of its inputs followed by a non-linear activation function. When we put a lot of such neurones together, they create an artificial neural network.


One layer of neurons is, thus, just a matrix multiplication followed by a non-linearity. Also, a neural network with at least one hidden layer is a universal approximator, meaning it can learn literally anything. It doesn’t require a set of complex rules, but learns from examples, using one very simple rule. If we want to teach it to distinguish apple from banana for example, we just need to provide it with example images. Recall how a computer perceives images as sets of numbers. Well, these numbers are introduced at the input layer of a neural network, and with every learning step, the network will guess what the numbers represent – apple or banana. Based on the correctness of the guess, it will update its weights accordingly. The network calculates first derivatives of the output with respect to the weights. In other words, it calculates how much each of the weights have contributed to its guess, and adjusts them appropriately. Quite simple and elegant. Isn’t it?

Unfortunately, these types of algorithms still can’t “see”. Animals and people still outperform them. But why? It turns out that we need a lot of neurones in this kind of structure (over a billion) and it is extremely hard and ineffective to train.

So how do people and animals do it? That’s what scientists tried to discover by measuring neuron activities in the visual cortex of a kitten, while showing it different images. They found no pattern what so ever, but the neurons fired up while the images where changing. It turns out that the neurons in the visual cortex actually recognize simple shapes like lines and edges and not complex objects as we initially thought. Furthermore, neurons in a visual cortex are organized in a hierarchical order, so that each layer can recognize more and more patterns, leading to recognition of complex objects. This discovery lead to creation of a better version of a neural network, called convolutional neural network. Such network, in a way, resembles organization of a visual cortex. Each layer consists of a convolution, which is just a small weighted matrix that sums part of the image (like 3x3 or 5x5), and this pattern is repeated with same weights across the entire image. Such structure dramatically reduces number of parameters compared to the classical neural network. The convolution matrices are learned during the process of neural network training, and although they extract only simple objects like edges, corners and lines, adding multiple layers together combines shapes of the previous ones, leading to recognition of complex objects. So at layer 3 for example, shapes like circles or rectangles are recognised, and at layer 5 eyes, mouths, wheels, windows, and at the end the network recognises faces, cars, fruits and much more.


These types of networks have already outperformed humans at various tasks. We use them to diagnose cancer, to drive our cars, test medicines, recognize people, detect and track pedestrians and many more, like discovering bad quality food, as we described in our previous blog.

We made computers see, but that was not enough. We wanted to peek inside the created artificial brain and to ask the question why. Why does the computer see what it sees? How can it distinguish African elephant from the Asian one, for example? Does it need to look at the ears (as we people), or is it some form of magic? Everything in this world is magic, except to the magician. As already described, the neural network learns by examples, by calculating how much each weight factor of each neuron has contributed to the final answer. So, there is a way to check how computer comes to the final answer. Using a technique called GradCAM we can examine the derivatives and propagate them backwards onto the image, to compute a heatmap and see which parts of the image most significantly influenced the decision.


GradCAM computes gradients of the last convolutional layers, and then takes the average for each pixel. Later that is combined with positive gradients backpropagated to the image itself, producing the heatmap of the important part. Here are some interesting examples that we have generated at Createsi.

ElephantsElephants GradCAM

Original image source: ImageNet.

On the left is the original image with elephants, and on the right is the output image from the GradCAM module, showing that this elephant is African, because of the shape of its head/ears. It turns out that machine vision algorithms learn to distinguish visual elements based on some unique characteristics, similar to us, people.

Here are some other examples of animals, which show unique visual elements that contribute in recognising this frog as a frog and this flamingo as a flamingo.
FrogFrog GradCAM
FlamingoFlamingo GradCAM

The final example shows a wrist watch and how machine correctly learned to recognise it, based on the face of a dial.

WatchWatch GradCAM

Did we solve all our problems with machine vision? Certainly not. Even though computer vision outperforms humans at many tasks, there are still areas at which machines struggle to achieve any reasonable performance. Our brains use much more knowledge and experience that does not come from sight alone, and there are now artificial networks that try to mimic that (Omninet for example, which will be addressed in future blogposts). We should, however, not always aim to completely imitate everything in nature. Airplanes do not clap their wings as birds do, but they still fly and transport people much faster than ever before. Similarly in AI, we should aim at discovering basic concepts, creating dedicated solutions that are very good at solving concrete problems. If we do this right, machines could (and already do) help us see things we are not able to see with our eyes only.

Do you like our blogs? Drop us a comment and feel free to contact us via e-mail or through social media. We would love to hear from you.


Leave a comment

All fields are required.