What is Computer Vision? What are the different Computer Vision tasks?

An example of how computer vision can be used to detect people and cars in an autonomous driving situation in Barcelona.
Title: Example of Computer Vision in Autonomous Driving
Source: Wikimedia Commons

What is Computer Vision?

Computer vision refers to the extraction of information from visual data, in the form of predictions, identification, captioning, and more. It encompasses any task that involves image, video, and other signal data, and can be modified from the 2D case to 3D and temporal data. Healthcare, defense, and several other industries actively apply computer vision. From diagnostic tools to facial recognition, it appears in tools of many forms and use cases.

Computer vision as a field stems from traditional image processing approaches. Image processing involves the use of mathematical techniques to enhance or transform images. It can also be used to derive analytics from images, and tends to favor a more rigid mathematical approach compared to fluid approaches such as those taken by machine learning. 

If we look at the image processing supercategory, we can split this into 4 main tasks:

  1. Image acquisition and representation
  2. Image transformations
  3. Image translation to parameters
  4. Parameter translation to decisions

Computer Vision is concerned with tasks 3 and 4: taking images from their innate data to a parameterized representations through machine learning methods, then using these parameterizations to definitively generate outputs. These can range from classification labels and text captions, to new images.

Computer Vision Pipeline

In practice, the computer vision pipeline encompasses the following pipeline:

A pipeline displaying the six major steps in the computer vision pipeline: Data Collection, Preprocessing, Feature Extraction, ML/AI approaches, Output/decision metrics, Analysis/explainability
Title: General Computer Vision Pipeline
Source: AIML.com Research
  1. Data collection – Regardless of whether the data is collected specifically for a task or sourced from a publicly available dataset, it must align with the requirements of the intended task. For example, a supervised approach may require the dataset to include labels or probability atlases.

  1. Preprocessing – Data may require cleaning to remove NaN values, outliers, and other inconsistencies.

  1. Feature extraction – Approaches such as PCA, Chi-square tests, and Lasso regression reduce data dimensionality. Additionally, SIFT, SURF, HOG and more provide methods to represent the data in feature spaces with fewer dimensions.

  1. ML/AI approaches – Depending on the problem, different architectures succeed at parameterizing the problem. Reference “History of Computer Vision” for examples.

  1. Output/decision metrics – In this step, the parameterized architecture assigns labels, identifies probability, or predicts future values. This may involve multiple steps, which many AI/ML approaches encapsulate.

  1. Analysis/explainability – Good models do not behave like black boxes. Developing robust solutions involves clearly visualizing and representing the effect of individual features, architecture choices. Some common visual techniques include GradCAM, as well as modified SHAP and xAI.

Common Computer Vision Tasks

TaskDescriptionCommonly used models
Image ClassificationAssign a label to imagesLeNet, AlexNet, ResNet
Object DetectionIdentify objects by type and locationYOLO, R-CNN
Image SegmentationSeparate image into partsUNet, ResNet, Mask R-CNN
Facial RecognitionIdentify individuals based on featuresVGGFace, DeepFace
Optical Character RecognitionRecognize content of handwritten or typed textCRNN, Tesseract
Image CaptioningAssign descriptions to imagesShow and Tell, Image Transformers, Reinforcement Learning
Visual Question AnsweringUse images to answer text queriesVision Language Transformers
Image RetrievalUse images to query a databaseSIFT, CNNs
Image GenerationCreate new images based on expectation setGANs
Pose and Motion EstimationDescribe future events using indications in image structureMask R-CNN, PoseFlow
3D ReconstructionStitch 2D slices or timepoints into a higher dimension representationSLAM, SfM, Depth Estimation

Image Classification

In image classification, the goal is to assign labels to images, such as “cat” or “dog”. MNIST, CIFAR, and ImageNet are some of the most popular datasets to train these tasks. Many industries demonstrate this, with Google Search implementing lightweight classification algorithms to improve search results.

Object Detection

This is an expansion upon image classification, additionally identifying the locations of the object within the image. Surveillance systems or automated vehicles often employ object detection techniques.

Image Segmentation

This task involves breaking up an image into partitions. This can be used to identify contours, sections of an object, or just simplify the “gray levels” within an image. Medical imaging often relies on segmentation to separate anatomical components.

Facial Recognition

This is an expansion of image classification, where there are unique identifiers for each individual; rather than identifying multiple cats with the same label “cat”, the same label “Person X” will only be applied to images of Person X that may appear from different angles, not every person. Facial recognition systems can be seen in smartphones, airport security systems, and more.

Optical Character Recognition

OCR refers to the “reading” of images of text. Just as the human brain can identify the letter “d” regardless of font or handwriting style, OCR aims to be stylistically blind. It is often used in document digitization.

Image Captioning

Generating captions for images, and by extension video, necessitates a deeper understanding not only of the objects within the frame, but their relationships to other objects. Transformer based models often receive this task, as seen in automatic captioning and report generation.

Visual Question Answering

An extension of captioning, VQA allows textual input from the user alongside the image data. Vision Language Models generate textual responses, building off object detection, classification, and relationship recognition approaches. This can be seen in some chatbots and medical diagnostic software.

Image Retrieval

This task involves querying a database using images or text. Images can be stored in their feature representations so a query which highlights certain features is able to evaluate larger databases for close matches quickly. A prime example of this is Google and other browsers’ image search.

Image Generation

Generation unlocks the predictive capability of computer vision models. Often utilizing GAN networks, or even diffusion models, image generation seeks to understand the distributions within images of interest, and generate new images that fall within the same distributions. It can be heavily driven by Bayesian statistics, or traditional ML approaches to fine tune stochastic generations. This can be seen in synthetic data generation, art style transfers, and even image editing softwares.

Pose and Motion Estimation

Estimating structural features of images such as position, orientation, arrangement, and more to understand static and dynamic scenes. Athletic analyses, gesture-based recognition and navigation systems all employ these techniques.

3D Reconstruction

A 3d structure represents a physical object or its evolution in time using 2D slices or timepoints. It is a highly important problem in healthcare and virtual twin modeling, but can also be found in many industries that study historical evolution or structures only visible under particular imaging protocols.

History of Computer Vision

A visual timeline of major architectures within computer vision
Title: Timeline of Computer Vision History
Source: AIML.com Research

Commonly Used Datasets

DatasetDescriptionTypical tasksImage Specifications
ImageNetWide variety of images, most generalizableImage Classification14M images, 21000 categories
CIFARWide variety of images of size 32×32, often used for benchmarkingImage Classification60000 images, 10 categories (CIFAR-10) or 100 categories (CIFAR-100)
MNISTHandwritten digitsOptical Character Recognition70000 images, 10 categories
COCOObjects presented within several contextsObject Detection, Image Segmentation, Image Captioning330000 images, 80 categories, 1.5M object instances
Pascal VOCObjects labeled with bounding boxes, pixel-wise classificationObject Detection, Image Segmentation10000 images, 20 categories
LFWFacial recognition datasetFacial Recognition13000 images, 5749 individuals

What are some typical challenges in computer vision?

Listed below are some of the challenges one might face when working with computer vision tasks:

Data Complexity and Ambiguity

The complexity of perception and the ambiguity in how images are interpreted can hinder vision tasks. For example, human vision relies heavily on repeated exposure to visual instances to learn, and the same is true of computers. Most commonly, researchers find challenge in the sheer lack of the high volume data required by visual training. Many publicly available datasets address this problem, but they often remain too general and provide insufficient context, especially in cases such as medical diagnostics. The quality of data is also immensely important, as this can affect the model’s accuracy and sensitivity to noise and other aberrations.

Runtime Considerations

Another constant pursuit in vision is runtime efficiency. Adding more layers and approaches such as region proposals can lead to large improvements in accuracy, but also require more computational time and resources. 

Lack of Generalizability

Models are also highly personalized to the tasks they are trained for, and often struggle when applied to similar problems in different contexts. Transfer learning and zero-shot prediction work to target these problems.

Beyond the Model

Most importantly, vision is always subject to ethical and open-source concerns. It is always important to evaluate the long-standing effect of vision solutions, and how both privacy and bias-free learning are integral to their success. Watch the videos below to see how privacy and fairness are taken into consideration as computer vision solutions are developed.

Video Explanations:

  • In this “Eye on AI” episode, Alice Xiang speaks on privacy and fairness challenges involved in vision systems, such as ethical and technical issues. Watch to see some proposed solutions (Runtime: 42 mins)

YouTube video
Privacy vs Fairness in Computer Vision with Alice Xiang by The TWIML AI Podcast with Sam Charrington
  • This video titled “How to Avoid Bias in Computer Vision Models” by Roboflow details how to identify and address biases within computer vision systems, highlighting approaches such as data first mentality, active learning, and model error analysis. (Runtime: 29 mins)
YouTube video
Preventing Algorithmic Bias in Computer Vision Models by Roboflow

Related Questions:

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad