Computer Vision (1)
Generative AI (2)
Machine Learning Basics (18)
Deep Learning (52)
- DL Basics (16)
- DL Architectures (17)
  - Feedforward Network / MLP (2)
  - Sequence models (6)
  - Transformers (9)
- DL Training and Optimization (17)
Natural Language Processing (27)
- NLP Data Preparation (18)
Supervised Learning (115)
- Regression (41)
  - Linear Regression (26)
  - Generalized Linear Models (9)
  - Regularization (6)
- Classification (70)
  - Logistic Regression (10)
  - Support Vector Machine (9)
  - Ensemble Learning (24)
  - Other Classification Models (9)
  - Classification Evaluations (9)
Unsupervised Learning (55)
- Clustering (37)
  - Distance Measures (9)
  - K-Means Clustering (9)
  - Hierarchical Clustering (3)
  - Gaussian Mixture Models (5)
  - Clustering Evaluations (6)
- Dimensionality Reduction (9)
Statistics (34)
Data Preparation (35)
- Feature Engineering (30)
- Sampling Techniques (5)

What is Computer Vision? What are the different Computer Vision tasks?

Updated: April 21, 2025

An example of how computer vision can be used to detect people and cars in an autonomous driving situation in Barcelona. — Title: Example of Computer Vision in Autonomous Driving
Source: Wikimedia Commons

What is Computer Vision?

Computer vision refers to the extraction of information from visual data, in the form of predictions, identification, captioning, and more. It encompasses any task that involves image, video, and other signal data, and can be modified from the 2D case to 3D and temporal data. Healthcare, defense, and several other industries actively apply computer vision. From diagnostic tools to facial recognition, it appears in tools of many forms and use cases.

Computer vision as a field stems from traditional image processing approaches. Image processing involves the use of mathematical techniques to enhance or transform images. It can also be used to derive analytics from images, and tends to favor a more rigid mathematical approach compared to fluid approaches such as those taken by machine learning.

If we look at the image processing supercategory, we can split this into 4 main tasks:

Image acquisition and representation
Image transformations
Image translation to parameters
Parameter translation to decisions

Computer Vision is concerned with tasks 3 and 4: taking images from their innate data to a parameterized representations through machine learning methods, then using these parameterizations to definitively generate outputs. These can range from classification labels and text captions, to new images.

Computer Vision Pipeline

In practice, the computer vision pipeline encompasses the following pipeline:

Data collection – Regardless of whether the data is collected specifically for a task or sourced from a publicly available dataset, it must align with the requirements of the intended task. For example, a supervised approach may require the dataset to include labels or probability atlases.

Preprocessing – Data may require cleaning to remove NaN values, outliers, and other inconsistencies.

Feature extraction – Approaches such as PCA, Chi-square tests, and Lasso regression reduce data dimensionality. Additionally, SIFT, SURF, HOG and more provide methods to represent the data in feature spaces with fewer dimensions.

ML/AI approaches – Depending on the problem, different architectures succeed at parameterizing the problem. Reference “History of Computer Vision” for examples.

Output/decision metrics – In this step, the parameterized architecture assigns labels, identifies probability, or predicts future values. This may involve multiple steps, which many AI/ML approaches encapsulate.

Analysis/explainability – Good models do not behave like black boxes. Developing robust solutions involves clearly visualizing and representing the effect of individual features, architecture choices. Some common visual techniques include GradCAM, as well as modified SHAP and xAI.

Common Computer Vision Tasks

Task	Description	Commonly used models
Image Classification	Assign a label to images	LeNet, AlexNet, ResNet
Object Detection	Identify objects by type and location	YOLO, R-CNN
Image Segmentation	Separate image into parts	UNet, ResNet, Mask R-CNN
Facial Recognition	Identify individuals based on features	VGGFace, DeepFace
Optical Character Recognition	Recognize content of handwritten or typed text	CRNN, Tesseract
Image Captioning	Assign descriptions to images	Show and Tell, Image Transformers, Reinforcement Learning
Visual Question Answering	Use images to answer text queries	Vision Language Transformers
Image Retrieval	Use images to query a database	SIFT, CNNs
Image Generation	Create new images based on expectation set	GANs
Pose and Motion Estimation	Describe future events using indications in image structure	Mask R-CNN, PoseFlow
3D Reconstruction	Stitch 2D slices or timepoints into a higher dimension representation	SLAM, SfM, Depth Estimation

Image Classification

In image classification, the goal is to assign labels to images, such as “cat” or “dog”. MNIST, CIFAR, and ImageNet are some of the most popular datasets to train these tasks. Many industries demonstrate this, with Google Search implementing lightweight classification algorithms to improve search results.

Object Detection

This is an expansion upon image classification, additionally identifying the locations of the object within the image. Surveillance systems or automated vehicles often employ object detection techniques.

Image Segmentation

This task involves breaking up an image into partitions. This can be used to identify contours, sections of an object, or just simplify the “gray levels” within an image. Medical imaging often relies on segmentation to separate anatomical components.

Facial Recognition

This is an expansion of image classification, where there are unique identifiers for each individual; rather than identifying multiple cats with the same label “cat”, the same label “Person X” will only be applied to images of Person X that may appear from different angles, not every person. Facial recognition systems can be seen in smartphones, airport security systems, and more.

Optical Character Recognition

OCR refers to the “reading” of images of text. Just as the human brain can identify the letter “d” regardless of font or handwriting style, OCR aims to be stylistically blind. It is often used in document digitization.

Image Captioning

Generating captions for images, and by extension video, necessitates a deeper understanding not only of the objects within the frame, but their relationships to other objects. Transformer based models often receive this task, as seen in automatic captioning and report generation.

Visual Question Answering

An extension of captioning, VQA allows textual input from the user alongside the image data. Vision Language Models generate textual responses, building off object detection, classification, and relationship recognition approaches. This can be seen in some chatbots and medical diagnostic software.

Image Retrieval

This task involves querying a database using images or text. Images can be stored in their feature representations so a query which highlights certain features is able to evaluate larger databases for close matches quickly. A prime example of this is Google and other browsers’ image search.

Image Generation

Generation unlocks the predictive capability of computer vision models. Often utilizing GAN networks, or even diffusion models, image generation seeks to understand the distributions within images of interest, and generate new images that fall within the same distributions. It can be heavily driven by Bayesian statistics, or traditional ML approaches to fine tune stochastic generations. This can be seen in synthetic data generation, art style transfers, and even image editing softwares.

Pose and Motion Estimation

Estimating structural features of images such as position, orientation, arrangement, and more to understand static and dynamic scenes. Athletic analyses, gesture-based recognition and navigation systems all employ these techniques.

3D Reconstruction

A 3d structure represents a physical object or its evolution in time using 2D slices or timepoints. It is a highly important problem in healthcare and virtual twin modeling, but can also be found in many industries that study historical evolution or structures only visible under particular imaging protocols.

History of Computer Vision

A visual timeline of major architectures within computer vision — Title: Timeline of Computer Vision History
Source: AIML.com Research

LeNet: Gradient-based learning applied to document recognition
AlexNet: ImageNet Classification with Deep Convolu t ional Neural Networks
VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition
GoogleNet: Going Deeper with Convolutions
R-CNN: Rich feature hierarchies for accurate object detection and semantic segmentation
Fast R-CNN: Fast R-CNN
Faster R-CNN: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Mask R-CNN: Mask R-CNN
ResNet: Deep Residual Learning for Image Recognition
YOLO: You Only Look Once: Unified, Real-Time Object Detection
DenseNet: Densely Connected Convolutional Networks
Vision Transformers: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

Commonly Used Datasets

Dataset	Description	Typical tasks	Image Specifications
ImageNet	Wide variety of images, most generalizable	Image Classification	14M images, 21000 categories
CIFAR	Wide variety of images of size 32×32, often used for benchmarking	Image Classification	60000 images, 10 categories (CIFAR-10) or 100 categories (CIFAR-100)
MNIST	Handwritten digits	Optical Character Recognition	70000 images, 10 categories
COCO	Objects presented within several contexts	Object Detection, Image Segmentation, Image Captioning	330000 images, 80 categories, 1.5M object instances
Pascal VOC	Objects labeled with bounding boxes, pixel-wise classification	Object Detection, Image Segmentation	10000 images, 20 categories
LFW	Facial recognition dataset	Facial Recognition	13000 images, 5749 individuals

What are some typical challenges in computer vision?

Listed below are some of the challenges one might face when working with computer vision tasks:

Data Complexity and Ambiguity

The complexity of perception and the ambiguity in how images are interpreted can hinder vision tasks. For example, human vision relies heavily on repeated exposure to visual instances to learn, and the same is true of computers. Most commonly, researchers find challenge in the sheer lack of the high volume data required by visual training. Many publicly available datasets address this problem, but they often remain too general and provide insufficient context, especially in cases such as medical diagnostics. The quality of data is also immensely important, as this can affect the model’s accuracy and sensitivity to noise and other aberrations.

Runtime Considerations

Another constant pursuit in vision is runtime efficiency. Adding more layers and approaches such as region proposals can lead to large improvements in accuracy, but also require more computational time and resources.

Lack of Generalizability

Models are also highly personalized to the tasks they are trained for, and often struggle when applied to similar problems in different contexts. Transfer learning and zero-shot prediction work to target these problems.

Beyond the Model

Most importantly, vision is always subject to ethical and open-source concerns. It is always important to evaluate the long-standing effect of vision solutions, and how both privacy and bias-free learning are integral to their success. Watch the videos below to see how privacy and fairness are taken into consideration as computer vision solutions are developed.

Video Explanations:

In this “Eye on AI” episode, Alice Xiang speaks on privacy and fairness challenges involved in vision systems, such as ethical and technical issues. Watch to see some proposed solutions (Runtime: 42 mins)

YouTube video — Privacy vs Fairness in Computer Vision with Alice Xiang by The TWIML AI Podcast with Sam Charrington

This video titled “How to Avoid Bias in Computer Vision Models” by Roboflow details how to identify and address biases within computer vision systems, highlighting approaches such as data first mentality, active learning, and model error analysis. (Runtime: 29 mins)

Related Questions:

What are Transformers? Discuss the evolution, advantages, and major breakthroughs in transformer models

Author

Krithika Thanigaivelan

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment (Cancel Reply)

You must be logged in to post a comment.

Partner Ad

Join us on:

Find out all the ways that you can

Contribute