Computer vision is everywhere these days. It is an exciting area of research that involves teaching machines to “see” and interpret the world around them using digital images or videos. In the past, computer vision was a separate area of science and technology. One of the key differentiators between modern computer vision and its previous versions is the shift from rule-based to data-driven approaches. With the advent of deep learning, computer vision has moved away from hand-crafted features and towards end-to-end models that can learn directly from raw data. This enabled more accurate and robust models, as well as scaling to large data sets. Thanks to this change, and in particular the transformer-based architectures, computer vision has many similarities with natural language processing. However, there are also some important differences.
There are a variety of start-ups that base their success on computer vision techniques. AI Clearing, for example, uses computer vision to automatically monitor construction progress. Thanks to image processing models, combined with drone images, construction managers can easily track the progress of large construction sites, such as large solar parks or motorways. Nomagic is a company that uses robots to automate repeatable logistical tasks in warehouses. They use computer vision solutions to determine the optimal removal position for unknown products, to scan the barcode or to optimally place the items in the containers. Focal Systems offers a solution for retail automation. Your cameras automatically scan store shelves to get accurate information about what's on them. This makes it easier to manage the store automatically. Another start-up company, RespoVision, is developing a solution specialized in soccer that automatically analyses TV game recordings and creates accurate game telemetries (such as ball position or player speed).
There are a number of computer vision tasks, and classifying all of these tasks is a painstaking job. One way to do this would be to look at the input data: photos (from a camera, a drone, a satellite...), videos, point clouds, or various remote sensing devices. Another important distinction in computer vision is the difference between generative and discriminative models. Discriminative models focus on distinguishing limits on observed images, such as recognizing objects or faces. Generative models, on the other hand, are designed to create something new, such as creating realistic images or videos. Examples of generative models include Midjourney, a tool that can generate images from text descriptions, and DALL-E 2, which can create images from text prompts.
Generative computer vision tasks involve creating something new, such as creating images or videos that are similar to a set of training data. These tasks are usually completed using generative models, which are designed to generate new data based on the patterns observed in the training data. Let's start with text-to-image synthesis. This creates an image based on a textual description. Think DaLle or Midjourney! Another common generative image processing task is image synthesis, which involves creating new images that are similar to a series of training images. This can be used for a wide range of applications, such as generating realistic images of people, animals, or objects for use in video games or virtual reality environments. Style transfer is closely related to this — it involves taking the style of one image and transferring it to another. For example, you could transfer the style of a famous painting to a photo. There is no need to stay in two dimensions. With 3D object creation (or 3D reconstruction), you can, for example, create 3D models of objects from 2D images or videos. 3D reconstruction can be used, for example, to create virtual tours of buildings or to visualize complex scientific data in 3D.
Sometimes we don't want to create anything new — we just want to work with what we have. Image inpainting, for example, involves filling in missing or damaged parts of an image. This can be useful in applications such as image recovery or video editing, where damaged or missing parts of an image need to be reconstructed. Another task of this kind - widely known from the FBI television series - is the super-resolution of the image. Essentially, a high-resolution image is generated from a low-resolution image. This can be useful for applications that require image quality improvements, such as satellite images. Sometimes we want to work with visual content, but the target output is text. For example, image captioning creates a textual description of an image or video. The closely related video summary creates a shorter, more concise version of a longer video by selecting and summarizing keyframes or segments. Video summarization is used in applications such as news broadcasts, sports analytics, and security monitoring.
One of the main tasks of computer vision is image classification, which divides an image into predefined classes or categories. For example, an image processing system can be trained to classify images of animals into various types, such as cats, dogs, and birds. Another important task of computer vision is semantic segmentation, in which each pixel of an image is given a corresponding class or name. Essentially, this is classification at the pixel level. This task can be used to separate the foreground and background of an image or to segment various objects in an image. In medical imaging, semantic segmentation can be used, for example, to identify and segment tumors in MRI scans. Also remember to blur the background when zooming! Classification is not limited to images. Point cloud classification is about classifying points in a 3D point cloud into different categories or classes. The classification of point clouds can be used, for example, to identify and classify various types of objects in a 3D map of a warehouse or to identify road markings and signs in a 3D map of a road.
Object recognition is another important task of computer vision, which involves identifying and locating objects in an image or video. This task is crucial for applications such as security monitoring and autonomous vehicles. Object recognition algorithms can be used, for example, to identify and track suspicious behavior in a crowd or to identify and avoid obstacles on the road. A special case is facial recognition, which recognizes and identifies people based on their facial features. Facial recognition is used in security monitoring, identity verification, and social media tagging. Multi-object tracking tracks multiple objects over a period of time. It is often used in video analysis and monitoring when it comes to tracking the movements of several people or vehicles in real time. One use case for this would be tracking customer behavior in a store to analyze and improve store layout or identify bottlenecks in warehouse operations.
There are many more tasks. Depth estimation is about estimating the distance of objects to a camera or sensor. Depth estimation is used in applications such as robotics, augmented reality and autonomous vehicles. Position estimation is the process of estimating the 3D position and orientation of a human body or object using a 2D image or video. Pose estimation is used in applications such as sports analysis, motion capture, and virtual testing. Edge detection is another task of computer vision, which involves identifying the edges or boundaries between objects in an image. This task is useful for feature extraction and can be used in applications such as face recognition, image segmentation, and object recognition. These are just a few examples of the many computer vision tasks that exist. The list is not exhaustive. There are tasks that are not as easy to classify, such as optical character recognition (OCR), which recognizes text in an image or video and converts it into machine-readable text. OCR is often used when scanning documents and digitally archiving. As computer vision technology continues to improve, we can expect even more advanced applications and use cases in the future.
Knowing which task you need is an important step in building an image processing pipeline, but there is much more to do. Once you've identified the specific tasks that can be solved with computer vision, the next step is to develop a plan for building a computer vision pipeline. This usually includes selecting appropriate models and hardware components, designing and training machine learning models, and integrating the models into your existing systems and workflows. You can always start with pre-trained models that can be adapted for specific use cases. This approach can help reduce the time and cost of training new models from scratch, while offering the benefits of computer vision technology. The smallest models are suitable for consumer GPUs. Often, they're good enough for their job, and there's no need to spend large amounts on server rooms.
The applications of computer vision are diverse, and we know that this can be complex — especially because the field of research is constantly changing and every new week brings new discoveries in this area. At Perelyn, we love computer vision and would love to be able to help! At Perelyn, we specialize in creating custom computer vision solutions for businesses of all sizes. Whether you want to automate a specific task, analyze large amounts of image or video data, or expand the capabilities of an existing system, we can help. Our team of experienced computer vision engineers and machine learning specialists has the skills and expertise to deliver even the most complex projects.
In our latest article, we highlight the central role of compliance and quality management in the development of AI systems. We show how innovative approaches can be reconciled with regulatory requirements — an important step for companies that want to drive sustainable innovations in the fast-moving world of technology.
Discover the impact of generative AI in retail. This technology unites departments, personalises content, and transforms customer experiences. Leading brands such as Coca-Cola and Walmart are already using their potential to optimise operations and drive innovation. Explore the future of retail with generative AI...
This article explores how large language models such as ChatGPT can be integrated with proprietary data. The recent advent of large language models opens up numerous opportunities for companies, but connecting with proprietary data is a challenge in such a transformation. This article discusses the various approaches to solve this problem, such as fine-tuning and contextual learning. In addition, the associated challenges and risks are addressed.