Human Pose Estimation Technology Capabilities and Use Cases in 2022

2022-10-08 06:39:52
关注

Human Pose Estimation Technology Capabilities and Use Cases in 2022
Illustration: © IoT For All

What is Human Pose Estimation?

Human Pose Estimation (HPE) is a task in computer vision that focuses on identifying the position of a human body in a specific scene. Most of the HPE methods are based on recording an RGB image with the optical sensor to detect body parts and the overall pose. This can be used in conjunction with other computer vision technologies for fitness and rehabilitation, augmented reality applications, and surveillance. 

'Fitness applications and AI-driven coaches are some of the most obvious use cases for body pose estimation.' -MobiDevClick To Tweet

The essence of the technology lies in detecting points of interest on the limbs, joints, and even the face of a human. These key points are used to produce a 2D or 3D representation of a human body model.

2D representation of a Albert Einstein body pose
2D representation of an Albert Einstein body pose

These models are basically a map of body joints we track during the movement. This is done for a computer not only to find the difference between a person just sitting and squatting, but also to calculate the angle of flexion in a specific joint and tell if the movement is performed correctly. 

There are three common types of human models: skeleton-based model, contour-based, and volume-based. The skeleton-based model is the most used one in human pose estimation because of its flexibility. This is because it consists of a set of joints like ankles, knees, shoulders, elbows, wrists, and limb orientations comprising the skeletal structure of a human body.

Body models in human pose estimation
Body models in human pose estimation

A skeleton-based model is used for 2D as well as 3D representation, but, generally, 2D and 3D methods are used in conjunction. 3D human pose estimation grants better accuracy to the application measurements since it considers the depth coordinates and fetches those results into the calculation. For the majority of movements, depth is important, because the human body doesn’t move in a 2D dimension.  

So now let’s find out how 3D human pose estimation works from a technical perspective and find out the current capabilities of such systems.

How 3D Human Pose Estimation Works

The overall flow of a body pose estimation system starts with capturing the initial data and uploading it for a system to process. As we’re dealing with motion detection, we need to analyze a sequence of images rather than a still photo since we need to extract how key points change during the movement pattern. 

Once the image is uploaded, the HPE system will detect and track the required key points for analysis. In a nutshell, different software modules are responsible for tracking 2D key points, creating a body representation, and converting it into a 3D space. So, generally, when we speak about creating a body pose estimation model, we mean implementing two different modules for 2D and 3D planes.

The difference between 2D and 3D pose estimation reconstructions
The difference between 2D and 3D pose estimation reconstructions

So, for the majority of human pose estimation tasks, the flow will be broken into two parts:

  1. Detecting and extracting 2D key points from the sequence of images. This entails using horizontal and vertical coordinates that build up a skeleton structure.
  2. Converting 2D key points into 3D adding the depth dimension. 

During this process, the application will make the required calculations to perform pose estimation. 

Estimating human pose during exercise is just one example in the fitness industry. Some models can also detect key points on the human face and track head position, which can be applied for entertainment applications like Snapchat masks. But we’ll discuss the use cases of HPE later in the article. 

You can check our demo to see how it works in a nutshell: just upload a short video performing some movement and wait for the processing time to see the pose analysis.

3D Pose Estimation Performance and Accuracy

Depending on the chosen algorithm, the HPE system will provide different performance and accuracy results. Let’s see how they correlate in terms of our experiment with two of the most popular human pose estimation models, VideoPose3D and BlazePose. 

We’ve tested BlazePose and VideoPose3D models on the same hardware using a 5-second video with 2160*3840 dimensions and 60 frames per second. VideoPose3D got a total time of 8 minutes for video processing and a good accuracy result. In contrast, BlazePose processing time reached 3-4 frames per second, which allows the use in real-time applications. But the accuracy results shown below don’t correspond to the objectives of any HPE task.

VideoPose3D and BlazePose processing results
VideoPose3D and BlazePose processing results

The processing time depends on the movement complexity, video and lighting quality, and the 2D pose detector module. Given the fact that BlazePose and VideoPose3D have different 2D detectors, this stage appears to be a performance bottleneck in both cases.

One of the possible ways to optimize HPE performance is the acceleration of 2D key point detection. Existing 2D detectors can be modified or amplified with the post-processing stages to improve general accuracy.

Real-time 3D Human Pose Estimation

Whether we deal with a fitness app, an app for rehabilitation, face masks, or surveillance, real-time processing is highly required. Of course, the performance of the model will depend on the chosen algorithm and hardware, but the majority of existing open-source models provide quite a long response time. In the opposite scenario, the accuracy suffers. So, is it possible to improve existing 3D human pose estimation models to achieve acceptable accuracy with real-time processing?

While models like BlazePose are able to provide real-time processing, the accuracy of its tracking is not suitable for commercial use or complex tasks. In terms of our experiment, we tested the 2D component of a BlazePose with a modified 3D-pose-baseline model using Python language. 

In terms of speed, our model achieves about 46 FPS on the above-mentioned hardware without video rendering whereas the 2D pose detection model produces key points with about 50 FPS. In comparison to the 2D pose detection model, the modified 3D baseline model can produce keypoints with about 780 FPS. Detailed information about the spent processing time of our approach is presented below.

BlazePose 2D + 3D-pose-baseline performance in percent
BlazePose 2D + 3D-pose-baseline performance in percent

While this approach doesn’t guarantee reliability in complex scenarios with dim lighting or unusual poses, standard videos can be processed in real time. But, generally, the accuracy of model predictions will depend on the training and the chosen architecture. Understanding the true capabilities of human pose estimation, we can analyze some common business applications and general use cases for this technology.

Human pose estimation use cases

HPE can be considered a quite mature technology since there are groundworks in the areas of applications like fitness, rehabilitation, augmented reality, animation, gaming, robotics, and even surveillance. So now let’s talk about the existing use cases.

AI Fitness and Self-Coaching

Fitness applications and AI-driven coaches are some of the most obvious use cases for body pose estimation. The model implemented in the phone app can use the hardware camera as a sensor to record someone doing an exercise and perform its analyses. 

Tracking the movement of a human body, the exercise can be split into phases of eccentric and concentric movements to analyze different angles of flexion and overall posture. This is done via tracking the key points and providing analytics in the form of hints or graphic analysis. This can be handled in real-time or after some delay, providing analytics on the major movement patterns and body mechanics for the user. 

Rehabilitation and Physiotherapy

The physiotherapy industry is another human activity tracking use case with similar rules of application. In the era of telemedicine, in-home consultations become much more flexible and diverse. AI technologies have enabled more complex ways that treatment can be done online. 

The analysis of rehab activities applies similar concepts to fitness applications, except for the requirements for accuracy. Since we’re dealing with recovering from the injury, this category of applications will fall into the healthcare category. This means it has to meet the standards of the healthcare industry and general data protection laws in a certain country. 

Augmented Reality 

Augmented reality applications like virtual fitting rooms can benefit from human estimation as one of the most advanced methods of detecting and recognizing the position of a human body in space. This can be used in e-commerce where shoppers struggle to fit their clothes before buying. 

Human pose estimation can be applied to track key points on the human body and pass this data to the augmented reality engine that will fit clothes on the user. This can be applied to any body part and type of clothes, or even face masks. We’ve described our experience of using human pose estimation for virtual fittings rooms in a dedicated article.

Animation and Gaming

Game development is a tough industry with a lot of complex tasks that require knowledge of human body mechanics. Body pose estimation is widely used in the animation of game characters to simplify this process by transferring tracked key points in a certain position to the animated model. 

The process of this work resembles motion tracking technology used in video production, but doesn’t require a large number of sensors placed on the model. Instead, we can use multiple cameras to detect the motion pattern and recognize it automatically. The data fetched then can be transformed and transferred to the actual 3D model in the game engine. 

Surveillance and Human Activity Analysis

Some surveillance cases don’t require spotting a crime in a crowd of people. Instead, cameras can be used to automate everyday processes like shopping at a grocery store. 

Cashierless store systems like Amazon GO, for example, apply human pose estimation to understand whether a person took some item from a shelf. HPE is used in combination with other computer vision technologies, which allows Amazon to automate the process of checkout in their stores using a network of camera sensors, IoT devices, and 

Human pose estimation is responsible for the part of the process where the actual area of contact with the product is not visible to the camera. So here, the HPE model analyzes the position of customers’ hands and heads to understand if they took the product from the shelf, or left it in place.

How to train a human pose estimation model?

Human pose estimation is a machine learning technology, which means you’ll need data to train it. Human pose estimation completes quite difficult tasks of detecting and recognizing multiple objects on the screen and neural networks are used as an engine for it. Training a neural network requires enormous amounts of data, so the most optimal way is to use available datasets like the following ones:

  • HumanEva
  • Coco
  • MPI Human Pose, and
  • Human3.6M

The majority of these datasets are suitable for fitness and rehab applications with human pose estimation. But this doesn’t guarantee high accuracy in terms of more unusual movements or specific tasks like surveillance or multi-person pose estimation. 

For the rest of the cases, data collection is inevitable since a neural network will require quality samples to provide accurate object detection and tracking. Here, experienced data science and machine learning teams can be helpful, since they can provide consultancy on how to gather data, and handle the actual development of the model.

Tweet

Share

Share

Email

  • Artificial Intelligence
  • Augmented Reality
  • Fitness
  • Machine Learning

  • Artificial Intelligence
  • Augmented Reality
  • Fitness
  • Machine Learning

参考译文
2022年人体姿态估计技术能力和用例
插图:© IoT For All --> 什么是人体姿态估计?人体姿态估计(Human Pose Estimation,HPE)是计算机视觉中的一项任务,专注于在特定场景中识别人体的位置。大多数HPE方法是基于使用光学传感器记录RGB图像,以检测身体部位和整体姿态。它可以与计算机视觉的其他技术结合使用,应用于健身和康复、增强现实和监控领域。“健身应用和人工智能驱动的教练可能是人体姿态估计最明显的应用场景之一。” - MobiDev点击推文这项技术的核心在于检测人体四肢、关节,甚至面部上的兴趣点。这些关键点用于生成2D或3D的人体模型。艾尔伯特·爱因斯坦人体姿态的2D表示这些模型本质上是我们追踪运动过程中人体关节的地图。这使得计算机不仅能区分坐着和蹲着的人,还能计算特定关节的弯曲角度,并判断动作是否正确执行。目前有三种常见的人体模型:基于骨架的模型、基于轮廓的模型和基于体积的模型。其中,基于骨架的模型在人体姿态估计中最为常用,因为其灵活性高。这是因为该模型由一组类似脚踝、膝盖、肩膀、手肘、手腕以及构成人体骨骼结构的四肢方向组成。在人体姿态估计中,基于骨架的模型可用于2D和3D表示,但通常将2D和3D方法结合使用。3D人体姿态估计能提供更精确的应用度量,因为它考虑了深度坐标并将其纳入计算。对于大多数运动而言,深度是重要的,因为人体并不是在二维空间中移动的。那么,现在让我们从技术角度探讨3D人体姿态估计的工作原理,并了解此类系统的当前能力。3D人体姿态估计的工作原理人体姿态估计系统整体流程从捕获初始数据并上传供系统处理开始。由于我们涉及的是运动检测,我们需要分析一系列图像,而不是静态图片,因为我们需要提取关键点在运动模式中的变化。一旦图像上传,HPE系统将检测并跟踪所需的分析关键点。简而言之,不同的软件模块负责跟踪2D关键点、创建身体表示,并将其转换为3D空间。因此,一般来说,当我们谈到创建人体姿态估计模型时,我们指的是为2D和3D平面分别实现两个不同的模块。2D和3D姿态估计重建之间的区别因此,对于大多数人体姿态估计任务,流程将分为两个部分:从图像序列中检测和提取2D关键点。这包括使用水平和垂直坐标来构建骨架结构。将2D关键点转换为3D,增加深度维度。在此过程中,应用将进行必要的计算以执行姿态估计。在健身行业中,运动中的人体姿态估计只是一个例子。一些模型还可以检测人体面部的关键点并跟踪头部位置,这可用于娱乐应用,比如Snapchat的贴纸。但我们将在文章后面讨论HPE的用例。你可以查看我们的演示来了解其基本原理:只需上传一段短视频,展示一些动作,并等待处理时间以查看姿态分析结果。3D姿态估计的性能和准确性根据所选择的算法,HPE系统将提供不同的性能和准确性结果。我们来看看在我们的实验中,使用两个最受欢迎的人体姿态估计模型——VideoPose3D和BlazePose,它们的表现和准确性如何关联。我们在相同硬件上使用了5秒长、分辨率为2160*3840、每秒60帧的视频来测试BlazePose和VideoPose3D模型。VideoPose3D的视频处理总时间为8分钟,并且准确性良好。相反,BlazePose的处理速度达到了每秒3-4帧,这使得其可以用于实时应用。但下面的准确性结果并不符合任何HPE任务的目标。BlazePose和VideoPose3D处理结果处理时间取决于动作复杂度、视频和光照质量,以及2D姿态检测模块。鉴于BlazePose和VideoPose3D使用不同的2D检测器,这在两种情况下都成为性能瓶颈。一种可能优化HPE性能的方式是加快2D关键点检测。现有的2D检测器可以修改或通过后处理阶段增强,以提高整体准确性。实时3D人体姿态估计无论我们处理的是健身应用、康复应用、面部贴纸,还是监控,实时处理都是高度需要的。当然,模型的性能将取决于所选的算法和硬件,但大多数现有的开源模型提供相当长的响应时间。相反地,准确性会受到影响。因此,是否有可能改进现有的3D人体姿态估计模型,以实现实时处理下的可接受准确性?虽然像BlazePose这样的模型能够实现实时处理,但其跟踪的准确性并不适合商业用途或复杂任务。就我们的实验而言,我们使用Python语言测试了BlazePose的2D组件与修改后的3D-pose-baseline模型的结合。就速度而言,我们的模型在上述硬件上实现了约46 FPS,不进行视频渲染,而2D姿态检测模型则每秒可生成约50个关键点。相比之下,修改后的3D基线模型每秒可生成约780个关键点。关于我们方法所花费的处理时间的详细信息如下所示。BlazePose 2D + 3D-pose-baseline性能百分比尽管这种方法在复杂场景中(如光线昏暗或姿势异常)无法保证可靠性,但常规视频可以在实时处理中得到处理。但总体而言,模型预测的准确性将取决于训练和所选架构。理解人体姿态估计的真实能力,我们可以分析这一技术的一些常见商业应用和一般用例。人体姿态估计的用例HPE可以被认为是一项相当成熟的技术,因为在诸如健身、康复、增强现实、动画、游戏、机器人甚至监控等领域中已有应用基础。那么现在让我们来谈谈现有的用例。AI健身与自我指导健身应用和由AI驱动的教练是人体姿态估计最明显的用例之一。手机应用中实现的模型可以使用硬件相机作为传感器来记录某人进行锻炼并执行分析。通过跟踪人体运动,运动可以被分为离心和向心运动阶段,以分析不同角度的弯曲和整体姿势。这是通过跟踪关键点并提供以提示或图形分析形式呈现的分析实现的。这可以实时处理,也可以在稍有延迟后提供关于主要运动模式和身体力学的分析给用户。康复与物理治疗物理治疗行业是人体活动跟踪的另一个用例,其应用规则与健身应用类似。在远程医疗的时代,家庭咨询变得更加灵活和多样化。AI技术使得在线治疗方式更加复杂。康复活动的分析应用与健身应用相似,除了对准确性的要求更高。由于我们处理的是从受伤中恢复的情况,这个类别的应用将属于医疗保健领域。这意味着它必须符合医疗保健行业的标准和某些国家的通用数据保护法规。增强现实增强现实应用,如虚拟试衣间,可以通过人体姿态估计这一检测和识别人体在空间中位置的先进方法受益。这可以用于电子商务中,帮助顾客在线试穿衣服。例如,无收银员的商店系统如Amazon GO,应用人体姿态估计来理解顾客是否从货架上拿取商品。HPE与其它计算机视觉技术结合使用,使得Amazon可以通过摄像头传感器网络、物联网设备和HPE系统,在店内自动完成结账流程。HPE负责那些接触商品的区域在相机上无法直接看到的部分。在这里,HPE模型分析顾客手和头部的位置,以判断他们是否从货架上拿取了商品,还是放回了原位。如何训练一个人体姿态估计模型?人体姿态估计是一种机器学习技术,这意味着你需要数据来训练它。人体姿态估计完成检测和识别屏幕上多个对象的较为复杂的任务,通常使用神经网络作为其引擎。训练神经网络需要大量的数据,因此最优化的方式是使用可用的数据集,例如以下这些:HumanEva、Coco、MPI Human Pose、Human3.6M这些数据集中的大多数适用于健身和康复应用的人体姿态估计。但这并不能保证在更加不寻常的运动或特定任务如监控或多人体姿态估计中的高准确性。在其他情况下,数据的收集是不可避免的,因为神经网络需要高质量的样本来提供准确的对象检测和跟踪。在这里,经验丰富的大数据和机器学习团队可以起到帮助作用,因为他们可以提供数据收集的咨询,并处理模型的实际开发。分享 推文 邮件 人工智能 增强现实 健身 机器学习 --> 人工智能 增强现实 健身 机器学习
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告
提取码
复制提取码
点击跳转至百度网盘